- Published on
Resume Summary and Skills Extraction using NLP and Machine Learning
- Authors
- Name
- Astik Dahal
Introduction
I took on an exciting challenge recently—building a system that could automatically analyze resumes and extract valuable insights. The idea was simple: instead of manually skimming through resumes, why not let machine learning (ML) and natural language processing (NLP) do the heavy lifting?
In this blog, I'll share my journey of developing a resume summarization and skill extraction system. Here's what we'll explore:
- Why automating resume screening matters.
- How I built an NLP-powered summarization system.
- Extracting skills using machine learning and clustering.
- The results and future improvements.
The Problem
While researching the challenges of resume screening, I realized how tedious the process can be. Each resume comes in different formats, with varying levels of detail. This inconsistency makes manual screening inefficient and prone to errors.
Common challenges:
- Resumes are unstructured and come in different file formats
- Extracting relevant skills is difficult.
- Summarizing long resumes manually is time-consuming.
Objective: To build an automated system that summarizes resumes and extracts key skills.
My Approach
To tackle this, I combined several techniques to create a robust system:
- Text extraction: Using
pdfminer.six
to pull raw text from PDFs. - Summarization: Leveraging Facebook’s BART model for concise summaries.
- Skill extraction: A mix of regex patterns, spaCy NLP, and clustering techniques.
Tools & Libraries Used:
- Python
- Transformers (
facebook/bart-large-cnn
) - spaCy (
en_core_web_trf
) - KeyBERT
- Scikit-learn
Step 1: Extracting Text from Resumes
First, I needed a way to extract content from PDF resumes. I used the pdfminer.six
library for this purpose.
from pdfminer.high_level import extract_text
from google.colab import files
def get_resume_text():
uploaded = files.upload()
pdf_name = list(uploaded.keys())[0]
return extract_text(pdf_name)
resume_text = get_resume_text()
Step 2: Summarization with BART
To generate concise summaries, I opted for Facebook's BART transformer, known for its efficiency in text generation tasks.
BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.
This model is pre-trained on English language, and fine-tuned on CNN Daily Mail. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Lewis et al. and first released in this repository (https://github.com/pytorch/fairseq/tree/master/examples/bart).
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
def summarize(text):
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")
inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt")
summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
summary = summarize(resume_text)
Step 3: Skill Extraction Using NLP and Clustering
Step 3.1: Initializing the Skill Extractor
I created a class called UniversalSkillExtractor
to initialize the necessary NLP models and tools needed for skill extraction.
import spacy
from keybert import KeyBERT
class UniversalSkillExtractor:
def __init__(self):
self.nlp = spacy.load("en_core_web_trf")
self.kw_model = KeyBERT()
The spacy
model helps with named entity recognition, while KeyBERT is used for extracting relevant keywords from text.
Step 3.2: Extracting Section-Based Skills
I implemented a function to capture skills listed under specific resume sections such as 'Skills', 'Experience', and 'Technologies'.
def _extract_section_skills(self, text):
return re.findall(r'(?i)\b(?:Python|Machine Learning|Deep Learning|NLP|SQL)\b', text)
This function uses regex to identify specific keywords that are commonly mentioned in resumes.
Step 3.3: Extracting Pattern-Based Skills
To improve accuracy, I designed a method to detect skill patterns based on common phrases such as "proficient in", "experience with", and "skilled in".
def _extract_pattern_based_skills(self, text):
patterns = [
r'(?i)\b(proficient in|skilled in|knowledge of|experience with)\s+([\w+/-]+)'
]
skills = []
for pattern in patterns:
skills.extend(re.findall(pattern, text))
return list(set(skills))
This function captures multi-word phrases and specific technical terms.
Step 3.4: Clustering Extracted Skills
To group related skills and remove redundancies, I used TF-IDF vectorization and KMeans clustering.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
def _cluster_skills(self, candidates):
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(candidates)
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans.fit(X)
return sorted(set(candidates), key=len)
This step helps in grouping similar skills and improving the final output.
Evaluation Results
=== DETECTED SKILLS ===
['Python and OpenCV with KNN for license plate detection ', 'data preprocessing pipelines', 'backend functionality', 'front-end development', 'frontend technologies', 'skilled professionals', 'user interface basics', 'Frontend Developer', 'design principles', 'iterative design', 'web development', 'product design', 'UI/UX efforts', 'game controls', 'Javascript']
The detected skills look promising, covering both technical and design aspects. However, there’s room for improvement in filtering out generic terms and better categorizing the extracted skills.
Conclusion
This project has shown me the immense potential of NLP and ML in automating tedious processes like resume screening with a goal of improving human and computer’s interaction. While the current results are promising, there's still work to be done in refining the model to better understand the nuances of resumes.
The code is available on Google Colab Notebook . Feel free to check it out.