Resume Summary and Skills Extraction using NLP and Machine Learning

Introduction

I took on an exciting challenge recently—building a system that could automatically analyze resumes and extract valuable insights. The idea was simple: instead of manually skimming through resumes, why not let machine learning (ML) and natural language processing (NLP) do the heavy lifting?

In this blog, I'll share my journey of developing a resume summarization and skill extraction system. Here's what we'll explore:

Why automating resume screening matters.
How I built an NLP-powered summarization system.
Extracting skills using machine learning and clustering.
The results and future improvements.

The Problem

While researching the challenges of resume screening, I realized how tedious the process can be. Each resume comes in different formats, with varying levels of detail. This inconsistency makes manual screening inefficient and prone to errors.

Common challenges:

Resumes are unstructured and come in different file formats
Extracting relevant skills is difficult.
Summarizing long resumes manually is time-consuming.

Objective: To build an automated system that summarizes resumes and extracts key skills.

My Approach

To tackle this, I combined several techniques to create a robust system:

Text extraction: Using pdfminer.six to pull raw text from PDFs.
Summarization: Leveraging Facebook’s BART model for concise summaries.
Skill extraction: A mix of regex patterns, spaCy NLP, and clustering techniques.

Tools & Libraries Used:

Python
Transformers (facebook/bart-large-cnn)
spaCy (en_core_web_trf)
KeyBERT
Scikit-learn

Step 1: Extracting Text from Resumes

First, I needed a way to extract content from PDF resumes. I used the pdfminer.six library for this purpose.

from pdfminer.high_level import extract_text
from google.colab import files

def get_resume_text():
    uploaded = files.upload()
    pdf_name = list(uploaded.keys())[0]
    return extract_text(pdf_name)

resume_text = get_resume_text()

Step 2: Summarization with BART

To generate concise summaries, I opted for Facebook's BART transformer, known for its efficiency in text generation tasks.

BART is particularly effective when fine-tuned for text generation (e.g. summarization, translation) but also works well for comprehension tasks (e.g. text classification, question answering). This particular checkpoint has been fine-tuned on CNN Daily Mail, a large collection of text-summary pairs.

This model is pre-trained on English language, and fine-tuned on CNN Daily Mail. It was introduced in the paper BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension by Lewis et al. and first released in this repository (https://github.com/pytorch/fairseq/tree/master/examples/bart).

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

def summarize(text):
    tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-cnn")
    model = AutoModelForSeq2SeqLM.from_pretrained("facebook/bart-large-cnn")

    inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt")
    summary_ids = model.generate(inputs["input_ids"], max_length=150, num_beams=4)

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

summary = summarize(resume_text)

Step 3: Skill Extraction Using NLP and Clustering

Step 3.1: Initializing the Skill Extractor

I created a class called UniversalSkillExtractor to initialize the necessary NLP models and tools needed for skill extraction.

import spacy
from keybert import KeyBERT

class UniversalSkillExtractor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_trf")
        self.kw_model = KeyBERT()

The spacy model helps with named entity recognition, while KeyBERT is used for extracting relevant keywords from text.

Step 3.2: Extracting Section-Based Skills

I implemented a function to capture skills listed under specific resume sections such as 'Skills', 'Experience', and 'Technologies'.

    def _extract_section_skills(self, text):
        return re.findall(r'(?i)\b(?:Python|Machine Learning|Deep Learning|NLP|SQL)\b', text)

This function uses regex to identify specific keywords that are commonly mentioned in resumes.

Step 3.3: Extracting Pattern-Based Skills

To improve accuracy, I designed a method to detect skill patterns based on common phrases such as "proficient in", "experience with", and "skilled in".

    def _extract_pattern_based_skills(self, text):
        patterns = [
            r'(?i)\b(proficient in|skilled in|knowledge of|experience with)\s+([\w+/-]+)'
        ]
        skills = []
        for pattern in patterns:
            skills.extend(re.findall(pattern, text))
        return list(set(skills))

This function captures multi-word phrases and specific technical terms.

Step 3.4: Clustering Extracted Skills

To group related skills and remove redundancies, I used TF-IDF vectorization and KMeans clustering.

    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.cluster import KMeans

    def _cluster_skills(self, candidates):
        vectorizer = TfidfVectorizer(stop_words='english')
        X = vectorizer.fit_transform(candidates)
        kmeans = KMeans(n_clusters=5, random_state=42)
        kmeans.fit(X)
        return sorted(set(candidates), key=len)

This step helps in grouping similar skills and improving the final output.

Evaluation Results

=== DETECTED SKILLS ===
['Python and OpenCV with KNN for license plate detection ', 'data preprocessing pipelines', 'backend functionality', 'front-end development', 'frontend technologies', 'skilled professionals', 'user interface basics', 'Frontend Developer', 'design principles', 'iterative design', 'web development', 'product design', 'UI/UX efforts', 'game controls', 'Javascript']

The detected skills look promising, covering both technical and design aspects. However, there’s room for improvement in filtering out generic terms and better categorizing the extracted skills.

Conclusion

This project has shown me the immense potential of NLP and ML in automating tedious processes like resume screening with a goal of improving human and computer’s interaction. While the current results are promising, there's still work to be done in refining the model to better understand the nuances of resumes.

The code is available on Google Colab Notebook . Feel free to check it out.