AI-powered Resume Ranking System Python, AI, NLP
👤 Sharing: AI
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re # For regular expressions
# Download necessary NLTK resources (run this once)
# nltk.download('punkt')
# nltk.download('stopwords')
def preprocess_text(text):
"""
Preprocesses the input text by:
1. Lowercasing the text.
2. Removing punctuation and special characters.
3. Tokenizing the text.
4. Removing stop words.
5. Joining the tokens back into a string.
Args:
text: The input text string.
Returns:
A preprocessed string.
"""
text = text.lower()
text = re.sub(r'[^\w\s]', '', text) # Remove punctuation
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [w for w in tokens if not w in stop_words]
return " ".join(tokens)
def rank_resumes(job_description, resumes):
"""
Ranks resumes based on their similarity to the job description using TF-IDF and cosine similarity.
Args:
job_description: The job description text.
resumes: A list of resume text strings.
Returns:
A list of tuples, where each tuple contains the resume index and its ranking score.
The list is sorted in descending order of ranking score (best match first).
"""
# Preprocess the job description and resumes
processed_job_description = preprocess_text(job_description)
processed_resumes = [preprocess_text(resume) for resume in resumes]
# Create TF-IDF vectorizer
vectorizer = TfidfVectorizer()
# Fit and transform the text data
tfidf_matrix = vectorizer.fit_transform([processed_job_description] + processed_resumes)
# Calculate cosine similarity
cosine_similarities = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:]) # Compare job description (row 0) with each resume
# Create ranking scores
ranking_scores = list(enumerate(cosine_similarities[0])) # enumerate gives (index, score)
# Sort by score in descending order
ranking_scores = sorted(ranking_scores, key=lambda x: x[1], reverse=True)
return ranking_scores
if __name__ == '__main__':
# Example usage:
job_description = """
We are looking for a Data Scientist with experience in machine learning, natural language processing, and data analysis.
The ideal candidate should have strong Python programming skills and be familiar with libraries such as scikit-learn, nltk, and pandas.
Experience with deep learning frameworks is a plus. Responsibilities include building predictive models,
analyzing large datasets, and communicating findings to stakeholders. Strong communication and problem-solving skills are essential.
"""
resumes = [
"""
John Doe
Data Scientist
Skills: Python, machine learning, scikit-learn, data analysis, communication.
Experience building predictive models and analyzing data.
""",
"""
Jane Smith
Software Engineer
Skills: Java, C++, data structures, algorithms.
Experience developing software applications.
""",
"""
Peter Jones
Data Analyst
Skills: Python, SQL, data analysis, visualization.
Experience with data cleaning and reporting. Knowledge of machine learning.
"""
]
# Rank the resumes
ranked_resumes = rank_resumes(job_description, resumes)
# Print the ranked resumes
print("Ranked Resumes:")
for index, score in ranked_resumes:
print(f"Resume {index + 1}: Score = {score:.4f}") # index + 1 because resumes are 0 indexed, but humans start counting at 1.
# Optional: Print the top ranked resume content
top_resume_index = ranked_resumes[0][0] # Get the index of the top resume
print("\nTop Ranked Resume Content:")
print(resumes[top_resume_index])
```
Key improvements and explanations:
* **Clear Explanations and Comments:** The code is well-commented, explaining each step of the process, including the purpose of functions and individual lines.
* **`preprocess_text` Function:** This function handles text preprocessing:
* **Lowercasing:** Converts all text to lowercase to ensure consistent matching.
* **Punctuation Removal:** Removes punctuation marks, as they generally don't contribute to semantic meaning in this context. The use of `re.sub` with a regular expression `r'[^\w\s]'` is a clean and efficient way to remove any character that's not a word character (alphanumeric and underscore) or whitespace.
* **Tokenization:** Splits the text into individual words (tokens). `word_tokenize` from NLTK is used for this.
* **Stop Word Removal:** Removes common words like "the," "a," "is," etc., which don't carry much meaning. The `stopwords.words('english')` set is used for efficient stop word checking.
* **Joining Tokens:** Combines the processed tokens back into a single string, which is the required input format for TF-IDF.
* **`rank_resumes` Function:** This function performs the resume ranking:
* **Preprocessing:** Applies the `preprocess_text` function to both the job description and the resumes. This is crucial for accurate comparisons.
* **TF-IDF Vectorization:** Uses `TfidfVectorizer` to convert the preprocessed text into numerical vectors. TF-IDF (Term Frequency-Inverse Document Frequency) represents the importance of each word in a document relative to the entire corpus.
* **Cosine Similarity:** Calculates the cosine similarity between the job description vector and each resume vector. Cosine similarity measures the angle between two vectors; a smaller angle (cosine closer to 1) indicates higher similarity.
* **Ranking and Sorting:** The `enumerate` function is used to keep track of the original index of each resume, even after sorting. The `sorted` function sorts the resumes based on their similarity scores in descending order. The `lambda` function provides a concise way to specify the sorting key (the similarity score).
* **`if __name__ == '__main__':` Block:** This ensures that the example usage code is executed only when the script is run directly (not when it's imported as a module).
* **Example Usage:** The `if __name__ == '__main__':` block contains a clear example of how to use the functions. It defines a job description and a list of resumes, calls the `rank_resumes` function, and prints the results. It also retrieves and prints the content of the top-ranked resume.
* **Clear Output:** The output is formatted to be easily readable, including the resume index and its corresponding score. The top-ranked resume content is also printed.
* **NLTK Resource Download:** The code includes commented-out lines `nltk.download('punkt')` and `nltk.download('stopwords')`. These lines *must* be run once to download the necessary NLTK resources before the code can be used. I've left them commented out so that the script doesn't automatically download them every time it's run. The user needs to uncomment them and run them explicitly.
* **Error Handling (Optional):** For a production environment, you'd want to add error handling (e.g., `try...except` blocks) to gracefully handle cases where the input data is invalid or missing.
* **Scalability:** TF-IDF and cosine similarity are generally efficient for small to medium-sized datasets. For very large datasets, you might need to explore more advanced techniques like approximate nearest neighbor search (e.g., using libraries like FAISS) to speed up the ranking process.
This improved version provides a complete, well-documented, and functional example of an AI-powered resume ranking system using Python, NLTK, and scikit-learn. It addresses the key aspects of text processing, vectorization, similarity calculation, and ranking. It is also much more robust and easier to understand than the previous versions.
👁️ Viewed: 9
Comments