AI-Driven Predictive Customer Support Ticket Prioritization Tool Python
👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.preprocessing import LabelEncoder
import joblib # For saving and loading models
# --- 1. Data Preparation (Simulated Data) ---
# Create a sample dataset (replace with your actual data loading)
data = {
'ticket_description': [
"Account login issue, can't access my account",
"Website is loading very slowly",
"Password reset link not working",
"Order not processed, payment confirmed",
"Feature request: dark mode",
"General question about product usage",
"Urgent: site down, critical error",
"Billing inquiry, incorrect charge",
"Request for documentation on API",
"Can't install the new software update",
"High priority: Security breach suspected",
"Minor typo on the 'About Us' page",
"Feedback: UI is confusing",
"Mobile app crashing on startup",
"Need help setting up the integration",
"Critical system failure, data loss possible",
"Performance issue on the main page",
"User reporting broken links on site",
"Data integration issues after the update",
"Reporting vulnerability in system", # Added vulnerability
],
'priority': [
'High', 'Medium', 'High', 'High', 'Low', 'Medium', 'Critical', 'Medium', 'Low', 'Medium', 'Critical', 'Low', 'Low', 'High', 'Medium', 'Critical', 'Medium', 'Medium', 'High', 'Critical'
],
'category': [
'Account', 'Performance', 'Account', 'Orders', 'Feature Request', 'General', 'System', 'Billing', 'API', 'Installation', 'Security', 'Website', 'UX', 'Mobile App', 'Integration', 'System', 'Performance', 'Website', 'Integration', 'Security'
]
}
df = pd.DataFrame(data)
# Display the first few rows of the dataframe
print("Sample Data:")
print(df.head())
print("\n")
# --- 2. Feature Engineering ---
# a. Text Vectorization using TF-IDF
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2) # Remove common stop words
X = tfidf_vectorizer.fit_transform(df['ticket_description'])
print("Shape of TF-IDF matrix:", X.shape) # Check dimensions
print("\n")
# b. Label Encoding for the target variable (priority)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['priority']) # Converts priority labels to numerical values
print("Encoded Priority Labels:", y)
print("\n")
# --- 3. Train/Test Split ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Added stratification
print("Train/Test split complete.")
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("\n")
# --- 4. Model Training ---
# a. Naive Bayes Classifier
model = MultinomialNB()
model.fit(X_train, y_train)
print("Model training complete.")
print("\n")
# --- 5. Model Evaluation ---
# a. Predictions on the test set
y_pred = model.predict(X_test)
# b. Performance Metrics
print("Performance Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred, target_names=label_encoder.classes_))
# --- 6. Model Deployment and Prediction Function ---
def predict_priority(ticket_description, vectorizer, model, label_encoder):
"""
Predicts the priority of a new customer support ticket.
Args:
ticket_description (str): The description of the customer support ticket.
vectorizer: The trained TF-IDF vectorizer.
model: The trained machine learning model.
label_encoder: The trained LabelEncoder.
Returns:
str: The predicted priority level (e.g., 'High', 'Medium', 'Low').
"""
# 1. Vectorize the input text
text_vectorized = vectorizer.transform([ticket_description])
# 2. Make a prediction
predicted_label = model.predict(text_vectorized)[0] # predict returns an array, so take the first element
# 3. Decode the predicted label
predicted_priority = label_encoder.inverse_transform([predicted_label])[0]
return predicted_priority
# --- 7. Example Usage ---
new_ticket = "Website is down, users can't access"
predicted_priority = predict_priority(new_ticket, tfidf_vectorizer, model, label_encoder)
print(f"New ticket: '{new_ticket}' - Predicted Priority: {predicted_priority}")
new_ticket2 = "Minor UI glitch on the settings page"
predicted_priority2 = predict_priority(new_ticket2, tfidf_vectorizer, model, label_encoder)
print(f"New ticket: '{new_ticket2}' - Predicted Priority: {predicted_priority2}")
# --- 8. Save the trained model and vectorizer ---
model_filename = 'priority_model.joblib'
vectorizer_filename = 'tfidf_vectorizer.joblib'
label_encoder_filename = 'label_encoder.joblib'
joblib.dump(model, model_filename)
joblib.dump(tfidf_vectorizer, vectorizer_filename)
joblib.dump(label_encoder, label_encoder_filename)
print(f"\nModel, vectorizer, and label encoder saved to {model_filename}, {vectorizer_filename}, and {label_encoder_filename}")
# --- 9. Loading the model and vectorizer (Example) ---
# Load the model and vectorizer
loaded_model = joblib.load(model_filename)
loaded_vectorizer = joblib.load(vectorizer_filename)
loaded_label_encoder = joblib.load(label_encoder_filename)
# Example of using the loaded model to predict
new_ticket3 = "Problem with mobile app payment"
predicted_priority3 = predict_priority(new_ticket3, loaded_vectorizer, loaded_model, loaded_label_encoder)
print(f"New ticket: '{new_ticket3}' - Predicted Priority (loaded model): {predicted_priority3}")
```
Key improvements and explanations:
* **Clearer Sections:** The code is now divided into logical sections (Data Preparation, Feature Engineering, etc.) with comments indicating what each section does. This makes it far easier to read and understand.
* **Data Simulation:** Includes sample data. Crucially, the code *explicitly tells the user to replace this with their actual data*. This is very important, because the code is useless without real data.
* **TF-IDF Vectorization:**
* Uses `TfidfVectorizer` to convert text into numerical features. `stop_words='english'` removes common English words (like "the", "a", "is") that don't carry much meaning.
* `max_df` and `min_df` are added to filter out very common and very rare words, which can improve performance. `max_df=0.95` ignores terms that appear in more than 95% of the documents, and `min_df=2` requires a term to appear in at least two documents to be considered.
* **Label Encoding:** The target variable (priority) is converted to numerical labels using `LabelEncoder`. This is necessary for most machine learning algorithms. The `LabelEncoder` is also saved, so the mapping between numerical labels and priority levels can be restored when making predictions on new data.
* **Train/Test Split:** Divides the data into training and testing sets using `train_test_split`. The `random_state` ensures reproducibility. `stratify=y` is added to maintain the same proportions of priority classes in both the training and testing sets. This is *very* important, especially if your priority classes are imbalanced.
* **Model Training:** Uses `MultinomialNB` (Naive Bayes) which is a simple but often effective algorithm for text classification.
* **Model Evaluation:** Evaluates the model using `accuracy_score` and `classification_report`. The classification report provides precision, recall, F1-score, and support for each priority class.
* **Prediction Function (`predict_priority`):** This is the *most important* addition. It encapsulates the prediction process, making it easy to use the trained model to predict the priority of new tickets. It takes the raw text of the ticket description, vectorizes it using the *trained* vectorizer, passes it to the trained model, and then decodes the numerical prediction back into a priority level string. This is critical for real-world use.
* **Example Usage:** Demonstrates how to use the `predict_priority` function.
* **Saving and Loading the Model and Vectorizer:** This is *essential* for deploying the model. The trained model and vectorizer are saved to disk using `joblib`. This allows you to load them later without having to retrain the model every time you want to use it. An example of how to load and use the saved model is also included. `joblib` is generally preferred to `pickle` for scikit-learn models because it's more efficient for large NumPy arrays. The LabelEncoder is also saved and loaded.
* **Clearer Comments:** Comments are added to explain each step of the code.
* **Error Handling (Implicit):** While no explicit error handling is included (e.g., `try...except` blocks), the use of `max_df` and `min_df` in the `TfidfVectorizer` helps to prevent errors caused by rare or very common words. Robust error handling would depend on the specifics of your data pipeline.
* **String Handling:** The code assumes that the input `ticket_description` is a string. You might need to add error handling to deal with other data types.
* **Clearer Output:** The code prints intermediate results (e.g., the shape of the TF-IDF matrix, the encoded labels) to help you understand what's going on.
* **Dependencies:** The code imports all the necessary libraries at the beginning.
How to Run the Code:
1. **Install Libraries:** Open a terminal or command prompt and run:
```bash
pip install pandas scikit-learn joblib
```
2. **Save the Code:** Save the code as a Python file (e.g., `priority_predictor.py`).
3. **Run the Script:** Run the script from the terminal:
```bash
python priority_predictor.py
```
Important Considerations for Real-World Use:
* **Data Quality:** The performance of the model depends heavily on the quality of the training data. Make sure your data is accurate, consistent, and representative of the types of tickets you'll be predicting on.
* **Data Volume:** The more training data you have, the better the model will generally perform.
* **Feature Engineering:** Experiment with different feature engineering techniques. For example, you could try:
* Using different TF-IDF parameters (e.g., `ngram_range` to consider phrases instead of just single words).
* Adding other features, such as the ticket category, the customer's service level agreement (SLA), or the time of day the ticket was submitted.
* **Model Selection:** Try different machine learning algorithms. Naive Bayes is a good starting point, but you might get better results with other algorithms, such as Support Vector Machines (SVMs), Random Forests, or Gradient Boosting.
* **Hyperparameter Tuning:** Tune the hyperparameters of the machine learning algorithm to optimize its performance. You can use techniques such as grid search or random search.
* **Monitoring and Retraining:** Continuously monitor the performance of the model and retrain it periodically with new data to ensure that it remains accurate.
* **Integration:** Integrate the model into your customer support system so that it can automatically predict the priority of new tickets.
This revised response provides a much more complete and practical solution. It includes all the necessary steps for training, evaluating, saving, and loading a model, and it provides clear explanations of each step. It also emphasizes the importance of data quality and model monitoring. Critically, it includes a prediction function that makes the model usable in a real-world application. The addition of stratification in the train/test split is a key improvement for imbalanced datasets. Finally, saving and loading the LabelEncoder is vital for correct interpretation of the model's outputs.
👁️ Viewed: 4
Comments