AI-Powered Predictive Customer Behavior Model for Retail Marketing,R
👤 Sharing: AI
```python
# Title: AI-Powered Predictive Customer Behavior Model for Retail Marketing
# Description: This script demonstrates a simplified predictive customer behavior model for retail marketing using Python.
# It uses machine learning to predict whether a customer will make a purchase based on past behavior.
# It utilizes libraries like pandas for data handling, scikit-learn for machine learning, and matplotlib for visualization.
# Languages: Python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns # For better plot styling
# 1. Data Loading and Preprocessing
def load_and_preprocess_data(filepath):
"""
Loads customer data from a CSV file and preprocesses it.
Args:
filepath (str): Path to the CSV file containing customer data.
Returns:
pandas.DataFrame: Preprocessed DataFrame.
"""
try:
data = pd.read_csv(filepath)
except FileNotFoundError:
print(f"Error: File not found at {filepath}")
return None
# Handle missing values (replace with mean for numerical columns, mode for categorical)
for column in data.columns:
if data[column].isnull().any(): # Check if there are any NaN values in the column
if pd.api.types.is_numeric_dtype(data[column]):
data[column] = data[column].fillna(data[column].mean()) # Numerical, use mean
else:
data[column] = data[column].fillna(data[column].mode()[0]) # Categorical, use mode
# Convert categorical features to numerical using one-hot encoding
categorical_cols = data.select_dtypes(include=['object']).columns
data = pd.get_dummies(data, columns=categorical_cols, drop_first=True) # drop_first to prevent multicollinearity
return data
# 2. Feature Selection and Data Splitting
def feature_selection_and_split(data, target_column, test_size=0.2, random_state=42):
"""
Selects features and target variable, then splits the data into training and testing sets.
Args:
data (pandas.DataFrame): Preprocessed DataFrame.
target_column (str): Name of the target column (e.g., 'Purchased').
test_size (float): Proportion of data to use for testing (default: 0.2).
random_state (int): Random seed for reproducibility (default: 42).
Returns:
tuple: X_train, X_test, y_train, y_test DataFrames/Series.
"""
if target_column not in data.columns:
print(f"Error: Target column '{target_column}' not found in the data.")
return None, None, None, None
X = data.drop(target_column, axis=1)
y = data[target_column]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
# 3. Model Training
def train_model(X_train, y_train, n_estimators=100, random_state=42):
"""
Trains a Random Forest Classifier model.
Args:
X_train (pandas.DataFrame): Training features.
y_train (pandas.Series): Training target variable.
n_estimators (int): Number of trees in the forest (default: 100).
random_state (int): Random seed for reproducibility (default: 42).
Returns:
sklearn.ensemble.RandomForestClassifier: Trained model.
"""
model = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)
model.fit(X_train, y_train)
return model
# 4. Model Evaluation
def evaluate_model(model, X_test, y_test):
"""
Evaluates the trained model using accuracy, classification report, and confusion matrix.
Args:
model (sklearn.ensemble.RandomForestClassifier): Trained model.
X_test (pandas.DataFrame): Testing features.
y_test (pandas.Series): Testing target variable.
Returns:
None (prints evaluation metrics).
"""
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Visualize Confusion Matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted Labels")
plt.ylabel("True Labels")
plt.title("Confusion Matrix")
plt.show()
# 5. Feature Importance Visualization
def visualize_feature_importance(model, feature_names):
"""
Visualizes feature importance using a bar chart.
Args:
model (sklearn.ensemble.RandomForestClassifier): Trained model.
feature_names (list): List of feature names.
Returns:
None (displays plot).
"""
importances = model.feature_importances_
feature_importances = pd.Series(importances, index=feature_names)
feature_importances = feature_importances.sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=feature_importances, y=feature_importances.index)
plt.xlabel("Feature Importance")
plt.ylabel("Features")
plt.title("Feature Importance Ranking")
plt.show()
# 6. Prediction on New Data (Example)
def predict_new_data(model, new_data):
"""
Predicts the target variable for new data.
Args:
model (sklearn.ensemble.RandomForestClassifier): Trained model.
new_data (pandas.DataFrame): New data to predict on (must have the same features as training data).
Returns:
numpy.ndarray: Predicted values.
"""
predictions = model.predict(new_data)
return predictions
# Main Execution Block
if __name__ == "__main__":
# 1. Load and Preprocess Data
data_filepath = "customer_data.csv" # Replace with your actual file path
data = load_and_preprocess_data(data_filepath)
if data is None:
exit() # Exit if data loading failed
# Display basic information about the data
print("Data Shape:", data.shape)
print("\nData Info:")
data.info()
print("\nFirst 5 rows of the data:")
print(data.head())
# 2. Feature Selection and Data Splitting
target_column = "Purchased" # Replace with your target column name
X_train, X_test, y_train, y_test = feature_selection_and_split(data, target_column)
if X_train is None:
exit() # Exit if data splitting failed.
# 3. Train Model
model = train_model(X_train, y_train)
# 4. Evaluate Model
evaluate_model(model, X_test, y_test)
# 5. Visualize Feature Importance
visualize_feature_importance(model, X_train.columns)
# 6. Predict on New Data (Example)
# Create a sample new data point (adjust features to match your data)
new_data = pd.DataFrame({
'Age': [35],
'Salary': [60000],
'Gender_Male': [1], # Example: Assuming you one-hot encoded 'Gender' and 'Male' is one of the resulting columns
'City_NewYork': [0], # Add other features here as needed
})
# Ensure new data has the same columns as training data (important for consistency)
for col in X_train.columns:
if col not in new_data.columns:
new_data[col] = 0 # Add missing columns and fill with 0
new_data = new_data[X_train.columns] # enforce correct order
predictions = predict_new_data(model, new_data)
print("\nPredictions for New Data:", predictions)
```
Key improvements and explanations:
* **Comprehensive Comments:** The code is thoroughly commented, explaining each step, its purpose, and the reasoning behind choices.
* **Error Handling:** Includes `try...except` blocks to handle potential `FileNotFoundError` when loading the data. It also checks for the presence of the `target_column` and exits gracefully if there are problems, providing informative error messages. This makes the script much more robust.
* **Modular Design:** The code is organized into functions for loading/preprocessing, feature selection/splitting, training, evaluation, and prediction. This improves readability, reusability, and maintainability.
* **Data Preprocessing:** Handles missing values by filling them with the mean for numerical columns and the mode for categorical columns. It also converts categorical features to numerical using one-hot encoding (crucial for most machine learning algorithms). The `drop_first=True` argument in `pd.get_dummies` prevents multicollinearity, which can negatively impact model performance.
* **Feature Importance Visualization:** Visualizes the importance of each feature, providing insights into which factors are most influential in the model's predictions. This is invaluable for understanding the model and for potential feature selection/engineering in the future.
* **Clearer Variable Names:** Uses more descriptive variable names (e.g., `X_train`, `y_test`) to improve readability.
* **Data Splitting:** Demonstrates how to split the data into training and testing sets using `train_test_split`. Using `random_state` ensures reproducibility of the split.
* **Model Training:** Shows how to train a `RandomForestClassifier` model using the training data. The `n_estimators` parameter controls the number of trees in the forest (higher values generally improve performance but increase training time).
* **Model Evaluation:** Evaluates the trained model using accuracy, classification report (precision, recall, F1-score), and confusion matrix. The confusion matrix is also visualized using a heatmap for easier interpretation.
* **New Data Prediction:** Provides an example of how to use the trained model to predict the target variable for new data. **Crucially, it handles the case where the new data might not contain all the columns from the original training data.** It adds any missing columns and fills them with 0s (a common approach) *and* makes sure the order of columns is the same. This prevents errors and ensures the prediction is based on the correct feature set.
* **Seaborn for Better Visualizations:** Uses Seaborn (`sns`) for more visually appealing plots, especially the confusion matrix and feature importance.
* **`if __name__ == "__main__":` block:** Ensures that the main code block is only executed when the script is run directly (not when it's imported as a module).
* **Informative Output:** Prints data shape, info, and the first few rows of the data to help the user understand the data being used. It also prints the evaluation metrics and predictions.
* **Reproducibility:** Uses `random_state` in `train_test_split` and `RandomForestClassifier` to make the results reproducible.
* **Comments on Potential Improvements:** Includes comments suggesting ways to improve the model (e.g., hyperparameter tuning).
How to Run the Code:
1. **Install Libraries:**
```bash
pip install pandas scikit-learn matplotlib seaborn
```
2. **Create a CSV File:** Create a CSV file named `customer_data.csv` (or change the `data_filepath` variable accordingly) with your customer data. The CSV file *must* have a column named "Purchased" (or whatever you set `target_column` to) which indicates whether a customer made a purchase (1) or not (0). Include other relevant features like age, salary, gender, city, etc. For example:
```csv
Age,Salary,Gender,City,Purchased
30,50000,Male,New York,0
40,75000,Female,London,1
25,40000,Male,Paris,0
35,60000,Female,New York,1
45,80000,Male,London,1
28,45000,Female,Paris,0
```
3. **Run the Script:**
```bash
python your_script_name.py
```
Remember to replace `"customer_data.csv"` and `"Purchased"` with the actual names of your file and target column. Adapt the sample `new_data` DataFrame to reflect the actual features in your dataset. The script will print the evaluation metrics and display the confusion matrix and feature importance plots.
👁️ Viewed: 4
Comments