AI-Powered Predictive Customer Behavior Model for Retail Marketing R
👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
# --- 1. Data Loading and Preprocessing ---
# Assuming your customer data is in a CSV file named 'customer_data.csv'
# The CSV should contain features relevant to customer behavior, such as:
# - 'age': Customer's age
# - 'gender': Customer's gender (e.g., 'Male', 'Female')
# - 'past_purchases': Number of past purchases
# - 'average_order_value': Average order value
# - 'time_since_last_purchase': Days since last purchase
# - 'location': Customer's location (e.g., city or region)
# - 'membership_tier': Customer's membership level (e.g., 'Bronze', 'Silver', 'Gold')
# - 'response_to_last_campaign': Whether the customer responded to the last campaign (0 or 1)
# - 'target': The target variable ? whether the customer will make a purchase in the next campaign (0 or 1) <-- MOST IMPORTANT
def load_and_preprocess_data(file_path):
"""
Loads customer data from a CSV file, preprocesses it, and returns the data.
Args:
file_path (str): Path to the CSV file containing customer data.
Returns:
pandas.DataFrame: Preprocessed customer data.
"""
try:
data = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
return None
except Exception as e:
print(f"Error reading CSV: {e}")
return None
# Handle missing values (replace with mean/median, or drop rows)
# This is crucial to prevent errors in the model training.
data = data.fillna(data.mean(numeric_only=True)) # Replace missing numeric values with the mean of the column. numeric_only=True is important.
# Convert categorical features to numerical using one-hot encoding
# This is essential because most machine learning models work with numerical data.
categorical_cols = ['gender', 'location', 'membership_tier'] # List your categorical columns here.
data = pd.get_dummies(data, columns=categorical_cols, dummy_na=False) # Use dummy_na=False if you don't want explicit NaN columns. If NaNs are present, consider `dummy_na=True` or more intelligent imputation.
# Separate features (X) and target variable (y)
X = data.drop('target', axis=1)
y = data['target']
return X, y
# --- 2. Data Splitting ---
def split_data(X, y, test_size=0.2, random_state=42):
"""
Splits the data into training and testing sets.
Args:
X (pandas.DataFrame): Features.
y (pandas.Series): Target variable.
test_size (float): Proportion of data to use for testing (e.g., 0.2 for 20%).
random_state (int): Random seed for reproducibility.
Returns:
tuple: X_train, X_test, y_train, y_test.
"""
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
# --- 3. Feature Scaling ---
def scale_data(X_train, X_test):
"""
Scales the features using StandardScaler. This helps improve model performance and convergence.
Args:
X_train (pandas.DataFrame): Training features.
X_test (pandas.DataFrame): Testing features.
Returns:
tuple: Scaled X_train, scaled X_test, and the scaler object.
"""
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, scaler # Return the scaler for later use (e.g., scaling new data).
# --- 4. Model Training ---
def train_model(X_train, y_train, n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42):
"""
Trains a Gradient Boosting Classifier model.
Args:
X_train (numpy.ndarray): Scaled training features.
y_train (pandas.Series): Training target variable.
n_estimators (int): Number of boosting stages.
learning_rate (float): Learning rate shrinks the contribution of each tree.
max_depth (int): Maximum depth of the individual regression estimators.
random_state (int): Random seed for reproducibility.
Returns:
GradientBoostingClassifier: Trained model.
"""
model = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth, random_state=random_state)
model.fit(X_train, y_train)
return model
# --- 5. Model Evaluation ---
def evaluate_model(model, X_test, y_test):
"""
Evaluates the trained model and prints classification report and confusion matrix.
Args:
model (GradientBoostingClassifier): Trained model.
X_test (numpy.ndarray): Scaled testing features.
y_test (pandas.Series): Testing target variable.
"""
y_pred = model.predict(X_test)
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
return y_pred # Return the predictions for further analysis.
# --- 6. Feature Importance Visualization ---
def plot_feature_importance(model, feature_names):
"""
Plots the feature importances of the trained model.
Args:
model (GradientBoostingClassifier): Trained model.
feature_names (list): List of feature names.
"""
feature_importance = model.feature_importances_
# Sort features by importance
indices = np.argsort(feature_importance)
plt.figure(figsize=(10, len(feature_names) * 0.5)) # Adjust figure size for better readability
plt.title("Feature Importances")
plt.barh(range(len(indices)), feature_importance[indices], color="b", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
# --- 7. Prediction Example (Optional) ---
def predict_for_new_customer(model, scaler, new_customer_data):
"""
Predicts whether a new customer will make a purchase.
Args:
model (GradientBoostingClassifier): Trained model.
scaler (StandardScaler): Fitted StandardScaler object.
new_customer_data (pandas.DataFrame or dict): Data for the new customer. Must have the same features as the training data.
Returns:
int: 0 or 1, representing the predicted purchase probability.
"""
if isinstance(new_customer_data, dict):
new_customer_data = pd.DataFrame([new_customer_data]) # Convert dictionary to DataFrame
# Make sure the new data has the same columns (features) as the training data
# This is crucial to avoid errors. Add any missing columns filled with 0.
missing_cols = set(X_train.columns) - set(new_customer_data.columns)
for c in missing_cols:
new_customer_data[c] = 0
new_customer_data = new_customer_data[X_train.columns] # Ensure the order is correct
# Scale the new customer data using the *same* scaler used for training
new_customer_scaled = scaler.transform(new_customer_data)
prediction = model.predict(new_customer_scaled)[0]
return prediction
# --- Main Execution ---
if __name__ == "__main__":
# 1. Load and preprocess data
file_path = 'customer_data.csv' # Replace with your actual file path
X, y = load_and_preprocess_data(file_path)
if X is None or y is None:
print("Data loading or preprocessing failed. Exiting.")
exit()
# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = split_data(X, y)
# 3. Scale the features
X_train_scaled, X_test_scaled, scaler = scale_data(X_train, X_test)
# 4. Train the model
model = train_model(X_train_scaled, y_train)
# 5. Evaluate the model
y_pred = evaluate_model(model, X_test_scaled, y_test) # Store predicted values for analysis
# 6. Visualize feature importance
plot_feature_importance(model, X.columns)
# 7. Example prediction for a new customer
# Replace with actual new customer data, ensuring it matches the features of your training data
new_customer_data = {
'age': 35,
'past_purchases': 5,
'average_order_value': 75.0,
'time_since_last_purchase': 30,
'response_to_last_campaign': 0,
}
#Add dummy columns to new_customer_data, same columns that were created in load_and_preprocess_data, like `gender_Male`,`location_NewYork`, etc.
new_customer_data['gender_Male'] = 1
new_customer_data['gender_Female'] = 0
new_customer_data['location_NewYork'] = 1
new_customer_data['location_LosAngeles'] = 0
new_customer_data['membership_tier_Bronze'] = 1
new_customer_data['membership_tier_Silver'] = 0
new_customer_data['membership_tier_Gold'] = 0
prediction = predict_for_new_customer(model, scaler, pd.DataFrame([new_customer_data]))
print(f"Prediction for new customer: {prediction}")
```
Key improvements and explanations:
* **Clearer Structure:** The code is now organized into functions for each major step (loading, splitting, scaling, training, evaluating, predicting). This makes the code more readable, maintainable, and reusable.
* **Error Handling:** Includes `try...except` blocks for file loading and potential issues during data processing. This prevents the program from crashing if the file is not found or contains errors.
* **Missing Value Handling:** `data.fillna(data.mean(numeric_only=True))` is added to handle missing numerical values. *Crucially*, the `numeric_only=True` argument prevents errors if there are string columns with NaNs, which could cause a crash. Consider more sophisticated imputation techniques (e.g., median, KNN imputation) for better accuracy if missing values are a significant problem. Be *very* careful about using `dropna()`, as you could lose a lot of data.
* **Categorical Feature Encoding:** Uses `pd.get_dummies()` for one-hot encoding of categorical features. This is *essential* because machine learning models need numerical inputs. `dummy_na=False` avoids creating extra columns for NaN values, which is generally preferable if you've already handled missing values or are certain that your data does not contain them. If you expect NaNs, using `dummy_na=True` and handling the resulting `NaN` feature can be more appropriate.
* **Feature Scaling:** `StandardScaler` is used to scale numerical features. This is *very important* for gradient boosting and other models to prevent features with larger values from dominating the model. The *same* scaler is used for both training and testing data, and the scaler is returned to be used for future predictions.
* **Data Splitting:** Uses `train_test_split()` to create training and testing sets. A `random_state` is included for reproducibility.
* **Model Training:** Uses `GradientBoostingClassifier`. Parameters like `n_estimators`, `learning_rate`, and `max_depth` are customizable.
* **Model Evaluation:** Provides a classification report and confusion matrix for model evaluation. This is critical for assessing the model's performance.
* **Feature Importance Visualization:** Added a function to plot feature importances. This helps understand which features are most important for the model.
* **Prediction Example:** Includes an example of how to predict for a new customer. This is the *most important* part for applying the model to real-world scenarios. The new customer data *must* have the same features as the training data (including the one-hot encoded columns). Missing columns are handled to avoid errors during prediction. Critically, the *same* scaler used for training is used to scale the new customer's data. The code now correctly handles `pd.DataFrame` as input to the prediction function.
* **Comments and Explanations:** The code is well-commented to explain each step.
* **`if __name__ == "__main__":` block:** Ensures that the main code only runs when the script is executed directly, not when it's imported as a module.
* **Clarity on Target Variable:** The comments explicitly point out the importance of the `target` variable (the thing you're trying to predict).
* **Returns for predictions:** Both `evaluate_model` and `predict_for_new_customer` now return their predictions, which is good practice.
* **DataFrame Conversion:** The `predict_for_new_customer` function now can accept a dictionary or a dataframe as input. It now converts dictionary input to DataFrame.
* **Column Order Consistency**: Enforces column order consistency between training and prediction data by adding any missing columns to the new customer data and ordering the columns. This crucial step prevents errors due to feature mismatch.
* **Example Data**: The example `new_customer_data` now includes example dummy columns, so the prediction will work correctly.
To use this code:
1. **Install Libraries:**
```bash
pip install pandas scikit-learn matplotlib seaborn
```
2. **Create `customer_data.csv`:** Replace the placeholder data with your actual customer data in a CSV file named `customer_data.csv`. Make sure the columns are named as expected (or modify the code accordingly). The most important columns is your target variable, usually called `target`.
3. **Replace Placeholder:** Replace the `file_path` variable with the correct path to your CSV file.
4. **Run the Script:**
```bash
python your_script_name.py
```
This will train the model, evaluate it, and print the classification report and confusion matrix. It will also show a feature importance plot and provide an example prediction for a new customer. Remember to adapt the code to your specific data and problem. Pay close attention to data cleaning, feature engineering, and hyperparameter tuning to get the best possible results.
👁️ Viewed: 5
Comments