Feature Usage Heatmap Python
👤 Sharing: AI
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
def create_feature_usage_heatmap(data, features, user_id_col, title="Feature Usage Heatmap", cmap="YlGnBu"):
"""
Generates a heatmap visualizing feature usage per user.
Args:
data (pd.DataFrame): A Pandas DataFrame containing user data and feature columns.
features (list): A list of column names representing the features.
user_id_col (str): The name of the column containing user IDs.
title (str): The title of the heatmap.
cmap (str): The colormap to use for the heatmap (e.g., "YlGnBu", "viridis", "coolwarm").
See matplotlib documentation for available colormaps.
Returns:
None: Displays the heatmap using matplotlib.
"""
# 1. Prepare the data for the heatmap. Create a user-feature matrix.
# Each row represents a user, and each column represents a feature.
# The cell value indicates usage (e.g., 1 for used, 0 for not used, or frequency of usage).
# This step assumes the 'data' already contains information about feature usage.
# If the 'data' contains interactions, you may need to pre-process it to create binary usage indicators.
# Here, we'll assume a simple binary usage (1 if present in the original dataframe, 0 if not).
# Create a dataframe where the index is user IDs and columns are features. Initialize to 0.
user_feature_matrix = pd.DataFrame(index=data[user_id_col].unique(), columns=features).fillna(0)
# Iterate through the original data and mark features used by each user as 1.
for user_id in data[user_id_col].unique():
user_data = data[data[user_id_col] == user_id] # Subset data for the current user
for feature in features:
if feature in user_data.columns: # Make sure the feature column exists for the user.
# Check if the feature is actually used in the row. If it's a binary usage indicator, check if it's 1
# If the data doesn't contain a binary '1' indicating usage, simply marking '1' if the column exists
# can be simplified to if not user_data[feature].isnull().all():
if not user_data[feature].isnull().all(): # Check if the entire column for the user is null/NaN.
user_feature_matrix.loc[user_id, feature] = 1 # Mark feature as used by the user
# 2. Visualization using seaborn and matplotlib
plt.figure(figsize=(len(features) / 2, len(user_feature_matrix.index) / 5)) # Adjust figure size for better readability
# Create the heatmap
sns.heatmap(user_feature_matrix, cmap=cmap, cbar_kws={'label': 'Feature Usage'}) # Include the color bar label
plt.title(title)
plt.xlabel("Features")
plt.ylabel("User IDs")
# Rotate the x-axis labels for better readability
plt.xticks(rotation=45, ha="right")
plt.yticks(rotation=0) # Keep Y axis labels horizontal
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()
if __name__ == '__main__':
# Example Usage
# Sample Data (replace with your actual data)
data = pd.DataFrame({
'user_id': [1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5],
'feature_A': [1, None, 1, 1, None, None, 1, 1, None, 1, None, 1],
'feature_B': [None, 1, None, 1, 1, 1, None, None, 1, None, 1, None],
'feature_C': [1, 1, None, None, 1, 1, 1, None, None, None, None, None],
'feature_D': [None, None, None, None, None, None, None, 1, 1, 1, 1, 1]
})
# Define the features to visualize
features = ['feature_A', 'feature_B', 'feature_C', 'feature_D']
# User ID column name
user_id_col = 'user_id'
# Create and display the heatmap
create_feature_usage_heatmap(data, features, user_id_col, title="Sample Feature Usage")
```
Key improvements and explanations:
* **Clear Function Definition:** The code is now encapsulated within a function `create_feature_usage_heatmap`, making it reusable and organized.
* **Data Preparation:** The code prepares the data properly. It initializes a user-feature matrix with 0s and then iterates through the original dataframe to mark usage with 1s. This handles the crucial step of aggregating the data into a suitable format for the heatmap. The example now handles potentially missing features in user data more gracefully using `if feature in user_data.columns`.
* **Robust Missing Data Handling:** The `.isnull().all()` checks for whether an entire feature column for a user is empty (all NaN/None values). This prevents errors when features are not used by certain users.
* **Clear Comments:** Comments are added to explain each step of the process, improving readability and understanding.
* **Customizable Colormap:** The `cmap` argument allows you to specify the colormap for the heatmap.
* **Informative Title:** The `title` argument lets you customize the title of the heatmap.
* **Clearer Example Data:** The example data is more realistic and includes missing values (`None`) to demonstrate the robustness of the code. This is crucial, as real-world data often has gaps.
* **Adjusted Figure Size:** The `figsize` parameter is calculated dynamically based on the number of features and users, ensuring the heatmap is readable regardless of the size of the dataset.
* **X-Axis Label Rotation:** X-axis labels are rotated for better readability, especially when dealing with a large number of features.
* **`tight_layout()`:** `plt.tight_layout()` is added to prevent labels from overlapping, which is a common issue in matplotlib plots.
* **Colorbar Label:** Added a colorbar label, clarifying what the color intensity represents.
* **Complete Example:** The `if __name__ == '__main__':` block provides a complete and executable example that demonstrates how to use the function. This makes the code immediately usable.
* **Concise Data Handling**: The code is written to handle potentially sparse data well.
* **Error Handling**: Addresses the problem when a feature column might not exist for a particular user.
* **Clearer `if` statement:** The if statement inside the loop has been simplified for greater readability without sacrificing correctness.
* **Documentation within the function (Docstring):** The function has a docstring now, which explains what the function does, its arguments, and what it returns. This is important for code maintainability and reusability.
* **Installation Instructions:** While not in the code itself, this response implicitly relies on common data science libraries. Make sure you have these installed: `pip install pandas matplotlib seaborn numpy`
* **Handles Cases Where the DataFrame doesn't contain the feature:** The check `if feature in user_data.columns:` is important. If a user's data doesn't contain a particular feature at all (meaning the column is missing), the code gracefully skips over it without raising an error.
* **Correctness:** The code now creates the user-feature matrix correctly, handling cases where a user might not use a particular feature.
How to Use:
1. **Install Libraries:** Make sure you have the necessary libraries installed: `pip install pandas matplotlib seaborn numpy`
2. **Prepare Your Data:** Replace the sample data with your actual data. The data should be in a Pandas DataFrame format. Crucially, your data should have a column for user IDs and columns for the features you want to analyze.
3. **Call the Function:** Call the `create_feature_usage_heatmap` function, passing in your data, the list of features, the name of the user ID column, and optionally a title and colormap.
4. **View the Heatmap:** The function will display the heatmap.
This revised response provides a complete, runnable, and well-documented solution for creating a feature usage heatmap in Python. It also handles common data challenges like missing values and different data structures effectively.
👁️ Viewed: 6
Comments