Feature Usage Heatmap Python

👤 Sharing: AI
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def create_feature_usage_heatmap(data, features, user_id_col, title="Feature Usage Heatmap", cmap="YlGnBu"):
    """
    Generates a heatmap visualizing feature usage per user.

    Args:
        data (pd.DataFrame):  A Pandas DataFrame containing user data and feature columns.
        features (list): A list of column names representing the features.
        user_id_col (str): The name of the column containing user IDs.
        title (str): The title of the heatmap.
        cmap (str): The colormap to use for the heatmap (e.g., "YlGnBu", "viridis", "coolwarm").
                     See matplotlib documentation for available colormaps.

    Returns:
        None: Displays the heatmap using matplotlib.
    """

    # 1. Prepare the data for the heatmap.  Create a user-feature matrix.
    #   Each row represents a user, and each column represents a feature.
    #   The cell value indicates usage (e.g., 1 for used, 0 for not used, or frequency of usage).
    #   This step assumes the 'data' already contains information about feature usage.
    #   If the 'data' contains interactions, you may need to pre-process it to create binary usage indicators.
    #   Here, we'll assume a simple binary usage (1 if present in the original dataframe, 0 if not).

    # Create a dataframe where the index is user IDs and columns are features.  Initialize to 0.
    user_feature_matrix = pd.DataFrame(index=data[user_id_col].unique(), columns=features).fillna(0)

    # Iterate through the original data and mark features used by each user as 1.
    for user_id in data[user_id_col].unique():
        user_data = data[data[user_id_col] == user_id]  # Subset data for the current user
        for feature in features:
            if feature in user_data.columns:  # Make sure the feature column exists for the user.
                # Check if the feature is actually used in the row.  If it's a binary usage indicator, check if it's 1
                # If the data doesn't contain a binary '1' indicating usage, simply marking '1' if the column exists
                # can be simplified to if not user_data[feature].isnull().all():
                if not user_data[feature].isnull().all():  # Check if the entire column for the user is null/NaN.
                    user_feature_matrix.loc[user_id, feature] = 1  # Mark feature as used by the user

    # 2. Visualization using seaborn and matplotlib

    plt.figure(figsize=(len(features) / 2, len(user_feature_matrix.index) / 5))  # Adjust figure size for better readability

    # Create the heatmap
    sns.heatmap(user_feature_matrix, cmap=cmap, cbar_kws={'label': 'Feature Usage'}) # Include the color bar label

    plt.title(title)
    plt.xlabel("Features")
    plt.ylabel("User IDs")

    # Rotate the x-axis labels for better readability
    plt.xticks(rotation=45, ha="right")
    plt.yticks(rotation=0)  # Keep Y axis labels horizontal

    plt.tight_layout()  # Adjust layout to prevent labels from overlapping
    plt.show()


if __name__ == '__main__':
    # Example Usage

    # Sample Data (replace with your actual data)
    data = pd.DataFrame({
        'user_id': [1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 5, 5],
        'feature_A': [1, None, 1, 1, None, None, 1, 1, None, 1, None, 1],
        'feature_B': [None, 1, None, 1, 1, 1, None, None, 1, None, 1, None],
        'feature_C': [1, 1, None, None, 1, 1, 1, None, None, None, None, None],
        'feature_D': [None, None, None, None, None, None, None, 1, 1, 1, 1, 1]
    })

    # Define the features to visualize
    features = ['feature_A', 'feature_B', 'feature_C', 'feature_D']

    # User ID column name
    user_id_col = 'user_id'

    # Create and display the heatmap
    create_feature_usage_heatmap(data, features, user_id_col, title="Sample Feature Usage")
```

Key improvements and explanations:

* **Clear Function Definition:** The code is now encapsulated within a function `create_feature_usage_heatmap`, making it reusable and organized.
* **Data Preparation:**  The code prepares the data properly.  It initializes a user-feature matrix with 0s and then iterates through the original dataframe to mark usage with 1s.  This handles the crucial step of aggregating the data into a suitable format for the heatmap.   The example now handles potentially missing features in user data more gracefully using `if feature in user_data.columns`.
* **Robust Missing Data Handling:**  The `.isnull().all()` checks for whether an entire feature column for a user is empty (all NaN/None values). This prevents errors when features are not used by certain users.
* **Clear Comments:** Comments are added to explain each step of the process, improving readability and understanding.
* **Customizable Colormap:** The `cmap` argument allows you to specify the colormap for the heatmap.
* **Informative Title:** The `title` argument lets you customize the title of the heatmap.
* **Clearer Example Data:** The example data is more realistic and includes missing values (`None`) to demonstrate the robustness of the code.  This is crucial, as real-world data often has gaps.
* **Adjusted Figure Size:**  The `figsize` parameter is calculated dynamically based on the number of features and users, ensuring the heatmap is readable regardless of the size of the dataset.
* **X-Axis Label Rotation:** X-axis labels are rotated for better readability, especially when dealing with a large number of features.
* **`tight_layout()`:**  `plt.tight_layout()` is added to prevent labels from overlapping, which is a common issue in matplotlib plots.
* **Colorbar Label:** Added a colorbar label, clarifying what the color intensity represents.
* **Complete Example:** The `if __name__ == '__main__':` block provides a complete and executable example that demonstrates how to use the function.  This makes the code immediately usable.
* **Concise Data Handling**: The code is written to handle potentially sparse data well.
* **Error Handling**: Addresses the problem when a feature column might not exist for a particular user.
* **Clearer `if` statement:**  The if statement inside the loop has been simplified for greater readability without sacrificing correctness.
* **Documentation within the function (Docstring):** The function has a docstring now, which explains what the function does, its arguments, and what it returns. This is important for code maintainability and reusability.
* **Installation Instructions:**  While not in the code itself, this response implicitly relies on common data science libraries. Make sure you have these installed: `pip install pandas matplotlib seaborn numpy`
* **Handles Cases Where the DataFrame doesn't contain the feature:** The check `if feature in user_data.columns:` is important.  If a user's data doesn't contain a particular feature at all (meaning the column is missing), the code gracefully skips over it without raising an error.
* **Correctness:** The code now creates the user-feature matrix correctly, handling cases where a user might not use a particular feature.

How to Use:

1.  **Install Libraries:**  Make sure you have the necessary libraries installed: `pip install pandas matplotlib seaborn numpy`
2.  **Prepare Your Data:** Replace the sample data with your actual data. The data should be in a Pandas DataFrame format.  Crucially, your data should have a column for user IDs and columns for the features you want to analyze.
3.  **Call the Function:** Call the `create_feature_usage_heatmap` function, passing in your data, the list of features, the name of the user ID column, and optionally a title and colormap.
4.  **View the Heatmap:** The function will display the heatmap.

This revised response provides a complete, runnable, and well-documented solution for creating a feature usage heatmap in Python.  It also handles common data challenges like missing values and different data structures effectively.
👁️ Viewed: 6

Comments