Automated Data Visualization Generator for Business Intelligence,Python

👤 Sharing: AI
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os  # Import the 'os' module
import warnings
warnings.filterwarnings("ignore")  # To suppress some warnings

class AutoViz:
    """
    Automated Data Visualization Generator for Business Intelligence.

    This class takes a Pandas DataFrame and automatically generates various visualizations
    to provide insights into the data.  It leverages matplotlib and seaborn libraries.

    Attributes:
        data (pd.DataFrame): The input Pandas DataFrame.
        output_path (str):  The path to the directory where the generated visualizations will be saved.
                         Defaults to 'visualizations'.
        file_prefix (str): A prefix for the generated image filenames. Defaults to 'auto_viz'.
    """

    def __init__(self, data, output_path='visualizations', file_prefix='auto_viz'):
        """
        Initializes the AutoViz object.

        Args:
            data (pd.DataFrame): The input Pandas DataFrame.
            output_path (str, optional): The directory to save visualizations. Defaults to 'visualizations'.
            file_prefix (str, optional): A prefix for the filenames. Defaults to 'auto_viz'.
        """
        if not isinstance(data, pd.DataFrame):
            raise TypeError("Input data must be a Pandas DataFrame.")

        self.data = data
        self.output_path = output_path
        self.file_prefix = file_prefix

        # Create the output directory if it doesn't exist.  Crucial for saving images.
        if not os.path.exists(self.output_path):
            os.makedirs(self.output_path)

    def visualize_distributions(self, numerical_cols, categorical_cols):
        """
        Generates histograms and box plots for numerical columns and count plots for categorical columns.

        Args:
            numerical_cols (list): A list of numerical column names.
            categorical_cols (list): A list of categorical column names.
        """
        # Numerical Columns Visualizations
        for col in numerical_cols:
            plt.figure(figsize=(12, 6))

            # Histogram
            plt.subplot(1, 2, 1)  # 1 row, 2 columns, first plot
            sns.histplot(self.data[col], kde=True)  # Enable Kernel Density Estimate
            plt.title(f'Distribution of {col}')
            plt.xlabel(col)
            plt.ylabel('Frequency')

            # Box Plot
            plt.subplot(1, 2, 2)  # 1 row, 2 columns, second plot
            sns.boxplot(x=self.data[col])
            plt.title(f'Box Plot of {col}')
            plt.xlabel(col)

            plt.tight_layout()  # Adjust layout to prevent overlapping
            filename = os.path.join(self.output_path, f"{self.file_prefix}_{col}_distribution.png") # Correct path joining
            plt.savefig(filename)
            plt.close() # close the figure to prevent memory leaks

        # Categorical Columns Visualizations
        for col in categorical_cols:
            plt.figure(figsize=(10, 6))
            sns.countplot(x=self.data[col], order = self.data[col].value_counts().index) # Order for better visualization
            plt.title(f'Count Plot of {col}')
            plt.xlabel(col)
            plt.ylabel('Count')
            plt.xticks(rotation=45, ha='right')  # Rotate x-axis labels for readability
            plt.tight_layout()
            filename = os.path.join(self.output_path, f"{self.file_prefix}_{col}_countplot.png")
            plt.savefig(filename)
            plt.close()

    def visualize_relationships(self, numerical_cols, target_col=None):
        """
        Generates scatter plots for numerical columns and box plots against a target variable (if provided).

        Args:
            numerical_cols (list): A list of numerical column names.
            target_col (str, optional): The name of the target variable column. Defaults to None.
        """

        if target_col:
            if target_col not in self.data.columns:
                print(f"Target column '{target_col}' not found in the data.  Visualizing pairwise relationships without a target.")
                target_col = None

        if target_col is None:  # Pairwise scatter plots
            num_cols = len(numerical_cols)
            for i in range(num_cols):
                for j in range(i + 1, num_cols):
                    col1 = numerical_cols[i]
                    col2 = numerical_cols[j]
                    plt.figure(figsize=(8, 6))
                    sns.scatterplot(x=self.data[col1], y=self.data[col2])
                    plt.title(f'Scatter Plot of {col1} vs {col2}')
                    plt.xlabel(col1)
                    plt.ylabel(col2)
                    plt.tight_layout()
                    filename = os.path.join(self.output_path, f"{self.file_prefix}_{col1}_vs_{col2}_scatter.png")
                    plt.savefig(filename)
                    plt.close()

        else: # Box plots against target
            for col in numerical_cols:
                plt.figure(figsize=(10, 6))
                sns.boxplot(x=self.data[target_col], y=self.data[col])
                plt.title(f'Box Plot of {col} vs {target_col}')
                plt.xlabel(target_col)
                plt.ylabel(col)
                plt.xticks(rotation=45, ha='right')
                plt.tight_layout()
                filename = os.path.join(self.output_path, f"{self.file_prefix}_{col}_vs_{target_col}_boxplot.png")
                plt.savefig(filename)
                plt.close()



    def visualize_correlations(self, numerical_cols):
        """
        Generates a heatmap of the correlation matrix for numerical columns.

        Args:
            numerical_cols (list): A list of numerical column names.
        """
        correlation_matrix = self.data[numerical_cols].corr()
        plt.figure(figsize=(10, 8))
        sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
        plt.title('Correlation Matrix')
        plt.tight_layout()
        filename = os.path.join(self.output_path, f"{self.file_prefix}_correlation_matrix.png")
        plt.savefig(filename)
        plt.close()


    def run(self, target_col=None):
        """
        Runs the automated visualization process.

        Args:
            target_col (str, optional): The name of the target variable column. Defaults to None.
        """

        numerical_cols = self.data.select_dtypes(include=['number']).columns.tolist()
        categorical_cols = self.data.select_dtypes(include=['object', 'category']).columns.tolist()

        if not numerical_cols and not categorical_cols:
             print("No numerical or categorical columns found in the data.  Nothing to visualize.")
             return

        if numerical_cols:
            self.visualize_distributions(numerical_cols, categorical_cols) # Include categorical_cols here!
            self.visualize_relationships(numerical_cols, target_col)
            self.visualize_correlations(numerical_cols)

        if categorical_cols:
            # We already visualized the distribution of categorical columns in the visualize_distributions method
            pass

        print(f"Visualizations saved to: {self.output_path}")


# Example Usage (Illustrative)
if __name__ == '__main__':
    # 1. Sample Data (Replace with your actual data loading)
    data = pd.DataFrame({
        'age': [25, 30, 22, 35, 28, 40, 45, 32, 27, 38],
        'income': [50000, 60000, 45000, 70000, 55000, 80000, 90000, 65000, 52000, 75000],
        'city': ['New York', 'London', 'New York', 'Paris', 'London', 'Tokyo', 'Tokyo', 'Paris', 'New York', 'London'],
        'gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female'],
        'purchased': [0, 1, 0, 1, 0, 1, 1, 0, 0, 1]  # Example target variable
    })


    # 2. Instantiate the AutoViz class
    auto_viz = AutoViz(data, output_path='my_visualizations', file_prefix='data_insights')

    # 3. Run the visualization process (optionally specify a target variable)
    auto_viz.run(target_col='purchased') # Specify a target column if you have one.  Otherwise, omit.

    print("Visualization complete. Check the 'my_visualizations' directory.")
```

Key improvements and explanations:

* **Clear Class Structure:** The code is organized into a class `AutoViz` for better structure, reusability, and maintainability.  This is crucial for larger projects.
* **Error Handling:** Includes a `TypeError` check to ensure the input is a Pandas DataFrame.  This is important for robustness. Includes a check to make sure the target column is in the dataframe. If the target column is not found, the code will now perform pairwise scatter plots without a target. Checks if numerical or categorical columns are found and prints a message if none are found.
* **Output Directory Creation:** The code now explicitly creates the output directory (`output_path`) if it doesn't exist using `os.makedirs`.  This is *essential* for the program to function correctly when saving images.  It prevents `FileNotFoundError` errors.
* **File Naming Convention:**  Uses `os.path.join` to create filenames correctly, ensuring cross-platform compatibility (Windows, macOS, Linux).  This prevents issues with path separators.  Adds a prefix to the filenames (`file_prefix`) making them easier to identify.  Names the files more descriptively (e.g., `column_distribution.png`, `col1_vs_col2_scatter.png`).
* **Numerical and Categorical Column Handling:** Correctly identifies numerical and categorical columns using `select_dtypes`. The visualize_distributions function now uses the categorical columns to generate count plots. The target variable is only used if provided and exists in the dataset, otherwise, it will perform pairwise scatter plots of the numerical columns.
* **Visualization Customization:**
    * **Histograms:** Added `kde=True` to `sns.histplot` to include Kernel Density Estimate lines, providing a smoother representation of the distribution.
    * **Count Plots:**  Orders the count plots by frequency using `order=self.data[col].value_counts().index` for better readability.
    * **Box Plots:** Rotates x-axis labels in count plots and target variable boxplots using `plt.xticks(rotation=45, ha='right')` to prevent overlapping, especially with longer labels.
    * **Correlation Matrix:** Uses `annot=True` in the heatmap to display correlation values and `fmt=".2f"` to format them to two decimal places.
    * **Tight Layout:** Calls `plt.tight_layout()` after each plot to prevent labels and titles from overlapping.
* **Memory Management:**  Crucially calls `plt.close()` after saving each figure to release memory.  Without this, the program will consume more and more memory as it generates plots, leading to slowdowns and potentially crashes, especially with large datasets.
* **Clearer Plot Titles and Labels:**  Provides more informative titles and labels for each plot, making the visualizations easier to understand.
* **Target Variable Handling:** The `run` method now accepts an optional `target_col` argument.  If provided, it generates box plots of numerical features against the target variable.  If not provided, it generates pairwise scatter plots of numerical features.
* **Concise Code and Comments:**  The code is written in a clean and concise manner with helpful comments explaining each step.
* **Example Usage:** The `if __name__ == '__main__':` block provides a clear example of how to use the `AutoViz` class.  It shows how to load data, instantiate the class, and run the visualization process. The example data includes a target variable `purchased` to demonstrate the target variable functionality.
* **Seaborn Styling:**  Leverages Seaborn's default styling for visually appealing plots.
* **Warning Suppression:** Uses `warnings.filterwarnings("ignore")` to suppress some warnings that might be generated by the libraries, making the output cleaner. Use this with caution; you may want to remove this if you're debugging and need to see warnings.
* **Comprehensive Docstrings:**  Includes docstrings for the class and each method, explaining their purpose, arguments, and return values.  This is essential for maintainability and collaboration.

How to use:

1. **Install Libraries:**
   ```bash
   pip install pandas matplotlib seaborn
   ```

2. **Replace Sample Data:**  Replace the sample data in the `if __name__ == '__main__':` block with your own data loading code (e.g., `pd.read_csv('your_data.csv')`).

3. **Instantiate and Run:** Create an instance of the `AutoViz` class, specifying your data, output path, and file prefix:

   ```python
   auto_viz = AutoViz(your_data, output_path='your_output_directory', file_prefix='your_prefix')
   ```

4. **Run the Visualization:** Call the `run()` method:

   ```python
   auto_viz.run(target_col='your_target_column')  # If you have a target variable
   auto_viz.run() # If you don't have a target variable
   ```

5. **Check Output:**  The generated visualizations will be saved in the directory you specified in the `output_path`.

This revised code provides a robust, well-structured, and easy-to-use solution for automated data visualization. Remember to replace the example data with your own dataset.
👁️ Viewed: 5

Comments