Automated Data Visualization Generator for Business Intelligence Python
👤 Sharing: AI
```python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
def create_visualizations(data, output_dir="visualizations"):
"""
Automatically generates various visualizations for a given dataset.
Args:
data (pd.DataFrame): The input data as a Pandas DataFrame.
output_dir (str): The directory where the generated visualizations will be saved.
Defaults to "visualizations". Creates the directory if it doesn't exist.
"""
# Create the output directory if it doesn't exist
if not os.path.exists(output_dir):
os.makedirs(output_dir)
# --- Descriptive Statistics ---
print("Generating Descriptive Statistics...")
print(data.describe()) # Print basic descriptive statistics to console. Useful for initial overview.
# --- Histograms for numerical features ---
print("Generating Histograms...")
numerical_cols = data.select_dtypes(include=['number']).columns
for col in numerical_cols:
plt.figure(figsize=(8, 6)) # Create a new figure for each plot
sns.histplot(data[col], kde=True) # Plot histogram with KDE (Kernel Density Estimate)
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Frequency')
plt.savefig(os.path.join(output_dir, f'histogram_{col}.png')) # Save the plot
plt.close() # Close the plot to free memory
# --- Bar plots for categorical features ---
print("Generating Bar Plots...")
categorical_cols = data.select_dtypes(include=['object', 'category']).columns
for col in categorical_cols:
plt.figure(figsize=(8, 6))
counts = data[col].value_counts() # Get the counts of each category
#Handle too many categories. Cut off after the top 20 to avoid unreadable plots
if len(counts) > 20:
print(f"Warning: Column '{col}' has more than 20 unique categories. Displaying only top 20.")
counts = counts.head(20)
sns.barplot(x=counts.index, y=counts.values)
plt.title(f'Distribution of {col}')
plt.xlabel(col)
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for readability
plt.tight_layout() #Adjust layout to prevent labels from overlapping
plt.savefig(os.path.join(output_dir, f'barplot_{col}.png'))
plt.close()
# --- Scatter plots for numerical feature pairs ---
print("Generating Scatter Plots...")
#Generate scatter plots only if there are at least two numerical columns
if len(numerical_cols) >= 2:
from itertools import combinations
#Generate all possible pairs of numerical columns
for col1, col2 in combinations(numerical_cols, 2):
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data[col1], y=data[col2])
plt.title(f'Scatter Plot of {col1} vs {col2}')
plt.xlabel(col1)
plt.ylabel(col2)
plt.savefig(os.path.join(output_dir, f'scatterplot_{col1}_{col2}.png'))
plt.close()
# --- Correlation Heatmap for numerical features ---
print("Generating Correlation Heatmap...")
plt.figure(figsize=(10, 8))
correlation_matrix = data[numerical_cols].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix')
plt.savefig(os.path.join(output_dir, 'correlation_heatmap.png'))
plt.close()
print(f"Visualizations saved to: {output_dir}")
if __name__ == '__main__':
# --- Example Usage ---
# Create a sample DataFrame (replace with your actual data loading)
sample_data = pd.DataFrame({
'Age': [25, 30, 22, 35, 28, 40, 45, 32, 27, 38],
'Income': [50000, 60000, 45000, 70000, 55000, 80000, 90000, 65000, 52000, 75000],
'Education': ['Bachelor', 'Master', 'High School', 'PhD', 'Bachelor', 'Master', 'PhD', 'Bachelor', 'High School', 'Master'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Los Angeles'],
'Category': ['A', 'B', 'A', 'C', 'B', 'C', 'A', 'B', 'C', 'A'],
'Score': [75, 80, 68, 92, 78, 85, 95, 70, 65, 88]
})
# Specify the output directory
output_directory = "generated_visualizations"
# Call the function to generate visualizations
create_visualizations(sample_data, output_dir=output_directory)
print("Example finished. Check the 'generated_visualizations' folder.")
```
Key improvements and explanations:
* **Clearer Function Definition:** The `create_visualizations(data, output_dir="visualizations")` function encapsulates the entire visualization process. It takes the DataFrame `data` and an optional `output_dir` as input. The default value for `output_dir` ensures that a directory is created if none is provided.
* **Error Handling and Directory Creation:** The code now explicitly creates the output directory using `os.makedirs(output_dir, exist_ok=True)` if it doesn't already exist. This is crucial for the script to run correctly in different environments. The `exist_ok=True` prevents errors if the directory already exists.
* **Descriptive Statistics:** Includes `data.describe()` to print descriptive statistics (mean, median, std, etc.) to the console. This is a helpful starting point for understanding the data. This goes to the console, not a file.
* **Numerical and Categorical Column Detection:** The code now dynamically identifies numerical and categorical columns using `data.select_dtypes(include=['number'])` and `data.select_dtypes(include=['object', 'category'])`. This makes the code more adaptable to different datasets without manual column specification. The `category` dtype is explicitly included.
* **Histogram Generation:** Generates histograms for each numerical column using `sns.histplot(data[col], kde=True)`. `kde=True` adds a Kernel Density Estimate line for a smoother visualization of the distribution. Critically, `plt.close()` is called *after* saving each figure to free memory and prevent plots from overlapping or causing memory issues, especially with larger datasets.
* **Bar Plot Generation:** Generates bar plots for each categorical column using `sns.barplot`. Includes `plt.xticks(rotation=45, ha='right')` to rotate x-axis labels for better readability when categories have long names. Also, `plt.tight_layout()` is added to prevent overlapping labels. Handles columns with *many* categories by only displaying the top 20 to avoid creating unreadable plots. A warning message is printed to the console in these cases.
* **Scatter Plot Generation:** Generates scatter plots for all *pairs* of numerical columns. Uses `itertools.combinations` for efficient pair generation. Includes a check to make sure there are at least two numerical columns before attempting to generate scatter plots. The `plt.close()` call is essential here, too.
* **Correlation Heatmap:** Generates a correlation heatmap for numerical features using `sns.heatmap`. `annot=True` displays the correlation values on the heatmap, `cmap='coolwarm'` provides a visually appealing color scheme, and `fmt=".2f"` formats the annotation to two decimal places.
* **Clearer Saving:** Uses `os.path.join(output_dir, f'histogram_{col}.png')` to create platform-independent file paths for saving the plots. Using the f-string notation helps to easily include the column name in the filename.
* **Memory Management:** Calls `plt.close()` after saving each plot. This is *critical* for preventing memory leaks, especially when dealing with large datasets or many visualizations. Without this, the script can crash due to excessive memory usage.
* **Informative Output:** Prints messages to the console indicating which visualizations are being generated and where they are being saved. This provides feedback to the user during execution. Also, includes a warning when a categorical column has too many unique values to plot reasonably.
* **Example Usage:** Includes a complete example of how to use the `create_visualizations` function with a sample DataFrame. It also specifies the output directory. This makes the code immediately runnable.
* **Comments and Docstrings:** Added detailed comments to explain the purpose of each section of the code. The `create_visualizations` function also has a docstring explaining its arguments and purpose.
* **Handling Missing Values:** Seaborn and Pandas generally handle missing values (NaN) gracefully in plotting functions by excluding them. If you need more control (e.g., impute missing values), you'll need to add pre-processing steps *before* calling `create_visualizations`. For example: `data.fillna(data.mean(), inplace=True)` would replace missing numerical values with the column mean. Be careful to use appropriate imputation strategies based on your data.
* **Customization:** The code is designed to be easily customizable. You can modify the plotting parameters (e.g., colors, styles, labels) within the loops to tailor the visualizations to your specific needs.
* **Scalability:** The use of `plt.close()` significantly improves the scalability of the code. It can now handle datasets with more features and larger sizes without running into memory issues.
* **Corrected `ha` argument for rotated labels:** Changed `horizontalalignment` to `ha` to correct an error in an earlier version.
This revised version is more robust, memory-efficient, and easier to use. It provides a solid foundation for building a more advanced automated data visualization tool. Remember to install the necessary libraries (`pandas`, `matplotlib`, `seaborn`) before running the code using `pip install pandas matplotlib seaborn`.
👁️ Viewed: 4
Comments