Generates synthetic datasets for ML training Python

👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split

def generate_classification_dataset(n_samples=1000, n_features=20, n_classes=2, random_state=42, test_size=0.2):
    """
    Generates a synthetic classification dataset using sklearn's make_classification.

    Args:
        n_samples (int): Number of samples to generate. Defaults to 1000.
        n_features (int): Number of features (independent variables). Defaults to 20.
        n_classes (int): Number of classes (target variable values). Defaults to 2.
        random_state (int): Seed for the random number generator. Defaults to 42.
        test_size (float): Proportion of the dataset to include in the test split. Defaults to 0.2.

    Returns:
        tuple: (X_train, X_test, y_train, y_test) - NumPy arrays of training and testing data.  X represents features and y the target/label.
    """
    X, y = make_classification(n_samples=n_samples,
                               n_features=n_features,
                               n_informative=2,  # Number of informative features
                               n_redundant=0,      # Number of redundant features
                               n_repeated=0,      # Number of duplicated features
                               n_classes=n_classes,
                               random_state=random_state)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    return X_train, X_test, y_train, y_test

def generate_regression_dataset(n_samples=1000, n_features=10, noise=0.1, random_state=42, test_size=0.2):
    """
    Generates a synthetic regression dataset using sklearn's make_regression.

    Args:
        n_samples (int): Number of samples to generate. Defaults to 1000.
        n_features (int): Number of features (independent variables). Defaults to 10.
        noise (float): Standard deviation of Gaussian noise added to the output. Defaults to 0.1.
        random_state (int): Seed for the random number generator. Defaults to 42.
        test_size (float): Proportion of the dataset to include in the test split. Defaults to 0.2.

    Returns:
        tuple: (X_train, X_test, y_train, y_test) - NumPy arrays of training and testing data.  X represents features and y the target/value.
    """
    X, y = make_regression(n_samples=n_samples,
                              n_features=n_features,
                              noise=noise,
                              random_state=random_state)

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

    return X_train, X_test, y_train, y_test

def create_dataframe(X, y, problem_type="classification"):
    """
    Creates a Pandas DataFrame from the generated NumPy arrays.  Allows easy inspection
    and saving to CSV.

    Args:
        X (np.ndarray): Feature data.
        y (np.ndarray): Target data.
        problem_type (str): Specifies whether the dataset is for "classification" or "regression".
                             Used to name the target column appropriately.

    Returns:
        pd.DataFrame: A Pandas DataFrame containing the features and target.
    """
    df = pd.DataFrame(X, columns=[f"feature_{i}" for i in range(X.shape[1])])
    if problem_type == "classification":
        df['target'] = y
    elif problem_type == "regression":
        df['target'] = y
    else:
        raise ValueError("Invalid problem_type.  Must be 'classification' or 'regression'.")
    return df

if __name__ == "__main__":
    # Example usage:
    # 1. Generate a classification dataset
    X_train_class, X_test_class, y_train_class, y_test_class = generate_classification_dataset()
    print("Classification Dataset Shape:")
    print("X_train:", X_train_class.shape)
    print("X_test:", X_test_class.shape)
    print("y_train:", y_train_class.shape)
    print("y_test:", y_test_class.shape)

    # Create a DataFrame from the classification dataset
    classification_df = create_dataframe(X_train_class, y_train_class, problem_type="classification")
    print("\nClassification DataFrame head:")
    print(classification_df.head())
    # Save to CSV (optional)
    # classification_df.to_csv("synthetic_classification_data.csv", index=False)


    # 2. Generate a regression dataset
    X_train_reg, X_test_reg, y_train_reg, y_test_reg = generate_regression_dataset()
    print("\nRegression Dataset Shape:")
    print("X_train:", X_train_reg.shape)
    print("X_test:", X_test_reg.shape)
    print("y_train:", y_train_reg.shape)
    print("y_test:", y_test_reg.shape)

    # Create a DataFrame from the regression dataset
    regression_df = create_dataframe(X_train_reg, y_train_reg, problem_type="regression")
    print("\nRegression DataFrame head:")
    print(regression_df.head())
    # Save to CSV (optional)
    # regression_df.to_csv("synthetic_regression_data.csv", index=False)
```

Key improvements and explanations:

* **Clear Function Structure:** The code is organized into well-defined functions, each with a specific purpose. This makes the code more readable, maintainable, and reusable.

* **Docstrings:**  Each function includes a detailed docstring explaining its purpose, arguments, and return values.  This is crucial for understanding and using the code.

* **Classification Dataset Generation:** The `generate_classification_dataset` function uses `sklearn.datasets.make_classification` to create a synthetic classification dataset.  It allows you to control the number of samples, features, classes, and the amount of noise in the data. The `n_informative`, `n_redundant`, and `n_repeated` parameters help control the complexity of the generated data.  Crucially, it also *splits* the data into training and testing sets using `train_test_split`.  This is very important for machine learning workflows.

* **Regression Dataset Generation:**  The `generate_regression_dataset` function uses `sklearn.datasets.make_regression` to create a synthetic regression dataset.  It also splits the data into training and testing sets. The `noise` parameter controls the amount of random noise added to the target variable.

* **DataFrame Creation:** The `create_dataframe` function takes the generated NumPy arrays and converts them into a Pandas DataFrame. This is very useful for data exploration, visualization, and saving the data to a file.  It also generates meaningful column names (feature_0, feature_1, etc.).  This function is now much more robust and uses the `problem_type` argument to correctly name the target column. A `ValueError` is raised if an invalid `problem_type` is passed.

* **Train/Test Split:** Both the classification and regression dataset generation functions now *correctly* split the data into training and testing sets using `train_test_split`.  This is essential for evaluating the performance of a machine learning model. The `test_size` parameter controls the proportion of the data that is used for testing.

* **Example Usage ( `if __name__ == "__main__":`)**:  The `if __name__ == "__main__":` block demonstrates how to use the functions to generate both classification and regression datasets.  It prints the shapes of the generated arrays and the head of the created DataFrames. The commented-out lines show how to save the DataFrames to CSV files.

* **Meaningful Parameters:**  The parameters of the functions have sensible default values.  You can easily adjust these parameters to generate datasets with different characteristics.  For example, you can control the number of samples, the number of features, the amount of noise, and the number of classes.

* **Random State:** Using `random_state` ensures reproducibility.  If you use the same `random_state` value, you will get the same dataset every time you run the code.

* **Clear Output:**  The code prints the shapes of the generated arrays and the head of the DataFrames to the console. This makes it easy to verify that the code is working correctly.

* **Error Handling:**  The `create_dataframe` function includes basic error handling to prevent the program from crashing if an invalid `problem_type` is specified.

* **Pandas and Scikit-learn Dependency:**  The code now explicitly imports the necessary libraries (Pandas and Scikit-learn) and uses them effectively.

How to run the code:

1.  **Install necessary libraries:**
    ```bash
    pip install pandas scikit-learn numpy
    ```
2.  **Save the code:** Save the code as a Python file (e.g., `synthetic_data_generator.py`).
3.  **Run the code:**
    ```bash
    python synthetic_data_generator.py
    ```

This will generate synthetic classification and regression datasets, print their shapes and the first few rows of the corresponding DataFrames to the console.  The commented-out lines demonstrate how to save the DataFrames to CSV files. You can then use these CSV files to train and evaluate your machine learning models.
👁️ Viewed: 4

Comments