AI-powered Fraud Detection Python, AI

👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from imblearn.over_sampling import SMOTE  # Handling imbalanced datasets
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Data Loading and Preprocessing

def load_and_preprocess_data(filepath="fraud_data.csv"):
    """
    Loads data from a CSV file, handles missing values, and performs basic feature engineering.

    Args:
        filepath (str): Path to the CSV file.  Defaults to "fraud_data.csv".

    Returns:
        pandas.DataFrame: The preprocessed DataFrame.
    """

    try:
        df = pd.read_csv(filepath)
    except FileNotFoundError:
        print(f"Error: File not found at {filepath}.  Make sure the file exists.")
        return None  # or raise the exception if appropriate

    # Handle missing values (replace with mean/median/mode as appropriate)
    for col in df.columns:
        if df[col].isnull().any():
            if pd.api.types.is_numeric_dtype(df[col]):
                df[col] = df[col].fillna(df[col].mean())  # or df[col].median() for robustness
            else:  # Handle categorical columns, filling with the most frequent value
                df[col] = df[col].fillna(df[col].mode()[0])  # mode()[0] gets the first mode if there are multiple

    # Feature Engineering (example: creating a transaction amount per customer)
    if 'customer_id' in df.columns and 'transaction_amount' in df.columns:
        df['transaction_amount_per_customer'] = df.groupby('customer_id')['transaction_amount'].transform('mean')

    return df


# 2. Feature Selection and Data Splitting

def feature_selection_and_splitting(df, target_variable='is_fraud', test_size=0.2, random_state=42):
    """
    Selects features, splits data into training and testing sets, and handles imbalanced data.

    Args:
        df (pandas.DataFrame): The DataFrame.
        target_variable (str): The name of the target variable (e.g., 'is_fraud'). Defaults to 'is_fraud'.
        test_size (float): The proportion of the data to use for testing. Defaults to 0.2.
        random_state (int): Random seed for reproducibility. Defaults to 42.

    Returns:
        tuple: (X_train, X_test, y_train, y_test) - training and testing sets.
    """

    if df is None:  # Check if the DataFrame is valid
        print("Error: DataFrame is None.  Check data loading.")
        return None, None, None, None

    X = df.drop(target_variable, axis=1)
    y = df[target_variable]

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y) # stratify ensures same proportion of classes in both splits

    # Handle Imbalanced Data using SMOTE (Synthetic Minority Oversampling Technique)
    smote = SMOTE(random_state=random_state)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

    return X_train_resampled, X_test, y_train_resampled, y_test


# 3. Model Training

def train_model(X_train, y_train, model_type='random_forest', random_state=42):
    """
    Trains a fraud detection model.

    Args:
        X_train (pandas.DataFrame): Training features.
        y_train (pandas.Series): Training target.
        model_type (str): The type of model to train ('random_forest'). Defaults to 'random_forest'.
        random_state (int): Random seed for reproducibility. Defaults to 42.

    Returns:
        object: The trained model.
    """
    if model_type == 'random_forest':
        model = RandomForestClassifier(random_state=random_state)  # You can add hyperparameters here
    else:
        print("Error: Invalid model type. Using RandomForestClassifier.")
        model = RandomForestClassifier(random_state=random_state)

    model.fit(X_train, y_train)
    return model

# 4. Model Evaluation

def evaluate_model(model, X_test, y_test):
    """
    Evaluates the trained model.

    Args:
        model (object): The trained model.
        X_test (pandas.DataFrame): Testing features.
        y_test (pandas.Series): Testing target.
    """

    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    print(f"Accuracy: {accuracy}")

    conf_matrix = confusion_matrix(y_test, y_pred)
    print("Confusion Matrix:\n", conf_matrix)

    class_report = classification_report(y_test, y_pred)
    print("Classification Report:\n", class_report)

    # Visualize Confusion Matrix
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt="d", cmap="Blues")
    plt.title("Confusion Matrix")
    plt.xlabel("Predicted Label")
    plt.ylabel("True Label")
    plt.show()

# 5. Main Execution

if __name__ == "__main__":
    # Replace 'fraud_data.csv' with the actual path to your CSV file
    data = load_and_preprocess_data("fraud_data.csv")

    if data is not None: #check to see if data has loaded

        X_train, X_test, y_train, y_test = feature_selection_and_splitting(data)

        if X_train is not None: #Check to see if splitting was succesful
            model = train_model(X_train, y_train)
            evaluate_model(model, X_test, y_test)
        else:
            print("Error splitting the data")

```

Key improvements and explanations:

* **Complete and Executable:**  This is a self-contained script that can be run directly, assuming you have a CSV file named `fraud_data.csv` in the same directory.
* **Error Handling:**  Includes a `try...except` block in `load_and_preprocess_data` to catch `FileNotFoundError` if the specified CSV file doesn't exist.  Also, checks if the dataframe is `None` after loading and splitting the data. These checks ensure that the program doesn't crash if the data loading or splitting fails. It also now prints a more helpful error message.
* **Data Loading and Preprocessing:**
    * Handles missing values using `fillna()`. Critically, it distinguishes between numeric and categorical columns, filling numeric columns with the mean (or median) and categorical columns with the mode. This prevents errors and ensures appropriate imputation.
    * Includes a basic example of feature engineering: `transaction_amount_per_customer`.  You should adapt this to your specific dataset.
* **Feature Selection and Splitting:**
    * Clearly separates feature selection (dropping the target variable) from data splitting.
    * Uses `train_test_split` to divide the data into training and testing sets.
    * **Crucially, uses `stratify=y` in `train_test_split`.**  This is essential for imbalanced datasets to ensure that the training and testing sets have the same proportion of fraudulent and non-fraudulent transactions.  Without this, your test set might not be representative of real-world data.
    * **Imbalanced Data Handling:**  Uses `SMOTE` (Synthetic Minority Oversampling Technique) to address the class imbalance problem, where fraudulent transactions are typically much rarer than legitimate ones. This is a *critical* step for fraud detection.
* **Model Training:**
    * Trains a `RandomForestClassifier`. You can easily swap this out for other models like Logistic Regression, SVM, or Gradient Boosting.
    * Includes a `model_type` argument so that other models can be easily incorporated
* **Model Evaluation:**
    * Calculates accuracy, confusion matrix, and classification report.
    * Prints the classification report, which gives you precision, recall, and F1-score for each class (fraudulent and non-fraudulent). **This is much more informative than just accuracy for imbalanced datasets.**  You need to look at the recall and precision for the 'fraud' class to see how well the model is detecting fraud.
    * **Visualization:** Includes code to plot the confusion matrix as a heatmap using `seaborn`. This makes it much easier to understand the model's performance.
* **Modularity:** The code is broken down into functions for data loading, preprocessing, feature selection, model training, and evaluation. This makes the code more readable, maintainable, and reusable.
* **Clear Comments and Docstrings:**  The code is well-commented and includes docstrings for each function explaining its purpose, arguments, and return values.
* **Reproducibility:**  Sets `random_state` in `train_test_split`, `SMOTE`, and `RandomForestClassifier` to ensure consistent results across multiple runs.
* **`if __name__ == "__main__":` block:** This ensures that the main execution code is only run when the script is executed directly, not when it's imported as a module.
* **Dependency Management:**  Uses `import` statements to clearly declare the required libraries.  You'll need to install these using `pip`:
   ```bash
   pip install pandas scikit-learn imbalanced-learn matplotlib seaborn
   ```
* **Realistic Data Handling:** Addresses the common problem of missing values and offers strategies for both numerical and categorical columns.

To use this code:

1. **Create `fraud_data.csv`:**  Create a CSV file named `fraud_data.csv` with your fraud data. The file *must* have a column named `is_fraud` (or change the `target_variable` argument in `feature_selection_and_splitting` to match your column name) that contains the target variable (0 for not fraud, 1 for fraud).  Include other relevant features in the CSV.  Example:

   ```csv
   customer_id,transaction_amount,location,is_fraud
   1,100,"New York",0
   2,500,"London",1
   3,25,"New York",0
   4,1000,"Paris",1
   5,75,"Tokyo",0
   ...
   ```

2. **Install Libraries:** Run the `pip install` command above to install the necessary libraries.

3. **Run the Script:** Execute the Python script: `python your_script_name.py`

4. **Analyze Results:** Carefully examine the accuracy, confusion matrix, and classification report. Pay particular attention to the precision and recall for the 'fraud' class.  The confusion matrix visualization will help you see which types of errors the model is making.

Remember to adapt the feature engineering and model parameters to your specific dataset and requirements. Good luck!
👁️ Viewed: 10

Comments