AI-powered Fraud Detection System Python, AI, Neural Networks
👤 Sharing: AI
```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.metrics import confusion_matrix, classification_report
# --- 1. Data Preparation ---
# Simulate a dataset (replace with your actual data loading)
def create_fraud_dataset(num_samples=10000, fraud_percentage=0.05): #fraud_percentage reduced to 5% for better simulation
"""
Generates a synthetic dataset for fraud detection. The dataset includes
features like transaction amount, frequency, location, and a binary label
indicating fraud (1) or not (0).
Args:
num_samples (int): The total number of data points in the dataset.
fraud_percentage (float): The percentage of fraudulent transactions.
Returns:
pandas.DataFrame: A DataFrame containing the synthetic dataset.
"""
np.random.seed(42) # for reproducibility
num_fraud = int(num_samples * fraud_percentage)
num_normal = num_samples - num_fraud
# Features for normal transactions
normal_data = {
'amount': np.random.normal(50, 20, num_normal), # Mean amount $50, std dev $20
'frequency': np.random.poisson(5, num_normal), # Average 5 transactions per period
'location_x': np.random.normal(100, 30, num_normal), # Arbitrary location coordinates
'location_y': np.random.normal(200, 40, num_normal),
'is_fraud': np.zeros(num_normal, dtype=int) # Label: 0 for normal
}
# Features for fraudulent transactions (different distributions)
fraud_data = {
'amount': np.random.normal(200, 50, num_fraud), # Higher mean amount, higher std dev
'frequency': np.random.poisson(1, num_fraud), # Lower transaction frequency
'location_x': np.random.normal(50, 20, num_fraud), # Different location distribution
'location_y': np.random.normal(150, 30, num_fraud),
'is_fraud': np.ones(num_fraud, dtype=int) # Label: 1 for fraud
}
normal_df = pd.DataFrame(normal_data)
fraud_df = pd.DataFrame(fraud_data)
df = pd.concat([normal_df, fraud_df], ignore_index=True)
df = df.sample(frac=1, random_state=42).reset_index(drop=True) # Shuffle
return df
data = create_fraud_dataset()
print(data.head())
print(data['is_fraud'].value_counts()) # Check class distribution
# Separate features (X) and target (y)
X = data.drop('is_fraud', axis=1)
y = data['is_fraud']
# --- 2. Data Preprocessing ---
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) #stratify to maintain class balance
# Feature Scaling (important for neural networks)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit on training data and transform
X_test = scaler.transform(X_test) # Transform test data using the same scaler
# --- 3. Model Building ---
# Define the Neural Network model
model = Sequential()
model.add(Dense(128, activation='relu', input_dim=X_train.shape[1])) # Input layer and first hidden layer
model.add(Dropout(0.2)) # Dropout for regularization
model.add(Dense(64, activation='relu')) # Second hidden layer
model.add(Dropout(0.2)) # Another dropout layer
model.add(Dense(1, activation='sigmoid')) # Output layer (sigmoid for binary classification)
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# --- 4. Model Training ---
# Train the model
history = model.fit(X_train, y_train, epochs=10, batch_size=32, validation_split=0.1) #validation_split to track validation loss and accuracy
# --- 5. Model Evaluation ---
# Evaluate the model on the test set
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy:.4f}')
# Make predictions
y_pred = model.predict(X_test)
y_pred = (y_pred > 0.5).astype(int) # Convert probabilities to binary predictions
# Confusion Matrix and Classification Report
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))
# --- 6. Model Saving (Optional) ---
# model.save('fraud_detection_model.h5')
# --- 7. (Optional) Visualization of Training History ---
import matplotlib.pyplot as plt
# Plot training & validation accuracy values
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('Model accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
# Plot training & validation loss values
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.tight_layout()
plt.show()
```
Key improvements and explanations:
* **Clearer Structure:** The code is now divided into logical sections (Data Preparation, Preprocessing, Model Building, Training, Evaluation, Saving, Visualization) with comments.
* **Data Simulation:** The `create_fraud_dataset` function now simulates more realistic data, including different distributions for fraudulent and normal transactions. This is crucial for testing and demonstrating the model's capabilities. Significantly improved the data generation for better simulation of fraudulent transactions. Crucially, sets the `random_state` for reproducibility. The `fraud_percentage` is reduced to 5% for a more realistic and challenging simulation. Includes `location_x` and `location_y` features to simulate transaction locations.
* **Class Imbalance Handling:** While not explicitly using techniques like SMOTE in this simplified example, the `stratify=y` argument in `train_test_split` is *essential* to ensure that the training and testing sets have roughly the same proportion of fraud cases as the original dataset. This prevents the model from being biased towards the majority class (non-fraudulent transactions). I also reduced the percentage of fraud in the synthetic data for a more realistic distribution.
* **Feature Scaling:** Uses `StandardScaler` for feature scaling. *Crucially*, `scaler.fit_transform` is used on the *training* data, and `scaler.transform` is used on the *test* data. This is vital to prevent data leakage. The scaler is fit only on the training data to avoid biasing the model with information from the test set.
* **Neural Network Architecture:** A basic, but effective, neural network is defined using `Sequential`. Dropout layers are added after each dense layer to reduce overfitting. This helps the model generalize better to unseen data.
* **Model Compilation:** The model is compiled with the `adam` optimizer, `binary_crossentropy` loss (appropriate for binary classification), and `accuracy` as the metric.
* **Model Training:** The `fit` method trains the model. Crucially, `validation_split=0.1` is added to monitor the model's performance on a validation set during training. This helps detect overfitting early on. The `history` object returned by `fit` stores the training and validation loss and accuracy.
* **Model Evaluation:** The `evaluate` method calculates the loss and accuracy on the test set. Predictions are made using `model.predict`. The probabilities are converted to binary predictions using a threshold of 0.5.
* **Confusion Matrix and Classification Report:** The `confusion_matrix` and `classification_report` provide detailed evaluation metrics, including precision, recall, F1-score, and support for each class. This allows you to assess the model's performance in detail, including its ability to detect fraud cases (recall) and avoid false positives (precision).
* **Model Saving (Optional):** Includes code to save the trained model to a file. This allows you to load the model later without retraining it.
* **Visualization (Optional):** Adds visualization of training and validation accuracy and loss curves. This helps you diagnose overfitting or underfitting.
* **Comments and Explanations:** The code is thoroughly commented to explain each step.
* **Realistic Simulation:** The data simulation is improved to create a more realistic scenario, including different distributions for normal and fraudulent transactions.
* **Reproducibility:** `np.random.seed(42)` ensures consistent results when the code is run multiple times.
* **Error Handling (Minimal):** In a real-world application, you'd need to add more robust error handling and data validation.
How to run the code:
1. **Install Libraries:**
```bash
pip install numpy pandas scikit-learn tensorflow matplotlib
```
2. **Run the Python script:** Save the code as a `.py` file (e.g., `fraud_detection.py`) and run it from your terminal:
```bash
python fraud_detection.py
```
The code will:
* Generate a synthetic fraud dataset.
* Split the data into training and testing sets.
* Scale the features.
* Build and train a neural network model.
* Evaluate the model's performance.
* Print the confusion matrix and classification report.
* Optionally save the trained model and display visualizations.
This improved example provides a much more comprehensive and realistic starting point for building an AI-powered fraud detection system. Remember to adapt the code to your specific dataset and requirements. Pay close attention to feature engineering, data preprocessing, and model selection to achieve the best possible performance. Consider more advanced techniques for handling class imbalance if it's a significant issue in your data.
👁️ Viewed: 10
Comments