AI-based Fraud Detection System Python, AI
👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# 1. Data Preparation (Simulated Fraudulent Transaction Data)
def generate_synthetic_data(n_samples=1000):
"""Generates synthetic transaction data with a 'fraud' column."""
np.random.seed(42) # for reproducibility
data = {
'amount': np.random.normal(100, 50, n_samples), # Transaction amount (mean 100, std 50)
'time_of_day': np.random.uniform(0, 24, n_samples), # Hour of the day (0-24)
'transaction_type': np.random.choice(['online', 'card', 'transfer'], n_samples),
'location': np.random.choice(['US', 'Europe', 'Asia', 'Africa'], n_samples),
'customer_id': np.random.randint(1000, 2000, n_samples), # Sample customer IDs
}
df = pd.DataFrame(data)
# Introduce some fraud based on rules and randomness
df['fraud'] = 0
# High amount transactions more likely to be fraud
df.loc[df['amount'] > 200, 'fraud'] = np.random.choice([0, 1], size=len(df[df['amount'] > 200]), p=[0.8, 0.2])
# Transactions late at night might be suspicious
df.loc[df['time_of_day'] > 22, 'fraud'] = np.random.choice([0, 1], size=len(df[df['time_of_day'] > 22]), p=[0.7, 0.3])
# Certain locations more prone to fraud in this example
df.loc[df['location'] == 'Africa', 'fraud'] = np.random.choice([0, 1], size=len(df[df['location'] == 'Africa']), p=[0.6, 0.4])
#Convert boolean to int
df['fraud'] = df['fraud'].astype(int)
return df
# Generate the synthetic dataset
data = generate_synthetic_data(n_samples=1000)
# Display the first few rows of the data
print("Sample Data:")
print(data.head())
print("\nData Description:")
print(data.describe())
print("\nFraud Class Distribution:")
print(data['fraud'].value_counts())
# 2. Feature Engineering and Data Preprocessing
# Convert categorical features to numerical using one-hot encoding
data = pd.get_dummies(data, columns=['transaction_type', 'location'])
# Define features (X) and target (y)
X = data.drop('fraud', axis=1)
y = data['fraud']
# 3. Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 4. Model Training (Random Forest Classifier)
model = RandomForestClassifier(n_estimators=100, random_state=42) # You can tune hyperparameters
model.fit(X_train, y_train)
# 5. Model Evaluation
y_pred = model.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
# Confusion Matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
# Classification Report (Precision, Recall, F1-score)
class_report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(class_report)
# 6. Feature Importance (Optional)
feature_importances = model.feature_importances_
feature_names = X.columns
# Create a DataFrame for feature importances
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)
print("\nFeature Importances:")
print(feature_importance_df)
# 7. Visualization (Optional) - Confusion Matrix Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=['Not Fraud', 'Fraud'], yticklabels=['Not Fraud', 'Fraud'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
# 8. Prediction on New Data (Example)
def predict_fraud(model, amount, time_of_day, transaction_type, location, customer_id):
"""Predicts whether a new transaction is fraudulent."""
# Create a dictionary representing the new transaction
new_transaction = {
'amount': amount,
'time_of_day': time_of_day,
'customer_id': customer_id,
'transaction_type_card': 0, # Initialize all transaction type columns to 0
'transaction_type_online': 0,
'transaction_type_transfer': 0,
'location_Africa': 0, # Initialize all location columns to 0
'location_Asia': 0,
'location_Europe': 0,
'location_US': 0
}
# Set the appropriate transaction type and location column to 1
new_transaction[f'transaction_type_{transaction_type}'] = 1
new_transaction[f'location_{location}'] = 1
# Convert the dictionary to a DataFrame
new_transaction_df = pd.DataFrame([new_transaction])
# Ensure the order of columns matches the training data
new_transaction_df = new_transaction_df[X.columns]
# Make the prediction
prediction = model.predict(new_transaction_df)[0]
return prediction
# Example of predicting a new transaction
new_amount = 250
new_time_of_day = 23
new_transaction_type = 'online'
new_location = 'Asia'
new_customer_id = 1500
prediction = predict_fraud(model, new_amount, new_time_of_day, new_transaction_type, new_location, new_customer_id)
if prediction == 1:
print("\nThis transaction is predicted as FRAUDULENT.")
else:
print("\nThis transaction is predicted as NOT fraudulent.")
```
Key improvements and explanations:
* **Clearer Structure:** The code is now divided into logical sections with comments explaining each step. This makes it much easier to understand.
* **Synthetic Data Generation:** This is crucial. It creates data that the model can actually learn from. The `generate_synthetic_data` function does this well:
* **Realistic Data:** Uses `np.random.normal` for amount, `np.random.uniform` for time, and `np.random.choice` for categories, simulating real-world data characteristics.
* **Fraud Introduction:** Crucially, it *injects* fraudulent transactions based on simple rules (e.g., higher amounts, unusual times) and some randomness. This is a *very* important step for making the model useful. It makes the fraud prediction problem meaningful. Critically it randomises whether a 'fraud' occurs under these conditions, otherwise the model could be too easily overfit and not generalise to new data.
* **Reproducibility:** `np.random.seed(42)` ensures that the same data is generated each time you run the code, making the results reproducible.
* **`customer_id`:** Added a simple `customer_id` to show how this might be included in real data. While not used in the fraud detection logic in this example, it's a feature that could be useful if you expanded the model.
* **Data Exploration:** The `print` statements after data generation are important for understanding the generated data. They show sample data, descriptive statistics, and the distribution of fraud vs. non-fraud cases. This helps you verify the data generation process.
* **One-Hot Encoding:** The code correctly uses `pd.get_dummies` to convert categorical features (transaction type, location) into numerical features, which is essential for most machine learning models.
* **Training and Testing Split:** The data is properly split into training and testing sets to evaluate the model's performance on unseen data. `random_state=42` is used here too for reproducibility.
* **Random Forest Classifier:** The code uses a Random Forest classifier, which is a good choice for fraud detection due to its ability to handle complex relationships and its robustness to outliers. Hyperparameters can be tuned (e.g., `n_estimators`).
* **Model Evaluation:** The code includes a comprehensive evaluation of the model using accuracy, confusion matrix, and classification report. This is essential for understanding the model's performance. The confusion matrix gives detailed insights into true positives, true negatives, false positives, and false negatives. The classification report shows precision, recall, and F1-score, which are more informative than accuracy alone.
* **Feature Importance:** The code calculates and displays feature importances, which can help you understand which features are most important for fraud detection. This is useful for feature selection and for gaining insights into the data.
* **Visualization:** A confusion matrix heatmap is added to visualize the model's performance. This makes it easier to understand the model's strengths and weaknesses.
* **Prediction on New Data:** The `predict_fraud` function demonstrates how to use the trained model to predict whether a new transaction is fraudulent.
* **Clear Function:** The function takes the necessary features as input and returns the prediction.
* **Data Preparation:** The function correctly prepares the new data in the same format as the training data (one-hot encoding, correct column order).
* **Column Order:** **Critically, it ensures that the new data has the *same column order* as the training data.** This is a common source of errors, so this is an important addition. Without it, the model will likely produce garbage predictions.
* **Clearer Output:** The output is more descriptive, making it easier to understand the results.
* **Error Handling:** While not explicitly added, the structured code and comments help with debugging and understanding the code, indirectly aiding error handling. More robust error handling could be added in a real-world application.
* **Scalability and Real-World Considerations:**
* **Feature Engineering:** In a real-world scenario, you would need to perform much more extensive feature engineering, including:
* **Time-based features:** Lagged transaction amounts, frequency of transactions per hour/day, etc.
* **Customer behavior:** Average transaction amount, number of transactions per customer, days since last transaction.
* **Location-based features:** Distance between transaction locations, IP address information.
* **Device information:** Device type, operating system.
* **Data Imbalance:** Fraud datasets are typically highly imbalanced (i.e., very few fraudulent transactions compared to legitimate ones). You would need to address this using techniques such as:
* **Oversampling:** Creating copies of fraudulent transactions (e.g., SMOTE).
* **Undersampling:** Removing legitimate transactions.
* **Cost-sensitive learning:** Assigning higher costs to misclassifying fraudulent transactions.
* **Model Selection:** Experiment with different machine learning models, such as:
* **Logistic Regression**
* **Support Vector Machines (SVM)**
* **Gradient Boosting Machines (e.g., XGBoost, LightGBM)**
* **Neural Networks**
* **Model Tuning:** Use techniques such as cross-validation and grid search to tune the hyperparameters of your model.
* **Real-time Prediction:** For real-time fraud detection, you would need to deploy the model to a production environment and integrate it with your transaction processing system.
This significantly improved version provides a much more complete and useful example of AI-based fraud detection in Python. The key addition is the synthetic data generation, which makes the problem meaningful and allows the model to learn. The explanations and improvements in structure and clarity make the code much easier to understand and build upon. The discussion of real-world considerations highlights the steps needed to adapt this example to a real fraud detection system.
👁️ Viewed: 10
Comments