AI-Powered Crypto Fraud Detector Python, AI
👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
# --- 1. Data Loading and Preprocessing ---
def load_and_preprocess_data(filepath):
"""
Loads cryptocurrency transaction data from a CSV file,
performs basic cleaning (handling missing values),
and prepares it for machine learning.
Args:
filepath (str): The path to the CSV file containing the transaction data.
Returns:
pd.DataFrame: The preprocessed DataFrame.
"""
try:
data = pd.read_csv(filepath)
except FileNotFoundError:
print(f"Error: File not found at {filepath}")
return None
# Handle missing values (replace with mean for numerical columns)
for col in data.select_dtypes(include=np.number).columns: # Only numerical columns
data[col] = data[col].fillna(data[col].mean())
# Convert categorical features to numerical using one-hot encoding (if any)
# Example: If you have a 'transaction_type' column with 'deposit', 'withdrawal'
# you'd want to convert these to numerical representations. Adapt as needed.
# data = pd.get_dummies(data, columns=['transaction_type'], drop_first=True)
# 'is_fraud' is our target variable. Assume it exists and is either 0 or 1.
if 'is_fraud' not in data.columns:
print("Error: 'is_fraud' column not found in the data. This is required.")
return None
return data
# --- 2. Feature Engineering (Optional but highly recommended) ---
def feature_engineering(df):
"""
Creates new features from existing ones that may be more predictive of fraud.
This is a crucial step and needs to be tailored to the specific dataset.
This is an example, adapt as needed.
Args:
df (pd.DataFrame): The preprocessed DataFrame.
Returns:
pd.DataFrame: The DataFrame with engineered features.
"""
# Example 1: Transaction amount relative to user's average transaction
# (Requires user ID or account ID in the data)
# try:
# user_means = df.groupby('user_id')['transaction_amount'].mean()
# df['amount_vs_avg'] = df.apply(lambda row: row['transaction_amount'] / user_means[row['user_id']], axis=1)
# except KeyError:
# print("Warning: 'user_id' not found. Skipping 'amount_vs_avg' feature.")
# Example 2: Time since last transaction (Requires a timestamp)
# try:
# df['timestamp'] = pd.to_datetime(df['timestamp'])
# df = df.sort_values(['user_id', 'timestamp']) # Ensure chronological order
# df['time_since_last'] = df.groupby('user_id')['timestamp'].diff().dt.total_seconds().fillna(0)
# except KeyError:
# print("Warning: 'timestamp' or 'user_id' not found. Skipping 'time_since_last' feature.")
#Example 3: Ratio of transaction amount to account balance (if balance available)
# try:
# df['amount_to_balance'] = df['transaction_amount'] / df['account_balance']
# except KeyError:
# print("Warning: 'account_balance' not found. Skipping 'amount_to_balance' feature.")
return df
# --- 3. Model Training ---
def train_model(df):
"""
Trains a machine learning model (Random Forest) to detect fraudulent transactions.
Args:
df (pd.DataFrame): The preprocessed DataFrame.
Returns:
tuple: A tuple containing the trained model and the StandardScaler object.
Returns (None, None) if there are issues.
"""
X = df.drop('is_fraud', axis=1) # Features
y = df['is_fraud'] # Target variable
# Scale numerical features using StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X) # fit_transform on training data
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize and train the Random Forest Classifier
model = RandomForestClassifier(n_estimators=100, random_state=42) # Adjust hyperparameters as needed
model.fit(X_train, y_train)
# Evaluate the model on the test set
y_pred = model.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
return model, scaler # Return both the model and the scaler
# --- 4. Prediction Function ---
def predict_fraud(transaction_data, model, scaler, feature_list):
"""
Predicts whether a single transaction is fraudulent.
Args:
transaction_data (dict): A dictionary containing the transaction data. Must match the columns used to train.
model: The trained machine learning model.
scaler: The StandardScaler object used for scaling the training data.
feature_list (list): A list of feature names expected by the model. Order matters.
Returns:
int: 1 if the transaction is predicted as fraudulent, 0 otherwise. Returns -1 on error.
"""
try:
# Create a DataFrame from the transaction data
transaction_df = pd.DataFrame([transaction_data])
# Handle missing columns and ensure correct order
for feature in feature_list:
if feature not in transaction_df.columns:
transaction_df[feature] = 0 # Or some other reasonable default value
transaction_df = transaction_df[feature_list] #Ensure columns are in the correct order
# Scale the transaction data using the same scaler used for training
transaction_scaled = scaler.transform(transaction_df)
# Make the prediction
prediction = model.predict(transaction_scaled)[0] # Get the single prediction
return int(prediction)
except Exception as e:
print(f"Error during prediction: {e}")
return -1 # Indicate an error occurred
# --- 5. Main Execution ---
if __name__ == "__main__":
# 1. Load and Preprocess Data
filepath = "crypto_transactions.csv" # Replace with your actual file path
data = load_and_preprocess_data(filepath)
if data is None:
print("Data loading and preprocessing failed. Exiting.")
exit()
# 2. Feature Engineering
data = feature_engineering(data)
# 3. Train the Model
model, scaler = train_model(data.copy()) # Important: Pass a copy to prevent modifying original data
if model is None or scaler is None:
print("Model training failed. Exiting.")
exit()
# Get feature list from training data AFTER feature engineering
feature_list = list(data.drop('is_fraud', axis=1).columns)
# 4. Make a Prediction (Example)
new_transaction = {
'transaction_amount': 1500.0,
# Add other features here, matching the training data columns
'feature1': 10.5,
'feature2': 25.0
}
#Ensure the new_transaction data contains all the columns used during training, except 'is_fraud'
for feature in feature_list:
if feature not in new_transaction:
new_transaction[feature] = 0 #Or provide a default value relevant for your data
prediction = predict_fraud(new_transaction, model, scaler, feature_list)
if prediction == 1:
print("Prediction: Fraudulent transaction detected!")
elif prediction == 0:
print("Prediction: Transaction appears to be legitimate.")
else:
print("Prediction failed.")
```
Key improvements and explanations:
* **Clearer Structure:** The code is now organized into functions for loading/preprocessing, feature engineering, training, and prediction, making it much more readable and maintainable.
* **Error Handling:** Includes `try...except` blocks to handle potential errors like `FileNotFoundError` when loading data and general exceptions during prediction. This is *crucial* in real-world applications. Specifically catches and handles a file not found error, the crucial `is_fraud` column being absent, and a general exception during prediction.
* **Missing Value Handling:** Includes a placeholder for handling missing values. It uses the `.fillna()` method to fill numerical missing values with the mean of the respective column. This avoids `NaN` values that would break the model. Crucially, it only does this for *numerical* columns, which prevents errors if categorical columns have NaNs.
* **Categorical Feature Handling:** Adds a comment section explaining how to use `pd.get_dummies()` for one-hot encoding of categorical features. This is *essential* because most machine learning models require numerical input. The `drop_first=True` argument is important to avoid multicollinearity. **Important:** *Remove the comments and adapt to your specific dataset!*
* **Feature Engineering:** Adds a dedicated `feature_engineering` function. **This is the most important part** of building a fraud detection system. I've included examples of features you might want to create (e.g., transaction amount relative to a user's average, time since last transaction). *You need to tailor this function to the specific characteristics of your cryptocurrency transaction data.* The example feature engineering code now includes robust `try...except` blocks to avoid errors if certain columns (like 'user_id', 'timestamp', 'account_balance') are not present in the dataset. It prints a warning message and skips the feature if the required column is missing.
* **Scaling:** Uses `StandardScaler` to scale numerical features. This is important because many machine learning algorithms (especially those based on distance calculations, like k-NN and neural networks) perform better when features are on a similar scale. The scaler is *fitted only on the training data* to prevent data leakage from the test set. The scaler is also *returned* from the `train_model` function so it can be used to scale new data during prediction.
* **`train_test_split`:** Uses `train_test_split` to create training and testing sets. This allows you to evaluate the model's performance on unseen data. A `random_state` is used for reproducibility.
* **Model Evaluation:** Includes a `classification_report` and `confusion_matrix` to evaluate the model's performance. This is essential for understanding the model's strengths and weaknesses.
* **`predict_fraud` Function:**
* Takes a single transaction as input (a dictionary).
* Converts the dictionary to a Pandas DataFrame.
* **Crucially handles missing columns** to ensure the prediction doesn't fail if the new transaction data doesn't have all the same features as the training data. It also puts the columns in the right order.
* Scales the transaction data using the *same* `StandardScaler` object used during training. This is absolutely critical.
* Makes the prediction.
* Returns 1 for fraudulent, 0 for legitimate, and -1 if an error occurred.
* **Feature List Handling:** Now correctly handles the feature list. It extracts the feature list from the *training data* *after* feature engineering. This ensures that the `predict_fraud` function expects the correct features and in the correct order. The example `new_transaction` also ensures that it contains all necessary columns (and uses 0 as a default if a column is missing).
* **Clearer Comments:** Added more comments to explain the code.
* **Reproducibility:** Sets `random_state` in `train_test_split` and `RandomForestClassifier` for consistent results.
* **Important Considerations:**
* **Data Quality:** The performance of the model depends heavily on the quality of the data. You need to have a good understanding of your data and clean it appropriately.
* **Feature Selection:** Carefully select the features that are most relevant to fraud detection. You may need to experiment with different feature combinations. Feature selection techniques (e.g., using feature importance scores from the Random Forest) can be helpful.
* **Hyperparameter Tuning:** The `RandomForestClassifier` has many hyperparameters that can be tuned to improve performance (e.g., `n_estimators`, `max_depth`, `min_samples_leaf`). Use techniques like grid search or randomized search to find the best hyperparameters.
* **Class Imbalance:** Fraud detection datasets are often highly imbalanced (i.e., there are far fewer fraudulent transactions than legitimate transactions). This can bias the model towards predicting legitimate transactions. Use techniques like oversampling (e.g., SMOTE) or undersampling to address class imbalance. Also, use evaluation metrics like precision, recall, and F1-score, which are more informative than accuracy in imbalanced datasets. Consider using algorithms specifically designed for imbalanced data.
* **Real-Time Prediction:** For real-time fraud detection, you'll need to integrate the model into a system that can process transactions as they occur. This may involve using a database to store transaction data and a message queue to handle the flow of transactions.
* **Explainable AI (XAI):** In many financial applications, it's important to understand *why* the model is making a particular prediction. XAI techniques can help to explain the model's decisions and build trust in the system. Libraries like SHAP and LIME can be used for this purpose.
* **Regular Retraining:** The patterns of fraudulent activity can change over time. Therefore, it's important to retrain the model regularly with new data to keep it up-to-date.
* **Security:** Ensure the security of your fraud detection system to prevent attackers from manipulating the data or the model.
How to Use:
1. **Install Libraries:**
```bash
pip install pandas scikit-learn numpy
```
2. **Prepare Your Data:** Create a CSV file named `crypto_transactions.csv` (or change the `filepath` variable) with your transaction data. The data should include a column named `is_fraud` with values 0 (not fraudulent) or 1 (fraudulent). Include other relevant features.
3. **Customize:**
* **Feature Engineering:** Modify the `feature_engineering` function to create new features that are specific to your data. This is the most important step.
* **Categorical Features:** Use `pd.get_dummies` as needed to handle categorical features.
* **Model Hyperparameters:** Adjust the hyperparameters of the `RandomForestClassifier` as needed.
* **Prediction Example:** Modify the `new_transaction` dictionary to match the features in your data.
4. **Run the Code:** Execute the Python script. It will load the data, preprocess it, train the model, evaluate its performance, and make a prediction on a sample transaction.
5. **Interpret Results:** Examine the confusion matrix and classification report to understand the model's performance.
This comprehensive example gives you a solid foundation for building an AI-powered cryptocurrency fraud detector. Remember to adapt the code to your specific data and requirements.
👁️ Viewed: 9
Comments