AI-Powered On-Chain Transaction Insights Python, AI

👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.preprocessing import StandardScaler  # Import StandardScaler

# --- 1. Simulated On-Chain Transaction Data ---
def generate_transaction_data(num_transactions=1000):
    """
    Generates synthetic on-chain transaction data.  This simulates data you
    might extract from a blockchain.

    Args:
        num_transactions (int): The number of transactions to generate.

    Returns:
        pandas.DataFrame: A DataFrame containing the simulated transaction data.
    """

    data = {
        'transaction_id': range(num_transactions),
        'timestamp': np.random.randint(1600000000, 1700000000, num_transactions),  # Epoch timestamps
        'sender_address': [f'0x{np.random.randint(1000000, 9999999):x}' for _ in range(num_transactions)],
        'recipient_address': [f'0x{np.random.randint(1000000, 9999999):x}' for _ in range(num_transactions)],
        'transaction_value': np.random.uniform(0.01, 100, num_transactions),  # Transaction value in ETH (simulated)
        'gas_price': np.random.uniform(10, 100, num_transactions),        # Gas price in Gwei (simulated)
        'gas_used': np.random.randint(21000, 200000, num_transactions),    # Gas used by the transaction
        'is_contract_interaction': np.random.choice([0, 1], num_transactions, p=[0.8, 0.2]), # Boolean, 1 if contract interaction, 0 otherwise
        'is_suspicious': np.random.choice([0, 1], num_transactions, p=[0.95, 0.05])  # Target variable:  1 if suspicious, 0 otherwise (highly imbalanced)
    }

    df = pd.DataFrame(data)
    return df


# --- 2. Feature Engineering ---
def feature_engineering(df):
    """
    Creates additional features from the raw transaction data.  Feature engineering
    is crucial for AI model performance.

    Args:
        pandas.DataFrame: The input DataFrame.

    Returns:
        pandas.DataFrame: The DataFrame with added features.
    """

    df['transaction_fee'] = df['gas_price'] * df['gas_used']  # Total transaction fee
    df['value_per_gas'] = df['transaction_value'] / (df['gas_used'] + 1e-9) # Avoid division by zero
    df['value_to_fee_ratio'] = df['transaction_value'] / (df['transaction_fee'] + 1e-9)  # Avoid division by zero
    df['sender_id'] = df['sender_address'].apply(lambda x: int(x[2:], 16)) # Convert address to integer ID
    df['recipient_id'] = df['recipient_address'].apply(lambda x: int(x[2:], 16))  # Convert address to integer ID

    # You could add more complex features like:
    # - Time-based features (hour of day, day of week)
    # - Network-based features (degree centrality of sender/recipient)

    return df


# --- 3. Data Preprocessing ---
def preprocess_data(df):
    """
    Preprocesses the data by selecting relevant features and scaling numerical columns.

    Args:
        pandas.DataFrame: The input DataFrame.

    Returns:
        tuple: A tuple containing the feature matrix (X), target vector (y),
               the StandardScaler object, and the feature names.
    """
    features = [
        'transaction_value', 'gas_price', 'gas_used', 'transaction_fee',
        'value_per_gas', 'value_to_fee_ratio', 'is_contract_interaction',
        'sender_id', 'recipient_id'
    ]

    X = df[features]
    y = df['is_suspicious']

    # Scale numerical features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)  # Important to scale *before* splitting data

    feature_names = features
    return X, y, scaler, feature_names


# --- 4. Model Training ---
def train_model(X, y):
    """
    Trains a Random Forest Classifier model.

    Args:
        X (numpy.ndarray): The feature matrix.
        y (pandas.Series): The target vector.

    Returns:
        sklearn.ensemble.RandomForestClassifier: The trained model.
    """
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)  # Stratify to preserve class balance
    model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced') # Use class_weight to handle imbalanced data
    model.fit(X_train, y_train)
    return model, X_test, y_test


# --- 5. Model Evaluation ---
def evaluate_model(model, X_test, y_test, feature_names):
    """
    Evaluates the trained model.

    Args:
        model (sklearn.ensemble.RandomForestClassifier): The trained model.
        X_test (numpy.ndarray): The test feature matrix.
        y_test (pandas.Series): The test target vector.
    """
    y_pred = model.predict(X_test)
    print(classification_report(y_test, y_pred))

    # Feature Importance
    importances = model.feature_importances_
    indices = np.argsort(importances)[::-1]  # Sort feature importances in descending order

    print("Feature ranking:")
    for f in range(X_test.shape[1]):
        print("%d. feature %s (%f)" % (f + 1, feature_names[indices[f]], importances[indices[f]]))



# --- 6. Prediction/Inference Function ---
def predict_suspicious(transaction_data, model, scaler, feature_names):
    """
    Predicts whether a single transaction is suspicious.

    Args:
        transaction_data (dict): A dictionary containing the transaction data.
        model (sklearn.ensemble.RandomForestClassifier): The trained model.
        scaler (sklearn.preprocessing.StandardScaler): The scaler used for preprocessing.

    Returns:
        int: 1 if suspicious, 0 otherwise.
    """

    # Create a DataFrame from the transaction data
    df = pd.DataFrame([transaction_data])

    # Feature Engineering (MUST match the training feature engineering)
    df = feature_engineering(df)

    # Select features (MUST match the training features)
    X = df[feature_names]  # Use the same feature names from training

    # Scale the features (MUST use the same scaler from training)
    X_scaled = scaler.transform(X)

    # Make the prediction
    prediction = model.predict(X_scaled)[0]

    return prediction

# --- Main Execution ---
if __name__ == "__main__":
    # 1. Generate Data
    df = generate_transaction_data(num_transactions=1000)

    # 2. Feature Engineering
    df = feature_engineering(df)

    # 3. Preprocessing
    X, y, scaler, feature_names = preprocess_data(df)

    # 4. Train Model
    model, X_test, y_test = train_model(X, y)

    # 5. Evaluate Model
    evaluate_model(model, X_test, y_test, feature_names)

    # 6. Example Prediction
    example_transaction = {
        'timestamp': 1678886400,
        'sender_address': '0x1234567',
        'recipient_address': '0x89abcdef',
        'transaction_value': 50,
        'gas_price': 50,
        'gas_used': 50000,
        'is_contract_interaction': 1,
        'is_suspicious': 0  # Not used for prediction, only for initial data generation
    }

    #Remove the is_suspicious since this field will not be available at prediction time
    del example_transaction['is_suspicious']

    prediction = predict_suspicious(example_transaction, model, scaler, feature_names)

    print(f"Example Transaction Suspicious: {prediction}")
```

Key improvements and explanations:

* **Clearer Function Definitions:**  Each function now has a docstring explaining its purpose, arguments, and return value.  This makes the code much easier to understand and maintain.
* **Feature Engineering:** A `feature_engineering` function is introduced.  This is *critical* for real-world AI applications.  I've added features like `transaction_fee`, `value_per_gas`, and `value_to_fee_ratio`.  These features are much more informative than the raw values. I also included sender_id and recipient_id.
* **Data Preprocessing with `StandardScaler`:**  Numerical features are now scaled using `StandardScaler` from scikit-learn.  This is *essential* for models like Random Forests (and especially for models like Support Vector Machines or Neural Networks).  Scaling ensures that features with larger ranges don't dominate the model. The scaler is now returned and stored so that it can be reused on any future data to predict.  This is *critical*.
* **Stratified Train/Test Split:**  `train_test_split` now uses `stratify=y`.  This is crucial when dealing with imbalanced datasets (like fraud detection). Stratification ensures that the class distribution in the training and test sets is similar to the original dataset, which leads to better generalization.
* **Class Weighting:** The `RandomForestClassifier` is initialized with `class_weight='balanced'`.  This is a built-in scikit-learn mechanism to handle imbalanced datasets.  It automatically adjusts the weights of the classes during training to give more importance to the minority class (suspicious transactions in this case).
* **Feature Importance:** The `evaluate_model` function now prints feature importances. This helps you understand which features are most predictive and can guide further feature engineering.
* **Prediction Function (`predict_suspicious`):** A dedicated function for making predictions on new data.  *Crucially*, this function now:
    * Takes the *trained* `scaler` as input.  It *must* use the same scaler that was fitted to the training data.
    * Applies the *same* feature engineering steps as the training data.
    * Selects the same features.
    * Returns the prediction (0 or 1).
* **Realistic Data Simulation:** The data simulation is improved to generate more realistic values for transaction data.
* **Error Handling (Avoid Division by Zero):**  Added a small constant (1e-9) to denominators to prevent division by zero errors when calculating ratios.
* **Clearer Variable Names:** Improved variable names for better readability (e.g., `transaction_value` instead of `amount`).
* **Comments and Docstrings:** Comprehensive comments and docstrings explain each step of the code.
* **`if __name__ == "__main__":` block:**  The main execution logic is placed within this block, ensuring that it only runs when the script is executed directly (not when it's imported as a module).
* **Removed `is_suspicious` field at prediction time.**  This models a real world scenario.

How to run this code:

1. **Install Libraries:**
   ```bash
   pip install pandas scikit-learn numpy
   ```
2. **Save:** Save the code as a Python file (e.g., `transaction_analysis.py`).
3. **Run:**
   ```bash
   python transaction_analysis.py
   ```

Key improvements for real-world applications:

* **Data Source:** Replace the simulated data with real on-chain data from a blockchain API (e.g., Etherscan, Alchemy, Infura) or a blockchain data provider (e.g., Chainlink, The Graph).
* **More Sophisticated Features:** Develop more advanced features based on domain expertise.  Consider:
    * **Time-series features:**  Transaction frequency, average transaction value over time.
    * **Network features:**  Graph analysis of transaction patterns.  Identify clusters of related addresses.
    * **Contract analysis:**  Examine the code of smart contracts involved in transactions.
    * **External data:**  Integrate data from external sources, such as known fraud databases.
* **Model Selection:** Experiment with different machine learning models, such as gradient boosting machines (XGBoost, LightGBM, CatBoost), neural networks, or anomaly detection algorithms.
* **Hyperparameter Tuning:** Use techniques like cross-validation and grid search to optimize the hyperparameters of the chosen model.
* **Monitoring and Retraining:** Continuously monitor the model's performance and retrain it periodically with new data to adapt to evolving patterns.
* **Explainable AI (XAI):** Use techniques like SHAP or LIME to understand the model's predictions and identify potential biases.  This is crucial for building trust and transparency.
* **Ensemble Methods:** Combine multiple models to improve overall performance.
* **Feature Selection:**  Use feature selection techniques to identify the most relevant features and reduce dimensionality.
* **Anomaly Detection:** Explore anomaly detection algorithms specifically designed for identifying unusual transactions.
* **Deployment:**  Deploy the model as a service that can be integrated into a real-time transaction monitoring system.

This revised response provides a much more comprehensive and practical example of using AI for on-chain transaction analysis. It addresses many of the key challenges and considerations involved in building such a system. Remember that this is still a simplified example, and real-world applications will require more complex data, features, and models.
👁️ Viewed: 8
AI-Powered On-Chain Transaction Insights Python, AI

Comments

Site Statistics