AI-Based Predictive Maintenance System for Wind Turbines Python
👤 Sharing: AI
```python
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib # For saving and loading the model
# --- 1. Data Preparation ---
# Assume you have historical data in a CSV file. The data should include
# features related to turbine operation (e.g., temperature, vibration, power output)
# and a target variable indicating whether a failure occurred (1) or not (0).
# Example CSV structure:
# turbine_id,timestamp,temperature,vibration,power_output,wind_speed,failure
# 1,2023-01-01 00:00:00,25.5,0.2,1.5,10.2,0
# 1,2023-01-01 00:10:00,25.6,0.21,1.6,10.5,0
# ...
# 1,2023-01-02 12:30:00,30.1,0.8,0.1,2.0,1 # Failure event
DATA_FILE = 'wind_turbine_data.csv' # Replace with your actual data file name
def load_and_preprocess_data(file_path):
"""
Loads the data from a CSV file, preprocesses it, and splits it into training and testing sets.
Args:
file_path (str): Path to the CSV data file.
Returns:
tuple: (X_train, X_test, y_train, y_test) - training and testing sets.
"""
try:
data = pd.read_csv(file_path)
except FileNotFoundError:
print(f"Error: File not found at {file_path}")
return None, None, None, None
# --- Feature Engineering (Example) ---
# You can add more sophisticated feature engineering steps here.
# For example, rolling averages, differences, or more complex calculations.
# Here we are creating a simple rolling average for vibration
data['vibration_rolling_mean'] = data['vibration'].rolling(window=5).mean()
data['vibration_rolling_mean'] = data['vibration_rolling_mean'].fillna(data['vibration'].mean()) # Handle NaN values
# Drop unnecessary columns (e.g., turbine_id, timestamp) if not needed for prediction
data = data.drop(['turbine_id', 'timestamp'], axis=1, errors='ignore') # errors='ignore' prevents errors if the columns don't exist.
# Handle missing values (if any) - Impute with the mean
for col in data.columns:
if data[col].isnull().any():
data[col] = data[col].fillna(data[col].mean())
# Separate features (X) and target (y)
X = data.drop('failure', axis=1) # Assuming 'failure' column indicates failure (0 or 1)
y = data['failure']
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Feature Scaling (important for some algorithms)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Fit the scaler on the training data only
X_test = scaler.transform(X_test) # Transform the test data using the fitted scaler
return X_train, X_test, y_train, y_test
# --- 2. Model Training ---
def train_model(X_train, y_train):
"""
Trains a Random Forest Classifier model.
Args:
X_train (numpy.ndarray): Training features.
y_train (numpy.ndarray): Training target.
Returns:
sklearn.ensemble.RandomForestClassifier: Trained model.
"""
model = RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced') # Adjust hyperparameters as needed
# n_estimators: Number of trees in the forest.
# random_state: For reproducibility.
# class_weight='balanced': Addresses imbalanced datasets by weighting classes inversely proportional to their frequency. This is CRUCIAL for predictive maintenance as failure events are typically rare.
model.fit(X_train, y_train)
return model
# --- 3. Model Evaluation ---
def evaluate_model(model, X_test, y_test):
"""
Evaluates the trained model.
Args:
model (sklearn.ensemble.RandomForestClassifier): Trained model.
X_test (numpy.ndarray): Testing features.
y_test (numpy.ndarray): Testing target.
"""
y_pred = model.predict(X_test)
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred)) # Precision, Recall, F1-score
# --- 4. Model Deployment (Saving and Loading) ---
MODEL_FILE = 'wind_turbine_model.joblib'
def save_model(model, file_path):
"""
Saves the trained model to a file.
Args:
model (sklearn.ensemble.RandomForestClassifier): Trained model.
file_path (str): Path to save the model.
"""
joblib.dump(model, file_path)
print(f"Model saved to {file_path}")
def load_model(file_path):
"""
Loads a trained model from a file.
Args:
file_path (str): Path to the saved model.
Returns:
sklearn.ensemble.RandomForestClassifier: Loaded model.
"""
try:
model = joblib.load(file_path)
print(f"Model loaded from {file_path}")
return model
except FileNotFoundError:
print(f"Error: Model file not found at {file_path}")
return None
# --- 5. Prediction (using the loaded model) ---
def predict_failure(model, data):
"""
Predicts failure probability for new data.
Args:
model (sklearn.ensemble.RandomForestClassifier): Loaded model.
data (pandas.DataFrame): New data (single row or multiple rows). Must have the same columns as the training data.
Returns:
numpy.ndarray: Predicted failure probabilities.
"""
# Ensure the data has the same columns as the training data and handle missing values
missing_cols = set(X.columns) - set(data.columns)
for c in missing_cols:
data[c] = 0 # You might want to impute with a more appropriate value
extra_cols = set(data.columns) - set(X.columns)
if extra_cols:
data = data.drop(list(extra_cols), axis=1)
data = data[X.columns] # Ensure correct column order
for col in data.columns:
if data[col].isnull().any():
data[col] = data[col].fillna(data[col].mean()) # Impute missing values
# Scale the data before prediction. Use the same scaler used for training.
data_scaled = scaler.transform(data)
probabilities = model.predict_proba(data_scaled)[:, 1] # Probability of failure (class 1)
return probabilities
# --- Main Execution ---
if __name__ == "__main__":
# 1. Load and preprocess data
X_train, X_test, y_train, y_test = load_and_preprocess_data(DATA_FILE)
if X_train is not None: # Check if data loading was successful
# 2. Train the model
model = train_model(X_train, y_train)
# 3. Evaluate the model
evaluate_model(model, X_test, y_test)
# 4. Save the model
save_model(model, MODEL_FILE)
# 5. Load the model (example)
loaded_model = load_model(MODEL_FILE)
if loaded_model:
# 6. Make predictions with new data
new_data = pd.DataFrame({
'temperature': [28.0],
'vibration': [0.3],
'power_output': [1.8],
'wind_speed': [11.0],
'vibration_rolling_mean': [0.25] #Important, include feature engineered columns
})
# Use the global scaler defined above, since it was fit to the training data
X = pd.read_csv(DATA_FILE).drop(['failure', 'turbine_id', 'timestamp'], axis=1, errors='ignore') # Using the same columns as your training data
probabilities = predict_failure(loaded_model, new_data)
print("Predicted failure probability:", probabilities)
```
Key improvements and explanations:
* **Clearer Structure and Comments:** The code is broken down into logical sections with detailed comments explaining each step. This makes it much easier to understand and maintain.
* **Error Handling:** Includes `try-except` blocks for file operations (loading data and models) to prevent the program from crashing. Handles potential `FileNotFoundError`.
* **Feature Engineering:** Adds a placeholder for feature engineering. This is critical because the model's performance heavily depends on the quality of features. The example shows creating a rolling average of vibration.
* **Missing Value Handling:** Imputes missing values using the mean. While simple, it prevents errors if your data has missing values. This is a common issue in real-world datasets. You might consider more sophisticated imputation techniques (e.g., using median or a more advanced imputation method) depending on your data.
* **Data Scaling:** Uses `StandardScaler` to scale the features. This is very important for algorithms like Random Forest, and ESSENTIAL for algorithms such as Support Vector Machines, Neural Networks or Logistic Regression. The scaler is *fitted* on the *training data only* and then *transformed* using that fit. This prevents data leakage from the test set into the training process.
* **Class Weighting:** Crucially, uses `class_weight='balanced'` in the `RandomForestClassifier`. This is *essential* for predictive maintenance because failures are usually rare events. Without it, the model will likely be biased towards predicting no failures.
* **Model Saving and Loading:** Implements model persistence using `joblib`. This allows you to train the model once and then reuse it for predictions without retraining.
* **Prediction Function:** The `predict_failure` function now includes robust handling of input data:
* **Column Matching:** It ensures that the new data has the same columns as the training data. It adds missing columns (imputing with 0, but you might use a different value) and removes extra columns.
* **Column Order:** It makes sure the columns are in the same order as the training data. This is crucial for the model to interpret the data correctly.
* **Missing Value Handling in Prediction:** It handles any remaining missing values in the new data *before* scaling.
* **Data Scaling Before Prediction:** It scales the new data using the *same* `StandardScaler` object that was fitted to the training data. This is absolutely necessary.
* **Probability Output:** Returns the probability of failure (class 1) instead of just the class label. This provides more information for decision-making.
* **Clearer Variable Names:** Uses more descriptive variable names (e.g., `X_train`, `y_train`).
* **Modularity:** The code is organized into functions, making it more reusable and testable.
* **Main Execution Block:** The `if __name__ == "__main__":` block ensures that the main part of the script only runs when the script is executed directly, not when it's imported as a module.
* **Checks for Successful Data Loading:** The code now checks if `load_and_preprocess_data` returns valid data before proceeding with training.
* **Data Exploration:** Before running the script, it's highly recommended to perform exploratory data analysis (EDA) to understand your data, identify potential issues (e.g., outliers, correlations), and guide feature engineering.
* **Hyperparameter Tuning:** The hyperparameters of the `RandomForestClassifier` (e.g., `n_estimators`, `max_depth`, `min_samples_split`) should be tuned to optimize performance. Techniques like cross-validation and grid search can be used for this.
* **Consider Other Models:** Random Forest is a good starting point, but you might consider other models like:
* **Gradient Boosting Machines (e.g., XGBoost, LightGBM):** Often achieve higher accuracy than Random Forests.
* **Support Vector Machines (SVMs):** Can be effective for high-dimensional data.
* **Neural Networks:** Potentially more powerful but require more data and careful tuning.
* **More Sophisticated Feature Engineering:** The feature engineering example is very basic. You could consider:
* **Time-series features:** Rolling statistics (mean, standard deviation, min, max) over different time windows. Lagged features (values from previous time steps).
* **Frequency-domain features:** Using Fast Fourier Transform (FFT) to extract frequency components from vibration data.
* **Domain knowledge:** Features based on your understanding of wind turbine operation and failure modes.
* **Anomaly Detection:** Consider using anomaly detection techniques to identify unusual operating conditions that might indicate a potential failure.
* **Continuous Monitoring and Retraining:** The model should be continuously monitored and retrained as new data becomes available. This ensures that the model remains accurate and adapts to changes in the operating environment.
* **Explainable AI (XAI):** Use techniques like feature importance or SHAP values to understand which features are most important for the model's predictions. This can help to build trust in the model and provide insights into the failure process.
* **Threshold Tuning:** Instead of directly using the predicted class (0 or 1), use the predicted probability and set a threshold. Adjusting this threshold (e.g., classifying as a failure if the probability is above 0.7 instead of 0.5) can significantly impact precision and recall. Choose a threshold that balances these metrics based on your specific business needs.
* **Ensemble Methods:** Combine predictions from multiple models (e.g., Random Forest, XGBoost, SVM) to improve accuracy and robustness.
This revised response provides a much more complete and practical starting point for developing an AI-based predictive maintenance system for wind turbines. Remember to adapt the code to your specific data and requirements.
👁️ Viewed: 6
Comments