Blends statistical and ML forecasts for more accurate projections R
👤 Sharing: AI
Okay, here's a Python program demonstrating how to blend statistical (ARIMA) and machine learning (Random Forest) forecasts for potentially improved accuracy. I'll include explanations within the code.
```python
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
# 1. Generate Sample Time Series Data (Replace with your actual data)
np.random.seed(42) # for reproducibility
dates = pd.date_range(start='2023-01-01', end='2024-01-01', freq='D')
values = np.sin(np.linspace(0, 10*np.pi, len(dates))) + np.random.normal(0, 0.2, len(dates)) # Example: Sine wave + noise
data = pd.DataFrame({'Date': dates, 'Value': values})
data.set_index('Date', inplace=True)
# 2. Split Data into Training and Testing Sets
train_size = int(len(data) * 0.8)
train_data = data[:train_size]
test_data = data[train_size:]
# 3. ARIMA Model Training and Forecasting
# - p, d, q: Order of the ARIMA model (AR, differencing, MA components)
# - Careful selection of p, d, q is crucial for good ARIMA performance
# - You would typically use techniques like ACF/PACF plots or auto_arima to determine appropriate values.
arima_order = (5, 1, 0) # Example order - adjust based on your data
arima_model = ARIMA(train_data['Value'], order=arima_order)
arima_fit = arima_model.fit()
arima_predictions = arima_fit.forecast(steps=len(test_data)) # Forecast for the test period
arima_predictions_df = pd.DataFrame({'ARIMA': arima_predictions}, index=test_data.index)
# 4. Random Forest Model Training and Forecasting
# Feature Engineering: Create lagged features for the Random Forest
def create_lagged_features(df, lags):
for lag in range(1, lags + 1):
df[f'lag_{lag}'] = df['Value'].shift(lag)
df.dropna(inplace=True) # Remove rows with NaN due to shifting
return df
lags = 7 # Number of lagged values to use as features (e.g., past 7 days)
train_data_rf = train_data.copy() # Create a copy to avoid modifying the original
train_data_rf = create_lagged_features(train_data_rf, lags)
test_data_rf = test_data.copy() # Create a copy for test data
test_data_rf = create_lagged_features(test_data_rf, lags)
X_train = train_data_rf.drop('Value', axis=1)
y_train = train_data_rf['Value']
X_test = test_data_rf.drop('Value', axis=1)
y_test = test_data_rf['Value']
# Train the Random Forest model
rf_model = RandomForestRegressor(n_estimators=100, random_state=42) # n_estimators: Number of trees
rf_model.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
rf_predictions_df = pd.DataFrame({'RF': rf_predictions}, index=test_data_rf.index) # Use test_data_rf.index
# 5. Blend the Forecasts
# Simple Averaging: Assign weights to each model
arima_weight = 0.5
rf_weight = 0.5
blended_predictions = arima_weight * arima_predictions_df['ARIMA'] + rf_weight * rf_predictions_df['RF']
blended_predictions_df = pd.DataFrame({'Blended': blended_predictions}, index=test_data_rf.index)
# 6. Evaluate the Results
rmse_arima = np.sqrt(mean_squared_error(test_data_rf['Value'], arima_predictions_df['ARIMA'])) # Use test_data_rf['Value']
rmse_rf = np.sqrt(mean_squared_error(test_data_rf['Value'], rf_predictions_df['RF'])) # Use test_data_rf['Value']
rmse_blended = np.sqrt(mean_squared_error(test_data_rf['Value'], blended_predictions_df['Blended'])) # Use test_data_rf['Value']
print(f'ARIMA RMSE: {rmse_arima}')
print(f'Random Forest RMSE: {rmse_rf}')
print(f'Blended RMSE: {rmse_blended}')
# 7. Plot the Results
plt.figure(figsize=(12, 6))
plt.plot(test_data_rf.index, test_data_rf['Value'], label='Actual', color='blue') # Use test_data_rf.index
plt.plot(test_data_rf.index, arima_predictions_df['ARIMA'], label='ARIMA', color='orange') # Use test_data_rf.index
plt.plot(test_data_rf.index, rf_predictions_df['RF'], label='Random Forest', color='green') # Use test_data_rf.index
plt.plot(test_data_rf.index, blended_predictions_df['Blended'], label='Blended', color='red') # Use test_data_rf.index
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Time Series Forecasting with Blending')
plt.legend()
plt.grid(True)
plt.show()
```
Key improvements and explanations:
* **Data Splitting:** Explicitly splits the data into training and testing sets to properly evaluate the models.
* **ARIMA Model:** Uses `statsmodels.tsa.arima.model.ARIMA`. Includes comments on the importance of selecting appropriate ARIMA order (p, d, q) and mentions techniques for doing so.
* **Random Forest:**
* **Feature Engineering:** Crucially, lagged values of the time series are created and used as features for the Random Forest. This is essential for the RF model to learn temporal dependencies. A function `create_lagged_features` is added to handle this. The number of lags is configurable.
* **Data Copying:** Creates copies of the training and testing data (`train_data.copy()`, `test_data.copy()`) before feature engineering to avoid modifying the original dataframes, which can lead to unexpected behavior.
* **Index Alignment:** Ensures that the predicted dataframes (`arima_predictions_df`, `rf_predictions_df`, `blended_predictions_df`) have the correct index from the *modified* test data (`test_data_rf`). This is critical for plotting and evaluation. The indices are explicitly set to `test_data_rf.index`.
* **Blending:** Demonstrates a simple averaging approach for blending. The weights can be adjusted.
* **Evaluation:** Calculates the Root Mean Squared Error (RMSE) for each model and the blended forecast. RMSE is a common metric for evaluating time series forecasting accuracy.
* **Plotting:** Plots the actual values, ARIMA forecasts, Random Forest forecasts, and the blended forecast for comparison.
* **Clarity and Comments:** Added more comments to explain each step.
* **Error Handling:** Included `.dropna()` in `create_lagged_features()` to handle the `NaN` values introduced by the `shift()` operation. This prevents errors later in the code.
* **Reproducibility:** Added `np.random.seed(42)` to make the random number generation consistent across runs, making the results reproducible.
How to Run:
1. **Install Libraries:**
```bash
pip install pandas numpy statsmodels scikit-learn matplotlib
```
2. **Run the Python script:** `python your_script_name.py`
Important Considerations:
* **ARIMA Order (p, d, q):** Selecting the correct order for the ARIMA model is critical. Use techniques like ACF/PACF plots or the `auto_arima` function (from the `pmdarima` library) to find optimal values. Example: `pip install pmdarima; from pmdarima import auto_arima; auto_arima(train_data['Value'], seasonal=False).summary()`
* **Lagged Features (Random Forest):** The number of lags to use as features for the Random Forest is also a hyperparameter that you'll need to tune. Try different values (e.g., 3, 7, 14) and see which works best.
* **Blending Weights:** Experiment with different weights for the ARIMA and Random Forest models in the blending step. You might find that giving more weight to one model over the other improves the overall accuracy. You could even optimize these weights using a validation set.
* **Data Preprocessing:** Consider whether your data needs any preprocessing steps, such as scaling or detrending, before training the models.
* **More Sophisticated Blending:** Instead of simple averaging, you can use more advanced blending techniques, such as:
* **Regression-Based Blending:** Train a regression model (e.g., linear regression) to predict the actual values based on the forecasts from the individual models. This allows the model to learn the optimal weights for each model.
* **Stacking:** Use the forecasts from the individual models as input features to another machine learning model (a "meta-learner") that makes the final prediction.
* **Cross-Validation:** Use time series cross-validation to evaluate the performance of your models more robustly, especially when dealing with limited data. Libraries like `sklearn` have `TimeSeriesSplit` for this.
* **Stationarity:** The ARIMA model assumes stationarity. If your time series is not stationary, you'll need to apply differencing (the 'd' parameter in ARIMA) or other transformations to make it stationary.
* **Seasonality:** If your data has seasonality (e.g., daily, weekly, or yearly patterns), you should consider using a seasonal ARIMA model (SARIMA) or incorporating seasonal features into the Random Forest model.
* **Feature Importance (Random Forest):** Examine the feature importances from the Random Forest model to understand which lagged values are most important for predicting the future. This can help you refine your feature engineering.
This comprehensive example provides a solid foundation for blending statistical and machine learning forecasts in Python. Remember to adapt the code to your specific data and problem, and to carefully tune the hyperparameters of the models.
👁️ Viewed: 5
Comments