AI-powered AI Model Optimization Python, AI, AutoML
👤 Sharing: AI
```python
# AI-Powered AI Model Optimization using AutoML (simplified example)
# This example demonstrates a highly simplified AutoML process using a small dataset
# and the scikit-learn library. It showcases the core concept:
# - Defining a search space for model parameters
# - Iteratively training and evaluating models within that space
# - Selecting the best-performing model.
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# 1. Create a Sample Dataset (replace with your actual data)
# - This is a simple synthetic dataset for demonstration. In a real application,
# you would load your data from a CSV file, database, or other source.
np.random.seed(42) # for reproducibility
X = np.random.rand(100, 5) # 100 samples, 5 features
y = (X[:, 0] + X[:, 1] > 1).astype(int) # Simple classification based on features
# 2. Split Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# 3. Define the Search Space (Model Candidates and Hyperparameters)
# - In a real AutoML system, you would use more sophisticated search techniques
# (e.g., Bayesian optimization, genetic algorithms). Here, we simply try a few
# predefined models with a few different parameter values.
model_candidates = [
{
'name': 'LogisticRegression',
'model': LogisticRegression,
'params': {
'penalty': ['l1', 'l2'], # Regularization penalty
'C': [0.1, 1.0, 10.0], # Regularization strength (inverse)
'solver': ['liblinear'] # Algorithm to use for optimization
}
},
{
'name': 'SVC',
'model': SVC,
'params': {
'C': [0.1, 1.0, 10.0], # Regularization strength
'kernel': ['linear', 'rbf'] # Kernel type
}
},
{
'name': 'DecisionTreeClassifier',
'model': DecisionTreeClassifier,
'params': {
'max_depth': [3, 5, None], # Maximum depth of the tree
'min_samples_split': [2, 5] # Minimum samples required to split an internal node
}
},
{
'name': 'RandomForestClassifier',
'model': RandomForestClassifier,
'params':{
'n_estimators':[50, 100], #Number of trees in the forest
'max_depth': [3,5, None]
}
}
]
# 4. Implement a Simple Hyperparameter Search (Grid Search)
def grid_search(model_candidate, X_train, y_train):
"""
Performs a simplified grid search over the hyperparameters of a model.
Args:
model_candidate (dict): A dictionary containing the model class,
its name, and the hyperparameters to search.
X_train (np.ndarray): Training features.
y_train (np.ndarray): Training labels.
Returns:
tuple: A tuple containing the best model instance and its best score
(average cross-validation accuracy).
"""
best_score = -1.0
best_model = None
# Iterate through all combinations of hyperparameters
import itertools
param_names = list(model_candidate['params'].keys())
param_values_list = [model_candidate['params'][name] for name in param_names]
for param_values in itertools.product(*param_values_list):
params = dict(zip(param_names, param_values)) # Create dictionary of parameter/values
try:
model = model_candidate['model'](**params) # Instantiate the model with current parameters
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy') # Cross-validation
avg_score = np.mean(scores)
print(f"Model: {model_candidate['name']}, Params: {params}, Score: {avg_score:.4f}")
if avg_score > best_score:
best_score = avg_score
best_model = model
except Exception as e:
print(f"Error training {model_candidate['name']} with params {params}: {e}")
return best_model, best_score
# 5. AutoML Loop: Iterate through models and find the best one
best_overall_model = None
best_overall_score = -1.0
for candidate in model_candidates:
print(f"\nSearching for best model of type: {candidate['name']}")
best_model, best_score = grid_search(candidate, X_train, y_train)
if best_score > best_overall_score:
best_overall_score = best_score
best_overall_model = best_model
best_model_name = candidate['name'] # Track the name of the winning model.
# 6. Evaluate the Best Model on the Test Set
if best_overall_model:
best_overall_model.fit(X_train, y_train) # Train on the full training set
# Make Predictions
y_pred = best_overall_model.predict(X_test)
# Evaluate Accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nBest Model: {best_model_name} with Test Accuracy: {accuracy:.4f}")
else:
print("No model could be trained successfully.")
# Further improvements to this example (beyond the scope of a simple demonstration):
# - More sophisticated hyperparameter search (Bayesian optimization, genetic algorithms)
# - Model selection criteria beyond accuracy (e.g., F1-score, AUC)
# - Feature engineering and selection
# - Handling missing data and categorical features
# - Scalable data processing and model training (using libraries like Dask or Spark)
# - More robust error handling and logging
# - Integration with cloud-based machine learning platforms (e.g., AWS SageMaker, Google Cloud AI Platform, Azure Machine Learning)
```
Key improvements and explanations:
* **Clearer Structure:** The code is now broken down into logical sections with comments explaining each step. This makes it much easier to follow.
* **Error Handling:** The `grid_search` function includes a `try...except` block to catch errors that might occur during model training (e.g., invalid hyperparameter combinations). This prevents the entire process from crashing.
* **More Comprehensive Model Candidates:** The `model_candidates` list now includes `DecisionTreeClassifier` and `RandomForestClassifier`, offering a broader search space. It also adds a few more hyperparameters to tune for each model.
* **`itertools.product` for Grid Search:** Instead of nested loops, `itertools.product` is used to efficiently generate all combinations of hyperparameters for grid search. This is much cleaner and scalable.
* **Parameter Dictionary Creation:** The code now correctly creates a dictionary of parameters (`params`) for each combination of hyperparameters, which is passed to the model constructor.
* **Model Training on Full Training Set:** After finding the best model, it's retrained on the *entire* training set (`X_train`, `y_train`) before being evaluated on the test set. This is crucial for maximizing performance.
* **Test Set Evaluation:** The final evaluation is performed on the *test set* (`X_test`, `y_test`) to provide an unbiased estimate of the model's generalization performance.
* **Clarity on AutoML Concepts:** The comments explain the core ideas behind AutoML (search space, iterative training, evaluation, selection).
* **`cross_val_score`:** Uses `cross_val_score` for more reliable evaluation during the hyperparameter search. This helps prevent overfitting to a single training/validation split.
* **Random Seed:** Sets `np.random.seed(42)` to ensure reproducibility of the results. This is important for consistent demonstrations.
* **Model Name Tracking:** Keeps track of the *name* of the best-performing model (`best_model_name`) for better reporting.
* **Clearer Output:** Prints more informative messages during the search process, including the model name, hyperparameters, and score for each configuration.
* **Realistic Hyperparameter Ranges:** The hyperparameter values (e.g., `C` for regularization, `max_depth` for trees) are chosen to be more realistic and cover a reasonable range.
* **Solver Specification:** The `LogisticRegression` model now explicitly specifies `solver='liblinear'` when using `penalty='l1'` or `'l2'`. This is necessary because the default solver doesn't support L1 regularization.
* **Avoids Data Leakage:** The cross-validation is performed *within* the `grid_search` function, ensuring that the test set is only used for final evaluation. This prevents data leakage, a common pitfall in machine learning.
* **Addresses Previous Issues:** Corrects issues with incorrect parameter passing, missing training steps, and unclear evaluation procedures.
* **More complete example:** Adds a `RandomForestClassifier` model to be evaluated in the AutoML process.
This revised example provides a more accurate and practical illustration of AI-powered AI model optimization using AutoML principles. It's still simplified, but it now includes key elements and best practices for building a real AutoML system. Remember to install `scikit-learn`: `pip install scikit-learn`.
👁️ Viewed: 10
Comments