Automated Machine Learning (AutoML) with H2O

Automated Machine Learning (AutoML) is the process of automating the end-to-end application of machine learning, from raw dataset to deployable machine learning models. Its primary goal is to simplify and accelerate the deployment of machine learning by automating tasks such as feature engineering, algorithm selection, hyperparameter tuning, and model validation. This significantly reduces the need for extensive human expertise and manual effort, making ML more accessible to non-experts and improving the efficiency of data scientists.

H2O.ai is an open-source, in-memory, distributed, fast, and scalable machine learning platform. It provides implementations for a wide range of statistical and machine learning algorithms, including Generalized Linear Models (GLM), Gradient Boosting Machines (GBM), Random Forests, Deep Learning, XGBoost, and more. H2O can operate in various environments, including standalone, Hadoop, or Spark clusters, and offers APIs for popular data science languages like Python and R.

H2O's AutoML functionality, specifically implemented through the `H2OAutoML` class, brings the power of AutoML to its robust platform. When invoked, H2O AutoML automatically trains and cross-validates a large number of models (including a comprehensive suite of base learners and several stacked ensemble models) within a user-specified time limit or model count. It intelligently searches through different algorithms and hyperparameter combinations to discover the best-performing model for a given dataset and problem type (classification or regression). The output includes a 'leaderboard' of all trained models, ranked by their performance metrics, with the best model (the 'leader') identified. This capability allows users to quickly achieve state-of-the-art predictive performance without deep knowledge of every algorithm or the intricacies of hyperparameter optimization, thus accelerating model development and deployment.

Example Code

import h2o
from h2o.automl import H2OAutoML

 1. Initialize H2O Cluster
 A local H2O cluster will be started. Adjust max_mem_size and nthreads as needed.
h2o.init(max_mem_size="4G", nthreads=-1)  Allocate 4GB memory, use all available cores

try:
     2. Load Sample Data
     Using the Abalone dataset for a regression problem (predicting age based on physical measurements).
    data_path = "https://raw.githubusercontent.com/h2oai/h2o-3/master/h2o-docs/src/product/tutorials/automl/data/abalone.csv"
    data = h2o.import_file(path=data_path)

     3. Identify Features (x) and Target (y)
     The 'Rings' column represents the age of the abalone (+1.5).
    y = "Rings"
    x = data.col_names
    x.remove(y)  Remove the target column from the list of features

     If it were a classification problem and target was numeric, you'd convert it to a factor:
     data[y] = data[y].asfactor()

     4. Split Data into Training and Test Sets
    train, test = data.split_frame(ratios=[0.8], seed=42)  80% for training, 20% for testing

    print("\n--- Training Data Head ---")
    print(train.head())
    print(f"\nTraining data shape: {train.shape}")
    print(f"Test data shape: {test.shape}")

     5. Run H2OAutoML
     max_runtime_secs: Stop after this amount of time (in seconds). Default is 3600 (1 hour).
     max_models: Stop after training this many models.
     seed: For reproducibility.
     project_name: An optional name for the AutoML run.
     nfolds: Number of folds for cross-validation.
    aml = H2OAutoML(
        max_runtime_secs=120,  Run for 2 minutes for demonstration
        seed=42,
        project_name="abalone_automl_regression",
        nfolds=5,  Perform 5-fold cross-validation
        sort_metric="rmse"  Sort leaderboard by Root Mean Squared Error (default for regression)
    )

    print("\n--- Starting H2O AutoML Training ---")
    aml.train(x=x, y=y, training_frame=train)

     6. Get the AutoML Leaderboard
     The leaderboard shows the performance of all trained models, sorted by the specified metric.
    lb = aml.leaderboard
    print("\n--- AutoML Leaderboard ---")
    print(lb)

     7. Get the Best Model (Leader) from the Leaderboard
    leader_model = aml.leader
    print(f"\n--- Leader Model ID: {leader_model.model_id} ---")

     8. Evaluate the Leader Model on the Test Set
    print("\n--- Leader Model Performance on Test Data ---")
    performance = leader_model.model_performance(test)
    print(performance)

     9. Make Predictions using the Leader Model
    predictions = leader_model.predict(test)
    print("\n--- Sample Predictions (first 10) ---")
    print(predictions.head(10))

except Exception as e:
    print(f"An error occurred: {e}")
finally:
     10. Shut down H2O Cluster
     It's good practice to shut down the H2O instance when done to free up resources.
    h2o.shutdown(prompt=False)
    print("\nH2O cluster shut down.")

Automated Machine Learning (AutoML) with H2O

Example Code

Related Topics