Categorical Data Model + CatBoost

Categorical Data Model

Categorical data represents types of data which may be divided into groups. These are qualitative variables that take on values that are names or labels. Examples include 'color' (red, blue, green), 'gender' (male, female), 'city' (New York, London, Paris), or 'product type' (electronics, apparel, food). Unlike numerical data, categorical data does not have inherent mathematical meaning, and its values cannot be ordered or used in arithmetic operations directly.

Machine learning models, especially traditional ones, typically require numerical input. Therefore, handling categorical features is a crucial preprocessing step. Common techniques include:

1. One-Hot Encoding: Creates a new binary feature for each unique category, indicating the presence (1) or absence (0) of that category. This can lead to a very high-dimensional sparse dataset for features with many unique categories (high cardinality).
2. Label Encoding: Assigns a unique integer to each category (e.g., 'red'=0, 'blue'=1, 'green'=2). This implies an ordinal relationship that might not exist, which can mislead models.
3. Target Encoding (Mean Encoding): Replaces each category with the mean of the target variable for that category. This can be powerful but is prone to overfitting if not regularized or done carefully (e.g., using cross-validation or adding noise).
4. Frequency/Count Encoding: Replaces each category with its frequency or count in the dataset.

Each method has its pros and cons, and the choice often depends on the dataset characteristics and the chosen machine learning algorithm.

CatBoost

CatBoost (Categorical Boosting) is a high-performance open-source gradient boosting on decision trees library developed by Yandex. It stands out due to its innovative approach to handling categorical features natively and effectively, without requiring explicit preprocessing steps like one-hot encoding or label encoding.

Key Features and Advantages of CatBoost:

1. Native Categorical Feature Handling: CatBoost's most significant strength. It leverages a permutation-driven approach to calculate 'ordered target statistics' (a more robust form of target encoding) during training. This avoids the 'prediction shift' problem (where statistics computed on the full dataset differ from those on a subset, leading to biased estimates) and addresses the target leakage issue that can arise with traditional target encoding. For low-cardinality categorical features, it might use one-hot encoding.
2. Oblivious Decision Trees: CatBoost uses symmetric trees, also known as oblivious trees, where the same splitting criterion is used for all nodes at the same level of the tree. This design makes the model simpler, faster to predict, and helps prevent overfitting, leading to more robust models.
3. Ordered Boosting: CatBoost uses a novel gradient boosting scheme called 'Ordered Boosting'. This technique tackles the 'prediction shift' problem (where gradients are computed on the same data points that are used to train the current model, leading to biased gradient estimations) by randomly permuting the training data and computing the gradient for each object using only the preceding objects in the permutation. This significantly reduces overfitting and improves generalization.
4. Resistance to Overfitting: Through its ordered boosting, oblivious trees, and specific handling of categorical features, CatBoost inherently offers strong resistance to overfitting compared to many other boosting algorithms.
5. Fast Prediction: The structure of oblivious trees allows for very fast inference.
6. GPU Support: CatBoost offers robust GPU support for accelerated training, which is particularly beneficial for large datasets.
7. Missing Value Handling: It can handle missing values in both numerical and categorical features gracefully.

How CatBoost Handles Categorical Features Internally:

For each categorical feature, CatBoost generates several new numerical features. It does this by calculating 'target statistics' (similar to target encoding) for each category. However, to avoid target leakage and prediction shift, it computes these statistics dynamically:

- For each example in the training set, the target statistic for its categorical feature is calculated using only the examples observed -before- it in a random permutation of the dataset. This ensures that the statistic for an example doesn't use information from the example itself, preventing leakage.
- It can combine multiple categorical features into a single new categorical feature before encoding, allowing the model to capture interactions between them.

In summary, CatBoost automates a crucial and often complex part of the machine learning pipeline – categorical feature engineering – while delivering state-of-the-art performance and robustness.

Example Code

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from catboost import CatBoostClassifier, Pool

 1. Create a synthetic dataset with categorical and numerical features
data = {
    'age': np.random.randint(20, 60, 1000),
    'gender': np.random.choice(['Male', 'Female'], 1000),
    'city': np.random.choice(['New York', 'London', 'Paris', 'Tokyo', 'Sydney'], 1000),
    'education': np.random.choice(['High School', 'Bachelors', 'Masters', 'PhD'], 1000),
    'income': np.random.randint(30000, 150000, 1000),
    'has_children': np.random.choice([0, 1], 1000),
    'target': np.random.choice([0, 1], 1000, p=[0.6, 0.4])  Binary classification target
}
df = pd.DataFrame(data)

print("Original DataFrame head:")
print(df.head())
print("\nDataFrame info:")
df.info()

 2. Separate features (X) and target (y)
X = df.drop('target', axis=1)
y = df['target']

 3. Identify categorical features
 CatBoost can automatically detect them if the dtype is 'object' or 'category',
 but it's good practice to explicitly list them for clarity and control.
categorical_features_indices = np.where(X.dtypes == 'object')[0]
categorical_feature_names = X.columns[categorical_features_indices].tolist()

print(f"\nIdentified categorical features: {categorical_feature_names}")

 4. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

 5. Initialize and train a CatBoostClassifier
 We pass the categorical_features_indices to the CatBoost model
 CatBoost will handle these columns appropriately internally.
model = CatBoostClassifier(
    iterations=100,  Number of boosting iterations (trees)
    learning_rate=0.1,
    depth=6,
    loss_function='Logloss',  For binary classification
    eval_metric='Accuracy',
    random_seed=42,
    verbose=10,  Print metrics every 10 iterations
    cat_features=categorical_feature_names  Pass categorical feature names here
)

print("\nTraining CatBoostClassifier...")
model.fit(X_train, y_train, eval_set=(X_test, y_test), early_stopping_rounds=20)

 6. Make predictions on the test set
y_pred_proba = model.predict_proba(X_test)[:, 1]
y_pred = model.predict(X_test)

 7. Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.4f}")

print("\nFeature importances:")
feature_importances = model.get_feature_importance(Pool(X_test, y_test, cat_features=categorical_feature_names))
for score, feature in sorted(zip(feature_importances, X.columns), reverse=True):
    print(f"{feature}: {score:.4f}")

 Example of how CatBoost handles categorical features internally
 No manual one-hot encoding or label encoding was needed for the categorical columns.
 CatBoost processed 'gender', 'city', and 'education' directly.

Categorical Data Model + CatBoost

Example Code

Related Topics