Kubeflow: Hyperparameter Tuning with Katib
Katib is a Kubernetes-native system for automated machine learning (AutoML). It supports hyperparameter tuning and neural architecture search (NAS), helping data scientists and ML engineers find the optimal configuration for their models. Katib is an integral part of Kubeflow, leveraging Kubernetes's orchestration capabilities to run many parallel training jobs efficiently.
1. Key Concepts in Katib
- Experiment: The core Katib resource that defines an AutoML job. An experiment specifies:
- Search Algorithm: The strategy to explore the hyperparameter space (e.g., Grid Search, Random Search, Bayesian Optimization, Hyperband).
- Objective Metric: The metric to optimize (e.g., accuracy, loss, F1-score).
- Parameters: The hyperparameters to tune, including their type (int, double, categorical), range, and step.
- Trial Template: A Kubernetes Job or Kubeflow
TFJob/PyTorchJobtemplate that Katib uses to run individual training jobs (trials) with different hyperparameter combinations.
- Trial: A single run of the training code with a specific set of hyperparameter values, managed by Katib.
- Suggestion: A set of hyperparameter values proposed by Katib's search algorithm for a new trial.
2. Setting up Katib (Conceptual)
Katib is installed as part of a full Kubeflow deployment. Once Kubeflow is set up, Katib's components (controller, UI) should be available.
3. Defining a Katib Experiment
A Katib experiment is defined using a YAML manifest (Experiment kind).
Example: Hyperparameter Tuning for a TensorFlow Model
Let's assume you want to tune the learning rate, number of layers, and number of units per layer for a simple TensorFlow model.
# filename: tf-experiment-example.yaml
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
name: tf-hyperparam-tuning
namespace: kubeflow # Or your specific namespace
spec:
objective:
type: maximize # or minimize
goal: 0.99 # Optional: target value for the metric
objectiveMetricName: accuracy # The metric Katib will optimize
algorithm:
algorithmName: random # Choose a search algorithm: random, grid, bayesianoptimization, hyperband
parameters:
- name: learning_rate
parameterType: double
minValue: "0.001"
maxValue: "0.1"
step: "0.01" # For grid search, or range for random/bayesian
- name: num_layers
parameterType: int
minValue: "1"
maxValue: "3"
- name: units_per_layer
parameterType: int
minValue: "32"
maxValue: "128"
step: "32"
trialTemplate:
# This template defines how to run a single trial.
# It usually points to a container image that has your training code.
trialParameters: # Parameters that will be passed to your trial container
- name: lr
description: Learning rate for the model
reference: learning_rate # Maps to the parameter name defined above
- name: layers
description: Number of hidden layers
reference: num_layers
- name: units
description: Number of units per layer
reference: units_per_layer
primaryContainer: tf-train-container
successCondition: status.conditions.#(type=="Succeeded")#(status=="True")
failureCondition: status.conditions.#(type=="Failed")#(status=="True")
# This is a generic Kubernetes Job template.
# For TFJob/PyTorchJob, the template would be slightly different.
# Below is a simplified example of a K8s Job.
jobTemplate:
apiVersion: batch/v1
kind: Job
spec:
template:
spec:
containers:
- name: tf-train-container
image: your-custom-tf-trainer-image:latest # Your Docker image for training
command:
- "python"
- "train.py"
- "--learning_rate=${trialParameters.lr}" # Pass hyperparams to script
- "--num_layers=${trialParameters.layers}"
- "--units_per_layer=${trialParameters.units}"
# Your training script (train.py) should:
# 1. Parse these command-line arguments.
# 2. Train the model with these hyperparameters.
# 3. Print the objective metric (e.g., 'accuracy=0.92') to stdout for Katib to capture.
# Katib uses regex to parse metrics from logs.
restartPolicy: Never
Example train.py (Conceptual)
Your train.py script would look something like this:
# train.py (Simplified)
import argparse
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('--learning_rate', type=float, default=0.01)
parser.add_argument('--num_layers', type=int, default=1)
parser.add_argument('--units_per_layer', type=int, default=64)
args = parser.parse_args()
# Build a simple model
model = keras.Sequential()
model.add(layers.InputLayer(input_shape=(10,)))
for _ in range(args.num_layers):
model.add(layers.Dense(args.units_per_layer, activation='relu'))
model.add(layers.Dense(1, activation='sigmoid'))
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
loss='binary_crossentropy',
metrics=['accuracy'])
# Generate dummy data for training
X_train = np.random.rand(100, 10).astype('float32')
y_train = np.random.randint(0, 2, size=(100, 1)).astype('float32')
# Train (briefly for demonstration)
history = model.fit(X_train, y_train, epochs=2, verbose=0) # Set verbose to 0 for cleaner logs
# Extract the metric Katib is looking for and print to stdout
# Katib uses regex to parse this output (e.g., 'accuracy=(\d+\.\d+)')
final_accuracy = history.history['accuracy'][-1]
print(f"accuracy={final_accuracy}") # IMPORTANT: Print the metric in this format
Steps to Run a Katib Experiment (Conceptual)
- Build Docker Image: Create a Docker image containing your
train.pyscript and its dependencies. Push it to a container registry. - Apply YAML:
bash kubectl apply -f tf-experiment-example.yaml - Monitor: Use the Kubeflow UI (Katib section) or
kubectl get experiments -n kubeflowto monitor the experiment's progress. You can see individual trials, their logs, and the best-found hyperparameters.
4. Search Algorithms
Katib supports various search algorithms:
- Random Search: Randomly samples hyperparameter combinations.
- Grid Search: Exhaustively searches a predefined grid of hyperparameters.
- Bayesian Optimization (e.g., TPE): Uses a probabilistic model to guide the search for optimal hyperparameters, often more efficient than random or grid search.
- Hyperband: A bandit-based strategy that adaptively allocates resources to promising hyperparameter configurations.
Further Topics:
- Neural Architecture Search (NAS) with Katib.
- Visualizing Katib experiment results in the Kubeflow UI.
- Integrating Katib experiments into Kubeflow Pipelines.
- Advanced trial templates (e.g., using
TFJoborPyTorchJobcustom resources). - Early stopping for trials within Katib.
Katib provides a powerful and scalable solution for automating the tedious and computationally intensive process of hyperparameter tuning, leading to better-performing ML models and more efficient use of computational resources.