Kubeflow: Hyperparameter Tuning with Katib

Katib is a Kubernetes-native system for automated machine learning (AutoML). It supports hyperparameter tuning and neural architecture search (NAS), helping data scientists and ML engineers find the optimal configuration for their models. Katib is an integral part of Kubeflow, leveraging Kubernetes's orchestration capabilities to run many parallel training jobs efficiently.

1. Key Concepts in Katib

Experiment: The core Katib resource that defines an AutoML job. An experiment specifies:
- Search Algorithm: The strategy to explore the hyperparameter space (e.g., Grid Search, Random Search, Bayesian Optimization, Hyperband).
- Objective Metric: The metric to optimize (e.g., accuracy, loss, F1-score).
- Parameters: The hyperparameters to tune, including their type (int, double, categorical), range, and step.
- Trial Template: A Kubernetes Job or Kubeflow TFJob/PyTorchJob template that Katib uses to run individual training jobs (trials) with different hyperparameter combinations.
Trial: A single run of the training code with a specific set of hyperparameter values, managed by Katib.
Suggestion: A set of hyperparameter values proposed by Katib's search algorithm for a new trial.

2. Setting up Katib (Conceptual)

Katib is installed as part of a full Kubeflow deployment. Once Kubeflow is set up, Katib's components (controller, UI) should be available.

3. Defining a Katib Experiment

A Katib experiment is defined using a YAML manifest (Experiment kind).

Example: Hyperparameter Tuning for a TensorFlow Model

Let's assume you want to tune the learning rate, number of layers, and number of units per layer for a simple TensorFlow model.

# filename: tf-experiment-example.yaml
apiVersion: "kubeflow.org/v1beta1"
kind: Experiment
metadata:
  name: tf-hyperparam-tuning
  namespace: kubeflow # Or your specific namespace
spec:
  objective:
    type: maximize # or minimize
    goal: 0.99     # Optional: target value for the metric
    objectiveMetricName: accuracy # The metric Katib will optimize
  algorithm:
    algorithmName: random # Choose a search algorithm: random, grid, bayesianoptimization, hyperband
  parameters:
    - name: learning_rate
      parameterType: double
      minValue: "0.001"
      maxValue: "0.1"
      step: "0.01" # For grid search, or range for random/bayesian
    - name: num_layers
      parameterType: int
      minValue: "1"
      maxValue: "3"
    - name: units_per_layer
      parameterType: int
      minValue: "32"
      maxValue: "128"
      step: "32"
  trialTemplate:
    # This template defines how to run a single trial.
    # It usually points to a container image that has your training code.
    trialParameters: # Parameters that will be passed to your trial container
      - name: lr
        description: Learning rate for the model
        reference: learning_rate # Maps to the parameter name defined above
      - name: layers
        description: Number of hidden layers
        reference: num_layers
      - name: units
        description: Number of units per layer
        reference: units_per_layer
    primaryContainer: tf-train-container
    successCondition: status.conditions.#(type=="Succeeded")#(status=="True")
    failureCondition: status.conditions.#(type=="Failed")#(status=="True")
    # This is a generic Kubernetes Job template.
    # For TFJob/PyTorchJob, the template would be slightly different.
    # Below is a simplified example of a K8s Job.
    jobTemplate:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          spec:
            containers:
              - name: tf-train-container
                image: your-custom-tf-trainer-image:latest # Your Docker image for training
                command:
                  - "python"
                  - "train.py"
                  - "--learning_rate=${trialParameters.lr}" # Pass hyperparams to script
                  - "--num_layers=${trialParameters.layers}"
                  - "--units_per_layer=${trialParameters.units}"
                # Your training script (train.py) should:
                # 1. Parse these command-line arguments.
                # 2. Train the model with these hyperparameters.
                # 3. Print the objective metric (e.g., 'accuracy=0.92') to stdout for Katib to capture.
                #    Katib uses regex to parse metrics from logs.
            restartPolicy: Never

Example `train.py` (Conceptual)

Your train.py script would look something like this:

# train.py (Simplified)
import argparse
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('--learning_rate', type=float, default=0.01)
    parser.add_argument('--num_layers', type=int, default=1)
    parser.add_argument('--units_per_layer', type=int, default=64)
    args = parser.parse_args()

    # Build a simple model
    model = keras.Sequential()
    model.add(layers.InputLayer(input_shape=(10,)))
    for _ in range(args.num_layers):
        model.add(layers.Dense(args.units_per_layer, activation='relu'))
    model.add(layers.Dense(1, activation='sigmoid'))

    model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=args.learning_rate),
                  loss='binary_crossentropy',
                  metrics=['accuracy'])

    # Generate dummy data for training
    X_train = np.random.rand(100, 10).astype('float32')
    y_train = np.random.randint(0, 2, size=(100, 1)).astype('float32')

    # Train (briefly for demonstration)
    history = model.fit(X_train, y_train, epochs=2, verbose=0) # Set verbose to 0 for cleaner logs

    # Extract the metric Katib is looking for and print to stdout
    # Katib uses regex to parse this output (e.g., 'accuracy=(\d+\.\d+)')
    final_accuracy = history.history['accuracy'][-1]
    print(f"accuracy={final_accuracy}") # IMPORTANT: Print the metric in this format

Steps to Run a Katib Experiment (Conceptual)

Build Docker Image: Create a Docker image containing your train.py script and its dependencies. Push it to a container registry.
Apply YAML: bash kubectl apply -f tf-experiment-example.yaml
Monitor: Use the Kubeflow UI (Katib section) or kubectl get experiments -n kubeflow to monitor the experiment's progress. You can see individual trials, their logs, and the best-found hyperparameters.

4. Search Algorithms

Katib supports various search algorithms:

Random Search: Randomly samples hyperparameter combinations.
Grid Search: Exhaustively searches a predefined grid of hyperparameters.
Bayesian Optimization (e.g., TPE): Uses a probabilistic model to guide the search for optimal hyperparameters, often more efficient than random or grid search.
Hyperband: A bandit-based strategy that adaptively allocates resources to promising hyperparameter configurations.

Further Topics:

Neural Architecture Search (NAS) with Katib.
Visualizing Katib experiment results in the Kubeflow UI.
Integrating Katib experiments into Kubeflow Pipelines.
Advanced trial templates (e.g., using TFJob or PyTorchJob custom resources).
Early stopping for trials within Katib.

Katib provides a powerful and scalable solution for automating the tedious and computationally intensive process of hyperparameter tuning, leading to better-performing ML models and more efficient use of computational resources.