Kubeflow: Model Serving with KFServing (KServe)
KFServing (now officially renamed to KServe and part of the Cloud Native Computing Foundation) is a serverless model serving platform built on Kubernetes. It provides a standardized and scalable way to deploy, monitor, and manage ML models in production environments. KServe abstracts away the complexities of Kubernetes, allowing data scientists and ML engineers to focus on their models.
1. Key Concepts in KServe
- InferenceService: The core KServe Custom Resource Definition (CRD) that defines how an ML model should be deployed. It encapsulates the model's server (e.g., Triton, ONNX Runtime, a custom server), the model artifact, and other serving configurations.
- Predictor: The main component of an
InferenceServicethat runs the ML model and performs inference. KServe supports various pre-built model servers (TensorFlow, PyTorch, Scikit-learn, XGBoost, ONNX, etc.) or allows custom predictors. - Transformer: (Optional) A component that performs data transformation (e.g., preprocessing or post-processing) before/after model inference.
- Explainer: (Optional) A component that provides model explanations (e.g., using AI Explainability methods like SHAP or LIME).
- Canary Rollouts & A/B Testing: KServe leverages Istio for intelligent routing, enabling seamless canary deployments, A/B testing, and traffic splitting for models.
- Autoscaling: Integrates with Knative for request-based autoscaling, scaling down to zero when idle.
2. Setting up KServe (Conceptual)
KServe is deployed on top of Kubernetes. A typical installation involves: 1. Kubernetes Cluster: A running Kubernetes cluster. 2. Istio: A service mesh for traffic management. 3. Knative Serving: A serverless platform for deploying and managing serverless workloads. 4. KServe: The KServe components themselves.
(Refer to the official KServe documentation for detailed installation instructions, as they can vary by Kubernetes environment and KServe version.)
3. Deploying a Model with KServe (InferenceService)
Deploying a model with KServe involves creating an InferenceService YAML manifest and applying it to your Kubernetes cluster.
Example: Deploying a Scikit-learn Model
Let's assume you have a trained Scikit-learn model (e.g., a LogisticRegression model saved as a joblib file) stored in a cloud storage bucket (e.g., S3, GCS, Azure Blob Storage).
# filename: sklearn-isvc.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris-classifier
spec:
predictor:
sklearn:
# Path to your trained model artifact in a storage bucket
# This example uses Google Cloud Storage (GCS)
storageUri: gs://your-model-bucket/iris_classifier.joblib
# For AWS S3: s3://your-model-bucket/iris_classifier.joblib
# For Azure Blob Storage: azureblob://your-model-bucket/iris_classifier.joblib
# For local path (if model is mounted): file:///mnt/models/iris_classifier.joblib
protocolVersion: v2 # Recommended for newer KServe versions
minReplicas: 1 # Ensure at least one replica is always running
maxReplicas: 3 # Scale up to 3 replicas based on traffic
Steps to deploy (conceptual):
-
Save your model: ```python # Example Python code to save a Scikit-learn model import joblib from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris
iris = load_iris() X, y = iris.data, iris.target model = LogisticRegression(solver='liblinear') model.fit(X, y)
Save the model to a file
joblib.dump(model, 'iris_classifier.joblib')
Upload 'iris_classifier.joblib' to your configured storageUri
e.g., gsutil cp iris_classifier.joblib gs://your-model-bucket/
```
-
Apply the YAML:
bash kubectl apply -f sklearn-isvc.yaml -
Check status:
bash kubectl get isvc sklearn-iris-classifierWait until theSTATUSshowsReady. -
Get the inference URL:
bash kubectl get inferenceservice sklearn-iris-classifier -o jsonpath='{.status.address.url}'This will give you the URL endpoint for your model.
Making an Inference Request (Conceptual)
KServe typically uses the V2 Inference Protocol.
# Example using curl (assuming your model URL is accessible)
MODEL_URL="http://sklearn-iris-classifier.default.example.com/v2/models/sklearn-iris-classifier/infer" # Replace with your actual URL
curl -v -H "Content-Type: application/json" -d '{
"inputs": [
{
"name": "input-0",
"shape": [1, 4],
"datatype": "FP32",
"data": [[5.1, 3.5, 1.4, 0.2]]
}
]
}' $MODEL_URL
4. Deploying Other Model Types
KServe supports various model servers:
- TensorFlow:
yaml # ... spec: predictor: tensorflow: storageUri: gs://your-tf-model-bucket/saved_model_path/ protocolVersion: v2 - PyTorch:
yaml # ... spec: predictor: pytorch: storageUri: gs://your-pytorch-model-bucket/model.pt # model.pt or model.pth runtimeVersion: "1.9" # Specify PyTorch version protocolVersion: v2 - XGBoost:
yaml # ... spec: predictor: xgboost: storageUri: gs://your-xgboost-model-bucket/model.bst # model.bst or model.json protocolVersion: v2 - Custom Predictor: For models not supported by built-in servers, you can provide a custom Docker image.
yaml # ... spec: predictor: container: image: your-custom-model-server:latest # Your Docker image args: ["python", "app.py"] # Command to run your server env: - name: MODEL_NAME value: mymodel - name: STORAGE_URI value: gs://your-custom-model-bucket/model/
5. Advanced KServe Features
- Traffic Splitting: Route a percentage of traffic to a new version of a model (canary deployment).
yaml # ... spec: predictor: canaryTrafficPercent: 20 # Send 20% traffic to canary model # ... (define canary model spec) - Transformers: Preprocessing steps before sending to the predictor, or post-processing of results.
- Explainers: Integrating model explainability tools.
- Multi-Model Serving: Deploying multiple models on a single
InferenceServiceinstance (e.g., for very small models).
Further Topics:
- Integrating KServe with Kubeflow Pipelines.
- Monitoring deployed models.
- Security considerations for model serving.
- Performance optimization for inference.
- KServe client SDK for programmatic deployment.
KServe (KFServing) is a powerful tool for MLOps, providing a robust, scalable, and standardized solution for deploying machine learning models in a cloud-native environment.