30 Days of MLOps Challenge ยท Day 21

Kubernetes iconScaling ML Model Inference with Kubernetes

By Aviraj Kawade ยท September 16, 2025 ยท 8 min read

Learn scaling model inference with Kubernetes to ensure high availability and low-latency predictions in production ML systems. Mastering this enables dynamic autoscaling, cost optimization, and seamless deployment of AI workloads at scale.

๐Ÿ’ก Hey โ€” It's Aviraj Kawade ๐Ÿ‘‹

We should learn scaling model inference with Kubernetes to ensure high availability and low-latency predictions in production ML systems. Mastering this enables dynamic autoscaling, cost optimization, and seamless deployment of AI workloads at scale.

๐Ÿ“š Key Learnings

  • Why Kubernetes is ideal for scaling model inference workloads
  • Understand deployment options: FastAPI/Flask containers as inference services
  • How to expose model APIs using Services, Ingress, and LoadBalancers
  • Scaling inference using Horizontal Pod Autoscaling (HPA) based on CPU/GPU/requests
  • Explore real-world deployments on KinD/minikube, AWS EKS, and Google GKE

๐Ÿง  Learn here

Kubernetes ML scaling overview diagram

Kubernetes (K8s) has emerged as the de facto standard for deploying, scaling, and managing containerized applications. When it comes to machine learning (ML) model inference, Kubernetes offers a highly efficient and robust platform to handle dynamic and resource-intensive workloads.

Why Kubernetes is Ideal for Scaling Model Inference:

1. Horizontal Pod Autoscaling (HPA)

  • Automatically scales the number of inference pods based on CPU, memory usage, or custom metrics (like request rate or latency)
  • Enables responsive scaling during peak traffic periods

2. Custom Resource Definitions (CRDs) for ML

  • Frameworks like Kubeflow, KServe, or Seldon Core extend Kubernetes with ML-specific capabilities
  • Abstracts complex deployment logic into simple YAMLs or CRDs for versioned and repeatable deployments

3. Multi-Tenant and Resource Isolation

  • Namespaces and resource quotas isolate workloads across teams, models, or services
  • Prevents one model from starving resources used by another

4. GPU/TPU Scheduling and Node Affinity

  • Kubernetes supports GPU/TPU acceleration with node affinity, tolerations, and taints
  • Enables dynamic scheduling of inference jobs based on hardware requirements

5. Observability and Monitoring

  • Native integration with Prometheus, Grafana, and OpenTelemetry for metrics, traces, and logs
  • Allows you to monitor inference latency, throughput, and model performance in real time

6. Load Balancing and Canary Deployments

  • Kubernetes Services + Ingress + Service Meshes (like Istio) provide advanced routing, traffic splitting, and A/B testing
  • Safely roll out new versions of models without downtime

7. Autoscaling Infrastructure (Cluster Autoscaler / Karpenter)

  • Scales the underlying nodes when pod demand exceeds capacity
  • Efficient use of resources with cost-aware node provisioning (Karpenter)

Sample Use Case: Real-Time Fraud Detection

  • Deploy a fraud detection model as a REST API using KServe on Kubernetes
  • Use HPA to scale inference pods based on HTTP request rate
  • Enable GPU acceleration for high-throughput scoring
  • Monitor model accuracy and latency via Prometheus + Grafana

๐Ÿง  Production Considerations

  • Use pod disruption budgets and readiness probes to ensure high availability
  • Co-locate model and feature store pods using affinity rules for performance
  • Apply network policies for secure access

AI/ML Companies Using Kubernetes

CompanyUse Case on Kubernetes
OpenAIScalable model serving and orchestration of inference jobs
SpotifyModel training and real-time personalization
PinterestRunning inference services and recommendation engines
NetflixContent personalization and observability for ML pipelines
LinkedInFeature store management, training workflows, inference
DoorDashReal-time ETA prediction and ML observability stack
NVIDIAGPU-accelerated inference and model serving infrastructure
SalesforceEinstein model deployment, monitoring, and scaling
Snap Inc.Image recognition models deployed via Kubernetes
ShopifyFraud detection and personalization pipelines

Deployment Options: FastAPI & Flask Containers as Inference Services

Deploying ML models via FastAPI or Flask containers is one of the most practical approaches for serving inference workloads in both development and production environments. This method wraps your trained model in a lightweight web service, allowing for flexible and scalable deployment.

FastAPI vs Flask for Inference

FeatureFastAPIFlask
SpeedVery fast (ASGI-based)Slower (WSGI-based)
Async SupportNative async/awaitRequires extensions or workarounds
Type HintsFull support for Pydantic & validationMinimal support
DocumentationAuto-generates Swagger & ReDocRequires manual setup
MaturityNewer, modernOlder, battle-tested

Common Use Case

  • Scenario: You have a trained ML model (e.g., sklearn, PyTorch, or TensorFlow) and want to expose it as an API for inference
  • Solution: Wrap the model in a FastAPI or Flask app and containerize it using Docker

FastAPI Inference Example

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

model = joblib.load("model.pkl")
app = FastAPI()

class InputData(BaseModel):
    feature1: float
    feature2: float

@app.post("/predict")
def predict(data: InputData):
    pred = model.predict([[data.feature1, data.feature2]])
    return {"prediction": pred.tolist()}

Dockerfile

FROM python:3.9
WORKDIR /app
COPY . .
RUN pip install fastapi uvicorn scikit-learn joblib
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

Flask Inference Example

# app.py
from flask import Flask, request, jsonify
import joblib

model = joblib.load("model.pkl")
app = Flask(__name__)

@app.route("/predict", methods=["POST"])
def predict():
    data = request.json
    pred = model.predict([[data['feature1'], data['feature2']]])
    return jsonify({"prediction": pred.tolist()})

Dockerfile

FROM python:3.9
WORKDIR /app
COPY . .
RUN pip install flask scikit-learn joblib
CMD ["python", "app.py"]

Deployment on Kubernetes

  1. Build and push the container image to a registry
  2. Create a Kubernetes Deployment and Service manifest
  3. Optionally expose via Ingress for external access

deployment.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: inference-api
  template:
    metadata:
      labels:
        app: inference-api
    spec:
      containers:
      - name: inference-container
        image: your-dockerhub-username/inference-api:latest
        ports:
        - containerPort: 80
        resources:
          limits:
            cpu: "500m"
            memory: "512Mi"
          requests:
            cpu: "250m"
            memory: "256Mi"

service.yaml

apiVersion: v1
kind: Service
metadata:
  name: inference-service
spec:
  selector:
    app: inference-api
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: ClusterIP

Exposing Model APIs on Kubernetes

Once your ML model is containerized and deployed on Kubernetes, the next critical step is exposing it for external or internal consumption. Kubernetes provides multiple mechanisms to expose services, depending on use case, security, and traffic requirements.

Key Methods to Expose Model APIs

1. ClusterIP (Default)

  • Use Case: Internal-only services (e.g., model used by internal pipeline)
  • Access: Not reachable from outside the cluster
apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-api
  ports:
    - port: 80
      targetPort: 80
  type: ClusterIP

2. NodePort

  • Use Case: Expose to external users (dev/test), not recommended for production
  • Access: Available on <NodeIP>:<NodePort>
apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-api
  type: NodePort
  ports:
    - port: 80
      targetPort: 80
      nodePort: 30007

3. LoadBalancer (Production-grade External Access)

  • Use Case: Exposing model to internet via cloud-managed external Load Balancer (e.g., AWS ELB, GCP LB)
  • Access: Public IP assigned by cloud provider
apiVersion: v1
kind: Service
metadata:
  name: model-service
spec:
  selector:
    app: model-api
  type: LoadBalancer
  ports:
    - port: 80
      targetPort: 80

Cloud provider assigns an external IP:

kubectl get svc model-service

4. Ingress (With NGINX/ALB/GKE Controller)

  • Use Case: Route traffic via domain/subdomain (e.g., /predict) with SSL termination
  • Access: Expose multiple APIs with a single LB + domain
Ingress Manifest:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: model-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  rules:
  - host: model.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: model-service
            port:
              number: 80

Note: You need an Ingress Controller (NGINX, ALB, etc.) installed.

๐Ÿ’ก Pro Tips

  • Use ClusterIP for internal inference pipelines
  • Use LoadBalancer for quick external access (dev/test/staging)
  • Use Ingress for production-ready setup with domain, TLS, path-based routing
  • Secure APIs with network policies, authentication, and rate limiting

Scaling Inference Using Horizontal Pod Autoscaling (HPA)

Horizontal Pod Autoscaler (HPA) is a Kubernetes feature that automatically adjusts the number of pods in a deployment based on observed metrics such as CPU usage, GPU usage, or custom request rate metrics. This makes it an ideal solution for scaling inference workloads dynamically based on traffic.

Why Use HPA for Inference APIs?

  • Automatically adds/removes pods based on real-time usage
  • Ensures optimal resource usage and cost efficiency
  • Prevents model latency spikes during traffic surges
  • Helps maintain SLA and system responsiveness

Prerequisites

  • Metrics Server must be installed in your cluster
  • Your deployment must define resource requests and limits for CPU/Memory

Example 1: Scale Based on CPU Usage

FastAPI/Flask deployment snippet:

resources:
  requests:
    cpu: "200m"
    memory: "256Mi"
  limits:
    cpu: "500m"
    memory: "512Mi"

HPA Manifest:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-api
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60

Example 2: Scale Based on Request Rate (Custom Metrics)

Use Prometheus Adapter to expose custom metrics like requests per second (RPS).

metrics:
- type: Pods
  pods:
    metric:
      name: rps
    target:
      type: AverageValue
      averageValue: "50"

Example 3: GPU-Based Autoscaling (Experimental)

You can use extended HPA controllers or KEDA + NVIDIA DCGM exporter.

metrics:
- type: External
  external:
    metric:
      name: nvidia.com/gpu_utilization
    target:
      type: Value
      value: "60"

View Scaling in Action

kubectl get hpa inference-hpa --watch

๐Ÿ’ก Best Practices

  • Set proper resource requests/limits for reliable autoscaling
  • Always test with load simulators (e.g., Locust, k6)
  • Combine HPA with Cluster Autoscaler/Karpenter for full elasticity
  • Use VPA (Vertical Pod Autoscaler) in "recommendation mode" for tuning

Real-World Inference API Deployments on KinD, Minikube, AWS EKS, and GCP GKE

Now, let's explore real-world deployment strategies for ML inference APIs (FastAPI/Flask) across different Kubernetes environments โ€” from local development setups like KinD and Minikube to cloud-native managed clusters like AWS EKS and Google GKE.

1. Local: KinD (Kubernetes-in-Docker)

Use Case

  • Fast local dev/testing for multi-node clusters
  • CI pipeline simulation (GitHub Actions, etc.)

Setup

Create a cluster with a custom config (multi-node example):

# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
  - role: control-plane
  - role: worker
  - role: worker

kind create cluster --name ml-inference --config kind-config.yaml

Deploy FastAPI Inference App

Build Docker image:

docker build -t inference-api:latest .

Load image to KinD:

kind load docker-image inference-api:latest --name ml-inference

Apply manifests:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Access Locally

kubectl port-forward svc/inference-service 8080:80
curl http://localhost:8080/predict

2. Local: Minikube

Use Case

  • Ideal for quick experiments with full K8s functionality
  • Supports LoadBalancer simulation

Setup

minikube start --driver=docker
minikube addons enable ingress metrics-server

Deploy Inference API:

Build and push image:

docker build -t inference-api:latest .
eval $(minikube docker-env)
docker tag inference-api:latest inference-api:local

Apply deployment:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Access Service

minikube service inference-service

3. Cloud: AWS EKS

Use Case

  • Production-grade managed Kubernetes for high-scale inference

Cluster Setup with eksctl

eksctl create cluster \
  --name ml-inference \
  --region us-east-1 \
  --nodes 3 \
  --node-type t3.medium \
  --with-oidc \
  --managed

Deploy to EKS

Tag and push Docker image:

docker tag inference-api:latest .dkr.ecr.us-east-1.amazonaws.com/inference-api:latest
aws ecr get-login-password | docker login --username AWS --password-stdin 
docker push /inference-api:latest

Apply Kubernetes manifests:

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Access via Load Balancer

kubectl get svc inference-service

Use the EXTERNAL-IP to test:

curl http:///predict

4. Cloud: Google Kubernetes Engine (GKE)

Use Case

  • Auto-upgrades, deep GCP service integration

Setup with gcloud

gcloud container clusters create ml-inference \
  --zone us-central1-a \
  --num-nodes 3 \
  --enable-ip-alias

Build & Push Image to Google Container Registry (GCR)

docker tag inference-api gcr.io//inference-api
docker push gcr.io//inference-api

Deploy App

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

External Access

kubectl get svc inference-service
curl http:///predict

๐Ÿง  Best Practices Across All Platforms

  • Use Helm/Kustomize for environment-specific configuration
  • Use ConfigMaps and Secrets for API config and auth keys
  • Enable PodDisruptionBudgets and readiness/liveness probes
  • Use HorizontalPodAutoscaler for scalable deployments

๐Ÿ”ฅ Challenges

Basic Deployment

  • Deploy your ML inference API on Minikube or EKS with proper YAML
  • Expose the service using type: LoadBalancer or Ingress
  • Access the model from browser or Postman and test predictions

Autoscaling

  • Set CPU requests/limits for your container
  • Enable HPA and simulate high load using ab or hey tools
  • Observe pod scaling (kubectl get hpa and kubectl get pods)

Advanced

  • Deploy on EKS using eksctl or Terraform
  • Set up logging and monitoring for your inference workload (Prometheus/Grafana)
  • Use taints and tolerations or node selectors to isolate inference workloads (prod-like setup)
  • Use KEDA for event-driven autoscaling based on request queue size (e.g., Kafka, SQS)