30 Days of MLOps Challenge · Day 13

Deployment rocket iconML Model Deployment – Batch vs Real-time Inference

By Aviraj Kawade · July 10, 2025 · 5 min read

Understanding both batch and real-time inference helps you choose the right serving strategy for latency, scalability, and cost—delivering efficient, reliable user experiences in production.

💡 Hey — It's Aviraj Kawade 👋

📚 Key Learnings

  • Understand batch vs real-time (online) inference.
  • Pick the right method based on latency, scale, and cost.
  • Implement and deploy both patterns in practice.

Inference can be done in two primary ways, each with clear trade‑offs.

  • Batch Inference — process large datasets on a schedule or trigger.
  • Real-Time (Online) Inference — request/response predictions via an API.

🧠 Learn here

Use the zoom controls to inspect the diagrams.

Batch Inference

Batch inference pipeline diagram

Example Use Cases

  • Daily product recommendations
  • Bulk risk scoring
  • Predictive maintenance reports

Example Flow

  1. Scheduler triggers job (Airflow, CronJob)
  2. Preprocess and feature engineer
  3. Run batch inference at scale
  4. Store outputs (DB, warehouse, lake)
  5. Downstream consumption

Real-Time (Online) Inference

Real-time inference architecture diagram

Example Use Cases

  • Fraud detection during transactions
  • Chatbot and conversational AI
  • Recommendations on click

Example Flow

  1. Request hits API gateway
  2. Model server processes input (FastAPI, Flask, TF Serving)
  3. Optional feature store retrieval
  4. Return prediction within SLA
  5. Monitor latency, errors, throughput

Comparison: Batch vs Real-Time

FeatureBatchReal-Time
LatencyHigh (minutes to hours)Low (ms to seconds)
Data VolumeLarge datasetsSingle/small records
InfrastructureData pipelines, batch jobs (Airflow)Web APIs, model servers (FastAPI, TF Serving)
TriggerScheduled (cron, DAG)On‑demand (request/event)
CostEfficient for large volumeHigher for low latency
DeploymentOffline/internal processesExposed API endpoint
Feedback loopSlowerFast/immediate
Common toolsSpark, Pandas, AirflowFastAPI, Seldon, Triton

Tools & Frameworks

PurposeBatch InferenceReal-Time Inference
FrameworksApache Spark, Dask, PandasFastAPI, Flask, gRPC, TF Serving
OrchestrationAirflow, KFPKServe, Seldon, BentoML
DeploymentKubernetes CronJobsKubernetes Deployment + Service

Choosing the Right Method

Use Case

  • Real‑time user interaction → Online inference
  • Analytics & reporting → Batch inference
  • Stream processing → Hybrid/near real‑time

Latency Requirements

LatencyRecommended
< 1 secondReal‑time (Online)
1 sec – few minutesNear real‑time / Micro‑batch
Minutes – hoursBatch

Scalability

Traffic PatternServing Method
High, burstyReal‑time + autoscaling
Periodic jobsBatch pipelines
MixedHybrid (Batch + Real‑time)

Decision Matrix

Use CaseLatencyTrafficRecommended
Product Recommendation on Click< 1sHighReal‑time
Daily Risk Score CalculationHoursLowBatch
Fraud Detection During Payment< 500msVery HighReal‑time
IoT Sensor Stream Classification~1sHigh, ContinuousNear Real‑time / Hybrid
Email Campaign PersonalizationMinutesLowBatch

Quick Examples

ScenarioRecommended Type
Daily/weekly reportsBatch
Prediction affects immediate UXReal‑time
Limited real‑time infraBatch
Prevent fraud instantlyReal‑time

Hands-on Demo: Structure

ml-inference-demo/
├── batch_inference/
│   ├── preprocess.py
│   ├── inference.py
│   ├── run_batch_job.py
│   └── airflow_dag.py
├── realtime_inference/
│   ├── main.py
│   ├── model.pkl
│   └── Dockerfile
├── models/
│   └── train_model.ipynb
├── data/
│   └── sample_input.csv
└── README.md

Batch Scripts

# batch_inference/preprocess.py
import pandas as pd

def preprocess(input_path, output_path):
    df = pd.read_csv(input_path)
    df.fillna(0, inplace=True)
    df.to_csv(output_path, index=False)
# batch_inference/inference.py
import pandas as pd
import joblib

def run_inference(input_path, model_path, output_path):
    df = pd.read_csv(input_path)
    model = joblib.load(model_path)
    predictions = model.predict(df)
    pd.DataFrame({'prediction': predictions}).to_csv(output_path, index=False)
# batch_inference/run_batch_job.py
from preprocess import preprocess
from inference import run_inference

preprocess('data/sample_input.csv', 'data/cleaned.csv')
run_inference('data/cleaned.csv', 'models/model.pkl', 'data/output.csv')
# batch_inference/airflow_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from inference import run_inference

def batch_job():
    run_inference('data/cleaned.csv', 'models/model.pkl', 'data/output.csv')

dag = DAG('batch_inference', start_date=datetime(2024, 1, 1), schedule_interval='@daily')
PythonOperator(task_id='run_batch', python_callable=batch_job, dag=dag)

Real-Time API

# realtime_inference/main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class InputData(BaseModel):
    features: list

@app.post("/predict")
def predict(data: InputData):
    prediction = model.predict([data.features])
    return {"prediction": float(prediction[0])}
# realtime_inference/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir fastapi uvicorn joblib pydantic scikit-learn
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
# models/train_model.ipynb (Python cells)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib

data = pd.read_csv('../data/sample_input.csv')
X = data.drop('target', axis=1)
y = data['target']

model = RandomForestClassifier()
model.fit(X, y)

joblib.dump(model, '../realtime_inference/model.pkl')
# data/sample_input.csv
feature1,feature2,feature3,target
1.0,2.5,3.0,0
2.1,0.3,3.3,1
1.2,1.8,2.2,0

Deployment Options

LayerBatch InferenceReal-Time Inference
SchedulerAirflow, CronJobN/A
ServingOffline scriptsFastAPI + Uvicorn
ContainerizationDocker (optional)Docker
OrchestrationK8s Job / CronJobK8s Deployment + Service
StorageS3, DW, DBRedis/NoSQL/In‑memory (optional)
MonitoringAirflow UIPrometheus + Grafana

Monitoring & Logging

Batch

  • Track DAG success/failure in Airflow
  • Persist logs (e.g., S3, logging service)

Real‑Time

  • Monitor latency, throughput, error rates (Prometheus)
  • Set alerts for p95/p99, 5xx rates

Common Pitfalls

  • Batch: Missing schema validation; brittle DAG error handling.
  • Real‑Time: Re‑loading model per request; underutilizing micro‑batching when acceptable.

🔥 Challenges

  • List ML tasks from prior days and categorize into batch vs real‑time.
  • Read AWS SageMaker Batch Transform vs Real‑Time Endpoints and write a short gist.
  • Prepare a dummy dataset (CSV/JSON) for batch processing.
  • Reuse your model from Day 11 or Day 12 and deploy both patterns.
← Back to MLOps Roadmap