30 Days of MLOps Challenge · Day 13

ML Model Deployment – Batch vs Real-time Inference

By Aviraj Kawade · July 10, 2025 · 5 min read

Understanding both batch and real-time inference helps you choose the right serving strategy for latency, scalability, and cost—delivering efficient, reliable user experiences in production.

← Previous: Day 12 Back to Roadmap Next: Day 14 →

💡 Hey — It's Aviraj Kawade 👋

📚 Key Learnings

Understand batch vs real-time (online) inference.
Pick the right method based on latency, scale, and cost.
Implement and deploy both patterns in practice.

Inference can be done in two primary ways, each with clear trade‑offs.

Batch Inference — process large datasets on a schedule or trigger.
Real-Time (Online) Inference — request/response predictions via an API.

🧠 Learn here

Use the zoom controls to inspect the diagrams.

Batch Inference

Example Use Cases

Daily product recommendations
Bulk risk scoring
Predictive maintenance reports

Example Flow

Scheduler triggers job (Airflow, CronJob)
Preprocess and feature engineer
Run batch inference at scale
Store outputs (DB, warehouse, lake)
Downstream consumption

Real-Time (Online) Inference

Real-time inference architecture diagram

Example Use Cases

Fraud detection during transactions
Chatbot and conversational AI
Recommendations on click

Example Flow

Request hits API gateway
Model server processes input (FastAPI, Flask, TF Serving)
Optional feature store retrieval
Return prediction within SLA
Monitor latency, errors, throughput

Comparison: Batch vs Real-Time

Feature	Batch	Real-Time
Latency	High (minutes to hours)	Low (ms to seconds)
Data Volume	Large datasets	Single/small records
Infrastructure	Data pipelines, batch jobs (Airflow)	Web APIs, model servers (FastAPI, TF Serving)
Trigger	Scheduled (cron, DAG)	On‑demand (request/event)
Cost	Efficient for large volume	Higher for low latency
Deployment	Offline/internal processes	Exposed API endpoint
Feedback loop	Slower	Fast/immediate
Common tools	Spark, Pandas, Airflow	FastAPI, Seldon, Triton

Tools & Frameworks

Purpose	Batch Inference	Real-Time Inference
Frameworks	Apache Spark, Dask, Pandas	FastAPI, Flask, gRPC, TF Serving
Orchestration	Airflow, KFP	KServe, Seldon, BentoML
Deployment	Kubernetes CronJobs	Kubernetes Deployment + Service

Choosing the Right Method

Use Case

Real‑time user interaction → Online inference
Analytics & reporting → Batch inference
Stream processing → Hybrid/near real‑time

Latency Requirements

Latency	Recommended
< 1 second	Real‑time (Online)
1 sec – few minutes	Near real‑time / Micro‑batch
Minutes – hours	Batch

Scalability

Traffic Pattern	Serving Method
High, bursty	Real‑time + autoscaling
Periodic jobs	Batch pipelines
Mixed	Hybrid (Batch + Real‑time)

Decision Matrix

Use Case	Latency	Traffic	Recommended
Product Recommendation on Click	< 1s	High	Real‑time
Daily Risk Score Calculation	Hours	Low	Batch
Fraud Detection During Payment	< 500ms	Very High	Real‑time
IoT Sensor Stream Classification	~1s	High, Continuous	Near Real‑time / Hybrid
Email Campaign Personalization	Minutes	Low	Batch

Quick Examples

Scenario	Recommended Type
Daily/weekly reports	Batch
Prediction affects immediate UX	Real‑time
Limited real‑time infra	Batch
Prevent fraud instantly	Real‑time

Hands-on Demo: Structure

ml-inference-demo/
├── batch_inference/
│   ├── preprocess.py
│   ├── inference.py
│   ├── run_batch_job.py
│   └── airflow_dag.py
├── realtime_inference/
│   ├── main.py
│   ├── model.pkl
│   └── Dockerfile
├── models/
│   └── train_model.ipynb
├── data/
│   └── sample_input.csv
└── README.md

Batch Scripts

# batch_inference/preprocess.py
import pandas as pd

def preprocess(input_path, output_path):
    df = pd.read_csv(input_path)
    df.fillna(0, inplace=True)
    df.to_csv(output_path, index=False)

# batch_inference/inference.py
import pandas as pd
import joblib

def run_inference(input_path, model_path, output_path):
    df = pd.read_csv(input_path)
    model = joblib.load(model_path)
    predictions = model.predict(df)
    pd.DataFrame({'prediction': predictions}).to_csv(output_path, index=False)

# batch_inference/run_batch_job.py
from preprocess import preprocess
from inference import run_inference

preprocess('data/sample_input.csv', 'data/cleaned.csv')
run_inference('data/cleaned.csv', 'models/model.pkl', 'data/output.csv')

# batch_inference/airflow_dag.py
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
from inference import run_inference

def batch_job():
    run_inference('data/cleaned.csv', 'models/model.pkl', 'data/output.csv')

dag = DAG('batch_inference', start_date=datetime(2024, 1, 1), schedule_interval='@daily')
PythonOperator(task_id='run_batch', python_callable=batch_job, dag=dag)

Real-Time API

# realtime_inference/main.py
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load("model.pkl")

class InputData(BaseModel):
    features: list

@app.post("/predict")
def predict(data: InputData):
    prediction = model.predict([data.features])
    return {"prediction": float(prediction[0])}

# realtime_inference/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir fastapi uvicorn joblib pydantic scikit-learn
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# models/train_model.ipynb (Python cells)
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib

data = pd.read_csv('../data/sample_input.csv')
X = data.drop('target', axis=1)
y = data['target']

model = RandomForestClassifier()
model.fit(X, y)

joblib.dump(model, '../realtime_inference/model.pkl')

# data/sample_input.csv
feature1,feature2,feature3,target
1.0,2.5,3.0,0
2.1,0.3,3.3,1
1.2,1.8,2.2,0

Deployment Options

Layer	Batch Inference	Real-Time Inference
Scheduler	Airflow, CronJob	N/A
Serving	Offline scripts	FastAPI + Uvicorn
Containerization	Docker (optional)	Docker
Orchestration	K8s Job / CronJob	K8s Deployment + Service
Storage	S3, DW, DB	Redis/NoSQL/In‑memory (optional)
Monitoring	Airflow UI	Prometheus + Grafana

Monitoring & Logging

Batch

Track DAG success/failure in Airflow
Persist logs (e.g., S3, logging service)

Real‑Time

Monitor latency, throughput, error rates (Prometheus)
Set alerts for p95/p99, 5xx rates

Common Pitfalls

Batch: Missing schema validation; brittle DAG error handling.
Real‑Time: Re‑loading model per request; underutilizing micro‑batching when acceptable.

🔥 Challenges

List ML tasks from prior days and categorize into batch vs real‑time.
Read AWS SageMaker Batch Transform vs Real‑Time Endpoints and write a short gist.
Prepare a dummy dataset (CSV/JSON) for batch processing.
Reuse your model from Day 11 or Day 12 and deploy both patterns.

← Back to MLOps Roadmap