Monitoring ML Systems in Production – Metrics, Logging, Alerting

💡 Hey — It's Aviraj Kawade 👋

We must master monitoring to ensure ML models in production are healthy, reliable, and performant. By tracking key metrics, analyzing logs, and setting up alerts, they can proactively detect model drift, data quality issues, and service failures before they impact users.

📚 Key Learnings

Understand the importance of observability in ML systems: uptime, performance, correctness
Learn what to monitor: prediction latency, model accuracy, data quality, service health
Differentiate between system metrics and model metrics
Set up logging, dashboards, and alerting for deployed ML services
Use Prometheus, Grafana, and custom logging solutions for insight and traceability

🧠 Learn here

Observability in ML systems is crucial for ensuring that models behave as expected in production. It helps detect issues related to performance, uptime, data quality, drift, and overall system health. Without observability, it's difficult to debug or maintain trust in deployed ML models.

Why Observability Matters in ML

Uptime: Ensures your model endpoints are live and accessible
Performance: Tracks latency, throughput, and resource consumption of your ML services
Correctness: Monitors input/output distributions, predictions, and error rates to catch anomalies or degradation
Compliance: Supports auditing and governance by keeping a history of inputs, predictions, and model versions
Root Cause Analysis: Aids debugging of failures, unexpected behaviors, or drifts

Key Observability Metrics

Category	Metrics
System	Uptime, response time, CPU/GPU usage, memory consumption
Model	Prediction latency, prediction confidence, accuracy (if feedback exists)
Data	Feature distribution, missing/null values, data schema validation
Drift	Data drift, concept drift, input/output change rate

Tools for Observability in ML

Prometheus + Grafana: For infrastructure and model performance metrics
OpenTelemetry: Distributed tracing and metrics collection
EvidentlyAI: Data drift, model performance, and monitoring dashboards
Seldon Alibi Detect / Fiddler AI / WhyLabs: For advanced monitoring, explainability, and drift detection
MLflow: Experiment tracking, model versioning, logging parameters and metrics
Elastic Stack (ELK): Centralized logging and visualization

Best Practices

Instrument all components: Include data pipeline, model serving layer, and infrastructure
Set alerts: Define thresholds and set up notifications for anomalies
Monitor input data: Compare live input features with training data
Track predictions: Log and analyze prediction confidence and output distribution
Version control everything: Track changes in code, models, and data schemas

Example Metrics to Track in Production

Model response time > 300ms
Drift in feature user_age > 10% from training distribution
Model confidence < 0.5 for more than 5% of predictions
Increase in null/missing values in real-time data > 2%

ML Monitoring: What to Monitor?

1. Prediction Latency

Latency indicates the time taken to serve a prediction request. High latency affects user experience and system performance.

Example: FastAPI + Prometheus

from fastapi import FastAPI, Request
import time
from prometheus_client import Histogram, start_http_server

app = FastAPI()
REQUEST_LATENCY = Histogram('prediction_latency_seconds', 'Time for prediction', ['endpoint'])

@app.post("/predict")
async def predict(request: Request):
    start_time = time.time()
    # simulate prediction
    time.sleep(0.3)  # replace with actual model.predict()
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(endpoint="/predict").observe(latency)
    return {"prediction": 42}

# Start Prometheus metrics server
start_http_server(8001)

2. Model Accuracy (Offline Evaluation)

Use actual labels after feedback to compare predictions and evaluate accuracy.

Example: with Scikit-Learn

from sklearn.metrics import accuracy_score
import pandas as pd

# Simulated predictions and actuals
preds = [1, 0, 1, 1, 0]
y_true = [1, 0, 0, 1, 0]
accuracy = accuracy_score(y_true, preds)
print(f"Accuracy: {accuracy * 100:.2f}%")

In production, log and evaluate this periodically using feedback labels.

3. Data Quality

Check for missing values, type mismatches, or invalid ranges in live input data.

Example: Real-Time Data Checks

import pandas as pd

def validate_data(df):
    issues = {}
    if df.isnull().sum().any():
        issues['nulls'] = df.isnull().sum().to_dict()
    if (df['age'] < 0).any():
        issues['age_range'] = "Negative age found"
    if df['salary'].dtype != float:
        issues['salary_type'] = "Expected float for salary"
    return issues

# Example incoming request payload
data = pd.DataFrame({"age": [25, -1], "salary": [50000, "unknown"]})
print(validate_data(data))

4. Service Health

Check uptime, HTTP error codes, memory/CPU usage, etc.

Example: FastAPI + Prometheus Metrics

from prometheus_client import Counter

REQUEST_COUNT = Counter('request_count', 'App Request Count', ['method', 'endpoint'])
ERROR_COUNT = Counter('error_count', 'Error Count', ['method', 'endpoint'])

@app.middleware("http")
async def count_requests(request: Request, call_next):
    method = request.method
    endpoint = request.url.path
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    try:
        response = await call_next(request)
    except Exception:
        ERROR_COUNT.labels(method=method, endpoint=endpoint).inc()
        raise
    return response

Use tools like Grafana to visualize the metrics.

Recommended Tools:

Prometheus + Grafana for latency and health metrics
MLflow or custom DB to store prediction vs actuals
EvidentlyAI for data quality and drift detection
OpenTelemetry for tracing

✅ Pro Tips:

Always define thresholds and alerting rules
Track metrics at both infra (CPU, memory) and model levels
Monitor data inputs continuously to prevent garbage-in garbage-out
Monitoring = Observability + Actionable Insights

System Metrics vs Model Metrics

System metrics focus on infrastructure performance, while model metrics focus on model behavior and outcomes.

Understanding the difference between system and model metrics is crucial for building and maintaining reliable ML systems in production.

Aspect	System Metrics	Model Metrics
Definition	Metrics related to infrastructure and application performance	Metrics that evaluate ML model behavior and performance
Focus Area	Hardware, network, service uptime	Prediction quality, fairness, drift, and model reliability
Examples	CPU usage, memory, disk I/O, network latency, HTTP errors	Accuracy, precision, recall, AUC, prediction latency, drift
Collection Tools	Prometheus, Grafana, Datadog, CloudWatch	MLflow, EvidentlyAI, custom logging, Alibi Detect
Usage	Ensures system availability and performance	Ensures model is accurate, fair, and not degrading
Alert Triggers	High CPU/memory, service down, HTTP 500 errors	High prediction errors, drift detected, confidence drop
Visualization	System dashboards (Grafana, Cloud dashboards)	ML dashboards (Evidently, custom visualizations)
Granularity	Service-level or node-level	Feature-level or prediction-level
Who Monitors	DevOps, SREs	ML Engineers, Data Scientists

Both metric types are essential for a robust and reliable ML system in production.

Logging, Dashboards, and Alerting for Deployed ML Services

These components are critical for production readiness, troubleshooting, and operational excellence.

1. Logging

Logging is essential for capturing runtime behavior, inputs/outputs, and model predictions for debugging and auditing.

Example: Logging in FastAPI ML Endpoint

from fastapi import FastAPI, Request
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
app = FastAPI()

@app.post("/predict")
async def predict(request: Request):
    payload = await request.json()
    logging.info(f"Received input: {payload}")
    result = {"prediction": 1}  # Simulated prediction
    logging.info(f"Prediction result: {result}")
    return result

Tip: Use structured logging (e.g., JSON logs) for better parsing in tools like ELK or Datadog.

2. Dashboards

Dashboards provide visual insights into system health, prediction quality, latency, and data distribution.

Example: Prometheus + Grafana Setup for ML Service

Expose metrics in FastAPI

from prometheus_client import start_http_server, Summary
PREDICTION_TIME = Summary('prediction_time_seconds', 'Time spent making prediction')

@PREDICTION_TIME.time()
@app.post("/predict")
async def predict(request: Request):
    # Simulated logic
    return {"prediction": 1}

start_http_server(8001)  # Expose metrics at :8001/metrics

Grafana Dashboard Panels:

Model prediction latency (from prediction_time_seconds)
Number of requests
Error rates
CPU/memory usage

Use labels to filter metrics by endpoint, model version, or feature.

3. Alerting

Alerts notify you when things go wrong or deviate from expected behavior.

Example: Prometheus Alert Rules

groups:
  - name: ml-service-alerts
    rules:
      - alert: HighLatency
        expr: prediction_time_seconds_sum / prediction_time_seconds_count > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High prediction latency detected"

      - alert: RequestFailures
        expr: increase(http_requests_total{status="500"}[5m]) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High number of 500 errors"

Alerting Integrations

Email / Slack / Teams (via Alertmanager)
PagerDuty / Opsgenie for on-call management

Always include summary and description in alerts for quick troubleshooting.

Recommended Tool Stack

Purpose	Tools
Logging	Python Logging, FluentBit, ELK, Loki
Metrics	Prometheus, OpenTelemetry
Dashboards	Grafana, Kibana
Alerting	Prometheus Alertmanager, PagerDuty

Insight and Traceability with Prometheus, Grafana, and Custom Logging

Let's understand how to implement insight and traceability in ML services using Prometheus, Grafana, and custom logging with a FastAPI-based model serving app.

🧠 Objective

To gain deep observability into:

Prediction latency
Request volumes and failures
Runtime logs for input/output tracing
Real-time dashboards for alerting and troubleshooting

Stack Overview

Component	Tool
Metrics	Prometheus
Visualization	Grafana
Logging	Python Logging + Fluent Bit (optional)
App Framework	FastAPI

Step 1: FastAPI App with Prometheus Metrics & Logging

main.py

from fastapi import FastAPI, Request
from prometheus_client import Counter, Summary, make_asgi_app
import time
import logging

# Setup Logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Metrics
REQUEST_COUNT = Counter("request_count", "Total request count", ['method', 'endpoint'])
PREDICTION_LATENCY = Summary("prediction_latency_seconds", "Prediction latency in seconds")

# FastAPI app
app = FastAPI()
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

@app.middleware("http")
async def record_metrics(request: Request, call_next):
    method = request.method
    endpoint = request.url.path
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    start_time = time.time()
    response = await call_next(request)
    latency = time.time() - start_time
    PREDICTION_LATENCY.observe(latency)
    return response

@app.post("/predict")
@PREDICTION_LATENCY.time()
async def predict(request: Request):
    payload = await request.json()
    logging.info(f"Incoming payload: {payload}")
    prediction = {"result": 1}  # Dummy model
    logging.info(f"Prediction: {prediction}")
    return prediction

Step 2: Run Prometheus

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'fastapi-app'
    static_configs:
      - targets: ['localhost:8000']

Start Prometheus

prometheus --config.file=prometheus.yml

Step 3: Set Up Grafana Dashboard

Run Grafana: docker run -d -p 3000:3000 grafana/grafana
Add Prometheus as a data source (URL: http://host.docker.internal:9090)
Create panels for:
- request_count
- prediction_latency_seconds

🧠 You can import a JSON dashboard config or create visual alerts from panels.

Step 4: Custom Logging with Fluent Bit

If using containers, route logs to Elastic or Loki for full observability.

Example: Log to File (Basic)

uvicorn main:app --host 0.0.0.0 --port 8000 --log-config log_config.yaml

Sample log_config.yaml

version: 1
formatters:
  default:
    format: '%(asctime)s - %(levelname)s - %(message)s'
handlers:
  file:
    class: logging.FileHandler
    filename: logs/app.log
    formatter: default
root:
  level: INFO
  handlers: [file]

🧠 Expected Outcome

With the above setup, you get:

Real-time metrics scraped by Prometheus
Visual dashboards via Grafana
Logged inputs/outputs for traceability
Alert conditions (high latency, error rate) visible in Grafana

This setup ensures production-grade observability for your ML services. Enhance further by adding tracing (e.g., OpenTelemetry) and distributed log collectors (Fluent Bit, Logstash).

🔥 Challenges

Instrumentation

Add a metrics endpoint (e.g., /metrics) to your FastAPI/Flask app
Track and expose latency of each prediction
Log every request and response in a file or stdout

Monitoring Setup

Install and configure Prometheus to collect app metrics
Connect Grafana to Prometheus and build a dashboard
Monitor system-level metrics like CPU, memory via node_exporter

Alerting

Set up a rule: Alert if avg latency > 500ms for 5 mins
Send alert to Slack/email (use Alertmanager or a webhook tool)

Advanced

Use EvidentlyAI or WhyLabs to log and visualize prediction distributions
Implement trace IDs for incoming requests to correlate logs and metrics