30 Days of MLOps Challenge · Day 19

Monitoring iconMonitoring ML Systems in Production – Metrics, Logging, Alerting

By Aviraj Kawade · September 16, 2025 · 7 min read

Master monitoring to ensure ML models in production are healthy, reliable, and performant. By tracking key metrics, analyzing logs, and setting up alerts, you can proactively detect model drift, data quality issues, and service failures before they impact users.

💡 Hey — It's Aviraj Kawade 👋

We must master monitoring to ensure ML models in production are healthy, reliable, and performant. By tracking key metrics, analyzing logs, and setting up alerts, they can proactively detect model drift, data quality issues, and service failures before they impact users.

📚 Key Learnings

  • Understand the importance of observability in ML systems: uptime, performance, correctness
  • Learn what to monitor: prediction latency, model accuracy, data quality, service health
  • Differentiate between system metrics and model metrics
  • Set up logging, dashboards, and alerting for deployed ML services
  • Use Prometheus, Grafana, and custom logging solutions for insight and traceability

🧠 Learn here

ML Monitoring overview diagram

Observability in ML systems is crucial for ensuring that models behave as expected in production. It helps detect issues related to performance, uptime, data quality, drift, and overall system health. Without observability, it's difficult to debug or maintain trust in deployed ML models.

Why Observability Matters in ML

  • Uptime: Ensures your model endpoints are live and accessible
  • Performance: Tracks latency, throughput, and resource consumption of your ML services
  • Correctness: Monitors input/output distributions, predictions, and error rates to catch anomalies or degradation
  • Compliance: Supports auditing and governance by keeping a history of inputs, predictions, and model versions
  • Root Cause Analysis: Aids debugging of failures, unexpected behaviors, or drifts

Key Observability Metrics

CategoryMetrics
SystemUptime, response time, CPU/GPU usage, memory consumption
ModelPrediction latency, prediction confidence, accuracy (if feedback exists)
DataFeature distribution, missing/null values, data schema validation
DriftData drift, concept drift, input/output change rate

Tools for Observability in ML

  • Prometheus + Grafana: For infrastructure and model performance metrics
  • OpenTelemetry: Distributed tracing and metrics collection
  • EvidentlyAI: Data drift, model performance, and monitoring dashboards
  • Seldon Alibi Detect / Fiddler AI / WhyLabs: For advanced monitoring, explainability, and drift detection
  • MLflow: Experiment tracking, model versioning, logging parameters and metrics
  • Elastic Stack (ELK): Centralized logging and visualization

Best Practices

  • Instrument all components: Include data pipeline, model serving layer, and infrastructure
  • Set alerts: Define thresholds and set up notifications for anomalies
  • Monitor input data: Compare live input features with training data
  • Track predictions: Log and analyze prediction confidence and output distribution
  • Version control everything: Track changes in code, models, and data schemas

Example Metrics to Track in Production

  • Model response time > 300ms
  • Drift in feature user_age > 10% from training distribution
  • Model confidence < 0.5 for more than 5% of predictions
  • Increase in null/missing values in real-time data > 2%

ML Monitoring: What to Monitor?

1. Prediction Latency

Latency indicates the time taken to serve a prediction request. High latency affects user experience and system performance.

Example: FastAPI + Prometheus

from fastapi import FastAPI, Request
import time
from prometheus_client import Histogram, start_http_server

app = FastAPI()
REQUEST_LATENCY = Histogram('prediction_latency_seconds', 'Time for prediction', ['endpoint'])

@app.post("/predict")
async def predict(request: Request):
    start_time = time.time()
    # simulate prediction
    time.sleep(0.3)  # replace with actual model.predict()
    latency = time.time() - start_time
    REQUEST_LATENCY.labels(endpoint="/predict").observe(latency)
    return {"prediction": 42}

# Start Prometheus metrics server
start_http_server(8001)

2. Model Accuracy (Offline Evaluation)

Use actual labels after feedback to compare predictions and evaluate accuracy.

Example: with Scikit-Learn

from sklearn.metrics import accuracy_score
import pandas as pd

# Simulated predictions and actuals
preds = [1, 0, 1, 1, 0]
y_true = [1, 0, 0, 1, 0]
accuracy = accuracy_score(y_true, preds)
print(f"Accuracy: {accuracy * 100:.2f}%")

In production, log and evaluate this periodically using feedback labels.

3. Data Quality

Check for missing values, type mismatches, or invalid ranges in live input data.

Example: Real-Time Data Checks

import pandas as pd

def validate_data(df):
    issues = {}
    if df.isnull().sum().any():
        issues['nulls'] = df.isnull().sum().to_dict()
    if (df['age'] < 0).any():
        issues['age_range'] = "Negative age found"
    if df['salary'].dtype != float:
        issues['salary_type'] = "Expected float for salary"
    return issues

# Example incoming request payload
data = pd.DataFrame({"age": [25, -1], "salary": [50000, "unknown"]})
print(validate_data(data))

4. Service Health

Check uptime, HTTP error codes, memory/CPU usage, etc.

Example: FastAPI + Prometheus Metrics

from prometheus_client import Counter

REQUEST_COUNT = Counter('request_count', 'App Request Count', ['method', 'endpoint'])
ERROR_COUNT = Counter('error_count', 'Error Count', ['method', 'endpoint'])

@app.middleware("http")
async def count_requests(request: Request, call_next):
    method = request.method
    endpoint = request.url.path
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    try:
        response = await call_next(request)
    except Exception:
        ERROR_COUNT.labels(method=method, endpoint=endpoint).inc()
        raise
    return response

Use tools like Grafana to visualize the metrics.

Recommended Tools:

  • Prometheus + Grafana for latency and health metrics
  • MLflow or custom DB to store prediction vs actuals
  • EvidentlyAI for data quality and drift detection
  • OpenTelemetry for tracing

✅ Pro Tips:

  • Always define thresholds and alerting rules
  • Track metrics at both infra (CPU, memory) and model levels
  • Monitor data inputs continuously to prevent garbage-in garbage-out
  • Monitoring = Observability + Actionable Insights

System Metrics vs Model Metrics

System metrics focus on infrastructure performance, while model metrics focus on model behavior and outcomes.

Understanding the difference between system and model metrics is crucial for building and maintaining reliable ML systems in production.

AspectSystem MetricsModel Metrics
DefinitionMetrics related to infrastructure and application performanceMetrics that evaluate ML model behavior and performance
Focus AreaHardware, network, service uptimePrediction quality, fairness, drift, and model reliability
ExamplesCPU usage, memory, disk I/O, network latency, HTTP errorsAccuracy, precision, recall, AUC, prediction latency, drift
Collection ToolsPrometheus, Grafana, Datadog, CloudWatchMLflow, EvidentlyAI, custom logging, Alibi Detect
UsageEnsures system availability and performanceEnsures model is accurate, fair, and not degrading
Alert TriggersHigh CPU/memory, service down, HTTP 500 errorsHigh prediction errors, drift detected, confidence drop
VisualizationSystem dashboards (Grafana, Cloud dashboards)ML dashboards (Evidently, custom visualizations)
GranularityService-level or node-levelFeature-level or prediction-level
Who MonitorsDevOps, SREsML Engineers, Data Scientists

Both metric types are essential for a robust and reliable ML system in production.

Logging, Dashboards, and Alerting for Deployed ML Services

These components are critical for production readiness, troubleshooting, and operational excellence.

1. Logging

Logging is essential for capturing runtime behavior, inputs/outputs, and model predictions for debugging and auditing.

Example: Logging in FastAPI ML Endpoint

from fastapi import FastAPI, Request
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
app = FastAPI()

@app.post("/predict")
async def predict(request: Request):
    payload = await request.json()
    logging.info(f"Received input: {payload}")
    result = {"prediction": 1}  # Simulated prediction
    logging.info(f"Prediction result: {result}")
    return result

Tip: Use structured logging (e.g., JSON logs) for better parsing in tools like ELK or Datadog.

2. Dashboards

Dashboards provide visual insights into system health, prediction quality, latency, and data distribution.

Example: Prometheus + Grafana Setup for ML Service

Expose metrics in FastAPI
from prometheus_client import start_http_server, Summary
PREDICTION_TIME = Summary('prediction_time_seconds', 'Time spent making prediction')

@PREDICTION_TIME.time()
@app.post("/predict")
async def predict(request: Request):
    # Simulated logic
    return {"prediction": 1}

start_http_server(8001)  # Expose metrics at :8001/metrics
Grafana Dashboard Panels:
  • Model prediction latency (from prediction_time_seconds)
  • Number of requests
  • Error rates
  • CPU/memory usage

Use labels to filter metrics by endpoint, model version, or feature.

3. Alerting

Alerts notify you when things go wrong or deviate from expected behavior.

Example: Prometheus Alert Rules

groups:
  - name: ml-service-alerts
    rules:
      - alert: HighLatency
        expr: prediction_time_seconds_sum / prediction_time_seconds_count > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High prediction latency detected"

      - alert: RequestFailures
        expr: increase(http_requests_total{status="500"}[5m]) > 5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "High number of 500 errors"

Alerting Integrations

  • Email / Slack / Teams (via Alertmanager)
  • PagerDuty / Opsgenie for on-call management

Always include summary and description in alerts for quick troubleshooting.

Recommended Tool Stack

PurposeTools
LoggingPython Logging, FluentBit, ELK, Loki
MetricsPrometheus, OpenTelemetry
DashboardsGrafana, Kibana
AlertingPrometheus Alertmanager, PagerDuty

Insight and Traceability with Prometheus, Grafana, and Custom Logging

Let's understand how to implement insight and traceability in ML services using Prometheus, Grafana, and custom logging with a FastAPI-based model serving app.

🧠 Objective

To gain deep observability into:

  • Prediction latency
  • Request volumes and failures
  • Runtime logs for input/output tracing
  • Real-time dashboards for alerting and troubleshooting

Stack Overview

ComponentTool
MetricsPrometheus
VisualizationGrafana
LoggingPython Logging + Fluent Bit (optional)
App FrameworkFastAPI

Step 1: FastAPI App with Prometheus Metrics & Logging

main.py

from fastapi import FastAPI, Request
from prometheus_client import Counter, Summary, make_asgi_app
import time
import logging

# Setup Logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")

# Metrics
REQUEST_COUNT = Counter("request_count", "Total request count", ['method', 'endpoint'])
PREDICTION_LATENCY = Summary("prediction_latency_seconds", "Prediction latency in seconds")

# FastAPI app
app = FastAPI()
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

@app.middleware("http")
async def record_metrics(request: Request, call_next):
    method = request.method
    endpoint = request.url.path
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    start_time = time.time()
    response = await call_next(request)
    latency = time.time() - start_time
    PREDICTION_LATENCY.observe(latency)
    return response

@app.post("/predict")
@PREDICTION_LATENCY.time()
async def predict(request: Request):
    payload = await request.json()
    logging.info(f"Incoming payload: {payload}")
    prediction = {"result": 1}  # Dummy model
    logging.info(f"Prediction: {prediction}")
    return prediction

Step 2: Run Prometheus

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'fastapi-app'
    static_configs:
      - targets: ['localhost:8000']

Start Prometheus

prometheus --config.file=prometheus.yml

Step 3: Set Up Grafana Dashboard

  • Run Grafana: docker run -d -p 3000:3000 grafana/grafana
  • Add Prometheus as a data source (URL: http://host.docker.internal:9090)
  • Create panels for:
    • request_count
    • prediction_latency_seconds

🧠 You can import a JSON dashboard config or create visual alerts from panels.

Step 4: Custom Logging with Fluent Bit

If using containers, route logs to Elastic or Loki for full observability.

Example: Log to File (Basic)

uvicorn main:app --host 0.0.0.0 --port 8000 --log-config log_config.yaml

Sample log_config.yaml

version: 1
formatters:
  default:
    format: '%(asctime)s - %(levelname)s - %(message)s'
handlers:
  file:
    class: logging.FileHandler
    filename: logs/app.log
    formatter: default
root:
  level: INFO
  handlers: [file]

🧠 Expected Outcome

With the above setup, you get:

  • Real-time metrics scraped by Prometheus
  • Visual dashboards via Grafana
  • Logged inputs/outputs for traceability
  • Alert conditions (high latency, error rate) visible in Grafana

This setup ensures production-grade observability for your ML services. Enhance further by adding tracing (e.g., OpenTelemetry) and distributed log collectors (Fluent Bit, Logstash).

🔥 Challenges

Instrumentation

  • Add a metrics endpoint (e.g., /metrics) to your FastAPI/Flask app
  • Track and expose latency of each prediction
  • Log every request and response in a file or stdout

Monitoring Setup

  • Install and configure Prometheus to collect app metrics
  • Connect Grafana to Prometheus and build a dashboard
  • Monitor system-level metrics like CPU, memory via node_exporter

Alerting

  • Set up a rule: Alert if avg latency > 500ms for 5 mins
  • Send alert to Slack/email (use Alertmanager or a webhook tool)

Advanced

  • Use EvidentlyAI or WhyLabs to log and visualize prediction distributions
  • Implement trace IDs for incoming requests to correlate logs and metrics