MLOps Tools Landscape – Explore the Ecosystem

Diagram: Tools Landscape

Tip: Use + / − or Ctrl + Mouse Wheel to zoom. Scroll to pan.

Key Learnings

Categories of tools across the ML lifecycle.
Open‑source vs managed services: trade‑offs and when to use each.
Where tools fit: versioning, training, orchestration, deployment, monitoring.
Deep dive into MLflow, DVC, Kubeflow, Airflow, SageMaker, and Vertex AI.

Categories of Tools Across the ML Lifecycle

Data Engineering & Preparation
Collecting, cleaning, transforming, and storing data.
- Data Collection: Apache Nifi, Kafka, Web Scrapers, APIs
- Data Cleaning: OpenRefine, Pandas, DataWrangler
- Data Transformation: Spark, dbt, Airbyte
- Data Storage: PostgreSQL, S3, Delta Lake, BigQuery
Experimentation & Development
Writing ML code, tracking experiments, and versioning datasets.
- Notebooks & IDEs: Jupyter, Colab, VS Code
- Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
- Data Versioning: DVC, LakeFS
- Feature Stores: Feast, Tecton
Model Training & Optimization
Training models, tuning hyperparameters, and managing compute.
- Frameworks: TensorFlow, PyTorch, Scikit‑learn, XGBoost
- HPO: Optuna, Ray Tune, Hyperopt
- Distributed Training: Horovod, SageMaker, Vertex AI
Model Packaging & Deployment
Containerizing models and deploying them as APIs or batch jobs.
- Packaging: ONNX, TorchScript, BentoML
- Deployment: KServe, Seldon Core, SageMaker, Vertex AI
- Containerization: Docker, Podman
- CI/CD: Jenkins, GitHub Actions, GitLab CI
Model Monitoring & Observability
Tracking performance, drift, and logs in production.
- Monitoring: Prometheus, Grafana, Evidently AI
- Drift Detection: WhyLabs, Fiddler, Arize
- Logging & Tracing: ELK Stack, Jaeger, OpenTelemetry
Governance & Compliance
Explainability, fairness, and secure access.
- Explainability: SHAP, LIME
- Fairness: AI Fairness 360, Fairlearn
- Security & Access: Vault, OPA, IAM
Workflow Orchestration
Pipeline orchestration and task scheduling.
- Engines: Apache Airflow, Kubeflow Pipelines, Prefect, Dagster

Open‑Source vs Managed Services

Feature/Aspect	Open‑Source (MLflow, DVC, Kubeflow)	Managed (SageMaker, Vertex, Azure ML)
Ease of Setup	Manual install and config	Out‑of‑the‑box minimal config
Customization	Highly customizable and flexible	Limited by provider implementation
Scalability	Manual scaling on your infra	Auto‑scaled by provider
Integration	Broad, may need glue code	Native within ecosystem
Cost	Low upfront; infra cost grows	Pay‑as‑you‑go; can be expensive
Data Security	Full control of data policies	Provider compliance/standards
Maintenance	Manual upgrades and monitoring	Handled by provider
Learning Curve	Steeper; infra + tools	Easier; infra abstracted
Support	Large OSS communities	Official vendor support
Vendor Lock‑In	No lock‑in; portable	Higher lock‑in risk

Summary: Open‑source suits teams needing flexibility and control. Managed services fit teams optimizing for speed and low ops overhead.

MLOps Tools by Stage

1. Versioning

Track versions of code, data, and models.

Git – Code versioning
DVC – Data and model versioning
MLflow – Model versioning and tracking
Weights & Biases – Experiment tracking

2. Training

Model development, experimentation, and tuning.

TensorFlow, PyTorch, Scikit‑learn
MLflow – Run tracking and logging
Weights & Biases – Hyperparameter tuning, metrics
Keras Tuner, Optuna – HPO

3. Orchestration

Automate and manage ML workflows and pipelines.

Apache Airflow – General orchestration
Kubeflow Pipelines – K8s‑native ML pipelines
Argo Workflows – Container‑native orchestration
MLflow Projects, Metaflow

4. Deployment

Serve trained models for real‑time or batch inference.

Seldon Core, KServe – K8s serving
TensorFlow Serving
MLflow Models, BentoML
SageMaker – Managed deploy

5. Monitoring

Monitor model performance, detect drift, and ensure reliability.

Prometheus & Grafana
Evidently AI, WhyLabs, Arize AI
SageMaker Model Monitor

In‑Depth: Popular Tools

1) MLflow — Tracking, Registry, Deployment

Manage the ML lifecycle: experiments, reproducibility, and model serving.

Tracking: log params, metrics, artifacts
Projects: package ML code
Models & Registry: package, version, promote

import mlflow

with mlflow.start_run():
    mlflow.log_param("alpha", 0.5)
    mlflow.log_metric("rmse", 0.78)
    # mlflow.sklearn.log_model(model, "model")

2) DVC — Data & Model Versioning

Git‑compatible versioning for datasets, models, and pipelines.

Version control for datasets
Pipelines and reproducibility
Remote storage (S3, GCS, Azure)

# Initialize DVC
dvc init

# Track dataset
dvc add data/train.csv

git add data/train.csv.dvc .gitignore
git commit -m "Add training data"

# Push data to remote storage
dvc remote add -d myremote s3://ml-data-store
dvc push

3) Kubeflow — End‑to‑End on Kubernetes

Cloud‑native platform for pipelines, training, and serving on K8s.

@dsl.pipeline(
    name='Basic pipeline',
    description='An example pipeline.'
)
def basic_pipeline():
    op = dsl.ContainerOp(
        name='echo',
        image='alpine',
        command=['echo', 'Hello Kubeflow!']
    )

4) Apache Airflow — Workflow Orchestration

Programmatically author, schedule, and monitor workflows.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def train_model():
    print("Training model...")

def evaluate_model():
    print("Evaluating model...")

dag = DAG('ml_pipeline', start_date=datetime(2023, 1, 1), schedule_interval='@daily')

train = PythonOperator(task_id='train', python_callable=train_model, dag=dag)
evaluate = PythonOperator(task_id='evaluate', python_callable=evaluate_model, dag=dag)

train >> evaluate

5) Amazon SageMaker — Managed ML Platform

Build, train, and deploy ML models at scale with AWS.

from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(entry_point='train.py',
                  role='SageMakerRole',
                  instance_type='ml.m5.large')
sklearn.fit()

predictor = sklearn.deploy(instance_type='ml.m5.large', initial_instance_count=1)

6) Google Vertex AI — Unified ML Platform

GCP's platform for AutoML, custom training, pipelines, and monitoring.

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='my-training-job',
    script_path='train.py',
    container_uri='gcr.io/cloud-aiplatform/training/tf-cpu.2-2:latest',
    requirements=['pandas']
)

model = job.run(replica_count=1, model_display_name='my-model')

Challenges

Create a visual diagram mapping tools to lifecycle stages.
Write: “MLflow vs SageMaker — Which to start with and why?”
Install MLflow locally and log a dummy metric.
Create a DVC pipeline to version a small CSV dataset.