30 Days of MLOps Challenge · Day 2

MLOps Tools iconMLOps Tools Landscape – Explore the Ecosystem

By Aviraj Kawade · June 4, 2025 · 6 min read

Understand the tools that power end‑to‑end ML workflows—from data versioning to orchestration, deployment, and monitoring.

Diagram: Tools Landscape

MLOps tools landscape diagram

Tip: Use + / − or Ctrl + Mouse Wheel to zoom. Scroll to pan.

Key Learnings

  • Categories of tools across the ML lifecycle.
  • Open‑source vs managed services: trade‑offs and when to use each.
  • Where tools fit: versioning, training, orchestration, deployment, monitoring.
  • Deep dive into MLflow, DVC, Kubeflow, Airflow, SageMaker, and Vertex AI.

Categories of Tools Across the ML Lifecycle

  1. Data Engineering & Preparation

    Collecting, cleaning, transforming, and storing data.

    • Data Collection: Apache Nifi, Kafka, Web Scrapers, APIs
    • Data Cleaning: OpenRefine, Pandas, DataWrangler
    • Data Transformation: Spark, dbt, Airbyte
    • Data Storage: PostgreSQL, S3, Delta Lake, BigQuery
  2. Experimentation & Development

    Writing ML code, tracking experiments, and versioning datasets.

    • Notebooks & IDEs: Jupyter, Colab, VS Code
    • Experiment Tracking: MLflow, Weights & Biases, Neptune.ai
    • Data Versioning: DVC, LakeFS
    • Feature Stores: Feast, Tecton
  3. Model Training & Optimization

    Training models, tuning hyperparameters, and managing compute.

    • Frameworks: TensorFlow, PyTorch, Scikit‑learn, XGBoost
    • HPO: Optuna, Ray Tune, Hyperopt
    • Distributed Training: Horovod, SageMaker, Vertex AI
  4. Model Packaging & Deployment

    Containerizing models and deploying them as APIs or batch jobs.

    • Packaging: ONNX, TorchScript, BentoML
    • Deployment: KServe, Seldon Core, SageMaker, Vertex AI
    • Containerization: Docker, Podman
    • CI/CD: Jenkins, GitHub Actions, GitLab CI
  5. Model Monitoring & Observability

    Tracking performance, drift, and logs in production.

    • Monitoring: Prometheus, Grafana, Evidently AI
    • Drift Detection: WhyLabs, Fiddler, Arize
    • Logging & Tracing: ELK Stack, Jaeger, OpenTelemetry
  6. Governance & Compliance

    Explainability, fairness, and secure access.

    • Explainability: SHAP, LIME
    • Fairness: AI Fairness 360, Fairlearn
    • Security & Access: Vault, OPA, IAM
  7. Workflow Orchestration

    Pipeline orchestration and task scheduling.

    • Engines: Apache Airflow, Kubeflow Pipelines, Prefect, Dagster

Open‑Source vs Managed Services

Feature/AspectOpen‑Source (MLflow, DVC, Kubeflow)Managed (SageMaker, Vertex, Azure ML)
Ease of SetupManual install and configOut‑of‑the‑box minimal config
CustomizationHighly customizable and flexibleLimited by provider implementation
ScalabilityManual scaling on your infraAuto‑scaled by provider
IntegrationBroad, may need glue codeNative within ecosystem
CostLow upfront; infra cost growsPay‑as‑you‑go; can be expensive
Data SecurityFull control of data policiesProvider compliance/standards
MaintenanceManual upgrades and monitoringHandled by provider
Learning CurveSteeper; infra + toolsEasier; infra abstracted
SupportLarge OSS communitiesOfficial vendor support
Vendor Lock‑InNo lock‑in; portableHigher lock‑in risk

Summary: Open‑source suits teams needing flexibility and control. Managed services fit teams optimizing for speed and low ops overhead.

MLOps Tools by Stage

1. Versioning

Track versions of code, data, and models.

  • Git – Code versioning
  • DVC – Data and model versioning
  • MLflow – Model versioning and tracking
  • Weights & Biases – Experiment tracking

2. Training

Model development, experimentation, and tuning.

  • TensorFlow, PyTorch, Scikit‑learn
  • MLflow – Run tracking and logging
  • Weights & Biases – Hyperparameter tuning, metrics
  • Keras Tuner, Optuna – HPO

3. Orchestration

Automate and manage ML workflows and pipelines.

  • Apache Airflow – General orchestration
  • Kubeflow Pipelines – K8s‑native ML pipelines
  • Argo Workflows – Container‑native orchestration
  • MLflow Projects, Metaflow

4. Deployment

Serve trained models for real‑time or batch inference.

  • Seldon Core, KServe – K8s serving
  • TensorFlow Serving
  • MLflow Models, BentoML
  • SageMaker – Managed deploy

5. Monitoring

Monitor model performance, detect drift, and ensure reliability.

  • Prometheus & Grafana
  • Evidently AI, WhyLabs, Arize AI
  • SageMaker Model Monitor

In‑Depth: Popular Tools

1) MLflow — Tracking, Registry, Deployment

Manage the ML lifecycle: experiments, reproducibility, and model serving.

  • Tracking: log params, metrics, artifacts
  • Projects: package ML code
  • Models & Registry: package, version, promote
import mlflow

with mlflow.start_run():
    mlflow.log_param("alpha", 0.5)
    mlflow.log_metric("rmse", 0.78)
    # mlflow.sklearn.log_model(model, "model")

2) DVC — Data & Model Versioning

Git‑compatible versioning for datasets, models, and pipelines.

  • Version control for datasets
  • Pipelines and reproducibility
  • Remote storage (S3, GCS, Azure)
# Initialize DVC
dvc init

# Track dataset
dvc add data/train.csv

git add data/train.csv.dvc .gitignore
git commit -m "Add training data"

# Push data to remote storage
dvc remote add -d myremote s3://ml-data-store
dvc push

3) Kubeflow — End‑to‑End on Kubernetes

Cloud‑native platform for pipelines, training, and serving on K8s.

@dsl.pipeline(
    name='Basic pipeline',
    description='An example pipeline.'
)
def basic_pipeline():
    op = dsl.ContainerOp(
        name='echo',
        image='alpine',
        command=['echo', 'Hello Kubeflow!']
    )

4) Apache Airflow — Workflow Orchestration

Programmatically author, schedule, and monitor workflows.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

def train_model():
    print("Training model...")

def evaluate_model():
    print("Evaluating model...")

dag = DAG('ml_pipeline', start_date=datetime(2023, 1, 1), schedule_interval='@daily')

train = PythonOperator(task_id='train', python_callable=train_model, dag=dag)
evaluate = PythonOperator(task_id='evaluate', python_callable=evaluate_model, dag=dag)

train >> evaluate

5) Amazon SageMaker — Managed ML Platform

Build, train, and deploy ML models at scale with AWS.

from sagemaker.sklearn.estimator import SKLearn

sklearn = SKLearn(entry_point='train.py',
                  role='SageMakerRole',
                  instance_type='ml.m5.large')
sklearn.fit()

predictor = sklearn.deploy(instance_type='ml.m5.large', initial_instance_count=1)

6) Google Vertex AI — Unified ML Platform

GCP's platform for AutoML, custom training, pipelines, and monitoring.

from google.cloud import aiplatform

aiplatform.init(project='my-project', location='us-central1')

job = aiplatform.CustomTrainingJob(
    display_name='my-training-job',
    script_path='train.py',
    container_uri='gcr.io/cloud-aiplatform/training/tf-cpu.2-2:latest',
    requirements=['pandas']
)

model = job.run(replica_count=1, model_display_name='my-model')

Challenges

  • Create a visual diagram mapping tools to lifecycle stages.
  • Write: “MLflow vs SageMaker — Which to start with and why?”
  • Install MLflow locally and log a dummy metric.
  • Create a DVC pipeline to version a small CSV dataset.
← Back to MLOps Roadmap