MLOps with ML Platforms (SageMaker & Vertex AI)

💡 Hey — It's Aviraj Kawade 👋

We should learn platforms like SageMaker and Vertex AI because they offer end-to-end managed services for model training, deployment, and monitoring, drastically reducing infrastructure overhead. Mastering these platforms enables faster experimentation, scalable automation, and production-grade ML workflows with built-in security, CI/CD, and governance.

📚 Key Learnings

Understand the MLOps capabilities of SageMaker and Vertex AI
Compare features of SageMaker Studio vs Vertex AI Workbench
Learn about integrations with Git, Terraform, and CI/CD
Get hands-on with training and deploying a simple model

🧠 Learn here

Let's start with Managed ML Platforms!

Managed ML Platform

A Managed ML Platform abstracts away infrastructure provisioning, scalability concerns, and low-level configurations.

It lets data scientists, ML engineers, and developers focus on model development and experimentation while the platform takes care of the rest.

A managed ML Platform Ideally should have:

Data Preparation & Labeling tools
AutoML capabilities
Model Training & Tuning (incl. hyperparameter optimization)
Model Deployment (real-time & batch)
Model Monitoring (drift detection, latency, accuracy)
Versioning & Reproducibility
Integrated Security & Compliance

Popular Managed ML Platforms

Platform	Provider	Highlights
Amazon SageMaker	AWS	Fully managed, supports Studio IDE, Autopilot, Pipelines, Model Monitor
Vertex AI	Google Cloud	Unified platform, strong AutoML, integration with BigQuery & notebooks
Azure ML	Microsoft	MLOps support with Azure DevOps, drag-and-drop UI, scalable endpoints
Databricks ML	Databricks	ML on top of Spark, great for large-scale data workflows

Why Use Managed ML Platforms?

🚀 Faster model development lifecycle
💰 Cost-optimized compute (pay-as-you-go)
🔒 Built-in security and compliance
🔄 Scalable from prototype to production
🧑‍🔧 Reduced need for infra & DevOps skills

For now, we will focus on SageMaker & Vertex AI

Amazon SageMaker

Amazon SageMaker is a fully managed service that provides tools to build, train, and deploy machine learning models quickly and at scale.

Features:

Data Preparation: Built-in Jupyter notebooks, SageMaker Data Wrangler, and Feature Store
Model Building: Supports popular ML frameworks (TensorFlow, PyTorch, XGBoost), built-in algorithms, and custom containers
Training: Distributed training, automatic model tuning (hyperparameter optimization)
Deployment: One-click model deployment to auto-scaling endpoints
MLOps & Monitoring: Model monitoring, endpoint drift detection, A/B testing, CI/CD integration

Components:

SageMaker Studio: Integrated visual interface for building ML workflows
SageMaker Processing: For running data pre-processing and post-processing jobs
SageMaker Training: Managed training jobs with distributed support
SageMaker Inference: Real-time, batch, and asynchronous inference options
SageMaker Pipelines: End-to-end ML pipeline orchestration

Getting Started with SageMaker:

Install AWS CLI & Boto3

pip install awscli boto3

Set up IAM Role with SageMaker permissions
Launch SageMaker Notebook Instance or SageMaker Studio from AWS Console

Example: Training a Built-in XGBoost Model

import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

role = get_execution_role()
sess = sagemaker.Session()

xgboost_container = sagemaker.image_uris.retrieve("xgboost", sess.boto_region_name, "1.5-1")

estimator = Estimator(
    image_uri=xgboost_container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path="s3://your-bucket/output",
    sagemaker_session=sess,
)

estimator.fit("s3://your-bucket/input")

Use Cases:

Predictive Analytics
Image and Text Classification
Time Series Forecasting
Anomaly Detection
Natural Language Processing (NLP)

Security & Compliance

VPC support for secure networking
KMS for encryption at rest
IAM roles for fine-grained access control
Audit trails via AWS CloudTrail

Deployment Options

Real-time Endpoints
Batch Transform
Asynchronous Inference
Edge Deployment via SageMaker Neo

🧠 Pro Tips

Use SageMaker Studio for an all-in-one visual experience
Use Model Monitor to detect drift in production
Optimize cost with spot instances and multi-model endpoints

Vertex AI

Vertex AI is Google Cloud's managed machine learning platform that helps data scientists and ML engineers build, train, and deploy ML models faster using unified tools and services.

Features:

Unified Platform: Manage data, train models, and deploy them from a single interface
Custom and AutoML Models: Supports AutoML for beginners and custom training for experts
Integrated MLOps: Pipelines, CI/CD, and model monitoring
Scalable Infrastructure: Train on CPUs, GPUs, TPUs
Prebuilt & Custom Containers: Use optimized Google containers or bring your own

Key Components:

Vertex AI Workbench: Managed JupyterLab notebooks with integration to BigQuery, GCS, etc.
Vertex AI Pipelines: Orchestrate ML workflows using Kubeflow Pipelines
Vertex AI Training: Custom training with Docker containers or prebuilt frameworks
Vertex AI Prediction: Online and batch prediction services
Vertex AI Model Registry: Versioned model repository
Vertex AI Experiments: Track model training runs and parameters

Getting Started:

Enable Vertex AI API in Google Cloud Console
Create a Cloud Storage bucket for datasets and model artifacts
Install Google Cloud SDK & Libraries

pip install google-cloud-aiplatform

Initialize Vertex AI SDK

from google.cloud import aiplatform

aiplatform.init(project='your-project-id', location='us-central1')

Example: Train a Custom Model

job = aiplatform.CustomContainerTrainingJob(
    display_name="my-training-job",
    container_uri="gcr.io/my-project/my-training-image",
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
)

model = job.run(
    model_display_name="my-model",
    replica_count=1,
    machine_type="n1-standard-4",
    args=["--epochs", "5"]
)

Use Cases:

Image Classification & Object Detection
Natural Language Processing (NLP)
Time Series Forecasting
Recommendation Systems
Tabular Data Models

Security & Compliance:

IAM for access control
VPC Service Controls
CMEK for data encryption
Audit logs and monitoring via Cloud Logging

Deployment Options:

Online Predictions (Real-time Inference)
Batch Predictions
Export to Edge via TensorFlow Lite or Coral

🧠 Pro Tips

Use Workbenches to interactively develop and test code
Track experiment runs using Vertex AI Experiments
Schedule training using Vertex AI Pipelines with CI/CD triggers
Monitor drift and health with Vertex AI Model Monitoring

SageMaker vs Vertex AI

Feature	SageMaker Studio	Vertex AI Workbench
Platform	AWS	Google Cloud
IDE Integration	Fully integrated JupyterLab-based IDE	JupyterLab integration with enhanced GCP tools
Notebook Type	Jupyter notebooks, SageMaker notebooks	Jupyter notebooks (managed and user-managed)
Compute Options	On-demand, spot, and SageMaker-provided ML instances	Custom VM types, GPU/TPU support
Auto-scaling	Yes (via SageMaker endpoints or pipelines)	Yes (via Vertex AI Training and Workbench)
Built-in Version Control	Git integration built-in	GitHub integration available
ML Frameworks Support	TensorFlow, PyTorch, MXNet, Scikit-learn, etc.	TensorFlow, PyTorch, Scikit-learn, XGBoost, etc.
Experiment Tracking	SageMaker Experiments	Vertex AI Experiments
Pipeline Support	SageMaker Pipelines	Vertex AI Pipelines
Model Registry	SageMaker Model Registry	Vertex AI Model Registry
Monitoring and Debugging	SageMaker Debugger, Model Monitor	Vertex AI Model Monitoring
MLOps Integration	SageMaker Projects with CI/CD templates	Cloud Build, Vertex Pipelines for MLOps
Security and IAM	Integrated with AWS IAM	Integrated with Google IAM
Data Access	Access to S3, Athena, Redshift, etc.	Access to BigQuery, Cloud Storage, etc.
Pricing	Pay-per-use based on compute and storage	Pay-per-use with VM cost + notebook pricing
Notebook Scheduling	Not native (can be done via Lambda/Step Functions)	Built-in scheduled executions
Custom Container Support	Yes (bring your own container to Studio)	Yes (via custom containers on Notebooks or Pipelines)
Extension Ecosystem	Supports Jupyter extensions, Studio add-ons	Supports JupyterLab extensions
Multi-user Support	Yes, with IAM roles and domain setup	Yes, with GCP IAM and shared Workbench environments

ML Platform Integrations: Git, Terraform, CI/CD with SageMaker & Vertex AI

Version Control with Git

SageMaker

SageMaker Studio Git Integration: Built-in support to clone, commit, and push Git repositories from Studio UI.

Best Practices:

Use Git for managing notebooks, training scripts, Dockerfiles, and pipeline definitions
Organize repos with /src, /notebooks, /pipelines, /deploy folders

Vertex AI

Workbench Git Integration: Managed JupyterLab with Git extension enabled.

Best Practices:

Store Kubeflow pipeline YAMLs and training scripts in Git
Track experiment metadata and commit hashes for reproducibility

Infrastructure as Code (IaC) with Terraform

SageMaker

Terraform AWS Provider: Supports creating resources like:

aws_sagemaker_notebook_instance
aws_sagemaker_model
aws_sagemaker_endpoint_config
aws_sagemaker_endpoint

Example:

resource "aws_sagemaker_model" "example" {
  name               = "example-model"
  execution_role_arn = aws_iam_role.sagemaker_role.arn
  primary_container {
    image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image"
    model_data_url = "s3://bucket/model.tar.gz"
  }
}

Vertex AI

Terraform GCP Provider: Supports:

google_vertex_ai_endpoint
google_vertex_ai_model
google_vertex_ai_pipeline_job
google_vertex_ai_featurestore

Example:

resource "google_vertex_ai_model" "model" {
  display_name = "vertex-model"
  container_spec {
    image_uri = "gcr.io/project/image"
  }
}

CI/CD Integration

SageMaker

CI/CD Tools: GitHub Actions, CodePipeline, Jenkins

Popular Tools:

sagemaker-training-toolkit & sagemaker-pipeline SDKs
Amazon SageMaker Projects for CI/CD automation

Example GitHub Action:

jobs:
  train-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
      - run: pip install sagemaker
      - run: python pipeline.py --train

Vertex AI

CI/CD Tools: Cloud Build, GitHub Actions, Tekton

Popular Practices:

Trigger pipeline runs using Cloud Build triggers
Store and version datasets/models in GCS

Cloud Build YAML Example:

steps:
  - name: 'gcr.io/cloud-builders/gcloud'
    args:
      - ai
      - custom-jobs
      - create
      - --display-name=my-job
      - --region=us-central1

Hands-On: Train and Deploy a Simple Model on SageMaker & Vertex AI

Part 1: Amazon SageMaker

Train and deploy a simple scikit-learn model using SageMaker built-in containers.

Prerequisites

AWS account with SageMaker access
S3 bucket
IAM role with SageMaker permissions
Python environment with boto3, sagemaker

Steps

1. Prepare Training Script: train.py

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train, y_train)

joblib.dump(model, '/opt/ml/model/model.joblib')

2. Upload Script to S3

aws s3 cp train.py s3://your-bucket/code/train.py

3. Train with SageMaker

from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role

sklearn_estimator = SKLearn(
    entry_point='train.py',
    role=get_execution_role(),
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    py_version='py3',
    sagemaker_session=sess
)
sklearn_estimator.fit()

4. Deploy as Endpoint

predictor = sklearn_estimator.deploy(
    instance_type='ml.m5.large', 
    initial_instance_count=1
)
predictor.predict([[5.1, 3.5, 1.4, 0.2]])

5. Clean Up

predictor.delete_endpoint()

Part 2: Google Vertex AI

Train and deploy a simple scikit-learn model using Vertex AI custom training job.

Prerequisites

GCP project with Vertex AI API enabled
GCS bucket
Python environment with google-cloud-aiplatform

Steps

1. Create Training Script: train.py

Same as above.

2. Build Docker Image

Create Dockerfile:

FROM python:3.9
RUN pip install scikit-learn joblib google-cloud-storage
COPY train.py .
CMD ["python", "train.py"]

Build & push:

gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/iris-trainer

3. Submit Custom Job

from google.cloud import aiplatform

aiplatform.init(project='YOUR_PROJECT_ID', location='us-central1')

job = aiplatform.CustomContainerTrainingJob(
    display_name='iris-train',
    container_uri='gcr.io/YOUR_PROJECT_ID/iris-trainer',
    model_serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest'
)

model = job.run(
    model_display_name='iris-model', 
    replica_count=1, 
    machine_type='n1-standard-4'
)

4. Deploy Model

endpoint = model.deploy(machine_type='n1-standard-4')
endpoint.predict(instances=[[5.1, 3.5, 1.4, 0.2]])

5. Clean Up

endpoint.undeploy_all()
endpoint.delete()

🔥 Challenges

Launch a notebook in SageMaker Studio or Vertex AI Workbench
Train a model using built-in algorithm or sklearn
Deploy as a real-time endpoint
Track experiment metadata (parameters, metrics)
Enable drift monitoring or logging on deployed endpoint
Use CloudWatch (SageMaker) or Logging (Vertex) to view logs
Create a simple pipeline with preprocessing, training, evaluation steps
Set up a CI/CD job (GitHub Actions / Cloud Build) to retrain on commit
Compare latency and performance between SageMaker & Vertex AI endpoints
Use SageMaker Model Registry or Vertex AI Model Registry to manage versions