30 Days of MLOps Challenge · Day 22

Cloud platforms iconMLOps with ML Platforms (SageMaker & Vertex AI)

By Aviraj Kawade · September 16, 2025 · 8 min read

Learn platforms like SageMaker and Vertex AI because they offer end-to-end managed services for model training, deployment, and monitoring, drastically reducing infrastructure overhead. Mastering these platforms enables faster experimentation, scalable automation, and production-grade ML workflows with built-in security, CI/CD, and governance.

💡 Hey — It's Aviraj Kawade 👋

We should learn platforms like SageMaker and Vertex AI because they offer end-to-end managed services for model training, deployment, and monitoring, drastically reducing infrastructure overhead. Mastering these platforms enables faster experimentation, scalable automation, and production-grade ML workflows with built-in security, CI/CD, and governance.

📚 Key Learnings

  • Understand the MLOps capabilities of SageMaker and Vertex AI
  • Compare features of SageMaker Studio vs Vertex AI Workbench
  • Learn about integrations with Git, Terraform, and CI/CD
  • Get hands-on with training and deploying a simple model

🧠 Learn here

Let's start with Managed ML Platforms!

Managed ML Platform

A Managed ML Platform abstracts away infrastructure provisioning, scalability concerns, and low-level configurations.

It lets data scientists, ML engineers, and developers focus on model development and experimentation while the platform takes care of the rest.

A managed ML Platform Ideally should have:

  • Data Preparation & Labeling tools
  • AutoML capabilities
  • Model Training & Tuning (incl. hyperparameter optimization)
  • Model Deployment (real-time & batch)
  • Model Monitoring (drift detection, latency, accuracy)
  • Versioning & Reproducibility
  • Integrated Security & Compliance

Popular Managed ML Platforms

PlatformProviderHighlights
Amazon SageMakerAWSFully managed, supports Studio IDE, Autopilot, Pipelines, Model Monitor
Vertex AIGoogle CloudUnified platform, strong AutoML, integration with BigQuery & notebooks
Azure MLMicrosoftMLOps support with Azure DevOps, drag-and-drop UI, scalable endpoints
Databricks MLDatabricksML on top of Spark, great for large-scale data workflows

Why Use Managed ML Platforms?

  • 🚀 Faster model development lifecycle
  • 💰 Cost-optimized compute (pay-as-you-go)
  • 🔒 Built-in security and compliance
  • 🔄 Scalable from prototype to production
  • 🧑‍🔧 Reduced need for infra & DevOps skills

For now, we will focus on SageMaker & Vertex AI

Amazon SageMaker

Amazon SageMaker overview diagram

Amazon SageMaker is a fully managed service that provides tools to build, train, and deploy machine learning models quickly and at scale.

Features:

  • Data Preparation: Built-in Jupyter notebooks, SageMaker Data Wrangler, and Feature Store
  • Model Building: Supports popular ML frameworks (TensorFlow, PyTorch, XGBoost), built-in algorithms, and custom containers
  • Training: Distributed training, automatic model tuning (hyperparameter optimization)
  • Deployment: One-click model deployment to auto-scaling endpoints
  • MLOps & Monitoring: Model monitoring, endpoint drift detection, A/B testing, CI/CD integration

Components:

  • SageMaker Studio: Integrated visual interface for building ML workflows
  • SageMaker Processing: For running data pre-processing and post-processing jobs
  • SageMaker Training: Managed training jobs with distributed support
  • SageMaker Inference: Real-time, batch, and asynchronous inference options
  • SageMaker Pipelines: End-to-end ML pipeline orchestration

Getting Started with SageMaker:

  1. Install AWS CLI & Boto3
pip install awscli boto3
  1. Set up IAM Role with SageMaker permissions
  2. Launch SageMaker Notebook Instance or SageMaker Studio from AWS Console

Example: Training a Built-in XGBoost Model

import sagemaker
from sagemaker import get_execution_role
from sagemaker.estimator import Estimator

role = get_execution_role()
sess = sagemaker.Session()

xgboost_container = sagemaker.image_uris.retrieve("xgboost", sess.boto_region_name, "1.5-1")

estimator = Estimator(
    image_uri=xgboost_container,
    role=role,
    instance_count=1,
    instance_type="ml.m5.large",
    output_path="s3://your-bucket/output",
    sagemaker_session=sess,
)

estimator.fit("s3://your-bucket/input")

Use Cases:

  • Predictive Analytics
  • Image and Text Classification
  • Time Series Forecasting
  • Anomaly Detection
  • Natural Language Processing (NLP)

Security & Compliance

  • VPC support for secure networking
  • KMS for encryption at rest
  • IAM roles for fine-grained access control
  • Audit trails via AWS CloudTrail

Deployment Options

  • Real-time Endpoints
  • Batch Transform
  • Asynchronous Inference
  • Edge Deployment via SageMaker Neo

🧠 Pro Tips

  • Use SageMaker Studio for an all-in-one visual experience
  • Use Model Monitor to detect drift in production
  • Optimize cost with spot instances and multi-model endpoints

Vertex AI

Vertex AI overview diagram

Vertex AI is Google Cloud's managed machine learning platform that helps data scientists and ML engineers build, train, and deploy ML models faster using unified tools and services.

Features:

  • Unified Platform: Manage data, train models, and deploy them from a single interface
  • Custom and AutoML Models: Supports AutoML for beginners and custom training for experts
  • Integrated MLOps: Pipelines, CI/CD, and model monitoring
  • Scalable Infrastructure: Train on CPUs, GPUs, TPUs
  • Prebuilt & Custom Containers: Use optimized Google containers or bring your own

Key Components:

  • Vertex AI Workbench: Managed JupyterLab notebooks with integration to BigQuery, GCS, etc.
  • Vertex AI Pipelines: Orchestrate ML workflows using Kubeflow Pipelines
  • Vertex AI Training: Custom training with Docker containers or prebuilt frameworks
  • Vertex AI Prediction: Online and batch prediction services
  • Vertex AI Model Registry: Versioned model repository
  • Vertex AI Experiments: Track model training runs and parameters

Getting Started:

  1. Enable Vertex AI API in Google Cloud Console
  2. Create a Cloud Storage bucket for datasets and model artifacts
  3. Install Google Cloud SDK & Libraries
pip install google-cloud-aiplatform

Initialize Vertex AI SDK

from google.cloud import aiplatform

aiplatform.init(project='your-project-id', location='us-central1')

Example: Train a Custom Model

job = aiplatform.CustomContainerTrainingJob(
    display_name="my-training-job",
    container_uri="gcr.io/my-project/my-training-image",
    model_serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest"
)

model = job.run(
    model_display_name="my-model",
    replica_count=1,
    machine_type="n1-standard-4",
    args=["--epochs", "5"]
)

Use Cases:

  • Image Classification & Object Detection
  • Natural Language Processing (NLP)
  • Time Series Forecasting
  • Recommendation Systems
  • Tabular Data Models

Security & Compliance:

  • IAM for access control
  • VPC Service Controls
  • CMEK for data encryption
  • Audit logs and monitoring via Cloud Logging

Deployment Options:

  • Online Predictions (Real-time Inference)
  • Batch Predictions
  • Export to Edge via TensorFlow Lite or Coral

🧠 Pro Tips

  • Use Workbenches to interactively develop and test code
  • Track experiment runs using Vertex AI Experiments
  • Schedule training using Vertex AI Pipelines with CI/CD triggers
  • Monitor drift and health with Vertex AI Model Monitoring

SageMaker vs Vertex AI

FeatureSageMaker StudioVertex AI Workbench
PlatformAWSGoogle Cloud
IDE IntegrationFully integrated JupyterLab-based IDEJupyterLab integration with enhanced GCP tools
Notebook TypeJupyter notebooks, SageMaker notebooksJupyter notebooks (managed and user-managed)
Compute OptionsOn-demand, spot, and SageMaker-provided ML instancesCustom VM types, GPU/TPU support
Auto-scalingYes (via SageMaker endpoints or pipelines)Yes (via Vertex AI Training and Workbench)
Built-in Version ControlGit integration built-inGitHub integration available
ML Frameworks SupportTensorFlow, PyTorch, MXNet, Scikit-learn, etc.TensorFlow, PyTorch, Scikit-learn, XGBoost, etc.
Experiment TrackingSageMaker ExperimentsVertex AI Experiments
Pipeline SupportSageMaker PipelinesVertex AI Pipelines
Model RegistrySageMaker Model RegistryVertex AI Model Registry
Monitoring and DebuggingSageMaker Debugger, Model MonitorVertex AI Model Monitoring
MLOps IntegrationSageMaker Projects with CI/CD templatesCloud Build, Vertex Pipelines for MLOps
Security and IAMIntegrated with AWS IAMIntegrated with Google IAM
Data AccessAccess to S3, Athena, Redshift, etc.Access to BigQuery, Cloud Storage, etc.
PricingPay-per-use based on compute and storagePay-per-use with VM cost + notebook pricing
Notebook SchedulingNot native (can be done via Lambda/Step Functions)Built-in scheduled executions
Custom Container SupportYes (bring your own container to Studio)Yes (via custom containers on Notebooks or Pipelines)
Extension EcosystemSupports Jupyter extensions, Studio add-onsSupports JupyterLab extensions
Multi-user SupportYes, with IAM roles and domain setupYes, with GCP IAM and shared Workbench environments

ML Platform Integrations: Git, Terraform, CI/CD with SageMaker & Vertex AI

Version Control with Git

SageMaker

SageMaker Studio Git Integration: Built-in support to clone, commit, and push Git repositories from Studio UI.

Best Practices:
  • Use Git for managing notebooks, training scripts, Dockerfiles, and pipeline definitions
  • Organize repos with /src, /notebooks, /pipelines, /deploy folders

Vertex AI

Workbench Git Integration: Managed JupyterLab with Git extension enabled.

Best Practices:
  • Store Kubeflow pipeline YAMLs and training scripts in Git
  • Track experiment metadata and commit hashes for reproducibility

Infrastructure as Code (IaC) with Terraform

SageMaker

Terraform AWS Provider: Supports creating resources like:

  • aws_sagemaker_notebook_instance
  • aws_sagemaker_model
  • aws_sagemaker_endpoint_config
  • aws_sagemaker_endpoint
Example:
resource "aws_sagemaker_model" "example" {
  name               = "example-model"
  execution_role_arn = aws_iam_role.sagemaker_role.arn
  primary_container {
    image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/my-image"
    model_data_url = "s3://bucket/model.tar.gz"
  }
}

Vertex AI

Terraform GCP Provider: Supports:

  • google_vertex_ai_endpoint
  • google_vertex_ai_model
  • google_vertex_ai_pipeline_job
  • google_vertex_ai_featurestore
Example:
resource "google_vertex_ai_model" "model" {
  display_name = "vertex-model"
  container_spec {
    image_uri = "gcr.io/project/image"
  }
}

CI/CD Integration

SageMaker

CI/CD Tools: GitHub Actions, CodePipeline, Jenkins

Popular Tools:

  • sagemaker-training-toolkit & sagemaker-pipeline SDKs
  • Amazon SageMaker Projects for CI/CD automation
Example GitHub Action:
jobs:
  train-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      - name: Setup Python
        uses: actions/setup-python@v2
      - run: pip install sagemaker
      - run: python pipeline.py --train

Vertex AI

CI/CD Tools: Cloud Build, GitHub Actions, Tekton

Popular Practices:

  • Trigger pipeline runs using Cloud Build triggers
  • Store and version datasets/models in GCS
Cloud Build YAML Example:
steps:
  - name: 'gcr.io/cloud-builders/gcloud'
    args:
      - ai
      - custom-jobs
      - create
      - --display-name=my-job
      - --region=us-central1

Hands-On: Train and Deploy a Simple Model on SageMaker & Vertex AI

Part 1: Amazon SageMaker

Train and deploy a simple scikit-learn model using SageMaker built-in containers.

Prerequisites

  • AWS account with SageMaker access
  • S3 bucket
  • IAM role with SageMaker permissions
  • Python environment with boto3, sagemaker

Steps

1. Prepare Training Script: train.py
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import joblib

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train, y_train)

joblib.dump(model, '/opt/ml/model/model.joblib')
2. Upload Script to S3
aws s3 cp train.py s3://your-bucket/code/train.py
3. Train with SageMaker
from sagemaker.sklearn.estimator import SKLearn
from sagemaker import get_execution_role

sklearn_estimator = SKLearn(
    entry_point='train.py',
    role=get_execution_role(),
    instance_type='ml.m5.large',
    framework_version='0.23-1',
    py_version='py3',
    sagemaker_session=sess
)
sklearn_estimator.fit()
4. Deploy as Endpoint
predictor = sklearn_estimator.deploy(
    instance_type='ml.m5.large', 
    initial_instance_count=1
)
predictor.predict([[5.1, 3.5, 1.4, 0.2]])
5. Clean Up
predictor.delete_endpoint()

Part 2: Google Vertex AI

Train and deploy a simple scikit-learn model using Vertex AI custom training job.

Prerequisites

  • GCP project with Vertex AI API enabled
  • GCS bucket
  • Python environment with google-cloud-aiplatform

Steps

1. Create Training Script: train.py

Same as above.

2. Build Docker Image

Create Dockerfile:

FROM python:3.9
RUN pip install scikit-learn joblib google-cloud-storage
COPY train.py .
CMD ["python", "train.py"]

Build & push:

gcloud builds submit --tag gcr.io/YOUR_PROJECT_ID/iris-trainer
3. Submit Custom Job
from google.cloud import aiplatform

aiplatform.init(project='YOUR_PROJECT_ID', location='us-central1')

job = aiplatform.CustomContainerTrainingJob(
    display_name='iris-train',
    container_uri='gcr.io/YOUR_PROJECT_ID/iris-trainer',
    model_serving_container_image_uri='us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.0-24:latest'
)

model = job.run(
    model_display_name='iris-model', 
    replica_count=1, 
    machine_type='n1-standard-4'
)
4. Deploy Model
endpoint = model.deploy(machine_type='n1-standard-4')
endpoint.predict(instances=[[5.1, 3.5, 1.4, 0.2]])
5. Clean Up
endpoint.undeploy_all()
endpoint.delete()

🔥 Challenges

  • Launch a notebook in SageMaker Studio or Vertex AI Workbench
  • Train a model using built-in algorithm or sklearn
  • Deploy as a real-time endpoint
  • Track experiment metadata (parameters, metrics)
  • Enable drift monitoring or logging on deployed endpoint
  • Use CloudWatch (SageMaker) or Logging (Vertex) to view logs
  • Create a simple pipeline with preprocessing, training, evaluation steps
  • Set up a CI/CD job (GitHub Actions / Cloud Build) to retrain on commit
  • Compare latency and performance between SageMaker & Vertex AI endpoints
  • Use SageMaker Model Registry or Vertex AI Model Registry to manage versions