30 Days of MLOps Challenge · Day 6

Scikit-learn iconTraining ML Models with Scikit-learn & TensorFlow – Build & Save Your Models Like a Pro

By Aviraj Kawade · June 20, 2025 · 11 min read

Understand ML model training with Scikit‑learn and TensorFlow, and learn to persist models for reliable deployment.

💡 Hey — It's Aviraj Kawade 👋 💡 Premium: Automating AI/Cloud/DevOps with n8n

Key Learnings

  • Train models with Scikit‑learn and TensorFlow.
  • Differences: classic ML vs deep learning workflows.
  • Persist models using joblib/pickle/SavedModel for deployment.
  • Modular training pipelines: load → preprocess → train → evaluate → save.

In Simple Terms: What is an ML Model?

An ML model learns patterns from data to make predictions. It uses features, learned weights, and an algorithm to output predictions on new inputs.

Simple terms ML model diagram

Scikit‑learn Overview

Open‑source Python library with consistent APIs for classification, regression, clustering, model selection, preprocessing, and more. Ideal for prototyping and small‑to‑medium projects.

Core Concepts

  • Estimators (fit), Predictors (predict), Transformers (transform)
  • Pipelines, model selection, cross‑validation, metrics
  • Model persistence using joblib or pickle

Project: Student Pass/Fail Prediction (Scikit‑learn)

Build a pipeline to predict student outcomes and save the trained model.

# requirements
pip install pandas scikit-learn joblib fastapi uvicorn

# data_generation.py
import pandas as pd
import random

random.seed(42)
schools = ['Greenwood High', 'Sunrise Public', 'Hillview School']
data = []
for _ in range(100):
    school = random.choice(schools)
    study_hours = round(random.uniform(1, 10), 1)
    absences = random.randint(0, 10)
    grade = round(random.uniform(40, 100), 1)
    passed = 1 if grade >= 50 else 0
    data.append([school, study_hours, absences, grade, passed])

df = pd.DataFrame(data, columns=['school', 'study_hours', 'absences', 'grade', 'passed'])
df.to_csv("students.csv", index=False)

# train_model.py
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib

df = pd.read_csv("students.csv")
X = df.drop("passed", axis=1)
y = df["passed"]

preprocessor = ColumnTransformer([
    ("school_encoder", OneHotEncoder(), ["school"])
], remainder="passthrough")

pipeline = Pipeline([
    ("preprocess", preprocessor),
    ("classifier", RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
pipeline.fit(X_train, y_train)

joblib.dump(pipeline, "student_model.pkl")

# predict.py
import joblib
import pandas as pd

model = joblib.load("student_model.pkl")

sample = pd.DataFrame([{
    "school": "Greenwood High",
    "study_hours": 6.5,
    "absences": 2,
    "grade": 78.0
}])

pred = model.predict(sample)
print(f"Predicted: {'Pass' if pred[0] == 1 else 'Fail'}")

# serve_model.py
from fastapi import FastAPI
from pydantic import BaseModel
import pandas as pd
import joblib

app = FastAPI()
model = joblib.load("student_model.pkl")

class StudentData(BaseModel):
    school: str
    study_hours: float
    absences: int
    grade: float

@app.post("/predict")
def predict(data: StudentData):
    df = pd.DataFrame([data.dict()])
    prediction = model.predict(df)[0]
    return {"prediction": "Pass" if prediction == 1 else "Fail"}

# run
# uvicorn serve_model:app --reload

TensorFlow Overview

Open‑source framework for ML/DL from Google. Supports CPUs/GPUs/TPUs, Keras API, TensorBoard, and production serving.

  • Tensors, Keras layers, optimizers, losses, metrics
  • tf.function, SavedModel format, TensorFlow Serving

Project: Time Series Forecasting (TensorFlow)

# install
pip install tensorflow pandas numpy scikit-learn

# data/generate_dataset.py
import numpy as np
import pandas as pd

days = np.arange(365)
temperature = 10 + 0.02 * days + np.sin(0.1 * days) + np.random.normal(0, 0.5, size=(365,))
df = pd.DataFrame({'day': days, 'temperature': temperature})
df.to_csv("sample_data.csv", index=False)

# src/train_model.py
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv("sample_data.csv")
data = df['temperature'].values.reshape(-1, 1)

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)

def create_sequences(data, window_size):
    X, y = [], []
    for i in range(len(data) - window_size):
        X.append(data[i:i+window_size])
        y.append(data[i+window_size])
    return np.array(X), np.array(y)

window_size = 10
X, y = create_sequences(data_scaled, window_size)

split = int(0.8 * len(X))
X_train, y_train = X[:split], y[:split]
X_test, y_test = X[split:], y[split:]

model = Sequential([
    LSTM(64, input_shape=(window_size, 1)),
    Dense(1)
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))

predictions = model.predict(X_test)

model.save("models/saved_model")

# src/serve_model.py
from tensorflow.keras.models import load_model
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

df = pd.read_csv("sample_data.csv")
data = df['temperature'].values.reshape(-1, 1)

scaler = MinMaxScaler()
data_scaled = scaler.fit_transform(data)

model = load_model("models/saved_model")

window_size = 10
recent_sequence = data_scaled[-window_size:].reshape(1, window_size, 1)
future_prediction = model.predict(recent_sequence)
print(f"Predicted next day's temperature: {future_prediction[0][0]:.2f}")

Why Save Models?

  • Deployment: load in APIs/apps without retraining.
  • Reproducibility: consistent results over time.
  • Performance: avoid expensive retraining.
  • Portability/versioning across teams and systems.

What to Use

ToolBest ForFormatNotes
joblibScikit‑learn models.pklEfficient for numpy arrays
pickleGeneral Python objects.pklFlexible, less efficient for big numpy models
SavedModelTensorFlow/KerasdirGraph+weights+metadata; great for TF Serving

Modular Training Pipelines

# pipeline.py
from data_loader import load_data
from preprocessing import preprocess_data
from model import build_model
from train import train_model
from evaluate import evaluate_model
from save_load import save_model

raw_data = load_data("data/data.csv")
X_train, X_test, y_train, y_test = preprocess_data(raw_data)
model = build_model()
trained_model = train_model(model, X_train, y_train)
evaluate_model(trained_model, X_test, y_test)
save_model(trained_model, "models/model.pkl")
  • Maintainable, testable, reusable components.
  • Scales to multiple datasets/models.

Challenges

  • Feature engineer a CSV with Pandas.
  • Train and save a Scikit‑learn model on Iris or Titanic.
  • Train a TensorFlow model on MNIST or Fashion‑MNIST and save.
  • Evaluate and log metrics (accuracy, loss) to JSON/text.
  • Modularize training into data.py, train.py, evaluate.py, save.py.
  • Load saved models and run inference on new samples.
  • Document the workflow in README.md.
← Back to MLOps Roadmap