Machine learning
Anonymized Case StudyDeployment and ops

Production ML infrastructure

What it actually takes to run a real ML system in production. MLflow registry, FastAPI service layer, React frontend, Docker + Coolify + Cloudflare, JWT + SES email auth, quarterly retrain runbook, drift monitoring, autoresearch off-net runtime.

Most data-science portfolios skip the part where the model has to keep running on a Wednesday morning when the data pipeline broke at 2am. This is that part.
Program card
typeDeployment and ops
statusAnonymized Case Study
page/ml-infrastructure
AI avatar summary

What Production ML infrastructure proves.

The data problem behind Production ML infrastructure and the decision the numbers made easier. Where I can share the figures I do. Where I can't, I say so.

Here's a summary of this page by my AI Avatar

1

The deploy

Two Docker apps behind Traefik. Python 3.12-slim runs FastAPI. node:20-alpine → nginx:alpine serves the React SPA. Coolify orchestrates both. Cloudflare in front (Full Strict SSL). MLflow tracking + registry deployed at its own subdomain; FastAPI loads models by alias=best so model promotion is a registry click, not a redeploy.

2

The retrain runbook

Quarterly. Validate the data first (the most common failure mode is upstream data drift, not model drift). Generate an accuracy report on the last quarter's predictions vs actuals. Refresh segments. Train pooled CV models. Register in MLflow. Build inference artifacts. Generate outlook + snapshot. Compare monthly checkpoints against last quarter. If WMAPE crosses threshold or KS-test flags drift on key features, escalate before promoting.

3

The autoresearch runtime

Separate from production. An autonomous Claude Code agent ran 369 experiments over 10 days against the same datasets, off-network from the production deploy, with structured experiment logging. Each experiment got a one-line summary, full configuration, and held-out result. Seven named discoveries came out the other end, none of which I'd have tried on my own that quickly.

4

Two parallel runtimes

Public Coolify deploy plus an internal-network deploy with its own LFS contract, hosts override, and Windows launcher. The point: a real ML system has to live both behind the firewall (where the data lives) and on the public internet (where the dashboard users are).