Production ML infrastructure
What it actually takes to run a real ML system in production. MLflow registry, FastAPI service layer, React frontend, Docker + Coolify + Cloudflare, JWT + SES email auth, quarterly retrain runbook, drift monitoring, autoresearch off-net runtime.
What Production ML infrastructure proves.
The data problem behind Production ML infrastructure and the decision the numbers made easier. Where I can share the figures I do. Where I can't, I say so.
Here's a summary of this page by my AI Avatar
The deploy
Two Docker apps behind Traefik. Python 3.12-slim runs FastAPI. node:20-alpine → nginx:alpine serves the React SPA. Coolify orchestrates both. Cloudflare in front (Full Strict SSL). MLflow tracking + registry deployed at its own subdomain; FastAPI loads models by alias=best so model promotion is a registry click, not a redeploy.
The retrain runbook
Quarterly. Validate the data first (the most common failure mode is upstream data drift, not model drift). Generate an accuracy report on the last quarter's predictions vs actuals. Refresh segments. Train pooled CV models. Register in MLflow. Build inference artifacts. Generate outlook + snapshot. Compare monthly checkpoints against last quarter. If WMAPE crosses threshold or KS-test flags drift on key features, escalate before promoting.
The autoresearch runtime
Separate from production. An autonomous Claude Code agent ran 369 experiments over 10 days against the same datasets, off-network from the production deploy, with structured experiment logging. Each experiment got a one-line summary, full configuration, and held-out result. Seven named discoveries came out the other end, none of which I'd have tried on my own that quickly.
Two parallel runtimes
Public Coolify deploy plus an internal-network deploy with its own LFS contract, hosts override, and Windows launcher. The point: a real ML system has to live both behind the firewall (where the data lives) and on the public internet (where the dashboard users are).