K
KnowMBAAdvisory
Data StrategyAdvanced8 min read

ML Engineering Practice

ML Engineering is the practice of taking trained models and operating them reliably in production: serving infrastructure, feature pipelines, monitoring, retraining, A/B testing, rollback, and cost control. ML engineers are the bridge between data scientists (who research and train models) and production systems (which serve predictions to real users at scale). The job is mostly software engineering โ€” most ML failures in production are not 'wrong model' but 'feature pipeline broke,' 'serving latency exploded,' or 'training data drifted and no one noticed.' KnowMBA POV: companies that hire data scientists without ML engineers end up with a wall of demos that never ship; companies that hire ML engineers without MLOps platform end up with engineers spending 80% of their time on plumbing instead of impact.

Also known asMLEMachine Learning EngineeringProduction ML Team

The Trap

The trap is assuming the data science team can 'also do production.' Data scientists optimize for model accuracy in notebooks; ML engineers optimize for system reliability and cost in production. The skills overlap maybe 30% โ€” and the kind of person who excels at one usually doesn't excel at the other. Asking your top researcher to also own a 99.9% uptime model serving system is how you get both a worse model AND a worse system.

What to Do

Establish the practice with five capabilities: (1) Feature pipelines that are versioned and tested. (2) A model registry that tracks lineage from training data to deployed artifact. (3) A serving layer (real-time API, batch scoring, or edge) with SLOs. (4) Monitoring that catches data drift, prediction drift, and performance regressions. (5) An incident process for model rollback. Without all five, you don't have ML engineering โ€” you have data scientists shipping notebooks to a server and hoping.

Formula

Time-to-Production for New Model = Days from Trained Model โ†’ Live Predictions Serving Real Users

In Practice

Uber's Michelangelo platform โ€” built starting in 2015 โ€” formalized ML engineering at scale. The platform standardized feature stores, model training, deployment, and monitoring across hundreds of models powering ETAs, fraud detection, pricing, and matching. By 2018, Michelangelo enabled Uber to operate thousands of models in production with a relatively small central ML platform team โ€” because the platform did the engineering work uniformly instead of every team rebuilding it. The Michelangelo paper became one of the most influential references in industry MLOps.

Pro Tips

  • 01

    Hire your first ML engineer BEFORE your fifth data scientist. The bottleneck in most ML orgs is not 'we need more models' โ€” it's 'we can't ship the models we have.' One MLE who can productionize work unlocks the entire research team.

  • 02

    Feature parity between training and serving is the #1 source of silent ML production failures. The same logical feature must produce the same value at training time and at serving time. Investing in a feature store (or rigorous feature pipeline contracts) prevents the class of bugs where 'the model worked in offline eval but performs terribly in production.'

  • 03

    Monitor PREDICTION distributions in production, not just model performance. You usually can't measure accuracy in real-time (no immediate ground truth), but a sudden shift in the distribution of predictions is an early warning that something broke upstream.

Myth vs Reality

Myth

โ€œIf we use SageMaker / Vertex AI / Databricks, we don't need ML engineersโ€

Reality

Managed platforms reduce engineering toil; they don't eliminate ML engineering judgment. Someone still has to design feature pipelines, set monitoring thresholds, design rollback strategies, and own the on-call rotation when predictions go sideways. Platforms are tools, not teams.

Myth

โ€œML engineers and data scientists are the same role with different titlesโ€

Reality

Different optimization functions, different daily tools, different failure modes. Data scientists win with offline metrics and notebook experiments; ML engineers win with production reliability and cost. The Venn overlap is small enough that most people are good at one or the other, not both.

Try it

Run the numbers.

Pressure-test the concept against your own knowledge โ€” answer the challenge or try the live scenario.

๐Ÿงช

Knowledge Check

Your data science team trained a fraud detection model that achieves 0.94 AUC offline. Three months after deployment, fraud losses are UP, not down. Most likely cause?

Real-world cases

Companies that lived this.

Verified narratives with the numbers that prove (or break) the concept.

๐Ÿš—

Uber Michelangelo

2015-Present

success

Uber built Michelangelo as an internal ML platform after recognizing that every ML team was rebuilding feature pipelines, training infrastructure, and serving layers from scratch. The platform standardized feature stores, distributed training, model deployment, monitoring, and rollback. By 2018, Michelangelo was running thousands of models in production powering ETAs, fraud detection, pricing, matching, and many more use cases โ€” with a central platform team that scaled sub-linearly with model count. The Michelangelo architecture paper became one of the most-cited references for industry MLOps.

Production Models

Thousands

Use Cases

ETA, fraud, pricing, matching, more

Central Platform Team Size

Sub-linear vs model count

Industry Influence

Reference architecture for MLOps

ML engineering as a centralized platform practice scales sub-linearly. Without it, every team rebuilds the same plumbing โ€” you pay for the engineering work N times instead of once.

Source โ†—
๐ŸŽฌ

Netflix Metaflow

2017-Present (Open Sourced 2019)

success

Netflix built Metaflow to give data scientists production-grade workflow capabilities without forcing them to learn engineering tooling. The framework abstracts compute, scheduling, and versioning behind Pythonic decorators. Open-sourced in 2019, Metaflow became a popular ML workflow framework explicitly designed around the insight that data scientists' productivity is the bottleneck and ML engineers' role is to remove engineering friction from the research path.

Internal Users at Netflix (peak)

Hundreds of data scientists

Open Source Adoption

Thousands of GitHub stars

Design Philosophy

Engineering invisible to data scientist

The best ML engineering platforms make the data scientist's job easier, not harder. If your platform requires DSes to learn Kubernetes, you've built the wrong abstraction.

Source โ†—

Related concepts

Keep connecting.

The concepts that orbit this one โ€” each one sharpens the others.

Beyond the concept

Turn ML Engineering Practice into a live operating decision.

Use this concept as the framing layer, then move into a diagnostic if it maps directly to a current bottleneck.

Typical response time: 24h ยท No retainer required

Turn ML Engineering Practice into a live operating decision.

Use ML Engineering Practice as the framing layer, then move into diagnostics or advisory if this maps directly to a current business bottleneck.