Engineering

ML is an Engineering Discipline.
Treat It Like One.

Youssef Zaky April 2025 7 min read

There's a habit in the ML world of treating models as outputs rather than components. You run experiments, pick the best checkpoint, wrap it in an endpoint, and ship it. If accuracy degrades in production, you re-train. If a dependency changes and results shift, you shrug and re-run. If someone asks you to reproduce a result from six months ago, you hope you still have the right environment file.

This is not how software engineering works. And yet, for some reason, the ML field has largely accepted it as normal.

The argument I want to make is simple: ML systems are software systems. They should be held to the same engineering standards — rigorous testing, reproducibility, careful component selection, and maintainability over time. The fact that there's a learned function in the middle doesn't change this. It raises the stakes.

The cost of ML debt

Software engineering has a well-understood concept of technical debt — shortcuts taken under pressure that compound over time and slow future progress. ML has its own version, and it's worse in several ways.

In regular software, a messy function is still deterministic. In ML, a poorly tracked training run is gone. A model trained without proper data versioning can't be reproduced. A pipeline with undocumented preprocessing steps creates a hidden assumption that will eventually be violated and will take days to diagnose. A "works on my machine" model deployment produces inconsistent results under subtle input distribution shifts that no one thought to test for.

These failures are expensive, slow, and — critically — preventable. The practices that prevent them aren't exotic. They're just software engineering applied carefully to the specific properties of ML systems.

Benchmarking: beyond the validation set

The most common ML evaluation anti-pattern is evaluating only on a held-out test set from the same distribution as training data, reporting a single accuracy number, and calling it done. This tells you very little about how the model will actually behave.

Rigorous benchmarking means:

A concrete example: On an industrial segmentation project, the model's overall accuracy was well above threshold. A per-class breakdown revealed one category — the most operationally critical one — was sitting at 71%. The aggregate metric had completely obscured it. Slice evaluation caught the issue before deployment; the aggregate metric would have shipped it.

Testing: behavioral contracts for models

You wouldn't ship a feature without tests. Why ship a model without them?

Model testing is different from software testing, but the goal is the same: verify that the component behaves as specified. For ML systems, this means:

None of this requires a research breakthrough. It requires the same discipline you'd apply to any other component in the stack.

Reproducibility: if you can't reproduce it, you don't own it

A training run you can't reproduce is a liability, not an asset. Someone will eventually ask you to retrain with a bug fix, to ablate a design decision, or to understand why performance degraded after a data pipeline change. If you can't answer those questions, you're flying blind.

Reproducibility in practice requires:

This isn't bureaucratic overhead. It's the minimum needed to debug production failures, onboard collaborators, and make confident architectural decisions. The cost of setting it up once is orders of magnitude smaller than the cost of reconstructing it after something breaks.

Component selection: choosing deliberately, not by default

The ML ecosystem moves fast, and there's a persistent temptation to use whatever is newest, most hyped, or most commonly cited in recent papers. This is usually the wrong selection criterion for production systems.

Choosing components carefully means asking:

Maintainability: models are long-lived infrastructure

Models in production don't retire the day you ship them. They're there when the data distribution shifts six months later. They're there when a new team member needs to understand what they do. They're there when someone discovers a systematic failure mode on a customer segment you didn't test.

Maintainability is what makes models possible to improve over time rather than accumulate around:

Why this matters

None of these practices are glamorous. They don't make the paper. They don't generate Twitter engagement. They make models that work reliably for the people and systems that depend on them — and that can be understood, debugged, and improved by engineers who weren't in the room when they were built.

This is what we mean by rigorous ML engineering. Not a slower process — a more disciplined one. The ML field is mature enough to hold itself to this standard. The cost of not doing so is paid by the teams and users who depend on the systems we build.

Youssef Zaky is the founder of Rigor AI, an ML engineering consulting firm based in Nova Scotia, Canada. He has 10+ years of experience building production CV, RL, and foundation model systems across robotics, retail AI, gaming, and industrial applications.

Get in touch if you're thinking about how to raise the engineering bar on your ML systems.