There's a habit in the ML world of treating models as outputs rather than components. You run experiments, pick the best checkpoint, wrap it in an endpoint, and ship it. If accuracy degrades in production, you re-train. If a dependency changes and results shift, you shrug and re-run. If someone asks you to reproduce a result from six months ago, you hope you still have the right environment file.
This is not how software engineering works. And yet, for some reason, the ML field has largely accepted it as normal.
The argument I want to make is simple: ML systems are software systems. They should be held to the same engineering standards — rigorous testing, reproducibility, careful component selection, and maintainability over time. The fact that there's a learned function in the middle doesn't change this. It raises the stakes.
The cost of ML debt
Software engineering has a well-understood concept of technical debt — shortcuts taken under pressure that compound over time and slow future progress. ML has its own version, and it's worse in several ways.
In regular software, a messy function is still deterministic. In ML, a poorly tracked training run is gone. A model trained without proper data versioning can't be reproduced. A pipeline with undocumented preprocessing steps creates a hidden assumption that will eventually be violated and will take days to diagnose. A "works on my machine" model deployment produces inconsistent results under subtle input distribution shifts that no one thought to test for.
These failures are expensive, slow, and — critically — preventable. The practices that prevent them aren't exotic. They're just software engineering applied carefully to the specific properties of ML systems.
Benchmarking: beyond the validation set
The most common ML evaluation anti-pattern is evaluating only on a held-out test set from the same distribution as training data, reporting a single accuracy number, and calling it done. This tells you very little about how the model will actually behave.
Rigorous benchmarking means:
- Slice evaluation. How does the model perform on specific subpopulations — rare classes, edge lighting conditions, unusual aspect ratios, geographic regions? Aggregate accuracy hides failures on minority slices that matter in production.
- Out-of-distribution testing. Deliberately test on data that differs from training: different cameras, different time of day, different hardware, different operators. If you can't characterize the failure boundary, you can't claim the model is production-ready.
- Regression benchmarks. Every model version should be evaluated against a fixed, versioned benchmark suite. "Better on average" is not enough — you need to know what regressed before you ship.
- Latency and resource profiling. Accuracy in isolation is meaningless if the model misses the latency budget. Benchmark throughput, memory footprint, and tail latency under realistic load — not just a single forward pass on a quiet machine.
A concrete example: On an industrial segmentation project, the model's overall accuracy was well above threshold. A per-class breakdown revealed one category — the most operationally critical one — was sitting at 71%. The aggregate metric had completely obscured it. Slice evaluation caught the issue before deployment; the aggregate metric would have shipped it.
Testing: behavioral contracts for models
You wouldn't ship a feature without tests. Why ship a model without them?
Model testing is different from software testing, but the goal is the same: verify that the component behaves as specified. For ML systems, this means:
- Invariance tests. Inputs that should produce the same output — a rotated image, a translated bounding box, a paraphrased query — should not cause meaningful output variation. Test this explicitly.
- Directional expectation tests. If you add noise to an input, confidence should decrease. If you increase the contrast of a relevant feature, detection should improve. These are testable behavioral claims.
- Minimum functionality tests. A set of hand-curated examples the model must always get right. Obvious failures on easy cases indicate regressions that automated evaluation might miss.
- Pipeline integration tests. The model is not the whole system. Test preprocessing, postprocessing, batching behavior, and error handling end-to-end — not just the model in isolation.
None of this requires a research breakthrough. It requires the same discipline you'd apply to any other component in the stack.
Reproducibility: if you can't reproduce it, you don't own it
A training run you can't reproduce is a liability, not an asset. Someone will eventually ask you to retrain with a bug fix, to ablate a design decision, or to understand why performance degraded after a data pipeline change. If you can't answer those questions, you're flying blind.
Reproducibility in practice requires:
- Data versioning. Every training run should reference a versioned, immutable snapshot of its training data. "The production dataset as of last Tuesday" is not a version.
- Dependency pinning. Environments should be fully specified — down to CUDA version and framework patch level. "It worked in the container" is not a reproducibility story.
- Experiment tracking. Hyperparameters, random seeds, hardware, commit hash, and data version should be logged automatically for every run. Not optionally — automatically.
- Artifact provenance. Every deployed model should have a traceable lineage: which data, which code, which config, which run produced this checkpoint.
This isn't bureaucratic overhead. It's the minimum needed to debug production failures, onboard collaborators, and make confident architectural decisions. The cost of setting it up once is orders of magnitude smaller than the cost of reconstructing it after something breaks.
Component selection: choosing deliberately, not by default
The ML ecosystem moves fast, and there's a persistent temptation to use whatever is newest, most hyped, or most commonly cited in recent papers. This is usually the wrong selection criterion for production systems.
Choosing components carefully means asking:
- Does the architecture match the constraint? A large transformer is not always the right choice for an edge device. A general-purpose object detector is not always better than a lighter, specialized one for a narrow domain. Fit the component to the problem, not the other way around.
- What are the operational properties? Inference latency, memory footprint, quantization behavior, and ONNX/TensorRT compatibility are not afterthoughts — they're selection criteria. Evaluate them before committing to an architecture.
- What's the maintenance burden? Dependencies that are actively maintained, well-tested, and have clear upgrade paths reduce long-term risk. A novel architecture with a single contributor and no tests is a liability regardless of its benchmark numbers.
- What does the failure mode look like? A model that fails gracefully — returning low-confidence outputs, triggering a fallback, or raising an error — is better than one that fails silently. This is a design choice, not an accident.
Maintainability: models are long-lived infrastructure
Models in production don't retire the day you ship them. They're there when the data distribution shifts six months later. They're there when a new team member needs to understand what they do. They're there when someone discovers a systematic failure mode on a customer segment you didn't test.
Maintainability is what makes models possible to improve over time rather than accumulate around:
- Model cards. Document what the model does, what it was trained on, what it was tested on, known failure modes, and what it was explicitly not designed to handle. This sounds obvious. Most teams don't do it.
- Clear abstraction boundaries. The preprocessing logic that feeds a model should be as versioned and testable as the model itself. Hidden coupling between pipeline stages is where production failures live.
- Monitoring and alerting. Distribution shift, latency regression, output distribution changes — these should be measured and alerted on, not discovered when a customer reports a problem.
- Deprecation plans. Models have lifespans. When you deploy a new version, the old one should have a clear sunset timeline. Zombie models that nobody maintains but everyone depends on are a common source of production incidents.
Why this matters
None of these practices are glamorous. They don't make the paper. They don't generate Twitter engagement. They make models that work reliably for the people and systems that depend on them — and that can be understood, debugged, and improved by engineers who weren't in the room when they were built.
This is what we mean by rigorous ML engineering. Not a slower process — a more disciplined one. The ML field is mature enough to hold itself to this standard. The cost of not doing so is paid by the teams and users who depend on the systems we build.
Youssef Zaky is the founder of Rigor AI, an ML engineering consulting firm based in Nova Scotia, Canada. He has 10+ years of experience building production CV, RL, and foundation model systems across robotics, retail AI, gaming, and industrial applications.
Get in touch if you're thinking about how to raise the engineering bar on your ML systems.