There’s a pattern I’ve seen repeat itself enough times now that it feels like a law: a team integrates a language model into a workflow, gets impressive early results, and then spends the next six months quietly negotiating with the failure modes they didn’t anticipate.
This isn’t a criticism of the technology. It’s a structural feature of building on systems whose behavior is probabilistic rather than deterministic. The mental model most engineers carry — inputs, logic, outputs — doesn’t hold cleanly when the “logic” is a 70-billion parameter network trained on the internet.
The prototype trap
Prototypes flatter AI systems. A demo that works 90% of the time in a controlled setting can feel remarkable. But most production systems are implicitly asking for something much closer to 99.9% — and that last decimal point is where the real work lives.
The gap isn’t primarily a model quality problem. It’s a problem of surface area. Every prompt path, every edge-case input, every unusual combination of context represents a new sample from a distribution your evals almost certainly didn’t cover.
“Most complexity is accidental — the product of decisions made without a clear picture of their downstream consequences.”
What I’ve seen work
The teams that navigate this well share a few habits. They instrument everything from the start — not to tune the model, but to understand the actual distribution of inputs hitting the system in production. They design for graceful degradation. And they resist the temptation to make the AI responsible for too many things at once.
Narrow scope with deep evaluation beats broad scope with shallow evaluation almost every time. A model that reliably does one thing well is more useful than a model that plausibly does five things.
None of this is a reason not to build with these tools. The productivity gains are real. But they accrue to teams that treat AI as a component in a system — with all the engineering discipline that implies — rather than a magic layer that handles whatever you throw at it.