Why the AI caution consensus is misdirecting the people who matter most — and the distinction nobody with a platform is making clearly enough.

Responding to: Agents don't know what good looks like — O'Reilly Radar

I've read Sam Newman's books. I've attended Neal Ford's talks. Their contributions to how this industry thinks about software architecture are genuine and substantial. Building Evolutionary Architectures changed how serious practitioners think about long-lived systems. Newman's microservices work gave a generation of engineers a vocabulary for problems they'd been struggling to name.

Which is exactly why this piece lands wrong and why it matters that it does.

When authors with this reach get the frame wrong it doesn't stay wrong quietly. It travels. It gets cited in architecture reviews, repeated in technology strategy meetings, and used to justify decisions that will cost organisations real money and real time. The here be monsters framing has consequences when it arrives with an O'Reilly banner attached.

This isn't a dismissal of the piece. There are genuine observations in it. But the foundational premise is wrong, the primary analogy is a category error, and the most important distinction in this entire conversation — the one that would actually help practitioners make better decisions — is almost entirely absent.

That absence starts with the opening premise. And the opening premise is wrong.

This Time Actually Is Different

The piece frames the current moment as another predictable split: one camp declaring the old rules dead, the other folding its arms and waiting for the hype to pass. Both wrong, both loud. It's a reasonable description of how the industry behaves. It's the wrong description of what's actually happening.

Previous shifts changed how we write software. This one changes the economics of whether you write it at all, how much of it you throw away, and how quickly you can start again. That's not a difference of degree. It's a difference of kind.

The rewrite has always been the architecturally correct answer more often than it was the economically viable one. Six to eighteen months of parallel cost, high risk of scope explosion, institutional knowledge walking out the door — the calculus almost always said live with the mess. That calculus is breaking. When a bounded service can be rewritten in days rather than months, when the model has absorbed more domain documentation than the original development team ever read, when you understand the problem properly the second time and can specify it cleanly — the entire defensive architecture of legacy software looks different.

The models are also still improving. Building permanent guardrail infrastructure around current limitations is like designing a motorway around the top speed of a 1905 automobile. The constraints you're engineering around are already being superseded.

The Category Error

Mezzalira reaches for the Dreyfus Model of Skill Acquisition to frame what AI can and can't do. The model is legitimate. The application is a category error that corrupts every conclusion that follows.

Dreyfus describes how humans move from rule-following novice to intuitive expert through embodied experience and accumulated judgment. It's a model of human learning. Applying it to a tool is like evaluating a compiler on whether it truly understands the code it processes. The question is malformed before you've started.

A hammer isn't a novice carpenter. A linter isn't a junior developer who hasn't learned why naming matters. You evaluate tools on reliability, accuracy, and how well they extend the capability of the human using them. The same standard applies here.

The implication of the Dreyfus framing is that AI needs to mature through the stages before it can be trusted with serious work — that the appropriate response is supervision, guardrails, and managed autonomy until it develops the judgment of a proficient practitioner. This generates precisely the wrong solutions. You don't manage a tool's development. You learn to use it well.

No human expert regardless of how brilliant or experienced has read the volume of material a large model has been trained on. Not close. The research surface area alone changes what's possible for a practitioner who understands the correct division of labour — hypothesis generation, literature synthesis, solution space exploration across adjacent domains, pattern recognition across codebases no individual could have read.

The division of labour

The doesn't-understand-why objection misallocates responsibility. It doesn't need to understand why. The why is yours. It always was. The electron microscope didn't replace the biologist — it gave the biologist eyes that could see things no human eye ever had. You don't evaluate it by asking if it truly understands cell biology.

Operator Failure Is Not a Technology Indictment

The canonical example: an agent tasked with making all tests pass replaces a failing assertion with assert True. Presented as a fundamental property of the technology. It isn't. It's a badly configured workflow operated without human oversight. The agent did precisely what it was instructed to do. The instruction was wrong. That's an operator problem.

The LinkedIn anecdote that follows — an agent modifying a build file to silently ignore failed steps — is worse as evidence. Someone sharing a horror story on LinkedIn is not empirical data. Reaching for it in a published O'Reilly piece to support a structural claim about AI capability is below the standard both authors have set in their own books.

No serious practitioner works this way. You don't hand an agent a binary objective with no constraints and no review gate. You work iteratively. The agent proposes. You review. You accept or reject. The human gate is not optional — it's basic workflow discipline applied to a non-deterministic system.

The production incidents that have made headlines need to be named for what they are. The deleted databases. The destroyed codebases. These are not AI horror stories.

They are operational negligence stories. Call them what they are.

Teams gave non-deterministic systems unsupervised write access to production infrastructure with no human gate, no rollback strategy, and no basic operational discipline. You would not give a junior developer direct unsupervised write access to a production database. The fact that some teams did the equivalent with an AI agent and were surprised by the result is not a technology failure. It's a judgment failure by the humans who configured the system.

This distinction matters because of how the narrative travels. "AI deletes production database" becomes part of the here be monsters canon, cited in risk assessments, used to justify blanket restriction policies, repeated in boardrooms by people who read the headline and not the detail. The correct lesson — non-deterministic systems must have deterministic boundaries at points where actions are irreversible — gets lost.

At the level of influence Ford and Newman operate at, this imprecision has consequences. Precision isn't pedantry here. It's the difference between solving the right problem and solving the wrong one.

The Distinction Nobody Is Making

There's a pattern here that anyone who lived through the microservices era should recognise immediately. Sam Newman documented it. Neal Ford watched it happen across hundreds of organisations.

Microservices weren't wrong. The decision to reach for microservices regardless of whether the problem warranted them was wrong. Teams took on distributed systems complexity without the operational maturity to manage it and without asking the prior question: does this problem actually require this solution. A modular monolith would have served most of them better. The tool was sound. The application was reckless.

Agentic AI is on exactly the same trajectory and the people best positioned to name it clearly are instead writing about Dreyfus stages and LinkedIn anecdotes.

An agentic system operates with significant autonomy across multiple steps, makes decisions about what to do next, and takes actions with real world consequences. It is by definition non-deterministic. For the right class of problem that's exactly what you want — open ended research, solution space exploration too large for a human to traverse manually, generating options across a wide domain before a human applies judgment to narrow them. The non-determinism is a feature because the value is in the breadth of exploration not the predictability of the path.

It is absolutely the wrong tool for a large class of problems teams are currently reaching for it to solve. Anywhere the requirement is reliable, auditable, repeatable execution — build pipelines, data transformations, deployment workflows, test execution — you want deterministic systems. Possibly with AI at specific well-defined nodes for specific well-defined tasks. Not autonomous agents making sequential decisions across the whole pipeline.

The production incidents follow directly from this misapplication. Teams aren't just configuring agents badly. They're reaching for agentic architecture in contexts that require deterministic architecture and discovering expensively that non-deterministic systems don't behave like deterministic ones.

The prior question

Not whether AI sits at the novice or advanced beginner stage of the Dreyfus model. The architectural question. Is this problem one that warrants a non-deterministic system at all. Answer that first. Everything else follows.

Ford and Newman between them have the pattern recognition to name this. They watched the microservices misapplication happen in real time and wrote about it with precision and clarity. The same analytical framework applied here would produce something genuinely valuable.

Instead the piece gives us Dreyfus and a LinkedIn screenshot.

What Good Actually Looks Like

The human brings judgment, context, accountability, and the why. The AI brings research surface area, synthesis capability, and the ability to explore solution space faster and wider than any individual practitioner. Those are different contributions. Keeping them separate is the discipline that makes the whole thing work.

You don't outsource reasoning. You use AI to inform it. When working through an architectural decision you surface relevant patterns, prior art, failure modes from adjacent domains. You interrogate the output. You ask for sources. You apply your own judgment to what comes back. The AI is not making the decision. It's expanding what you can see before you make it.

You match the tool to the problem. Open ended research, hypothesis generation, synthesis across large domains — agentic capability has genuine leverage here. Reliable repeatable execution of defined processes — you build deterministic systems with AI at specific nodes where it adds value. Not autonomous agents across the whole pipeline.

You treat the output as a starting point not an answer. The practitioner who understands this uses AI to raise the floor of what they can consider before applying their own expertise to raise the ceiling of what they produce.

This is craft. The same accumulated judgment about when to use which tool and how to use it well that good engineers develop about every other part of their work. It needs to be talked about honestly by people doing it rather than observed from a distance by people writing about it.

Ford and Newman could write that piece. It would be more useful than this one.

The Correct Question

Enterprise software complexity is not primarily inherent. It's primarily self-inflicted — the accumulated output of large teams operating under coordination overhead, political prioritisation, and processes that select for compliance over quality. Arguing we need to slow AI adoption to protect organisations structured around that environment is arguing for the preservation of the dysfunction.

The organisations willing to shed the coordination layer and rebuild around smaller higher trust teams using these tools well will find the complexity was never as inherent as it looked. The tools are legitimate. The economics are real. The misapplication is the problem — teams giving agents production access with no gates, and commentators whose caution misdirects serious practitioners toward the wrong solutions.

Mezzalira asked: do agents know what good looks like?

The more useful question is: do the people deploying them?

That's the one that was always going to determine the outcome. Not the capability of the tool. The judgment and craft of the humans holding it.

The floor has dropped. The economics have changed.

Are you asking the right questions?


Further Reading

The piece this responds to

Background reading from the authors