DevelopmentInnovationStrategy

The Better AI Gets at Writing Code, the Worse We Get at Reviewing It

Patrick Hammond
Patrick Hammond
Atomic Robot CTO
February 25, 202613 min read
The Better AI Gets at Writing Code, the Worse We Get at Reviewing It

You’re three hours into reviewing AI-generated code across multiple threads. Auth module in one tab, billing logic in another, test refactors in a third. Everything looks clean. Everything looks right. And that’s when you should be worried.

Something has shifted in the shape of engineering work. Not the amount of effort — the kind. Less time writing. More time reviewing, catching subtle errors, supervising across parallel streams.

AI hasn’t reduced the cognitive load of software engineering. It has redirected it — toward a kind of reviewing we’re not built for. You’re evaluating output you didn’t conceive, in patterns you didn’t choose, at a pace you don’t control — often without understanding why a particular approach was taken. You’re reconstructing intent, not confirming a familiar signal.

This feels new to software engineers. It is not new to humans.

For decades, other industries have faced the same fundamental challenge: skilled professionals whose primary job shifted from doing the work to monitoring systems that do it for them. Aviation, radiology, among others — these fields have deep research on exactly this cognitive pattern. The concepts have names: vigilance decrement, automation complacency, monitoring fatigue.

But the research also points toward solutions. Other industries have already mapped this problem space, and software engineering already has the practices it calls for. We just need to apply them to the new shape of the work.

What the Research Already Knows

Three well-studied phenomena converge in AI-assisted development workflows. Each has its own body of research. Together, they describe a problem that is larger than any one of them.

Vigilance Decrement

Sustained attention is more expensive than it feels. As Warm, Parasuraman, and Matthews (2008) showed, detection performance declines during sustained monitoring — not because the brain loses motivation, but because it runs out of resources. “Vigilance requires hard mental work and is stressful.”

The decline starts well before most code review sessions end. A 2025 review in Cognitive Science describes the vigilance decrement as “one of the most robust findings in attention research,” and Klein and Feltmate (2025), surveying 75 years of this research, confirm that performance drops after the first half hour on task.

Automation Complacency

When automated systems are mostly correct, humans systematically over-trust them — and this cannot be overcome with simple practice or instructions.

Lisanne Bainbridge articulated this paradox in 1983: the more reliable the automation, the less the human operator is able to contribute when it fails. The designer who tries to eliminate the human operator still leaves the human with the tasks the designer cannot automate — and these tend to be the hardest tasks of all.

The numbers are striking. In a review of automation complacency research, McBride, Rogers, and Fisk (2014) reported that a key study by Parasuraman and colleagues (1993) found that when automation was consistently reliable, operators detected only about 30% of automation errors. When the system sometimes failed visibly, detection rates jumped to roughly 75%.

The better AI gets, the worse we can expect to get at catching what it misses. And the harder the output is to verify, the more likely we are to just accept it.

As AI improves, the problem gets worse, not better. Mica Endsley, former Chief Scientist of the U.S. Air Force, calls it the “automation conundrum”:

“The more automation is added to a system, and the more reliable and robust that automation is, the less likely it is that human operators overseeing the automation will be aware of critical information and able to take over manual control when needed.”

If you’re waiting for AI to get good enough that human review becomes unnecessary, the research suggests that the path between here and there is the most dangerous stretch.

Context Switching

AI-assisted development doesn’t just ask you to review more — it asks you to review across more contexts at once. Every context switch leaves cognitive residue that degrades performance on the next task, and the cost compounds.

“Attention residue” — cognitive activity about a prior task that persists even after you’ve moved on — was named by Sophie Leroy in 2009. People experiencing attention residue fail to notice errors and miss optimal solutions. As she described it: “It’s like Windows staying open in our brains, and it makes it hard to focus on the intervening work.” Every open tab, every half-reviewed PR, every thread you’ll get back to later — they don’t close cleanly. They stay resident, competing for the same attention you need for the task in front of you.

The neuroscience explains why humans aren’t great at this. Wiehler and colleagues, in a 2022 study published in Current Biology, found that sustained high-demand cognitive work causes glutamate to accumulate in the lateral prefrontal cortex — the brain region responsible for executive control. The buildup makes cognitive control literally more expensive to activate over time. This isn’t about willpower. It’s neurochemistry.

You can see this play out in practice. Parnin and Rugaber (2011) studied 10,000 recorded sessions from 86 programmers and found that only 10% of programming sessions had coding activity begin in under a minute. In 93% of sessions, programmers navigated to other locations before editing — rebuilding the mental model they’d lost.

AI-enabled workflows amplify all of this. Parallel streams — auth in one thread, billing in another, infrastructure in a third — multiply the switching cost beyond what traditional development workflows demanded. Each switch means reloading a mental model. Each reload is more expensive than the last. And we’re doing it more frequently.

Review Fatigue: Naming the Experience

Any one of these would degrade code review quality on its own. But in AI-assisted development, they don’t take turns — they stack. Your attention depletes, your trust in the output inflates, and every context switch makes both worse. The result is what we’re calling review fatigue: the point where sustained code review quietly shifts from deep evaluation to surface scanning — not because you stopped caring, but because vigilance decrement, automation complacency, and context-switching costs are all operating on you at once.

You might already know what it feels like. Everything starts looking correct. The urge to approve grows. Edge cases stop triggering alarm bells. You stop asking second-order questions — what happens if this input is null? What about concurrency? You trust the pattern. Everyone has their own version of the tell. For me, it’s when I notice I’m scanning at higher and higher levels versus understanding — when I just start approving faster. That’s my cue to pause and step back before blindly accepting everything.

That self-awareness is the first line of defense — but it requires honesty about what’s happening. And because AI output looks clean and confident when it’s mostly correct, it masks the uncertainty underneath, making the complacency even more dangerous. The danger concentrates in that middle zone: good enough to trust, wrong enough to hurt.

The Amplifier Loop

AI increases output volume, which increases review volume, which increases vigilance demand and context switches, which degrades review quality — and bugs slip through. But the system still looks productive. Throughput metrics go up. The erosion is invisible until something breaks.

Teams are already responding to this pressure by using AI to review AI-generated code — a quick scan of the output, then an AI-driven review of the PR, then directing an agent to address the comments. It’s a pragmatic adaptation — and it’s happening whether we theorize about it or not. But automating the review of automated output doesn’t eliminate the human factors problem — it nests it. Now you’re monitoring the monitor, and the same complacency dynamics apply one level up. Bainbridge’s irony of automation (1983) is recursive.

The Deeper Cost

There’s a deeper cost, though — one that layered automation can’t address at all. Review isn’t just quality assurance — it’s how you learn the system you’re building. As Adam Toennis, a principal engineer on our team, puts it: the differentiator for an engineer in an AI-assisted workflow is “higher-level architectural decisions — being able to describe how data flows between all the layers and why they’re flowing that way.”

That knowledge doesn’t come from writing prompts. It comes from reviewing carefully enough to understand what was built and why. If you skip the review or do it poorly, you don’t just risk defects — you lose comprehension of the system itself. The cost of review fatigue isn’t just escaped bugs. It’s eroding the very understanding that makes you capable of leading the next round of work.

Lessons from Other Industries

The good news for us: we’re not the first profession to face this.

Other industries have already lived through what happens when sustained monitoring fails — and the consequences were severe enough to force systemic change.

A note of proportion: for most of us, the software we build is not life-and-death. The case studies below involve catastrophic failures, and the risks in a typical software project are not that. But the cognitive mechanisms don’t care about those details — the science is the same.

They studied the problem because the consequences forced them to. We have the luxury of learning from their research before ours force us to.

In autonomous vehicles, an Uber self-driving vehicle struck and killed a pedestrian in Tempe, Arizona in 2018. The safety driver’s sole job was to monitor the automation. The NTSB investigation found that Uber lacked adequate mechanisms for addressing operator automation complacency. The entire role was vigilance work — and complacency made that role useless. The parallel to assigning a developer to “just review the AI’s code” is direct.

In radiology, a 2013 study embedded a gorilla — 48 times the size of a lung nodule — into a series of CT scans and asked expert radiologists to search for nodules. 83% did not see the gorilla. Eye tracking showed that over half of those who missed it looked directly at its location. When you’re monitoring for one category of signal, you become blind to others — even enormous ones. When you’re scanning AI-generated code for syntax and logic errors, what are you not seeing?

In cybersecurity, a 2023 Vectra AI report found SOC analysts receive an average of 4,484 alerts daily. 67% go uninvestigated. 83% are false positives. The volume problem in security monitoring maps directly to the volume problem in AI code review: more signals than a human can process, most of them benign, with the critical ones hiding in the noise.

Across every one of these industries, the same pattern repeats:

  1. A system is designed assuming sustained human monitoring accuracy
  2. The assumption fails
  3. Individuals are blamed
  4. Investigations reveal systemic causes
  5. Structural changes follow — duty hours, mandatory rest, checklists, redundancy, alarm redesign

The question for software engineering is whether we’ll learn from this research — or wait until our own consequences teach us the same lessons.

The Verification Gap

Those industries faced monitoring fatigue within domains their operators already understood. Software engineering has an additional problem: AI makes it easy to generate code in domains you haven’t mastered yet — and complacency hits harder when you lack the expertise to spot what’s wrong.

One of AI’s genuine strengths is making new domains more approachable — an experienced engineer can move into an unfamiliar framework and be productive faster than ever before. But that accessibility requires discipline. Barry Geipel, a principal engineer on our team with four decades of experience, brings deep architectural expertise that translates across any stack. On a recent React Native project, his solutions reflected decades of hard-won judgment. But he’s honest about the gap: “I can review some things, but others I have no good grounding on.”

His architectural instincts are sharp — but idiomatic patterns in an unfamiliar framework are a different kind of knowledge. It’s the difference between knowing a power tool is powerful and knowing every way it can kick back.

The evidence confirms this isn’t just a feeling. Perry, Srivastava, Kumar, and Boneh (2023) studied developers using AI coding assistants and found that participants with AI access wrote significantly less secure code than those without — yet were more likely to believe they had written secure code. The assistance increased confidence while decreasing quality.

Goddard, Roudsari, and Wyatt (2012), in a systematic review of automation bias, found that erroneous automated advice was followed at a 26% higher rate among groups using automated recommendations — and that task inexperience correlated with increased automation bias errors. The less you know a domain, the more you trust the machine.

When AI makes a new domain accessible, the verification challenge doesn’t disappear — it shifts into territory where you may not yet have the fluency to distinguish correct from plausible. The more complex the AI-generated code, the harder to verify — and the more engineers will simply trust it.

Shifting How We Work

The picture is clear. But it also points toward the response — and the encouraging part is that most of what’s needed isn’t new. The answer isn’t more discipline. It’s better design — of our workflows, our habits, and our expectations.

The Case for Slowing Down

If AI increases output volume, then review capacity becomes the limiting reagent. Managing an AI-augmented team means managing cognitive load, not just managing tickets. And the instinct most teams have — to push through faster — is exactly backward.

Aviation learned this the hard way. After decades of accidents caused by rushing through checklists, the industry didn’t tell pilots to read faster — it mandated slower, more deliberate procedures and built in mandatory pauses. The teams under the most time pressure got more structure, not less. The same principle applies here.

Slowing down isn’t lost productivity — it’s where the quality lives. The team that feels too busy to slow down for review is the team that needs it most. Here’s a reframe I keep coming back to: the faster the code is written, the slower the review should be. Break the assumption that those two rates should be linearly related.

What the Research Suggests

Every industry that has studied monitoring fatigue has concluded the same thing: the solution is structural, not individual. You cannot train, motivate, or discipline your way past biological limits. As Nancy Leveson writes in Engineering a Safer World: Systems Thinking Applied to Safety (2012):

“Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen.”

Treat review like high-focus work, not background noise. Vigilance requires hard mental work. It is not passive scanning. Schedule it accordingly — and protect it like you would any other high-concentration task.

Protect uninterrupted review windows. Context switching leaves cognitive residue that degrades performance. Fragmenting code reviews across meetings and Slack guarantees missed errors.

Normalize stepping away. NIOSH recommends breaks every two hours minimum. ATC mandates rest between positions. In practice, treat review sessions like focused intervals — 25 to 30 minutes of deep review, then a deliberate break (if you’re already using the Pomodoro Technique, this fits naturally). The vigilance decrement research is clear: a break can reduce or reset the decline. In software engineering, stepping away from review is treated as lost time. It should be treated as quality assurance.

Practices We Already Have

The tools are already in your toolbox — pair programming, TDD, small PRs, WIP limits, CI, observability, retros. Those aren’t relics of the pre-AI era. They’re exactly what this moment calls for.

Pair on reviews. Aviation doesn’t put one pilot in the cockpit. Two sets of eyes, shared cognitive load, real-time calibration through dialogue. When you’re narrating your reasoning aloud, you force active evaluation over passive scanning. This is especially valuable for the riskiest reviews: large AI-generated diffs, unfamiliar domains, late in the day.

Small PRs, small review units. We already know small, focused pull requests get better reviews. If AI generates a large block of code, break the review into smaller, focused units. Don’t review a 500-line AI-generated diff in one pass.

Seek out failure. Bahner and colleagues (2008) at TU Berlin found that exposing operators to automation failures during training significantly decreased complacency — one of the few proven countermeasures. Software engineering already has a practice built on this principle: red-green-refactor. The red step forces you to see the failure state before the success state. Apply the same discipline to AI output: before accepting, verify you can articulate what incorrect would look like. If you can’t describe the failure mode, you aren’t reviewing — you’re rubber-stamping.

Limit WIP. AI makes parallelization cheap. Cognitive switching is not. Every open review thread is an open loop consuming working memory — and research tells us we only have about four slots of working memory to work with. Finish one review before opening the next.

Strengthen automated layers. Type systems, linters, static analysis, CI — these exist because we’ve always known human review is imperfect. If human review is more fallible than we assumed, the automated layers become more critical, not less.

But when the tests are also AI-generated, we face a question most teams haven’t confronted: are we generating incorrect verification of incorrect code?

If the same system — or the same cognitive session — produces both the implementation and the tests, the tests may encode the same flawed assumptions as the code they’re meant to validate. These layers need deliberate checkpoints where humans have ownership and accountability — gates in the workflow where a person must actively choose to accept rather than passively allowing code to flow through. Those gates are the part we can’t automate away.

Document intent. One reason AI-generated code is harder to review is that you’re reconstructing intent. Architecture Decision Records, design rationale, prompt logs — these restore the context that makes review meaningful. You’re not just asking “does this code work?” but “does this code do what we intended?” As we’ve written about previously, the second question is unanswerable without documentation.

Retro on review quality. Add review fatigue as a retrospective topic: “Did we catch ourselves rubber-stamping this sprint? Where did AI-generated code slip through that shouldn’t have?” This surfaces the problem at the team level — where the real fixes live. At the individual level, a refine-then-fresh-perspective pass at the end of each session builds the same reflective habit into your daily workflow.

Watch for the signals. These are hypotheses grounded in what the research predicts — not proven metrics. But they give you something concrete to instrument:

  • Review approval speed trending faster without corresponding decrease in PR complexity
  • Review comment density declining with fewer questions asked per review
  • Bug escape rate increasing despite stable or increasing test coverage
  • Late-in-day approvals correlating with higher defect rates
  • PRs approved per day rising without corresponding increase in review time
  • Passive acceptance masking as productivity with engineers following AI rather than directing it

Review fatigue is invisible until you build the dashboard for it. Don’t measure productivity purely by output volume — AI throughput that outpaces review capacity creates silent quality erosion. The system looks productive while judgment degrades.

The Institute of Medicine didn’t tell doctors to be more careful. They redesigned the system. The Federal Aviation Administration didn’t tell pilots to pay more attention. They mandated rest, checklists, and co-pilots. The solution was never “try harder.” It was “work smarter.”

If the Work Shifts, Shift How You Work

When a team gets this right, the work actually feels better. Reviews are focused, not frantic. Developers trust the process because the process is designed for how humans actually work. AI does what it’s good at: generating, iterating, exploring. Humans do what they’re good at: judging, questioning, deciding. The pace is sustainable because it’s intentional.

This isn’t a crisis. It’s a transition — and we’re in known territory. We’ll have to shift past our muscle memory and familiar patterns — but that’s something we engineers have always been good at.


These aren’t theoretical principles for us — they’re how we work. We’ve built our own engineering workflows around the research in this article: structured review practices, deliberate human checkpoints, and team rhythms designed for how people actually think. The harder part isn’t knowing what works — it’s embedding it into your team’s culture and making it stick under real deadline pressure. If your team is feeling the tension between AI-accelerated output and review capacity, we’d welcome the conversation.

Ready to Build Something Amazing?

Let's talk. Schedule a consultation to explore how our empathetic, strategic approach can help you turn complex needs into intuitive and enjoyable digital experiences.
Start a Conversation Let's Build Together