Skip to main content

Command Palette

Search for a command to run...

Development in the Age of AI: The Illusion of Correctness

The better AI-generated code looks, the less critical thinking we apply to it. A look at the automation paradox — Air France 447, Tesla Autopilot — and what Stanford, METR, Veracode, and Apiiro reveal about AI-assisted development.

Published
13 min read
D
Full Stack Developer · since 2018 Building web products with Node.js, TypeScript, React, and Next.js. On the backend — NestJS, microservices, PostgreSQL/MongoDB, deployed on AWS. Domain experience: automation, healthcare, insurance, transport/logistics, web hosting. Quietly building SaaS on the side. Writing about what usually goes unspoken.

A year or two ago, programming with AI (LLMs) was in its early stages; today, the market demands the use of various AI tools. This applies not only to programming, design, analytics, customer support, copywriting, translation, and even routine clerical work, which have all been handed over to AI agents of one sort or another.

What does this mean for us as developers? More and more code is being written, validated, and tested with the help of AI. At nearly every stage of building a software product, an LLM is involved — from the initial idea to the finished product. And the more we use AI, the more problems we discover (both obvious and subtle), regardless of the benefits it brings.

What does this look like in practice? A paradox: the better the AI output looks, the less critical thinking we apply to it. Neat indentation, sensible names, appropriate comments — and the code slips past our internal skepticism filter. In a serious team, code review, linters, tests, and CI will catch it. But on the twentieth identical PR of the day, before a release, under a deadline, scrutiny becomes more lenient. It looks fine, after all. The tests are green.

And this is precisely where a specific class of problems sneaks through — the kind that standard filters miss. A hallucinated API that compiles because the AI picked a similarly named function — and it silently returns the wrong thing. A race condition that can't be reproduced on staging without real concurrent load. An off-by-one in pagination that drops the last element once in every thousand requests. A dependency with a hallucinated name that an attacker has registered on npm with malware. An N+1 query that brings the database to its knees only under production traffic.

The most insidious case is a quiet violation of business logic. The tests are green because they verify what the developer thought was important. QA validated against a limited dataset. But without a full domain context, the AI may have shifted the logic imperceptibly: a rounding step that creates a penny-level discrepancy in finances on an edge case. A condition that applies to standard users but not to corporate users on a special tariff. A filter that quietly drops one category of records. These bugs don't crash, don't trigger alerts. They do the wrong thing — and only surface when a customer notices a discrepancy in a report.

And the worst of it is that AI code looks best precisely where we look least: in standard CRUD, boilerplate endpoints, and typical forms. Not because AI does worse there, but because that's where our attention is lowest. We'll re-read a complex algorithm three times. The twentieth controller of the day passes with minimal resistance. The false sense of correctness is strongest exactly where we're most vulnerable.

Who gets electrocuted

Ask any electrician: who gets shocked most often? The answer is the same — beginners and the very experienced. The middle ground is rarely affected. A beginner gets shocked because they don't know where the danger is. The experienced one gets shocked because they've stopped checking — the brain has automated the process and discarded the "unnecessary" safety steps.

The same logic, in my view, applies to AI-driven development. A junior accepts AI code because they can't assess whether it's correct: it compiles, the tests are green, so it must be fine. A senior accepts it because they're tired of checking: the first hundred times they read it carefully, everything was fine, and the brain concluded, "AI is reliable in this context". And that's exactly where a race condition slips through — one that a junior wouldn't have written, because they wouldn't have known how to write it elegantly. Mid-level developers, in my experience, tend to be the most critical — they already know where the AI lies, because they've been burned by it themselves, but they haven't yet developed the immunity of trust.

Note:

A caveat upfront: this is a working hypothesis based on an analogy, not a conclusion drawn from data. I haven't come across any direct studies broken down by experience level — most of the research cited below tested experienced developers. So treat this section as the author's view, not as fact.

We've seen this before — in pilots and drivers

This mechanism is well known outside IT. In 2009, an Airbus A330 operating Air France flight 447 crashed into the Atlantic Ocean, killing all 228 people on board. The official BEA report (the French equivalent of the NTSB) described the chain of events: the pitot tubes iced over at cruising altitude → the autopilot disengaged automatically → the pilots failed to understand what was happening and, through incorrect inputs, put the aircraft into a sustained stall. The stall warning sounded continuously for 54 seconds — the crew ignored it.

The BEA's conclusion wasn't "the pilots were bad". It was something more chilling: the pilots had lost the skill of manual flight at altitude because they always flew on autopilot. They had only ever practiced stall recovery for low altitudes — takeoff and landing. No one had prepared them for this situation at cruising altitude because "the autopilot is reliable, after all."

A more recent example is Tesla Autopilot. In August 2025, a federal jury in Miami found Tesla partly liable for the first time in a fatal crash involving autopilot, awarding $243 million in damages (of which $200 million was punitive, i.e., intended as punishment). The driver had dropped his phone, leaned down to pick it up, and run a stop sign at 100 km/h, killing a pedestrian. The driver later said: "I thought the system would brake if there were an obstacle." In February 2026, the judge denied Tesla's motion to overturn the verdict, leaving the award in place. According to NHTSA data, autopilot has been linked to at least 467 collisions and 13 deaths.

The BEA report on AF447 called this the automation paradox: the more reliable the automation, the fewer skills the human retains when the automation suddenly becomes necessary. Tesla shows the same pattern at a different scale: the system handles the overwhelming majority of cases, the human stops monitoring it, and it's precisely in the remaining minority that catastrophe occurs.

The same data in software development

What the aviation and automotive industries have collected over decades, software development has measured in the past two years. And the figures fall into the same picture.

Stanford (Perry et al., ACM CCS 2023). 47 developers, 5 security tasks, a controlled experiment. Participants using AI wrote significantly less secure code on 4 out of 5 tasks. The paradox: they were also more confident that their code was secure than those who wrote without AI — classic automation bias, captured in data.

METR (July 2025). A randomized controlled trial with 16 experienced open-source developers working on large repositories they were already familiar with. Before the experiment, they predicted AI would speed them up by 24%. After, they estimated it had sped them up by 20%. The actual measurement: AI slowed them down by 19%. A 39-point gap between perception and reality — developers couldn't trust their own sense of productivity. An important caveat: this is a specific setup (experienced people on familiar code), and it doesn't mean AI "always slows you down". It means the perception of speed is unreliable.

Veracode 2025 GenAI Code Security Report. Tested 100+ LLMs across 4 languages. AI-generated code contains 2.74× more vulnerabilities than human-written code. 45% of AI samples contain errors from the OWASP Top 10. Java fared the worst: a 72% security failure rate.

Apiiro (Fortune 50, 2025). A study covering 7,000 developers and 62,000 repositories. Over six months (December 2024 → June 2025), security findings grew tenfold. Privilege escalation paths were up 322%, architectural design flaws up 153%, and credential leaks up 40%. Developers using AI made 3–4 times more commits but fewer PRs — meaning PRs grew larger, and reviews became more superficial.

Stack Overflow Developer Survey 2025 (49,000 developers). 84% use AI, with half using it daily. But trust is falling: 46% don't trust AI for accuracy (up from 31% a year ago). 66% spend more time debugging AI-generated code than writing it.

The picture is consistent: developers see the problem — and use AI anyway. Like pilots who know about the automation paradox but still fly on autopilot 99% of the time.

What's going on in our heads

We've covered what's happening. Now, briefly, why? Because this isn't a question of discipline or laziness. It's basic psychology, studied within human factors science for the past 50 years.

Automation bias (Parasuraman & Manzey, 2010 — the seminal work). When a system appears competent, the brain hands over control and stops critically evaluating its output. It manifests in two ways: errors of omission, where a person fails to spot a mistake the system missed; and errors of commission, where a person acts on an incorrect prompt from the system, ignoring their own judgment. The Stanford study captured this in developers' own words: participants using AI wrote less secure code while being more confident that it was secure.

The complacency effect. The more reliable a system is most of the time, the harder it becomes to notice a failure in the rest. If AI produces correct code 95 times, on the 96th, you're no longer reading it carefully. And that's where the race condition lives. It's the same mechanism by which an electrician skips the gloves and a pilot stops monitoring the instruments. Reliability breeds inattention.

Cognitive offloading. The brain consciously offloads tasks externally to conserve resources. This is normal — it's how delegation works. The problem is that an unused skill = an atrophying skill. For pilots, this is manual flight. For developers, it's the ability to read code carefully, hold the architecture in your head, and anticipate edge cases.

A Microsoft Research and Carnegie Mellon study (CHI 2025) demonstrated this directly across 319 knowledge workers and 936 real-world cases: the higher the confidence in the AI, the less critical thinking applied to its output. Work shifts from execution to oversight. And the skill of oversight without the skill of execution is exactly the situation we saw in Air France 447.

What to do about it

I'm deliberately not going to offer system-level answers here ("rebuild your CI", "buy a SAST tool") — that's a separate topic, and every team solves it differently. What's more interesting is what you can do yourself, without coordinating with anyone, starting with your next PR.

Individual reading techniques

Right now, no agreements required.

Adversarial review. Read AI-generated code as if you were paid a bonus for every bug you found. Not "does this work?" but "where will this break?" A different mindset surfaces different mistakes.

The "how do you know this?" question. Make the AI explain its reasoning. Where does this function come from? Why this approach and not another? Which edge cases did you consider? Half of all hallucinations fall apart at this stage — the model starts contradicting its own explanations.

Methodology

Changes how you work.

Tests before accepting AI code, not after. A classic trap: AI generates code → we write tests against that code → the tests pass → we conclude everything is fine. But the tests aren't verifying correctness — they're verifying that the code is consistent with itself. First, decide what the function should do. Then, how to test it? Then, and only then, the implementation, whether yours or the AI's.

AI-free days. Periodically write code without AI at all — not on principle, but to keep the skill alive. After Air France 447, Lufthansa pilots were required to complete mandatory simulator hours to prevent the skill from disappearing. For an experienced developer, one day a week or one pet project without Copilot is enough. If you're a junior, this isn't "keeping in shape" — it's actually building the skill in the first place: a skill that doesn't yet exist won't appear from staring at finished AI output, no matter how much of it you look at.

Team rules

Require agreement with colleagues.

Code review without AI on the other side. If the reviewer is asking Copilot whether the code is good, the review doesn't exist. That's AI reviewing AI, with you as the middleman. The point of review is a second pair of eyes that thinks differently. The moment both sides use the same model, you've lost the second perspective.

In closing

Vibe coding made it into Collins Dictionary's Word of the Year 2025. That same year, Merriam-Webster chose slop — the umbrella term for low-quality AI output — as its own. Two sides of the same coin, and both sides are real.

AI as a development tool is powerful, useful, and already inseparable from the work. The question isn't whether to use it. The question is how to preserve critical thinking when everything around you looks correct. Because a false sense of correctness is the most expensive thing in engineering. It costs race conditions in production, penny-level discrepancies in financial reports, malware in production dependencies, and — in some industries — human lives.

For electricians, the rule of always testing with an indicator didn't become ritual overnight; it took decades and no small number of deaths. In AI-assisted development, the equivalent culture is only just emerging. And the sooner "re-read the diff line by line" becomes an automatic motion, the fewer stories of "we did it this way a thousand times, and it was always fine" we'll have to write.

References

Academic research

  • Perry, N., Srivastava, M., Kumar, D., Boneh, D. (2023). Do Users Write More Insecure Code with AI Assistants? ACM Conference on Computer and Communications Security (CCS '23). arXiv · GitHub data
  • Becker, J., Rush, N., Barnes, B., Rein, D. (2025). Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR. arXiv · Blog · Update (February 2026)
  • Lee, H.-P. et al. (2025). The Impact of Generative AI on Critical Thinking: Self-Reported Reductions in Cognitive Effort and Confidence Effects From a Survey of Knowledge Workers. Microsoft Research + Carnegie Mellon University. CHI 2025. PDF · ACM
  • Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381–410. DOI

Industry reports

Aviation: Air France 447

  • BEA (2012). Final Report — Accident on 1st June 2009 to the Airbus A330-203, flight AF 447. Presentation · PDF on the FAA site
  • IEEE Spectrum (2012). Air France Flight 447 Crash Causes in Part Point to Automation Paradox. IEEE Spectrum

Tesla Autopilot: Benavides v. Tesla

  • NBC News (1 August 2025). Tesla hit with $243 million in damages after jury finds its Autopilot feature contributed to fatal crash. NBC News
  • CNBC (20 February 2026). Tesla loses bid to toss $243 million verdict in fatal Autopilot crash suit. CNBC
  • Electrek (20 February 2026). Tesla has to pay historic $243 million judgement over Autopilot crash, judge says. Electrek

Vibe coding and terminology

  • Karpathy, A. (2 February 2025). The original tweet about vibe coding: x.com
  • Collins Dictionary — vibe coding as Word of the Year 2025.
  • Merriam-Webster — slop as Word of the Year 2025.

Slopsquatting

  • toxsec (2026). What is Slopsquatting? AI Hallucinations Ship Malware. toxsec.com
  • Aikido Security (2026). Slopsquatting: The AI Package Hallucination Attack Already Happening. aikido.dev
  • Mend.io (2025). The Hallucinated Package Attack: Slopsquatting Explained. mend.io
  • InfoWorld (2026). Supply-chain attacks take aim at your AI coding agents. infoworld.com

Context and further reading

  • MIT Technology Review (December 2025). AI coding is now everywhere. But not everyone is convinced. technologyreview.com
  • The Register (February 2025). Some workers are already outsourcing their brains to AI. theregister.com
  • The Register (September 2025). AI code assistants improve production of security problems. theregister.com
  • Fortune (February 2025). AI might already be warping our brains, leaving our judgment and critical thinking 'atrophied and unprepared'. fortune.com

More from this blog

D

Dmytro Spivak Blog

2 posts

Full Stack Developer · since 2018.

Building web products with Node.js, TypeScript, React, and Next.js. On the backend — NestJS, microservices, PostgreSQL/MongoDB, deployed on AWS.

Domain experience: automation, healthcare, insurance, transport/logistics, web hosting.

Quietly building SaaS on the side.

Writing about what usually goes unspoken.