Transcripts mislead on AI coding gains
Transcripts promise neat bounds on AI coding gains, but they gloss over real productivity. They’re showroom artifacts, not the codebase or team dynamics regulators actually rely on.
The METR note makes a clean, comforting promise: analyze coding agent transcripts and you can upper bound productivity gains from AI agents. A lab-grade ceiling for a messy reality.
But transcripts are the prettiest artifact in the room.
They’re not the fossil record of a codebase or the political ledger of an engineering team. They’re what you’d show a regulator or an investor on demo day. Not what you’d show a post-incident review.
What transcripts actually buy you
The method has real use. If you want a theoretical maximum—what’s possible in controlled conditions—coding agent transcripts help. They surface where agents shine: legible, bounded tasks; churn through boilerplate; pattern-matching fixes. METR’s note is doing something rare in this space: putting a brake on wild promises instead of flooring the accelerator.
That ceiling is a signal.
The danger starts when people mistake it for a forecast.
What the transcripts hide
The note treats transcripts like a laboratory: clean prompt, clean response, easy attribution of “who did what.” Great for a research note. Dangerous for a board deck.
Transcripts freeze the moment an agent looks competent.
They capture micro-tasks—autocomplete here, quick refactor there, snippet generation on demand. They don’t show the downstream friction: how long it took to integrate that snippet, how many review cycles it triggered, whether the change played nicely with the rest of the system, or how many hours a human spent unpicking a too-confident function.
They certainly don’t show the pull requests that quietly died in review for security, reliability, or just “this is the wrong abstraction.”
Here’s what they won’t tell you: productivity in software is mostly everything that happens between commits.
Coordination, code ownership, onboarding, incident response, technical debt triage. A transcript can show a model producing “usable” code. It can’t show how brittle that code becomes under real load, or how much cognitive overhead it adds for the next engineer who stumbles into it six months later.
So an “upper bound” drawn from transcripts might be mathematically elegant and practically skewed. An arithmetic ceiling for a social system.
Bias wears a lab coat
The note leans on transcripts as if they’re a neutral sample of real work. But whose transcripts? From which agents, on what tasks, under what pressure?
Which ones get saved, shared, and polished into a research artifact?
The clean wins. The sharp demos. The “look, it wrote a whole function” moments. Convenient, isn’t it.
Selection bias pushes the dataset toward short, solvable, visually satisfying prompts. The hairy conversations—architecture tradeoffs, refactoring a decade-old module, negotiating with product over scope creep—either don’t live in a coding transcript or never get curated.
There’s survivorship bias too: successful runs become examples; frustrating dead-ends vanish into private Slack channels.
Who benefits from ceilings gently tilted upward by these biases?
Venture capital firms and platform vendors. Follow the money. An optimistic “upper bound” gives investors a narrative: there’s still air above us, still upside to fund. It hands procurement teams a headline number to dangle in front of a skeptical finance department. It lets product leaders skip the grind of careful pilots because “even if we get a fraction of this, we’re fine.”
But a ceiling is not a strategy.
Transcripts miss culture
Ask anyone who’s watched two teams use the same tool with wildly different results. Culture eats automation for breakfast.
Engineering norms—review practices, pair programming, documentation discipline, incident rituals—determine how far an AI agent’s potential actually travels. A disciplined team with reliable tests and continuous integration might funnel agent output through a safety net and gain real speed. A loose team shipping straight from local branches might just add faster ways to create incidents.
The transcript doesn’t capture fear.
It won’t tell you which engineer will get blamed if an agent-introduced bug takes down billing. It won’t show who is comfortable overruling the model, or who silently rubber-stamps suggestions to hit a productivity metric.
That’s where incentives creep in.
Follow the incentives
If leadership treats METR’s upper bound as a north star instead of a boundary, hiring and performance reviews start to warp.
Fewer experienced engineers, more “agent operators.” Lines of code and PR counts become proxy metrics for “AI impact.” The quiet work—simplifying architectures, paying down debt, saying no to brittle shortcuts—starts to look like underperformance next to flashy transcript clips.
Who gets credit when an agent drafts a test that a human curates, extends, and defends in production? If the transcript is the evidence, the agent gets more of the halo than the human who made it safe.
We’ve seen this movie before. When GitHub first popularized contribution graphs, some managers fixated on green squares. Developers gamed the metric with tiny commits and cosmetic changes. The tool was fine; the incentives were toxic.
Transcripts are just the new green squares.
What a sober use of the note looks like
The METR research can still be useful—if you treat it as a stress test, not a promise.
If you’re an engineering manager or buyer, treat transcripts as a first meeting, not a wedding. Start with their ceilings, then carve out sandbox pilots that mirror your worst days: incident response, hairy migrations, compliance reviews, not just tidy feature tickets.
Don’t let a ceiling replace reality checks.
Ask vendors and internal champions for their failure logs, not just their hero runs. Ask where agents slowed things down, where engineers stopped trusting suggestions, where governance overhead wiped out speed gains. Good partners will bring those scars into the room. Honest costs make better math.
The METR note gives the market something solid: a disciplined way to say “this is as good as it gets in this narrow frame.” The real test will be whether boards and executives treat that frame as context—or as a number to tattoo onto roadmaps.