The number everyone quotes this year is the time horizon. METR's measurements show the length of task a frontier agent can finish on its own, measured by how long it would take a person, now stretching into hours and doubling roughly every seven months. The curve is real, I feel it in my own work, and I think the public conversation about it is celebrating the wrong half of the transaction.
Every hour of autonomous work produces an hour's worth of output someone has to trust or check. Generation got cheap and got fast and keeps compounding. Verification did not. There is no curve anywhere showing how much output a person can responsibly vouch for doubling every seven months, and the distance between those two lines is being papered over, team by team, with a quiet and mostly unexamined increase in faith.
The industry's own surveys say the quiet part in numbers, nearly nine in ten agent teams have observability in place while only about half run evals. We are watching what our systems do far more than we are judging whether it was right.
Verification is a budget, whether or not you set one
I run a content factory that generates math problems for children, which means I have spent two years inside this exact asymmetry at miniature scale. The factory can produce more in an hour than I could carefully review in a week. The gap between those two rates is not a temporary inconvenience awaiting a better model. It is the permanent operating condition, and it forced me to treat verification as a budget with explicit line items, because the alternative was pretending I checked things I skimmed.
What that budget bought, concretely, is a ladder. Hard programmatic gates reject malformed output before anyone's attention is spent. A scored judgment routes everything doubtful to a person, and the line between machine-cleared and human-reviewed was drawn from months of recorded disagreement, not preference. A blind visual judge re-solves every problem from rendered pixels, because the artifact a child receives is a screen, never a row of JSON. And above all of it, ongoing human sampling keeps testing whether the automated layers still deserve the trust the architecture extends them.
I describe the ladder not because the specifics generalize, yours would differ, but because of what each rung actually is. Every rung is a place where I stopped looking at everything and started trusting a proxy. The gates are faith in my own rule-writing. The threshold is faith in a calibration. The sampling rate is faith in statistics. Verification at scale is not the opposite of faith, it is faith made explicit, with its assumptions written down where they can be audited and revised. What worries me is not the teams with small verification budgets. It is the teams that could not tell you where their checking ends and their believing begins.
Review becomes sampling, sampling becomes statistics, statistics become culture
The progression I lived through in miniature is, I suspect, the one most engineering organizations are entering now at full size.
First you review everything, and it works until volume kills it. Then you review by exception, machine-cleared work flows through and doubt routes to people, and the integrity of the whole arrangement quietly relocates into whatever decides what counts as doubt. Then even the exceptions outgrow you and you sample, which means accepting, out loud if you are honest, that unexamined work is reaching the world carrying your name. Each step is rational. Each step is also a transfer of trust from eyes to instruments, and instruments drift. The discipline that keeps the ladder honest is unglamorous, recording human judgments so they can recalibrate the machines, watching agreement rates for rot, and treating a rising auto-clear rate as a question instead of a victory.
What worries me about this year's tooling landscape is how lopsided the investment runs. Agents that take on hours-long tasks are a product category. The instruments a person needs to honestly vouch for that much machine output, interfaces built for judgment, calibration loops, drift alarms, disagreement queues, barely have names yet. We are shipping the strong half of the transaction and improvising the half I think was always the point.
Where my own faith begins
So, in the spirit of the ledger this post is asking for, here is mine. I trust my gates as far as the rules I can read. I trust the 0.85 line on concepts with deep review history and trust it progressively less the newer the territory, which is why new concepts start at full human review. I trust the blind judge's disagreements far more than its agreements, it is a model too. And past the sampling rate, I am believing, not checking, and I try to say so in exactly those words when I describe the system, because the day I stop noticing that boundary is the day it starts moving without me.
The bottleneck has moved from writing to verifying. I doubt it moves back soon, though I have been wrong about this field before. The capability curves all point one direction, and the doubling will keep outrunning any human review practice, which means every team shipping agentic work now owns a faith budget whether they have itemized it or not. My one suggestion is the itemizing. Know where your checking ends. Write down what you trust past that point and why, and revisit the list on a schedule, because the systems on the other side of it are improving faster than your reasons.