Do AI agents actually work yet? What the benchmarks show

The demos are real and the leaderboards are climbing fast. Turned loose on real multi-step work, though, the best agents finish about a third of it, and the reason is arithmetic: reliability compounds downward. Here is what the benchmarks show, the half-life math behind it, and where agents actually pay today.

Turn the best AI agent on the market loose inside a realistic software company, give it the chat tool, the project board, the code repo, and 175 ordinary office tasks, and it finishes about 30% of them on its own. That is not an old model and it is not a typo. It is Gemini 2.5 Pro, one of the strongest agents available, on Carnegie Mellon’s TheAgentCompany benchmark, scoring 30.3% fully autonomous and creeping to 39% only when you hand out partial credit for getting part-way. The rest it botched or left half-done.

Hold that next to the demos. Every agent launch shows a clean run: the agent books the trip, files the expense, fixes the bug, all by itself, and the demo is real. It is also a sample size of one. A business does not run on one good run, it runs on the thousandth, and the distance between “worked once on stage” and “works every time, unattended” is the entire question. The short answer to “do AI agents work yet” is this: on narrow, short, supervised tasks, genuinely yes, and on the open-ended multi-step work people keep promising they will automate, not reliably, not yet, and the reason is arithmetic rather than a bug that gets patched next month.

The short version

Agents earn their keep inside tight lines and lose it the moment you widen them. If you are putting one to work to make money:

Keep the scope narrow and the task short. Reliability falls off a cliff as a job gets longer, and the data below is brutal on this. One tool, one well-defined job, a handful of steps, is where agents already pay.
Put a human at the checkpoints. Not watching every token, but approving the few moves that spend money or cannot be undone. The agent drafts, you commit.
Make failure cheap and reversible. Point an agent at a sandbox, a draft, a proposal, never a production database or a live send. The headline agent disasters were each one irreversible action away from fine.
Add a retry and a checker. A second pass that verifies the first one’s work buys back a lot of the lost reliability. Bare single-shot autonomy is the weakest setup there is.
Do not buy the leaderboard. The most eye-popping scores come from bespoke multi-model rigs you cannot actually purchase, not from the agent you would deploy.

Everything after this is the evidence, and the math that says why.

Turned loose, they fail most of the job

That 30% is not an outlier. Salesforce built its own benchmark, CRMArena-Pro, to see how agents handle real sales and service work, the exact thing it sells agents to do. On single-step tasks the best model cleared about 58%. The moment the task became a normal back-and-forth where the agent had to ask for a missing detail, that fell to roughly 35%. The same agents almost never refused to hand over confidential data unless they were specifically told to be careful, and telling them to be careful dragged completion down further. This is the vendor’s own evidence against its own pitch, which makes it hard to wave away.

The deeper problem is not the average, it is the consistency. Sierra’s tau-bench sits an agent in a customer-service seat and asks it to follow company policy through a multi-step chat. The best agent it tested got the average retail task right about 61% of the time, which sounds usable. Then the researchers asked a sharper question: can it get the same task right eight times in a row, once for each of eight customers? The share it nailed every single time dropped below 25%. An agent that works two times in three is a fine demo and a poor employee, because the customer who draws the third outcome is a refund, a chargeback, or a complaint.

The math nobody demos

Here is why a healthy per-task score still does not add up to an autonomous worker, and it is just multiplication. Say an agent is 95% reliable on each individual step, which is better than most manage. Chain 20 of those steps into one task, the kind of thing “book my travel and file the expenses” actually involves, and the odds it gets the whole task right are 0.95 to the twentieth power, or about 36%. At ten steps it is still only 60%. Reliability compounds, and it compounds downward, because a long task is a chain of subtasks where failing any one fails the whole thing.

Oxford’s Toby Ord gave this its name in a 2025 paper, Is there a half-life for the success rates of AI agents? Working from the benchmark data, he found agent success decays exponentially with task length, as cleanly as if each agent ran a fixed risk of failing per minute of work, a half-life. One consequence is worth keeping on a sticky note: every extra “nine” of reliability you demand divides the length of task you can trust by roughly ten. An agent reliable enough to finish an hour-long job half the time can only be trusted on a job a fraction as long once you need it to work eight or nine times out of ten. The demo is the 50% case. A business runs on the 99% case, and that is a different, much smaller job.

The cliff, measured

The research group METR put numbers on the cliff. It measures an agent’s “50% time horizon,” the length of task, scored by how long it takes a person, that an agent finishes on its own about half the time. The headline that gets quoted is that this horizon is doubling every few months, and it is: the frontier has climbed from minutes to a few hours of human-equivalent work in about two years. The line underneath is the one that matters here. The 80% horizon, the length you can trust four times in five, runs about five times shorter than the 50% one. In METR’s data, models nailed almost every task that takes a human under four minutes and succeeded less than 10% of the time on tasks that take a human more than about four hours. “Half the time on a long task” is the headline. “Almost never, when it has to be reliable” is the same finding.

Why the scoreboards disagree

You can also find a leaderboard that says agents already match people. On GAIA, a benchmark of real-world assistant tasks, humans score 92%, versus 15% for GPT-4 with plugins when it launched, and the top entries now claim around 92% too. Read the fine print. Those top scores are custom multi-model rigs, scaffolding bolted onto several models and tuned for the test, submitted by their own authors and not independently reproduced. The best agent an outside lab has actually reproduced sits near 75%, and a bare model with no scaffolding is far below that. The number that matters for you is not the one a research team can hit on a hidden test set with a hand-built ensemble. It is the one you can buy, point at your work, and trust without a specialist babysitting it, and that number is a good deal lower than the leaderboard.

The fair counterpoint, and it is fair, is that all of these numbers are moving up quickly. METR’s horizon is doubling on the order of every four to seven months. The 30% of today is not the 30% of two years from now. So the right posture is neither “agents are a toy” nor “agents will run the company by Friday.” It is to build for where the cliff sits today, and to keep watching it recede.

What works right now

The agents earning real money in mid-2026 are not running businesses unattended. They are doing narrow, bounded, supervised work, and doing it well: drafting the first version, triaging the inbox, pulling the data, writing the boilerplate, with a person owning the few decisions that count and a cheap way to catch the misses. That is augmentation, and it pays. The trap is the next sentence in every pitch, where the bounded helper quietly becomes an unattended employee. That sentence is where the 36% lives.

One more thing if you are building this rather than buying it: the long, looping, multi-step runs that fail most often are also the most expensive ones to serve, because every retry and every step is more tokens on the meter. The work that breaks your reliability is the same work that breaks your margin, which is a strong hint about where to point an agent and where not to. Where agents do pay in coding specifically, the speed comes with its own bill.

Do AI agents work yet? On a leash, yes, and the leash is getting longer fast. Off the leash, on the open-ended multi-step work the word “agent” was supposed to promise, they finish about a third of it, and the math says that will not flip just because the demo ran clean. An agent that works half the time is a tool you supervise, not an employee you trust. Build for the half that fails, and you keep the half that pays.

AI agents finish a third of the job, and the math says why