Three numbers from this year, each with a primary source you can open in a new tab. This is the public version of the credibility tile we show in the workshop, built so a sceptical operator can fact-check it before paying for a seat.
Source posture
Primary links first, methodology linked separately, and the as-of date travels with the tile so screenshots stay accountable.
Agents already work. Here's the receipts.
Manus
86.5/70.1/57.7
% task success on GAIA (L1 / L2 / L3)
Manus tops the GAIA benchmark across all three difficulty levels — autonomous agents solving real, multi-step tasks.
huggingface.co
Skyvern
85–94%
success rate on Web-Bench (real-website SOTA)
Browser agents hit 85–94% success on Skyvern's real-website task suite — the "click around the web for me" demo is no longer demo-ware.
skyvern.com
Perplexity
#1
for B2B SaaS citation quality
Perplexity beats ChatGPT and Google AI Mode on citation quality for B2B SaaS — AI search drives referral traffic today.
averi.ai
How to read this. Every number above links to its primary source — open them. The three logos are shown in greyscale on purpose: the numbers carry the slide, not the brands. We refresh these figures monthly; the "current as of" date moves even when the numbers don't, so the tile never goes stale-looking. Questions about methodology? The arXiv paper is the GAIA method of record.