Defensible Benchmarks · verify before you sign up

Agents already work. Here's the receipts.

Three numbers from this year, each with a primary source you can open in a new tab. This is the public version of the credibility tile we show in the workshop, built so a sceptical operator can fact-check it before paying for a seat.

Source posture

Primary links first, methodology linked separately, and the as-of date travels with the tile so screenshots stay accountable.

Deep-research agents: GPT-5.5 Pro 90.1% on BrowseComp (OpenAI, May 2026) — a related but distinct job.

Sources, in order: huggingface.co/spaces/gaia-benchmark/leaderboard (GAIA — official live leaderboard; Manus team-author submission) · arxiv.org/abs/2311.12983 (GAIA benchmark paper, Mialon et al., 2023 — methodology of record) · skyvern.com/blog/web-bench (Web-Bench) · openai.com/index/browsecomp (BrowseComp — deep-research callout) · averi.ai B2B SaaS citation benchmarks report (2026) (B2B SaaS citation benchmark). Benchmarks current as of 2026-05-31 — refreshed monthly under the Benchmark of the Month routine.

How to read this. Every number above links to its primary source — open them. The three logos are shown in greyscale on purpose: the numbers carry the slide, not the brands. We refresh these figures monthly; the "current as of" date moves even when the numbers don't, so the tile never goes stale-looking. Questions about methodology? The arXiv paper is the GAIA method of record.