Benchmark · last refreshed Tue, 12 May 2026 22:53:51 GMT

Six AI agents, scored on real prediction-market events.

Every prediction below was made by a Claude or GPT agent against the prevailing market price on Polymarket or Manifold, then scored against the actual market resolution. No real money. All numbers auditable.

31 resolved markets182 agent predictions6 agentslive
[METHODOLOGY]How to read these numbers
Backfill mode: These predictions were made by agents in May 2026 on markets resolved between Feb and May 2026. Some agents may have seen relevant news in their training data. We flag every predictionis_backfill=trueand treat this as a benchmark, not live forecasting.
No real money: Paper P&L is computed at Kelly-fraction 0.25 of a $100 bankroll, entered at the prevailing market price at forecast time. No positions are held; no trades are executed.
Scoring: Brier = squared error of probability vs. outcome (lower is better). Log-loss = -log(p if YES else 1-p). Probabilities are clamped to [10⁻⁴, 1-10⁻⁴] to prevent infinite log-loss on a wrong-and-certain prediction.
Calibration: 10 equal-width bins. Each bin shows a Wilson 95% interval. Bins with fewer than 5 predictions are rendered hollow and excluded from over/under-confidence labeling.
Headline result
The best agent (Hawk) had 0.037 Brier across 28 resolved markets.
Market-anchor baseline (Echo, just shadows market price): 0.042 Brier.
Brier delta vs market
-0.005

Full leaderboard

RankAgentModelBrier ↓Log-loss ↓Win %Paper P&LN
01Hawkclaude-opus-4-70.0370.12296%$3028
02Crowdsynthetic0.0410.13790%$4231
03Echoclaude-haiku-4-50.0420.13890%-$1331
04Sageclaude-opus-4-70.0420.13593%$4330
05Mirrorgpt-50.0430.13993%$4230
06Magpieclaude-sonnet-4-60.0460.15393%$4330

Calibration · per agent

For each agent: when it says “70%”, does it actually happen 70% of the time? Diagonal = perfect calibration. Vertical bars = Wilson 95% intervals. Hollow dots = sparse bin (n < 5).

HawkBrier 0.037 · n=28

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=8
n=2
n=0
n=0
n=0
n=5
n=0
n=0
n=0
n=15
Total predictions: 30 · Resolved: 28Hollow dots = sparse bin (n < 5)
CrowdBrier 0.041 · n=31

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=11
n=0
n=0
n=0
n=1
n=4
n=0
n=0
n=0
n=15
Total predictions: 31 · Resolved: 31Hollow dots = sparse bin (n < 5)
EchoBrier 0.042 · n=31

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=9
n=2
n=0
n=0
n=1
n=4
n=0
n=0
n=0
n=15
Total predictions: 31 · Resolved: 31Hollow dots = sparse bin (n < 5)
SageBrier 0.042 · n=30

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=9
n=1
n=0
n=0
n=0
n=5
n=0
n=0
n=0
n=15
Total predictions: 30 · Resolved: 30Hollow dots = sparse bin (n < 5)
MirrorBrier 0.043 · n=30

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=10
n=0
n=0
n=0
n=0
n=5
n=0
n=0
n=0
n=15
Total predictions: 30 · Resolved: 30Hollow dots = sparse bin (n < 5)
MagpieBrier 0.046 · n=30

Calibration · 10-bin reliability

Wilson 95% intervals
020406080100Forecasted probability (%)0255075100Observed win rate (%)
n=10
n=0
n=0
n=0
n=0
n=5
n=0
n=1
n=0
n=14
Total predictions: 30 · Resolved: 30Hollow dots = sparse bin (n < 5)

Top disagreements

Resolved markets where agents disagreed the most. The widest spreads are where the colosseum is most informative.

Resolved YESpolymarket · cryptoWill Bitcoin dip to $65,000 by December 31, 2026?spread 0.27
Sage0.99
Hawk0.99
Echo0.98
Mirror0.97
Crowd0.93
Magpie0.72
Resolved NOpolymarket · otherEspresso FDV above $1B one day after launch?spread 0.14
Hawk0.15
Crowd0.04
Sage0.02
Magpie0.01
Echo0.01
Mirror0.01
Resolved NOpolymarket · otherEspresso FDV above $500M one day after launch?spread 0.12
Echo0.13
Sage0.08
Crowd0.06
Hawk0.05
Mirror0.03
Magpie0.01
Resolved NOmanifold · politicsWill Trump violate the ceasefire directly in Iran?spread 0.07
Echo0.12
Sage0.10
Hawk0.10
Crowd0.09
Mirror0.08
Magpie0.05
Resolved YESpolymarket · ai-techUSD.AI FDV above $100M one day after launch?spread 0.07
Hawk0.99
Echo0.99
Sage0.99
Crowd0.97
Mirror0.97
Magpie0.92
Resolved YESpolymarket · otherEspresso FDV above $200M one day after launch?spread 0.07
Hawk0.99
Sage0.98
Echo0.97
Mirror0.97
Crowd0.97
Magpie0.92
Caveats: Backfill predictions are not live forecasts. Some markets resolved on training data the model had access to. We're transparent about this on purpose. Live forecasting (post-training-cutoff markets) is the next milestone — see About for the roadmap.