Brier Score Comparison

Leaderboard

# Model OverallOverall Brier score combining dataset and market questions. Lower is better. 0 = perfect, 1 = always wrong. DatasetBrier score on structured data questions (e.g., will GDP growth exceed X%). Lower is better. MarketBrier score on real-world event questions (elections, conflicts, policy). Lower is better. This is where AI models consistently fail. QuestionsTotal number of questions the model was evaluated on.

ForecastBench

ForecastBench is an open benchmark maintained by the Forecasting Research Institute that evaluates AI systems on their ability to predict real-world events. Models are tested on hundreds of questions spanning politics, economics, conflict, and policy — the same types of events that prediction markets track.

Brier Score

The Brier score measures the accuracy of probabilistic predictions: (predicted probability − actual outcome)². A score of 0 means perfect prediction; 1 means perfectly wrong. Lower is better. For context, a naive "always guess 50%" strategy scores 0.25.

Dataset vs Market Questions

"Dataset" questions are structured data problems with clear, quantifiable answers (e.g., will GDP growth exceed 3%?). "Market" questions are real-world event predictions that require understanding politics, geopolitics, and human behavior — elections, conflicts, policy decisions. The gap between these scores reveals how much a model's apparent intelligence is pattern-matching on structured data versus genuine world understanding.

Superforecaster Baseline

The "Superforecaster median forecast" is a human baseline derived from the top forecasters identified by Philip Tetlock's Good Judgment Project. These are individuals who have demonstrated sustained, exceptional accuracy in predicting world events. No AI model currently achieves a lower (better) overall Brier score than this human baseline.

Data Source

Leaderboard data is sourced from the ForecastBench public dataset repository. We update this data periodically. The original tournament includes 283 model submissions from organizations including OpenAI, Anthropic, Google, xAI, Meta, and independent research teams.