LLM Rankings

Unified benchmark leaderboard — percentile normalization & statistical tiering across multiple leaderboards

Benchmark data last updated: —

Benchmarks in this category

Methodology

Scores are percentile-normalized across all models on each benchmark (0 = best, 1 = worst). The final score is the median percentile across all benchmarks a model was evaluated on. Models with fewer than 3 benchmarks receive a sparse-data penalty (+0.25 for n=1, +0.10 for n=2). Tiers use the "Indistinguishable from Best" method: models whose asymmetric Q1–Q3 interval overlaps the tier leader are grouped together. Error bars span Q1 to Q3 around the median (asymmetric). Marker size in the ranking chart scales with cost. Diamonds = open-weight models. † = n=2 benchmarks    ‡ = n=1 benchmark.

Full methodology · Changelog · Source on GitHub

Changelog