Language Model Council
Democratically Benchmarking Foundation Models on Highly Subjective Tasks
What if we let the models themselves decide who's the best?
Democratically Decided Leaderboard for Emotional Intelligence
Note: It's been a while since the council last met (the original data was collected in 2024). You'll notice almost all of these models are deprecated or nearing end-of-life. One day, a new council may get back together to evaluate themselves again.
Scenario
Responses
Loading...
Loading...
Affinities (Raw)
Judge vs. Respondent
Affinities (Council-Normalized)
Judge vs. Respondent
Expected Win Rates (Bradley-Terry)
Respondent vs. Respondent
Judge Agreement (Cohen's Kappa)
Judge vs. Judge