← All Demos
Language Model Council
Democratically Benchmarking Foundation Models on Highly Subjective Tasks
What if we let the models themselves decide who's the best?
Democratically Decided Leaderboard for Emotional Intelligence
Note: It's been a while since the council last met (the original data was collected in 2024). You'll notice almost all of these models are deprecated or nearing end-of-life. One day, a new council may get back together to evaluate themselves again.
Scenario
Responses
Loading...
Loading...
Affinities (Raw)
Judge vs. Respondent
Affinities (Council-Normalized)
Judge vs. Respondent
Expected Win Rates (Bradley-Terry)
Respondent vs. Respondent
Judge Agreement (Cohen's Kappa)
Judge vs. Judge