Language Model Council

Democratically Benchmarking Foundation Models on Highly Subjective Tasks

Paper GitHub Data Recording Slides

What if we let the models themselves decide who's the best?

Democratically Decided Leaderboard for Emotional Intelligence

Note: It's been a while since the council last met (the original data was collected in 2024). You'll notice almost all of these models are deprecated or nearing end-of-life. One day, a new council may get back together to evaluate themselves again.

Scenario

Responses

Affinities (Raw)

Judge vs. Respondent

Affinities (Council-Normalized)

Judge vs. Respondent

Expected Win Rates (Bradley-Terry)

Respondent vs. Respondent

Judge Agreement (Cohen's Kappa)

Judge vs. Judge