Monte Carlo Simulator: Why Balanced Accuracy Wins

This simulator answers: which binary classification metric is best for selecting an LLM judge that will most accurately rank models by prevalence?

It generates random scenarios with multiple models and judges, determines which judge is objectively best at ranking models, then checks whether each metric (Balanced Accuracy, Accuracy, F1+, Macro F1) would have selected that best judge using only a golden validation set.

Based on Collot et al., "Balanced Accuracy: The Right Metric for Evaluating LLM Judges" (EACL 2026).

Models

Models per scenario 5

Benchmark size 200

Prev. min 0.01

Prev. max 0.50

Judges

Judges per scenario 3

Golden set size 1000

TPR min 0.00

TPR max 1.00

FPR min 0.00

FPR max 1.00

Simulation

Scenarios 1000

0 / 1000

Metric Selection Success Rate

Balanced Accuracy --

Accuracy --

F1+ --

Macro F1 --

Average Rank Gap Loss

Balanced Accuracy --

Accuracy --

F1+ --

Macro F1 --

Success Rate Over Time

Rank Gap Loss Over Time