Sheet Music Bench

Pick a piece, see every model's answer to every question, scored against the ground truth.

Question correlation

Conditional probability of answering B correctly given that A is correct, across all model × piece cells. Off-diagonal cells reveal which questions test overlapping vs. independent sub-skills. Hover any cell for the underlying probability, baseline, uplift, and sample count.

Raw P(B | A) Uplift vs. baseline

Per-piece accuracy

Mean accuracy across all models, one cell per (piece, question). Click any column header to sort by it. Click a piece to see every model's answers in the Examples tab.

Sheet Music Bench (v1)

Question correlation

Per-piece accuracy Reset

Per-piece accuracy