Sheet Music Bench (v1)
Frontier multimodal LLMs answer 5 basic questions each about 35 sheet-music images.
Loading data…
Pick a piece, see every model's answer to every question, scored against the ground truth.
All
Wrong only
Question correlation
Conditional probability of answering B correctly given that A is correct,
across all model × piece cells. Off-diagonal cells reveal which questions
test overlapping vs. independent sub-skills. Hover any cell for the
underlying probability, baseline, uplift, and sample count.
Raw P(B | A)
Uplift vs. baseline
Per-piece accuracy
Mean accuracy across all models, one cell per (piece, question).
Click any column header to sort by it. Click a piece to see every
model's answers in the Examples tab.