Pick a piece, see every model's answer to every question, scored against the ground truth.
AllWrong only
Question correlation
Conditional uplift: P(B correct | A correct) − P(B correct),
computed across all model × piece cells. Off-diagonal cells reveal
which questions test overlapping vs. independent sub-skills.
Hover for the underlying conditional probability and sample count.
Per-piece accuracy
Mean accuracy across all models, one cell per (piece, question).
Pieces sorted by overall difficulty — hardest at top.
Click a piece to see every model's answers in the Examples tab.