Benchmarks
Datasets for evaluation and monitoring
Import
Generate
Well-known Benchmarks
Import popular public benchmarks
GSM8K (Math Word Problems)
Grade-school math problems with free-form reasoning.
MMLU (Massive Multitask)
57 tasks, multiple-choice across many domains.
ARC (AI2 Reasoning Challenge)
Challenging science questions (MCQ).
TruthfulQA
Measures truthfulness and informativeness.
HellaSwag
Commonsense inference with adversarial distractors.