Γvaluation Updated 2026-04
AI Benchmark
Definition
An AI benchmark is a standardized test that measures and compares AI model performance on specific tasks.
See also in the glossary
L
LLM (Large Language Model)
An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.
F
Foundation Model
A foundation model is a large AI model pre-trained on massive data, adaptable to multiple tasks.
A
AI Hallucination
An AI hallucination is a response generated by an AI model that appears plausible but is factually incorrect or fabricated.
A
AI Reasoning
AI reasoning refers to a model's ability to break down a problem into logical steps to reach a conclusion, rather than answering instinctively.
Tools that use ai benchmark
Frequently Asked Questions
What are the most popular AI benchmarks?
MMLU (general knowledge), HumanEval (code), MATH (mathematics), HellaSwag (reasoning), and LMSYS Arena ELO (human voting ranking).
Are benchmarks reliable?
Partially. Models can be optimized to perform on benchmarks without being better in practice. Arena ELO ranking is considered the most representative.