Evaluating LLMs

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Article 10 of 14

Large language model evaluation inherits all of classical ML evaluation and adds failure modes classical ML does not have. The output is unstructured text, not a scalar. The model is often closed-weight, so counterfactuals against training data are unavailable. The benchmarks it scores on may be contaminated by its own training data. The metric choice is contested, and the community has not settled on a canonical evaluation suite. An LLM evaluation harness that borrows from classical ML without adaptation will miss most of what matters. The article teaches the shape of an LLM evaluation harness that produces decision-grade evidence.

The five evaluation modes for LLMs

LLM evaluation practitioners use five distinct modes, and a mature harness uses all five.

Capability benchmarks. Standardized tests that probe specific capabilities: MMLU (general knowledge), HumanEval (code generation), GSM8K (math reasoning), BIG-Bench (multiple categories), HELM (holistic, many subtasks), MT-Bench (multi-turn conversation). Capability benchmarks are the sharpest tool for comparing models across vendors on a fixed yardstick¹².

Regression sets. Curated input-output pairs specific to the deployed feature. The regression set is the practitioner’s hedge against prompt and model changes silently breaking known-good behaviors. Regression sets grow over time as incidents surface new failure cases.

Safety sweeps. Batteries of adversarial inputs probing failure modes — prompt injection, jailbreaks, harmful-content generation, sensitive-information leakage. Article 11 develops safety-sweep design in depth.

Human review. A sample of production conversations or outputs reviewed by a trained human against a rubric. Human review is the only mode that reliably catches “the model is technically correct but practically useless” failures.

LLM-as-judge. A separate LLM scores outputs against a rubric. Less reliable than human review but orders of magnitude cheaper; suitable for large-scale screening with periodic human calibration. Zheng et al. (MT-Bench, NeurIPS 2023) documented LLM-as-judge as a viable scalable technique when calibrated against human preferences².

The modes are complementary. A harness that uses only benchmarks is gameable (Goodhart’s law on public benchmarks). A harness that uses only human review does not scale. A harness that uses only LLM-as-judge inherits the judge’s biases. The composition is the discipline.

[DIAGRAM: HubSpokeDiagram — aitm-eci-article-10-llm-eval-hub — Central hub “LLM Evaluation” with spokes: benchmark suite, regression set, safety red-team, human review panel, LLM-as-judge, monitoring. Each spoke names a typical cadence and ownership.]

Benchmark contamination

Public LLM benchmarks have a specific failure mode: contamination. The test set has leaked into training data. The model has seen the evaluation inputs (and sometimes the answers) during pretraining. Benchmark scores become inflated by memorization rather than by capability.

Contamination is not hypothetical. Sainz et al. (EMNLP 2023) documented systematic contamination in several widely-used benchmarks across several widely-used models, and showed that a model’s score on a contaminated benchmark can be 10–30 points higher than its score on a held-out sibling of the same benchmark³. Stanford HELM maintains a contamination-aware scoring track that scores models on benchmarks with known contamination status separately¹.

Three defenses are available to practitioners.

Prefer contamination-resistant benchmarks. Benchmarks released after the model’s training cutoff. Benchmarks whose authors actively rotate questions. Benchmarks whose answers are not publicly posted.

Create private benchmarks. A curated evaluation set that never leaves the organization’s storage and cannot have been present in any public training corpus. The Stanford HELM project, the Google DeepMind evaluation suites, and several enterprise programs maintain private benchmarks for this reason.

Use held-out regression sets. The practitioner’s own regression set, built from production traffic or from curated cases, is by construction never in a training corpus unless the practitioner leaked it.

The LLM-as-judge pattern and its hazards

LLM-as-judge uses a separate LLM to score outputs. The pattern is cheap, fast, and scalable, and has several specific hazards.

Judge bias. The judge model tends to prefer outputs in the style of its own training distribution. A GPT-class judge may prefer GPT-class outputs. Calibration against human judgments, and use of multiple judges from different providers, mitigates this.

Position bias. When the judge is asked to compare two outputs, it tends to prefer the first (or last) regardless of content. Randomizing position per query and computing agreement across random-order runs is the standard correction².

Self-enhancement bias. A model evaluating its own outputs scores them more favorably than an independent judge does. The defense is not to use the same model to evaluate itself.

Rubric underspecification. A rubric like “rate helpfulness from 1 to 5” produces inconsistent scores because the scale is underspecified. A rubric with anchored examples at each level produces more consistent scores.

Frameworks implementing LLM-as-judge patterns with these corrections are widely available: Ragas for RAG evaluation, DeepEval, promptfoo, LangSmith’s evaluation harness, Humanloop, Langfuse, Braintrust, OpenAI Evals, Anthropic Workbench, W&B Weave⁴. The choice of framework is less important than the discipline of running it with the corrections.

Human-review harness design

Human review is expensive and irreplaceable. The harness design makes it feasible at scale.

Sampling. Review cannot be complete. Sampling strategies include stratified random (ensure coverage across user segments, time windows, input categories), uncertainty-based (review outputs where model confidence or LLM-as-judge score is mid-range), and incident-driven (review outputs associated with complaints, rollbacks, guardrail hits).

Rubric. A structured rubric turns human judgment into data. A typical rubric has 3–6 dimensions (correctness, groundedness, safety, tone, helpfulness, brevity) each scored on a 3- or 5-point scale with anchored examples.

Calibration. Multiple reviewers score the same items to measure inter-rater agreement. Agreement below a threshold (Cohen’s kappa < 0.6) is a signal the rubric is underspecified. Calibration is done at the start of a review program and repeated periodically.

Feedback loop. Findings from human review flow into the regression set (new failure cases become regression tests) and into the safety sweep (new adversarial classes become safety tests). The loop is the mechanism by which the harness learns.

Grounding evaluation for retrieval-augmented systems

Retrieval-augmented systems have an evaluation requirement classical ML does not: groundedness. Did the model’s output remain within the information retrieved, or did it extrapolate beyond it?

Grounding metrics come in two families.

Extraction-based. The output is compared to the retrieved context, and a score reflects how much of the output’s factual content is attributable to the context. Faithfulness (Ragas), context-precision, context-recall, and several similar metrics implement this⁴.

Citation-based. The output includes citations to its source passages, and a score reflects whether the cited passages actually support the claims they are cited for. Citation metrics are the gold standard but require outputs to be structured to support citation.

Grounding evaluation is the right mode for most customer-facing generative AI applications, because most of the user-visible failures are hallucinations or ungrounded claims. A harness that tracks groundedness over time catches degradation before it becomes an incident.

Cadence and ownership

LLM evaluation runs on a multi-cadence schedule.

On every prompt change or model change. Regression set and safety sweep. Automated, under 30 minutes.
On every canary promotion. Benchmark run on the canary model against relevant capability benchmarks. Automated, under 2 hours.
Weekly. Human review of a production sample. Manual, typically 2–4 hours of reviewer time per week for a feature.
Monthly. Full-harness run including capability benchmarks, private benchmarks, regression set, safety sweep, and a human-review batch. Automated plus manual.

Ownership: the ML engineer owns regression and benchmark components; a product or domain expert owns the human-review rubric; a safety or security lead owns the safety sweep. The coordination belongs on the experiment brief (Article 14).

[DIAGRAM: MatrixDiagram — aitm-eci-article-10-llm-eval-modes — 2x2 of “Human involvement (low vs. high)” by “Automation scale (low vs. high)” placing benchmarks top-right, regression bottom-right, human review bottom-left, LLM-as-judge top-right, safety sweep top-right.]

Regulatory alignment

The NIST AI 600-1 Generative AI Profile enumerates the MEASURE subcategories most relevant to LLM evaluation, including benchmark performance under prospective and operational conditions, safety evaluation, and monitoring⁵. The EU AI Act places LLMs that qualify as general-purpose AI models with systemic risk under specific evaluation and reporting obligations (Article 51 et seq.), including adversarial testing and serious-incident reporting⁶. A practitioner’s harness is the evidentiary substrate that these obligations are met on.

Two real references in the LLM evaluation vocabulary

Stanford HELM. Liang et al. (TMLR 2023) introduced HELM as a holistic evaluation framework reporting across many metrics (accuracy, calibration, robustness, fairness, bias, toxicity, efficiency) on many subtasks¹. HELM’s methodological contribution is the argument that a single benchmark number is insufficient; evaluation must be multi-metric and multi-task.

UK AI Safety Institute (AISI) frontier-model evaluations. The UK AISI has run a government-scale evaluation program on frontier models, with published methodology and partial results⁷. The program is a real-world reference for large-scale safety evaluation and is the sort of program a regulated enterprise can point to when arguing for comparable in-house practice.

Summary

LLM evaluation uses five modes: capability benchmarks, regression sets, safety sweeps, human review, LLM-as-judge. A mature harness composes all five. Benchmark contamination is a specific hazard; private benchmarks and held-out regression sets are the defense. LLM-as-judge has known biases (judge, position, self-enhancement) with standard corrections. Human review requires sampling, rubrics, calibration, and feedback loops. Grounding metrics are mandatory for retrieval-augmented systems. The harness runs on a multi-cadence schedule; ownership is distributed across ML, product, and safety roles. The regulatory anchors shape the harness’s evidentiary shape. Stanford HELM and UK AISI are reference programs. Article 11 develops safety and red-team experimentation in depth.

Further reading in the Core Stream: Evaluating Agentic AI: Goal Achievement and Behavioral Assessment.

Percy Liang et al. Holistic Evaluation of Language Models (HELM). Transactions on Machine Learning Research, 2023. https://crfm.stanford.edu/helm/ — accessed 2026-04-19. ↩ ↩² ↩³
Lianmin Zheng et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023. https://arxiv.org/abs/2306.05685 — accessed 2026-04-19. ↩ ↩² ↩³
Oscar Sainz et al. NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination. EMNLP 2023. https://aclanthology.org/2023.findings-emnlp.722/ — accessed 2026-04-19. ↩
Ragas, DeepEval, promptfoo, LangSmith, Humanloop, Langfuse, Braintrust, OpenAI Evals, Anthropic Workbench, W&B Weave documentation. — accessed 2026-04-19. ↩ ↩²
Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, July 2024. National Institute of Standards and Technology. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 (EU AI Act), Articles 51–56 (general-purpose AI models). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
UK AI Safety Institute — frontier AI evaluation program. https://www.aisi.gov.uk/work — accessed 2026-04-19. ↩