Evaluation Architecture: Offline, Online, and Human

FlowRidge

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 11 of 35

Every AI system accumulates quality debt the moment it ships. The prompt drifts as new edge cases arrive and get patched. The retrieval corpus grows and its distribution shifts. The model gets upgraded by the provider, and the upgrade is not always an improvement for the specific workload. The tool schema gains a new entry, and the new tool surfaces a failure mode the old schema did not have. A system without an evaluation harness has no way to notice any of these shifts until a customer, a regulator, or an executive surfaces the problem in public. The evaluation harness is the instrument by which the architect keeps the system under observation — not a one-time test the team runs before launch. This article builds the harness from first principles and shows the architect how to sequence its components across the development lifecycle.

The four evaluation modes

Evaluation has four complementary modes, and a production-grade system runs all four.

Offline evaluation runs the system against a curated set of inputs with known expected outputs (or rubric-bound targets). Results are scored by automated metrics, LLM-as-judge, or a hybrid. Offline evaluation is the fastest, cheapest mode and runs on every commit in a CI pipeline. It catches regressions at the point they are introduced, before any user sees them.

Online evaluation runs the system in production and scores a sample of live traffic. Online evaluation captures distribution shifts the offline golden set misses — new query patterns users are asking, new failure modes that emerge under scale, new side effects of integrations with downstream systems. Online evaluation is slower and more expensive than offline but is the only mode that ground-truths against real traffic.

Shadow and canary evaluation run a new version alongside the current production version on a subset of traffic. The comparison is recorded (often with the user seeing only the current version’s response) so the team can measure before-and-after quality on the exact same traffic. Shadow runs in silent mode; canary routes a small percentage (often 1–5%) of real traffic to the new version. Both are gated promotion mechanisms.

Human review samples outputs for expert evaluation. Humans score the outputs against a rubric, validate LLM-as-judge calibration, flag novel failure modes, and produce the training data for the next round of automated metrics. Human review is the slowest and most expensive mode and is the only mode that closes the loop on quality definitions the automated systems cannot self-define. Article 12 develops human-review pipelines in depth.

Each mode has a different feedback latency. Offline returns results in minutes; online returns results in hours; shadow and canary return results in days; human review returns results in weeks. An evaluation architecture that relies on a single mode has blind spots proportional to the gaps between them.

The golden set

The golden set is the curated collection of inputs and expected outputs against which offline evaluation runs. It is not a benchmark dataset downloaded from a public source, and it is not a scraped sample of historical traffic. It is a deliberately constructed representation of the workload’s real distribution, with three design rules.

First, the golden set covers the distribution’s frequency head and tail. The most common query patterns are represented in proportion to their production frequency, and the rare-but-critical patterns (regulatory edge cases, high-value transactions, known prior-incident categories) are over-represented because they matter more than their frequency suggests.

Second, the golden set is versioned as code. Golden-set entries live in a Git repository, are reviewed in pull requests, and are changed only with team consensus. This discipline is non-negotiable — a golden set that drifts silently stops being evidence of quality.

Third, the golden set includes adversarial examples: known prompt-injection attempts, jailbreak patterns, toxicity probes, bias probes, and out-of-scope queries. The system’s behavior on adversarial inputs is as much a quality property as its behavior on benign inputs. Article 14 develops security architecture, which depends on the adversarial portion of the golden set being maintained.

A production golden set for a non-trivial use case has 500 to 5,000 entries across all these categories. A golden set smaller than 100 entries cannot meaningfully estimate quality; a golden set larger than 10,000 entries becomes expensive to maintain and to re-evaluate on every run.

Automated metrics

Three families of automated metrics cover most RAG and generation use cases.

Reference-based metrics compare the system’s output to a reference answer — BLEU, ROUGE, BERTScore, and exact-match. These metrics work when the expected output is well-defined (translation, structured extraction, summarization against a reference). They break down on open-ended generation where many correct answers exist.

Retrieval metrics measure whether the retrieval stage returned relevant chunks — recall at k, mean reciprocal rank, precision at k, normalized discounted cumulative gain. These metrics are narrow and interpretable; they surface whether the retrieval stage is the cause of downstream quality problems.

Task-specific rubric metrics define quality in terms of the use case’s own dimensions — factual correctness, grounding in the retrieved context, completeness, safety, tone adherence, instruction following. Rubric metrics are scored by LLM-as-judge, by humans, or by hybrid pipelines. They are the most expensive and the most reflective of real quality for open-ended generation.

The architect combines metrics: retrieval metrics for the retrieval stage, reference-based metrics where a reference exists, rubric metrics for everything else. A dashboard that reports a single quality number hides more than it reveals; a dashboard that reports a vector of metrics lets the architect diagnose which sub-system caused a regression.

LLM-as-judge

LLM-as-judge is an LLM (often a stronger model than the one under evaluation) prompted to score outputs against a rubric. It is the most powerful evaluation tool to emerge in the past two years because it scales rubric-based scoring that previously required humans. It is also the easiest to misuse. Article 12 covers LLM-as-judge mechanics, biases (position bias, verbosity bias, self-preference), and calibration against human labels in depth; for Article 11’s purposes, LLM-as-judge is one of the metrics in the harness, not the whole harness.

The architect uses LLM-as-judge for tasks where a rubric is definable, where human labels exist to calibrate the judge, and where the judge’s evaluation cost is a fraction of the system’s inference cost. The architect does not use LLM-as-judge on its own as the sole evaluation signal because the calibration drifts as models and prompts change.

[DIAGRAM: TimelineDiagram — aite-sat-article-11-lifecycle-evaluation-timeline — Horizontal timeline spanning the development lifecycle from left to right: “Pre-deployment baseline (offline + human calibration)” → “Shadow evaluation (new version silent on 100% traffic)” → “Canary (new version on 1-5% traffic with automatic abort)” → “A/B experiment (split traffic, measure outcome metrics)” → “Ongoing regression (offline every commit + online sampling daily)”. Beneath each phase, annotations show which evaluation modes run, what the gate criteria are, and the typical duration (hours, days, weeks).]

Online experimentation

Once a version clears offline and shadow, online experimentation — a split traffic test between the current and new version — is the final evidence that the new version performs at least as well as the current one on real users and real outcomes. The test runs long enough to detect a meaningful effect size at the chosen confidence level; it does not run so long that the new version accumulates opportunity cost by serving only part of traffic.

The architect instruments outcome metrics that matter to the business — task completion rate, user escalation rate, time-to-answer, per-query cost — alongside the system-level quality metrics. A new version that is objectively higher-quality on the rubric but produces lower task completion (perhaps because its responses are longer and users abandon mid-read) is not actually an improvement for the business. The architect guards against the temptation to celebrate rubric-score wins that do not translate into outcome-metric wins.

Safety evaluation

Safety is not a separate metric; it is a set of evaluation suites run against the same system with the same cadence as the quality suites. Toxicity, bias, harmful-instruction compliance, prompt-injection susceptibility, data-leak tests — each has its own golden set, its own metrics, and its own pass thresholds. The architect treats a safety regression as a release blocker regardless of the quality gain the new version delivers on other metrics.

Public safety benchmarks — RealToxicityPrompts, ToxiGen, bias-probe datasets, TruthfulQA, HELM’s safety suite — provide the starting points, but the use-case-specific safety suite is built from the prior-incident library and the team’s own red-team exercises.¹ The benchmark gives coverage on known failure modes; the internal suite gives coverage on the specific failures this system has or could have.

[DIAGRAM: HubSpokeDiagram — aite-sat-article-11-evaluation-harness-hub-spoke — Hub labelled “Evaluation harness” with spokes radiating out to each component. Spoke 1: “Golden sets (versioned, reviewed, 500–5000 entries)”. Spoke 2: “Regression suite (offline CI, every commit)”. Spoke 3: “Safety suite (toxicity, bias, injection, leakage)”. Spoke 4: “Human review pipeline (expert scoring, judge calibration)”. Spoke 5: “Online experiments (A/B, shadow, canary)”. Spoke 6: “Metrics backend (observability stack, dashboards, alerts)”. Spoke 7: “Evidence pack (conformity assessment, audit trail)”. Each spoke annotated with its owner role and its cadence.]

Evaluation infrastructure

The harness is infrastructure. The architect specifies the runner (often a CI job or a dedicated evaluation service), the storage (runs, scores, versions of golden sets, version of the system under test are all recorded), the dashboards (per-version quality vectors, per-run drill-down), and the alerting (threshold-based regression alerts into the team’s paging system).

Open-source reference stacks for evaluation include MLflow’s evaluation APIs, LangSmith (commercial but with a free tier), Promptfoo, Ragas, DeepEval, and Arize Phoenix.² Each covers a different slice — MLflow and Phoenix cover the experiment-tracking layer, Ragas and DeepEval cover the metrics layer, LangSmith and Promptfoo cover end-to-end orchestration. The architect picks one that covers the needed slice and composes as needed; none of them is a complete harness out of the box.

Two real-world examples

Stanford HELM. The Holistic Evaluation of Language Models project at Stanford published a multi-dimensional benchmark suite covering accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency across a large panel of models.³ HELM is the public reference for what “multi-metric evaluation” looks like at scale. The architectural point for the AITE-SAT learner is not that HELM is the right benchmark for every use case — it usually is not — but that the way HELM structures its evaluation (explicit scenarios, explicit metrics per scenario, explicit model coverage) is the way a production evaluation harness should be structured. Borrow the structure; substitute the specific scenarios and metrics that match the workload.

Arize Phoenix evaluation patterns. Arize AI’s Phoenix project, open-source under Apache 2.0, ships reference evaluation patterns for RAG quality (retrieval relevance, response relevance, hallucination detection, QA correctness) with runnable notebooks.⁴ The project is the vendor-neutral reference for how to compose LLM-as-judge metrics, trace-level evaluation, and experiment comparison into a working harness. The architectural point is that evaluation tooling is mature enough in 2025–2026 that no team has to build the harness from zero; the work is in picking the pieces and wiring them to the workload’s specific metrics.

MLflow Evaluation documentation. MLflow’s model-evaluation APIs cover classic ML metrics plus LLM-specific evaluators (QA metrics, toxicity, perplexity) and integrate with the broader MLflow experiment-tracking layer.⁵ MLflow is the vendor-neutral OSS reference for the experiment-tracking half of the harness. An architect who picks MLflow for model-artifact tracking gets evaluation tracking in the same tool.

Evaluation cadence and ownership

Every item in the harness has an owner and a cadence. Offline evaluation on the regression suite runs on every commit, owned by the development team, with failure blocking the merge. Safety evaluation runs on every commit for the changes-prone subsurface (prompt templates, tool schemas, retrieval logic), owned by the security engineering function. Online evaluation sampling runs continuously, owned by the AI platform team, with daily summaries surfaced in the team’s dashboard. Human review runs weekly against a sampled and incident-triggered pool, owned by the quality function with reviewer management often outsourced to a qualified vendor. Quarterly evaluation planning — the session where the team revisits golden-set composition, metric definitions, and threshold calibrations — is owned by the architect because it is the point at which the harness itself is evolved rather than exercised.

Without named ownership per item, evaluation decays into a collection of scripts nobody runs and reports nobody reads. Named ownership is the mechanism that keeps the harness alive through team turnover and organizational change.

Regulatory alignment

EU AI Act Article 15 requires that high-risk AI systems achieve “an appropriate level of accuracy, robustness and cybersecurity” across their lifecycle, and Article 12 requires that systems keep logs that allow the system’s functioning to be traced.⁶ Both articles imply an evaluation harness that measures accuracy and robustness on an ongoing basis, records the measurements, and preserves them for audit. The harness architecture is the architect’s primary piece of evidence for Article 15 conformity. ISO/IEC 42001 Clause 9 (performance evaluation) and Clause 10 (continual improvement) impose similar expectations on the management-system side; the harness is what the management system measures against.

Summary

The evaluation harness is the instrument by which the architect keeps the AI system under observation. The four modes — offline, online, shadow and canary, human review — have complementary feedback latencies and coverage. The golden set is versioned code, representative of production distribution, and includes adversarial and edge-case entries. Automated metrics, LLM-as-judge, and human review combine into a per-run quality vector rather than a single number. Online experimentation guards against rubric-score wins that do not translate into outcome-metric wins. Safety evaluation runs alongside quality evaluation, and a safety regression is a release blocker. Open-source tooling (MLflow, Phoenix, Ragas, DeepEval, Promptfoo) covers most of the harness; the architect composes rather than builds from scratch. Regulatory alignment under EU AI Act Articles 12 and 15 and ISO 42001 Clauses 9 and 10 depends on the harness being designed, documented, and continually exercised.

Further reading in the Core Stream: Continuous Evaluation of AI Systems and Testing and Validation for AI Systems.

Public safety benchmarks: RealToxicityPrompts (Gehman et al., 2020, arXiv 2009.11462). https://arxiv.org/abs/2009.11462 — accessed 2026-04-20. ToxiGen (Hartvigsen et al., 2022, arXiv 2203.09509). https://arxiv.org/abs/2203.09509 — accessed 2026-04-20. TruthfulQA (Lin et al., 2022, arXiv 2109.07958). https://arxiv.org/abs/2109.07958 — accessed 2026-04-20. ↩
Promptfoo. https://www.promptfoo.dev/ — accessed 2026-04-20. Ragas. https://docs.ragas.io/ — accessed 2026-04-20. DeepEval. https://docs.confident-ai.com/ — accessed 2026-04-20. LangSmith. https://docs.smith.langchain.com/ — accessed 2026-04-20. Arize Phoenix. https://docs.arize.com/phoenix — accessed 2026-04-20. ↩
Percy Liang et al., “Holistic Evaluation of Language Models” (HELM), Stanford CRFM. https://crfm.stanford.edu/helm/ — accessed 2026-04-20. ↩
Arize AI Phoenix project. https://github.com/Arize-ai/phoenix — accessed 2026-04-20. ↩
MLflow Evaluation documentation. https://mlflow.org/docs/latest/llms/llm-evaluate/index.html — accessed 2026-04-20. ↩
Regulation (EU) 2024/1689, Articles 12 and 15. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-20. ISO/IEC 42001:2023, Clauses 9 and 10. https://www.iso.org/standard/81230.html — accessed 2026-04-20. ↩