Lab 02: Build an LLM Evaluation Harness with Offline, Online, and Human Components

FlowRidge

AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Lab Notebook 2 of 5

Scenario

Your team is preparing to release DraftMate, an email-reply drafting assistant inside a mid-market customer service platform. The feature suggests three candidate replies when an agent opens a ticket; the agent accepts, edits, or rejects a draft. DraftMate is retrieval-augmented against a knowledge base of product articles and prior resolved tickets, and it is built on a closed-weight managed API for the generator and on an open-weight reranker for retrieval. You have inherited a prototype with no evaluation harness. The release gate is ten working days away. The release criterion is that the primary metric, accept-without-edit rate, improves over the legacy template-based system by at least 4 absolute percentage points on a held-out ticket sample, with no guardrail regression.

Your assignment is to build the evaluation harness end to end. The harness must run in continuous integration on every pull request that touches the prompt, the retriever, the index, or the generator choice, and it must produce a gate-readable result in under 20 minutes on the pull-request path and under two hours on the nightly path. It must also serve the online layer and the human-review layer after release.

Part 1: Golden set design and construction (45 minutes)

Produce a plan for the offline golden set. The plan covers:

Size and composition. Target size, segmentation by ticket category (billing, technical, account, refund, shipping, other), language distribution (at least English and one non-English language — Spanish, French, or German are acceptable choices), and tenant distribution (across at least three tenant sizes to prevent single-tenant overfit).
Construction protocol. How raw ticket-and-reply pairs are sampled from the production log, how personal data is redacted before an example enters the set, how the acceptable-reply annotation is produced (human-only, with an inter-rater agreement cadence), and how examples are retired when they become stale (for example, when the product changes and the reference reply no longer applies).
Anti-leakage discipline. How the golden set is excluded from any fine-tuning or few-shot-selection corpus; how a test-set-in-training check runs in CI; how a new release is blocked if the check flags overlap.
Versioning. How a golden-set version is pinned in a CI run, how a version change is approved, and how historical results remain comparable across version changes.

Document the set schema: the fields each example carries (ticket excerpt, retrieval context signals, reference reply, rubric-level annotations, metadata). Include a note on whether you will carry reference replies verbatim (preferred for compact comparisons) or whether the offline scorer will use semantic-similarity scoring against one or more reference replies.

Expected artifact: DraftMate-Golden-Set-Plan.md.

Part 2: Offline scorer and LLM-as-judge pipeline (60 minutes)

Design the offline scoring pipeline. The pipeline ingests a candidate reply and a set of references and produces a score on each of five rubric dimensions: grounding (does the reply align with retrieved context?), helpfulness (does it address the ticket?), safety (does it refuse appropriately on out-of-policy asks?), style (does it match brand guidelines?), and factuality (does it avoid unsupported claims?). Specify:

The scoring mix. For each rubric dimension, say whether the scorer is deterministic (regex, schema validator, code-based checker), an LLM-as-judge call, or a hybrid. At least two dimensions must have a deterministic component to anchor the LLM-as-judge calibration.
The judge-model choice. Name the judge model, the prompt template, the output schema (a strict JSON envelope with a per-dimension numeric score and a free-text rationale), and the temperature and top-p settings. Name one alternative judge model for rotation.
The calibration protocol. How you measure the judge-to-human agreement on a calibration-subset of the golden set, at what agreement level you declare the judge fit for purpose (for example, κ ≥ 0.6 on a subset of 200 examples), and how you re-run calibration when the judge model updates.
The reproducibility controls. How the judge’s random seed, the retrieval snapshot, the prompt template, and the candidate-generation chain are captured in a single run record.

Write 300 words on the failure modes an LLM-as-judge can introduce (position bias, verbosity bias, style sycophancy, reasoning shortcut, self-preference when the judge and the candidate are the same model family) and the specific counter-measure in your pipeline for each.

Expected artifact: DraftMate-Offline-Scorer-Spec.md.

Part 3: Online evaluation path and canary guardrails (45 minutes)

Design the online evaluation path. Specify:

The canary ramp protocol. The traffic-allocation schedule (for example, 1% → 5% → 25% → 50%), the minimum time per step, the sample size required per step for the primary metric to be read with the confidence the team has pre-committed to (alpha, power), and the randomization unit.
The guardrail set. At least five guardrails, including latency p99, answer-acceptance rate, refusal rate, unsafe-content rate, and cost per ticket. For each, specify the baseline, the breach threshold, the breach window (point-in-time or rolling), and the action on breach (auto-rollback, page the owner, or slow the ramp).
The post-deployment monitoring. The online pipeline that samples live traffic for rubric-scoring, the cadence, the dashboard the on-call engineer watches, and the alerting destinations.
The rollback mechanics. The exact feature-flag evaluation cost, the time-to-rollback SLO, and the data-plane effect of a rollback (for example, in-flight requests complete on the new path; new requests arrive on the old path).

Expected artifact: DraftMate-Online-Eval-Plan.md.

Part 4: Human-review console and disagreement adjudication (30 minutes)

Design the human-review console used by the content-quality team. Specify:

The sampling rate and stratification (for example, 2% of production traffic, stratified by ticket category and by the LLM-as-judge score quartile so that low-scoring cases are over-sampled).
The rater UX: what the rater sees (ticket excerpt, retrieval-context signals, candidate reply, reference replies), the rubric they apply, and the time budget per item.
The adjudication rule: what happens when the LLM-as-judge and the human rater disagree by more than one point on any dimension. Whose label enters the golden set. How the disagreement feeds the next calibration round.
The rater-quality controls: inter-rater agreement, hidden-gold audit items, training cadence, and how a rater is offboarded for disagreement drift.

Expected artifact: DraftMate-Human-Review-Console-Spec.md.

Final deliverable and what good looks like

Package the four artifacts into DraftMate-Evaluation-Harness.md with a one-page summary stating the release-gate readout format (a single go/no-go with the per-dimension subscores), the on-call responsibilities after release, and the cadence on which the harness itself is re-evaluated.

A reviewer will look for: a golden set with a concrete anti-leakage check; at least two deterministic rubric signals anchoring an LLM-as-judge layer; named calibration protocol with an agreement threshold; a canary plan with sample-size mathematics; and a human-review console with a documented disagreement rule. Harnesses that score only with LLM-as-judge and no deterministic anchor fail review; so do harnesses with guardrails but no breach window.