Lab 02: Design an Evaluation Harness for a Retrieval-Augmented Feature

FlowRidge

AITM-PEW: Prompt Engineering Associate — Body of Knowledge Lab Notebook 2 of 2

Scenario

Your organisation runs LegalDesk, a retrieval-augmented feature that answers questions from the in-house legal team using the organisation’s policy library, past contracts, and decision memoranda. The feature has been live for four months. It was built by a small team that has since moved on. Quality has recently degraded: the legal team’s head of practice has raised concerns about answers that cite chunks which are either irrelevant or, in one flagged case, refer to a policy that has been superseded. Your assignment is to design the evaluation harness LegalDesk should have had from the start, run a first offline pass, build the dashboard that will carry the ongoing evaluation, and produce the runbook that tells the on-call engineer what to do when the harness alerts.

The team has given you the current prompt template, the retrieval configuration, a representative sample of one hundred anonymised production conversations from the last month, and a list of the legal team’s stated concerns.

Part 1: Dimension and case design (30 minutes)

Produce the harness specification covering the six dimensions from Article 8. For each dimension, specify:

The test-case population. Where do the cases come from: curated happy-path set, edge cases, adversarial probes, production-sample draws, or a combination. Give the target size of the test-case population per dimension.
The scoring method. Automated check, LLM-as-judge with prompted rubric, human review sample, or a combination. Justify the choice.
The threshold. What score on this dimension is acceptable for LegalDesk to remain in production? Thresholds should reflect that LegalDesk serves an internal expert user base that will catch obvious errors but relies on the system for accuracy on subtler questions.
The failure consequence. What happens when a dimension falls below threshold? Page the on-call engineer, alert the prompt owner, block the next release, or initiate a rollback.

Particular attention is required on grounding: LegalDesk’s flagged incident involved a citation to a superseded policy. Design at least one test-case category that would have detected that incident earlier, such as a periodic check that cited chunks resolve to documents still marked as current rather than archived.

Deliverable: LegalDesk-Harness-Specification.md with a per-dimension table covering population, scoring, threshold, and failure consequence.

Part 2: Build and run the offline pass (30 minutes)

Populate the harness with the test cases specified in Part 1. You should end with at least fifty concrete test cases distributed across the six dimensions in the proportions your specification demands. Source test cases as follows:

Curate happy-path cases from the production sample where the answer was correct and the user was satisfied.
Curate edge and adversarial cases from published injection and hallucination taxonomies (OWASP LLM Top 10, MITRE ATLAS, plus the attack classes from Article 7).
Design grounding cases from the retrieval corpus: include at least three cases where the correct answer requires combining two chunks; at least three where retrieval returns irrelevant chunks and the system should refuse; at least three where a chunk that was once authoritative has been superseded (simulate this by marking the chunk archived in the corpus metadata).
Include stability cases by running the same happy-path input five times and computing variance.

Run the offline pass. Record, per case and per dimension, the raw result, the pass/fail, and any anomalies. Produce a per-dimension summary.

Deliverable: LegalDesk-Offline-Pass-Results.csv with raw results and LegalDesk-Offline-Pass-Summary.md with the dimension-by-dimension summary.

Part 3: Dashboard design (20 minutes)

Design the dashboard that will present ongoing harness results to three audiences: engineering, product, and governance. The dashboard design does not require building a working dashboard; it requires specifying what each audience sees, at what cadence, and what triggers an alert. Deliver three mockups or wireframes:

Engineering dashboard. Shows all six dimensions with current value, threshold, trend sparkline, per-test-case drill-down, and the latency and cost metrics from Article 8. Shows the cadence of offline and online runs and the next scheduled run.
Product dashboard. Shows a rolled-up quality-trend chart, recent incidents, outstanding drift alerts, and a qualitative summary of the last week’s user feedback. Audience is the product owner, not the engineer, so detail is compressed.
Governance dashboard. Shows adversarial resistance, grounding, audit-trail completeness, and change-log summary over the period. Audience is the governance reviewer, so the orientation is compliance and residual-risk posture.

You may use ASCII art, tables, or a simple sketch tool. The dashboard is a specification artefact; a downstream engineer will build it.

Deliverable: LegalDesk-Dashboards.md with three named mockups, each annotated with the metrics displayed, the refresh cadence, and the alert thresholds.

Part 4: Runbook (15 minutes)

Produce the on-call runbook that tells the engineer what to do when the harness alerts. The runbook addresses at least the following alert classes:

Correctness regression. The correctness dimension drops from its baseline. Runbook should include: immediate steps (confirm the regression, identify the last change), diagnosis steps (check model version, retrieval source version, prompt version), and escalation path.
Grounding regression. Citation accuracy drops. Runbook should address: the superseded-policy scenario specifically; whether a rollback of the retrieval index is needed; who to page if the retrieval index itself is suspect.
Safety / adversarial-success rate rise. Adversarial cases succeed at a higher rate than before. Runbook should address: whether to disable the feature pending investigation; the security owner to page; the communication duties if a user-facing incident has occurred.
Cost or latency regression. Tokens per request or p95 latency rise beyond threshold. Runbook should address: possible causes (prompt growth, retrieval growth, model change) and the triage sequence.

Each runbook entry names the responsible role, the tools the engineer uses, and the decision criteria. Vague runbooks are not useful runbooks; specifics matter.

Deliverable: LegalDesk-Runbook.md with entries for each alert class.

Reflection questions (10 minutes)

Write one paragraph on each of the following:

The superseded-policy incident. Your Part 1 design should include a grounding check for archived chunks. Would the same check have caught the incident if the chunk were still marked current but the underlying policy had been superseded in substance? What additional control would close that gap, and where does that control live (in the retrieval pipeline, in the prompt, or in a human review process)?
LLM-as-judge limits. LLM-as-judge scoring is convenient but imperfect; judges disagree with each other and with humans. Where in your Part 1 specification did you use LLM-as-judge, and how do you intend to validate that the judge is itself calibrated? Propose one concrete calibration test you could run.
Harness as governance instrument. In the LegalDesk incident, the harness did not exist. The harness you have now designed is substantial. What risk is introduced by the harness itself, and what control would address that risk? Consider, for example, what happens if the harness’s test set becomes stale or if its adversarial probe set fails to track current attack techniques.

Final deliverable

A single governance-grade package named LegalDesk-Harness-Package.md combining all four artefacts, the three reflection paragraphs, and a one-page executive summary at the top naming: the feature, its current harness maturity, the top three residual risks even with the harness in place, and the recommended next investments in harness expansion. The package should run to approximately ten to fourteen pages and should be reviewable by the same governance committee that would see the PolicyGuide selection brief from Lab 01.

What good looks like

A reviewer will look for:

All six dimensions represented with distinct test-case populations and distinct scoring approaches.
At least fifty test cases with real inputs and real expected properties, not hypothetical placeholders.
A specific grounding investment that would have detected the superseded-policy incident.
Three audience-specific dashboards, not a single dashboard relabelled three times.
A runbook whose entries name specific tools, specific roles, and specific decision criteria.
A candid reflection on the harness’s own limits.