Artifact Template: LLM Evaluation Harness Specification

FlowRidge

AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Artifact Template

How to use this template

This template is the companion artifact to Template 1 Section 7 and to Lab 2. The harness specification is authored by the evaluation owner (often the solution architect in collaboration with an applied-science partner), reviewed by the governance owner, and committed alongside the architecture design document at the gate that authorizes implementation. The harness itself is a product — it ships, it runs, it is maintained — and the specification governs its evolution.

Every section is required. Sections that are not applicable for the feature are completed with one paragraph stating why. Empty sections are rejected by review.

LLM Evaluation Harness Specification — [Feature Name]

1. Identification and ownership

Field	Value
Feature name	[link to architecture design document]
Harness name	[typically `<feature>-evals` or similar]
Evaluation owner	[single accountable individual]
Governance reviewer	[name and role]
Pull-request CI path	[repository, branch, CI workflow name]
Production telemetry path	[observability backend, dashboard ID or URL]
Authored	YYYY-MM-DD
Review cadence	[e.g., quarterly, or on prompt/model change]

2. Primary and secondary metrics

Primary metric. The single metric that the release gate reads. Name, operational definition (numerator, denominator, window, filters), expected baseline from production, minimum detectable effect for the release change, alpha and power. A single primary metric is a design discipline; a harness that releases on a vote across many metrics is a harness that cannot be governed.

Secondary metrics. The short list of supporting metrics that the release readout includes, with rationale for each.

Guardrail metrics. The safety-net metrics whose breach forces a decision regardless of primary-metric performance. At minimum: latency p99, refusal rate, unsafe-content rate, cost per session. For each: breach threshold, breach window, action on breach.

3. Offline layer

Golden set

Field	Value
Size	[number of examples at launch]
Segmentation dimensions	[e.g., topic, language, tenant size, difficulty]
Construction protocol	[how examples are sampled, annotated, and maintained]
Anti-leakage discipline	[the check that prevents training-set overlap, CI-enforced]
Refresh cadence	[how often the set is grown, revised, retired]
Versioning	[pin format, approval for version change, historical comparability]
Owner	[name or team]
Storage location	[retention and access]

Scoring

Per rubric dimension, specify scorer type (deterministic, LLM-as-judge, hybrid). At minimum, two rubric dimensions anchored by a deterministic signal.

Rubric dimension	Scorer type	Named implementation	Pass threshold
[Grounding]	Hybrid (citation check + LLM-as-judge)	[implementation name]	[e.g., ≥ 0.85 on 1.0 scale]
[Helpfulness]	LLM-as-judge	[implementation name]	[threshold]
[Safety]	Deterministic (policy-match)	[implementation name]	[threshold]
[Style]	LLM-as-judge	[implementation name]	[threshold]
[Factuality]	Deterministic where possible	[implementation name]	[threshold]

LLM-as-judge configuration

Field	Value
Judge model (primary)	[named provider and model]
Judge model (alternate)	[for rotation, different provider or family]
Prompt template version	[pinned version]
Output schema	[JSON envelope with per-dimension numeric score and rationale]
Temperature, top-p	[for reproducibility]
Calibration protocol	[kappa or equivalent agreement with human labels, threshold, subset size]
Recalibration trigger	[on judge-model update, on quarterly cadence, on drift signal]

CI integration

Field	Value
Pull-request path runtime	[target ≤ 20 minutes]
Nightly path runtime	[target ≤ 2 hours]
Failure mode on CI	[block merge, warn but allow, waiver protocol]
Result storage	[where historical results persist]

4. Online layer

Canary ramp

Step	Traffic	Minimum time	Pre-condition to advance	Rollback trigger
1	[e.g., 1%]	[e.g., 24 hours]	[guardrail clean, primary metric within expected band]	[guardrail breach or sustained primary-metric regression]
2	[5%]	[…]	[…]	[…]
3	[25%]	[…]	[…]	[…]
4	[50%]	[…]	[…]	[…]
5	[100%]	[…]	[…]	[…]

Guardrails in production

Detailed table with the same structure as §2, with breach window and breach action specified per metric.

Online rubric sampling

Field	Value
Sampling rate	[percentage of production traffic]
Stratification	[e.g., by ticket category and LLM-as-judge score quartile]
Scorer pipeline	[same as offline or a reduced variant]
Dashboard location	[URL]
Alert channels	[destinations]

5. Human-review layer

Field	Value
Console owner	[team]
Sampling rate	[percentage]
Stratification	[dimensions]
Rubric	[dimensions matching offline rubric]
Rater pool size	[count]
Inter-rater agreement target	[e.g., κ ≥ 0.7]
Hidden-gold cadence	[percentage of items that are pre-labeled gold for rater QA]
Disagreement adjudication rule	[who decides when judge and human disagree]
Feedback loop to golden set	[how adjudicated items feed the next set refresh]

6. Cost model

Category	Estimate	Unit
Offline scoring compute	[amount]	[dollars per run]
LLM-as-judge inference	[amount]	[dollars per run, with assumption on example count]
Online rubric-sampling inference	[amount]	[dollars per month]
Human-review hours	[amount]	[hours per month]
Annual harness operating cost	[amount]	[dollars per year]

7. Reproducibility and governance

Field	Value
Tracking system	[MLflow, W&B, Langfuse, Arize, Humanloop, or other]
Run record ID format	[…]
Code pinning	[repository, branch, commit SHA discipline]
Data pinning	[golden-set version, retrieval-snapshot version]
Container pinning	[registry and SHA]
Regulatory anchors	[EU AI Act Articles 9, 12, 13, 14, 15; NIST AI RMF MEASURE; ISO 42001 Clause 9.1; other]
Retention	[duration, storage, access controls]

8. Review and amendments

Role	Name	Decision	Date
Evaluation owner	[…]	Authored	YYYY-MM-DD
Peer reviewer	[…]	Approved	YYYY-MM-DD
Governance reviewer	[…]	Approved	YYYY-MM-DD

Amendment log. Each entry: date, author, change summary, sections affected, re-approvals obtained.

Notes on use

When to use this template. Every production LLM feature. The discipline is the same at small scale and at large; the absolute targets differ.

Common errors in first-time use. Missing deterministic rubric anchor; LLM-as-judge without a calibration protocol; canary ramp without sample-size mathematics; guardrails without breach windows; human-review layer without a disagreement rule; cost model missing the judge-inference line. Reviewers treat these as blocking.

What follows. The harness specification feeds the feature’s release-gate readout and, through its offline and online metrics, the post-release monitoring posture. The harness is revisited when the model, the prompt, the retrieval corpus, or the tool inventory changes.