Lab 01: Design and Execute an Offline Evaluation Harness

FlowRidge

AITM-ECI: AI Experimentation Associate — Body of Knowledge Lab Notebook 1 of 2

Scenario

Your organization is an insurance company piloting an AI-assisted claims triage feature called TriagePilot. The product team has trained a gradient-boosted classifier that predicts, from a claim’s initial text description, metadata fields, and customer history features, whether the claim is likely to be routine (can be fast-tracked), complex (needs human adjuster attention), or potentially fraudulent (needs investigation). The training dataset has 2.4 million historical claims across 6 years of customer interactions. The team plans to deploy the model as a recommendation (the model’s output is surfaced to the human adjuster, who makes the final decision) on a shadow basis first, then ramp.

Your role is the experimentation practitioner assigned to the pre-deployment offline evaluation. Your deliverables, produced across the four parts of this lab, will form the offline evaluation package that goes to the deployment gate.

Part 1: Split design (30 minutes)

Produce a split design document describing how you will partition the 2.4M claims dataset into training, validation, and test slices. Your design must address the dataset’s structure.

Temporal structure. The data spans 6 years. How will you respect temporal order?
Group structure. The dataset has many claims per customer. How will you prevent identity leakage?
Class imbalance. Routine claims dominate; fraud claims are <2%. How will you maintain class proportions in the test slice?
Seasonal effects. Claim types shift seasonally (weather events, year-end deductible-reset effects). How will you ensure the test slice reflects seasonal variety?

For each of the four considerations, record:

The specific partitioning rule you will apply.
The resulting per-slice size (approximate).
The risk the rule does not address and what complementary control will.

Conclude with a diagram or table showing the final split structure.

Expected artifact: Split-Design.md (one to two pages), with the four considerations addressed explicitly and the final split structure illustrated.

Part 2: Leakage audit (25 minutes)

Apply the five-class leakage taxonomy from Article 3 (target, temporal, identity, duplicate, preprocessing) to the feature list below. For each feature, record the leakage mode it is at risk for, the concrete failure the mode would produce, and the specific check or transformation that prevents it.

Feature list:

days_since_first_claim — days between the customer’s first historical claim and the claim being predicted on.
customer_total_claims_count — count of all claims the customer has ever filed.
claim_amount_vs_customer_average — ratio of the current claim’s amount to the customer’s average historical claim amount.
adjuster_action_flag — a flag that is set after human adjuster review.
neighborhood_fraud_rate — a 30-day rolling average of fraud rates in the claim’s postal code.
days_to_resolution — days between claim filing and resolution.
claim_id_hash — a 128-bit hash of the claim ID, intended as an anonymization stand-in.
policy_active_flag — whether the policy was active on the claim date.
customer_segment_embedding — a 32-dimensional vector produced by a separate pipeline that runs nightly.

Include a one-paragraph reflection: which feature do you think is the highest-risk leakage case, and why? This paragraph becomes part of the evaluation package and will be specifically asked about in the review.

Expected artifact: Leakage-Audit.md with a table row per feature and the reflection paragraph.

Part 3: Slice evaluation plan (30 minutes)

Design a slice evaluation for the TriagePilot test set. The slice plan must cover at least the dimensions below, plus at least two additional dimensions you consider relevant for this feature.

Required dimensions:

Geographic slice (claim’s province or equivalent).
Temporal slice (most recent 3 months versus earlier period).
Claim-amount slice (quintiles of claim amount).
Fraud-class slice (evaluated separately because of low base rate).

For each dimension, specify:

The slice boundaries.
The minimum sample size per slice that the test set must preserve (otherwise the slice metric is not reliable).
The metric you will report per slice (the primary aggregate metric, at minimum; additional slice-specific metrics if appropriate).
The rule for flagging a per-slice regression against the production baseline.

Include a paragraph explaining which two additional slices you added and why those slices are relevant for TriagePilot specifically (a customer-facing insurance claims product, regulated in most jurisdictions, with a non-trivial fraud-detection component).

Expected artifact: Slice-Evaluation-Plan.md with a table of slices and the justification paragraph.

Part 4: Reproducibility and confidence-interval record (25 minutes)

Produce the reproducibility-and-confidence section of the offline evaluation report. The section must enable a colleague on a different team, using a different tracking tool, to reproduce your offline evaluation and reach substantially the same conclusion.

Content required:

The exact commands (or equivalent) to reproduce the evaluation from the dataset version pinned to a date.
The software versions and the hardware class needed (CPU vs. GPU, approximate memory).
The expected runtime and expected primary-metric value (with declared tolerance).
A bootstrap-based confidence interval protocol for each primary metric. Specify the bootstrap sample count, the percentile convention, and the random seed policy.
The distribution-gap analysis: which statistic(s) will you compute between training and test, and what threshold will flag an actionable gap?
The record of what you will do if the gap threshold is breached (escalate, rerun, or proceed with documented caveat).

A pointer to the tracking tool you would use in your organization (MLflow, W&B, Neptune, Aim, SageMaker Experiments, Vertex AI Experiments, Azure ML, Databricks, or another) is acceptable; name your choice and the rationale in one sentence. The choice is a realism exercise, not a branded recommendation.

Expected artifact: Reproducibility-and-Confidence.md with the commands, the bootstrap protocol, and the distribution-gap protocol.

Reflection questions (10 minutes)

Write one paragraph on each. These paragraphs join the evaluation package as rationale the deployment gate will read.

The adjuster action flag. Feature #4 in Part 2 is adjuster_action_flag. The product team is reluctant to drop it because it is predictive. How do you explain, in governance terms and without being dismissive, why it must be dropped or reframed?
The shadow plan. The team’s plan is “offline test passes, then shadow, then ramp”. Given your Part 1–3 artifacts, what gap does shadow cover that offline cannot, and what would an inadequate shadow test miss that a good one would catch? Refer to Article 1’s four-mode vocabulary.
The fraud slice. Fraud is <2% of claims and has the highest human cost per error (wrongly flagging a non-fraud claim harms a customer; wrongly missing a fraud claim costs the insurer). How does your Part 3 plan ensure the fraud-slice metrics are both sized and interpreted correctly?

Final deliverable

A single offline evaluation package named TriagePilot-Offline-Evaluation-Package.md combining the four artifact files and the reflection paragraphs in order, with a one-page executive summary at the top stating: the feature, the residual risks after your proposed mitigations, the go/no-go recommendation for moving to shadow deployment, and the conditions on that recommendation. The package runs to approximately eight to ten pages.

What good looks like

A deployment-gate reviewer will look for:

Completeness. All four partitioning considerations addressed. All nine features audited. All required slice dimensions plus two relevant additions. A concrete reproducibility protocol.
Specificity. Named partition rules, named leakage transformations, named slice thresholds, named confidence-interval protocol. Not “we will check for leakage” but “we will apply a point-in-time feature cutoff at the claim-filing date”.
Neutrality. Tracking-tool choice named with one-sentence rationale; no unqualified vendor endorsement.
Feature-specific coverage. The TriagePilot-specific elements (fraud class, insurance regulation, adjuster interaction) visibly shape the artifacts.
A clear recommendation. The executive summary takes a position and attaches conditions, rather than hedging.