Lab 01: Design a RAG Reference Architecture for a Regulated Internal Knowledge Assistant

FlowRidge

AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Lab Notebook 1 of 5

Scenario

You are the solution architect assigned to PolicyPilot, an internal knowledge assistant for underwriters and claims specialists at a composite European insurer headquartered in Dublin, with operating branches in Madrid and Frankfurt. The assistant must answer natural-language questions over a corpus of roughly 180,000 documents: product wordings, reinsurance treaties, claims guidelines, regulatory circulars from the EIOPA, the Central Bank of Ireland, and BaFin, and internal underwriting memos. Answers must cite the source paragraph; hallucinated citations are a release blocker. The product is scoped as a decision-support feature (not a decision-making system), so under the EU AI Act it is not presumed high-risk, but the parent system supporting underwriting decisions is Annex III high-risk, and PolicyPilot must be engineered so its evidence pack can fold into that system’s conformity assessment. Data residency is EU-only; personal data appears in claims memos and must be handled under GDPR Article 9 (special categories, given medical context in life and health claims).

Business targets at steady state are 1,200 internal daily active users, median question-to-answer latency under 6 seconds, answer-acceptance rate (a thumbs-up rate with a calibrated definition) of 70% or higher on a held-out underwriter golden set, and per-user marginal cost under 8 cents per session. Build-versus-buy is open; the architect’s recommendation must be defensible against both a managed-API path and a self-hosted open-weight path.

Your deliverable is a complete architecture package submitted to the Model Risk Committee.

Part 1: Reference architecture diagram and narrative (60 minutes)

Produce a reference architecture that shows, at a minimum, the ingress tier, the retrieval tier, the generation tier, the evaluation path, the observability path, and the governance boundary. At each tier, annotate:

The component’s responsibility in one sentence.
The failure mode the component is the primary defense against.
Whether the component is stateful, and if so where its state lives.
The authentication or authorization step present at each inter-tier boundary.

The diagram must be technology-neutral: name capabilities (dense retriever, sparse retriever, reranker, policy engine, telemetry bus, audit log) rather than vendors, and provide a sidebar that lists at least two viable implementations per capability drawn from different stack families. At least one implementation per capability must work on a self-hosted open-weight path (for example, Llama 3 or Mistral via vLLM, pgvector or Qdrant, bge-reranker) and at least one must work on a managed cloud-API path (for example, Bedrock, Azure AI Foundry, or Vertex AI with Pinecone or Weaviate).

Write a 400-to-500-word narrative that walks a reader through the request lifecycle from the underwriter’s browser to the cited answer. Call out explicitly where a personal identifier could be reflected into a log or prompt, and what redaction occurs before that point.

Expected artifact: PolicyPilot-Reference-Architecture.md with a single system diagram and the narrative.

Part 2: Data-contract register (40 minutes)

Your index must be trustworthy. Produce a data-contract register listing every source system that feeds the index, with, for each source:

Field	What to record
Source ID and owner	Team name, accountable individual
Refresh cadence	Event-driven, hourly, daily, weekly, or bounded-staleness
Sensitivity class	Public, internal, confidential, restricted (GDPR-special)
Residency constraint	EU-only, country-specific, none
Ingestion SLA	Time from source update to index availability
Retention rule	Delete-from-index rules for source-deleted or retracted documents
Access filter	Tenant, business-unit, or jurisdiction scoping applied at retrieval

Include a one-paragraph section on how a document retraction (for example, a withdrawn EIOPA guideline) propagates from the source repository to the live index within the stated SLA, and how a retrieval that surfaced the retracted paragraph before the propagation completed is detected and remediated.

Expected artifact: PolicyPilot-Data-Contract-Register.md with the source table and the retraction paragraph.

Part 3: Evaluation plan aligned to the production path (30 minutes)

Produce a three-layer evaluation plan: offline, online, and human review. Specify:

Offline. The golden set composition (size, domain coverage, refresh cadence, ownership), the faithfulness and citation-validity checks, the retrieval-quality metrics (hit-rate-at-K, reciprocal rank), and how the set defends against leakage into the training of any fine-tuned component.
Online. The canary ramp protocol, the guardrails (latency, refusal rate, unsafe-content rate, cost per session), and the rollback trigger on each guardrail.
Human review. The sampling rate for human rating, the rubric dimensions (grounding, completeness, style), the rater calibration cadence, and how disagreements with an LLM-as-judge pipeline are reconciled.

Name a tracking platform for each layer (for example, Langfuse, Arize, Humanloop, MLflow, or a build-your-own logging stack) with a one-sentence rationale. Name two alternatives. The naming is a realism exercise; endorsement is not.

Expected artifact: PolicyPilot-Evaluation-Plan.md with the three layers and guardrail table.

Part 4: Architecture decision record for the generator choice (20 minutes)

Produce a two-page ADR for the generator decision. Use the standard ADR structure (context, decision, consequences, alternatives, revisit trigger). The decision must include:

The chosen stack family (managed API, cloud platform, or self-hosted open-weight) with the primary and secondary candidate models.
The grounding and refusal behaviour expected of the generator, and how the chosen model has been verified to exhibit them.
The data-egress posture: what, if anything, leaves the EU boundary, and the contractual and technical controls in place.
The revisit trigger: the condition under which the committee reopens the decision (a material change in model pricing, a shift in regulatory guidance, or a sustained evaluation gap).

The ADR must be defensible. A reviewer should be able to read it and follow the reasoning without having participated in the discussion.

Expected artifact: ADR-001-Generator-Choice.md.

Final deliverable and what good looks like

Package the four artifacts into PolicyPilot-Architecture-Package.md with a one-page executive summary stating the target operating envelope, the residual architecture risks, and the go-to-build recommendation with conditions.

A reviewer will look for: completeness across all four parts; multi-stack parity (at least one open-weight and one managed-API implementation named per capability); an explicit grounding and citation-validity test; a concrete retraction-propagation SLA; and an ADR that takes a position and names the revisit trigger. Vague architecture (“an LLM generates the answer”) and single-vendor architecture (“PolicyPilot uses vendor X”) both fail review.