Lab 04: Design a Secure LLM Gateway with a Policy Engine

FlowRidge

AITE-SAT: AI Solution Architecture Expert — Body of Knowledge Lab Notebook 4 of 5

Scenario

You are the architect assigned to GateKeep, the enterprise LLM gateway for a global professional-services firm with 85,000 staff across 60 countries. GateKeep sits between every internal application and every model provider the firm contracts with. Applications (an internal chat client, a document-drafting assistant, a RAG-backed knowledge tool, an agent that summarizes client calls) submit model requests to GateKeep, and GateKeep routes them to the approved provider under the approved configuration, with redaction and policy enforcement in between. No internal application is permitted to call a provider directly.

The firm has contracts with three managed-API providers (select any three from OpenAI, Anthropic, Gemini, Mistral, Cohere, or AI21), two cloud platforms (Bedrock and Azure AI Foundry), and a self-hosted open-weight serving stack (vLLM on an internal GPU cluster). The firm is subject to GDPR, the UK Data Protection Act, an array of professional-privilege obligations on client matter, and industry-specific rules for its financial-services and health-sector client engagements. Client-matter content cannot leave jurisdictional boundaries agreed with each client; some client matters are explicitly barred from any third-party LLM provider and must route to the self-hosted path only.

Your assignment is to design GateKeep so that an auditor, reading the gateway’s code and policy records, can see how any given request was handled, why, and under whose authority. Performance is a constraint: the gateway must add less than 80 milliseconds to the p99 latency of a normal request path, and it must sustain 4,000 requests per second across the firm at peak.

Part 1: Request lifecycle and policy decision point (45 minutes)

Produce the end-to-end request lifecycle diagram and narrative. The narrative walks a reader through a request from the moment it leaves the calling application to the moment the response arrives back at the application. At a minimum, the lifecycle includes:

Authentication. The calling application presents a signed identity (mTLS, workload identity, or a short-lived token). The gateway verifies the identity and resolves it to a service record.
Attribute enrichment. The gateway attaches the tenant, the client matter if the request carries one, the jurisdiction, the data-class (public, internal, confidential, restricted), and the purpose tag.
Policy decision. The policy engine evaluates the request against the current policy. The decision is permit, permit-with-redaction, permit-with-route-override (force to self-hosted path), or deny. The decision is a first-class object, persisted with the request record.
Redaction. If the decision includes redaction, the redaction pipeline runs over the prompt and any attached context before the request leaves the gateway. Redaction is lossless in the sense that the gateway retains the original; the provider receives only the redacted version.
Provider routing. The gateway routes to the chosen provider endpoint (managed API, cloud platform, or self-hosted), attaching the gateway’s own authentication to the provider, and stripping the calling application’s identity from the outbound request.
Response handling. The response is scanned for sensitive output, is unioned with the gateway’s traceability header, and is returned to the caller with a per-request cost record.

Document the authorization boundaries at each step. The policy engine must be a distinct component (Open Policy Agent, Cedar, an internal rule engine, or an equivalent); stating this explicitly is part of the exercise.

Expected artifact: GateKeep-Request-Lifecycle.md.

Part 2: Allow-list and route-override policy (30 minutes)

Produce the allow-list design. The allow-list is not a single list; it is a matrix of (calling application × data class × client matter × jurisdiction) → allowed providers and configurations. Specify:

The policy language and engine. Sketch three representative rules in the chosen language. Examples:
- “The internal chat client, on internal data, for non-client-matter purposes, in the EU, can route to any approved provider with EU data residency; default is a specific provider, override is a named alternative.”
- “The call-summarizer agent, on any data class, for any client matter flagged with the “no-third-party” attribute, must route to the self-hosted path only; a deny response is returned if the path is unhealthy.”
- “Any request carrying restricted data class must pass the redaction pipeline; an unredacted request on restricted data is denied.”
The policy-change workflow. Who authors changes, how changes are reviewed, how changes are rolled out (feature flag, percentage ramp, immediate for emergency), and the rollback protocol.
The deny response. What the calling application receives (an explicit deny code, the policy rule ID that denied, no leakage of other rules), and how the denial is logged for audit.

Expected artifact: GateKeep-Policy-Specification.md with the three sample rules.

Part 3: Redaction pipeline and output scanning (30 minutes)

Produce the redaction design. The pipeline covers both inbound (prompt + retrieved context) and outbound (response) streams. Specify:

The detection taxonomy. At least eight classes: personal identifiers (names, emails, phone numbers, national IDs); sensitive identifiers (bank accounts, medical codes, credentials); client-matter identifiers (matter numbers, opposing parties); secret patterns (API keys, access tokens); location data; protected characteristics (GDPR Article 9 categories); commercial-sensitive markers (deal codes, transaction IDs); and freeform classifiers for content that matches domain-specific patterns.
The detection implementation. A hybrid of named-entity recognition (open-source models are acceptable), regex, and deny-list look-ups. Specify false-positive and false-negative targets, and the evaluation set used to measure them.
The replacement policy. Each detected span is replaced with a typed placeholder that preserves type (for example, [PERSON_NAME], [EMAIL]) and, where the downstream prompt depends on consistent reference, a stable surrogate (the same name maps to the same surrogate within a request). The gateway retains the mapping so the response can be de-redacted before return.
The failure mode. If the detector confidence is below a threshold, the gateway fails closed (deny) rather than pass an uncertain prompt through. Document the threshold-and-override workflow.

Expected artifact: GateKeep-Redaction-Spec.md.

Part 4: Rate-limiting, cost-attribution, and tenancy (30 minutes)

Produce the runtime-quality design. Specify:

Rate limiting. Per-application, per-tenant, and per-user limits. At least two algorithms should be offered (token bucket, leaky bucket, or concurrency-based); state which you use where and why. The limit response is a typed 429 with a retry-after hint.
Cost attribution. The gateway must tag every request with the paying cost center, compute the marginal cost (input tokens × input price + output tokens × output price, or a platform-specific model) at request time, and emit a billing record. Specify the reconciliation cadence against the provider’s own invoice and the tolerance before a discrepancy is investigated.
Multi-tenancy. The gateway serves many internal products (tenants). Specify the tenant isolation in the policy store, in the rate-limit store, in the log stream, and in the cost stream. A per-tenant outage must not degrade other tenants’ traffic.
Observability. The trace schema (propagating an end-to-end trace ID from the calling application through the gateway to the provider and back), the SLO dashboard (availability, latency, deny rate by rule), and the incident-response playbook for a policy-engine outage.

Expected artifact: GateKeep-Runtime-Spec.md.

Final deliverable and what good looks like

Package the four artifacts into GateKeep-Architecture-Package.md with a one-page summary stating the performance envelope (p99 overhead, peak RPS, policy-decision latency), the three most material residual risks, and the roll-out plan (which applications migrate first, in what order, under what success criterion).

A reviewer will look for: a policy decision as a first-class persisted object; a policy engine distinct from the gateway code; at least three representative rules in a named policy language; a redaction pipeline with false-positive and false-negative targets; rate-limit and cost-attribution designs that survive a single-tenant outage; and an auditor-readable request lifecycle. Architectures that embed policy in application code, rather than in a separate policy engine, fail review.