Architecture Decision Records and Documentation

FlowRidge

This article defines the AITE-SAT ADR template, covers the four AI decision types that almost always warrant an ADR, and walks through three worked examples pulled from the public engineering record.

The classic ADR template

Nygard’s original template has five sections: title, status, context, decision, consequences. Many teams extend it with options considered, decision criteria, and decision owners. Spotify, Amazon Web Services, and the LangChain and LlamaIndex open-source projects all ship public ADRs that illustrate the spread of styles; the AWS Prescriptive Guidance library maintains a canonical worked set.¹

An ADR is intentionally short. Two to three pages beats a twenty-page design doc because the audience is the future reader trying to reconstruct why the team chose what they chose. Long design documents have their place but do not replace ADRs. A design document is a narrative of how the system works; an ADR is the record of a fork in the road and the reason the team took one branch.

The AITE-SAT ADR extensions

For AI systems, the classic template is necessary but not sufficient. The following extensions earn their place because each corresponds to a failure mode that ADR-less teams routinely repeat.

Model-family choice. Which family of models does this system rely on, and why. Record the closed-weight versus open-weight position (Article 2), the multi-provider fallback if any, the route-and-escalate policy, and the version-pin strategy. A team building on an Anthropic Claude managed API with a Llama 3 self-hosted fallback writes that down; a team that writes only “we use Claude” will scramble when a provider outage or policy change forces re-architecture.

Retrieval strategy. Does the system use RAG (Article 4), and if so: chunking strategy (Article 5), vector store (Article 6), hybrid retrieval balance, reranker choice, and freshness target. The ADR records not only what was chosen but what was tested and rejected. A Weaviate decision over pgvector is qualitatively different from a pgvector decision because Postgres already exists in the stack.

Evaluation target. What metric, on what eval set, at what threshold, is the release bar. Linking the ADR to the eval harness (Article 11) closes a common gap where the team knows the target informally but the evidence pack cannot prove it.

Fine-tuning boundary. The article’s sister decision framework (Article 10) answers whether the system fine-tunes. The ADR records the answer and the trigger for revisiting — typically a specific eval-score target that the non-fine-tuned path has failed to reach over a defined window.

Fallback plan. What happens when the model is unavailable, the response fails validation, the cost budget burns, or the kill-switch fires (Article 20). Silent failure is not a fallback. The ADR records the graceful degradation behaviour and the user-facing experience.

The four AI decisions that always warrant an ADR

Not every decision is an ADR candidate. Nygard’s guidance — record architecturally significant decisions that change the shape of the system or commit to a cost — still holds. For AI systems the following four classes always cross the threshold.

1. Model-family commitment. Managed closed-weight API versus open-weight self-hosted versus cloud-platform catalogue versus hybrid. This decision sets the cost curve (Article 9), the data residency posture (Article 18), the security boundary (Article 14), and the customisation path (Article 10). The Klarna deployment on OpenAI is a public example where the downstream architecture was legible because the anchor decision was public.² The Bloomberg-built BloombergGPT was the opposite — a full custom pre-train on proprietary financial data that reshaped the entire cost and security architecture.³

2. Retrieval architecture. Whether the system is a plain model, a RAG system, a fine-tuned model, or a stack of the above. A decision to adopt hybrid retrieval with a reranker over a Qdrant index, for instance, is an ADR because the retrieval architecture drives latency (Article 17), freshness (Article 20), and the evaluation cadence.

3. Evaluation contract. The target metric, the eval set, the threshold, and the action on regression. Often the evaluation contract is the most contested ADR in the system because it defines when the team is allowed to ship. Locking it down early resolves many later disputes.

4. Agency boundary. Is the system allowed to write back to other systems, and under what conditions. The decision to permit tool-calling or not (Article 7), and which tools, and with what guardrails, deserves its own ADR. Agentic architectures (Article 32) elevate this decision further and earn multiple interrelated ADRs.

Other decisions — vector-store selection, orchestration framework choice, inference routing strategy — are ADR-worthy when they cross cost, security, or lock-in thresholds. Teams tend to over-document in year one and under-document in year three; the architect’s calibration is a learned skill.

The ADR lifecycle

An ADR progresses through drafting, review, accepted, and (eventually) superseded. Nygard’s original insight was that superseded ADRs are never deleted; they are linked forward to the ADR that replaced them so that the history of reasoning is preserved. This matters more in AI than in classical systems because the decision space moves faster. An ADR written in 2023 that chose GPT-3.5 over GPT-4 for cost reasons is not wrong history, it is relevant context for the 2025 ADR that re-opened the decision.

The review stage is where the architect earns their salary. A reviewer asks: what alternatives were considered, what are the reversibility costs, what evidence backs the choice, what is the trigger to revisit. An ADR that cannot answer these questions is not ready.

Worked example 1 — Klarna customer-service assistant

Klarna publicly announced in early 2024 that its OpenAI-powered assistant was handling the workload equivalent of 700 agents and producing measurable CSAT outcomes.² The architecture decisions leading there are partly disclosed in press and vendor case material. A plausible ADR corpus would include:

ADR-001 — Model family: Managed GPT-4 class via OpenAI API. Alternatives: Anthropic Claude, open-weight Llama 3 self-hosted. Decision driver: scale, multilingual coverage, response quality. Fallback: graceful degradation to scripted FAQ + human-agent escalation.
ADR-002 — Retrieval: RAG over internal knowledge base, chunked by Q/A pair. Alternatives: fine-tuned model, pure prompt-engineered zero-shot. Decision driver: freshness, auditability.
ADR-003 — Evaluation target: Composite of intent-classification accuracy, citation accuracy, escalation-rate, and CSAT. Threshold: parity with prior chatbot baseline at launch; monthly review.
ADR-004 — Fallback: On model failure or validator-fail, route to scripted FAQ. On escalation trigger (refund disputes, fraud), route to human agent.

The ADRs that matter most in hindsight are usually the ones that looked least controversial at the time. ADR-004 (fallback) looks administrative but defines the user experience during every outage.

Worked example 2 — Bloomberg BloombergGPT

Bloomberg’s 2023 paper described a 50B-parameter model pre-trained on a blend of financial and general-purpose data.³ The decision was atypical; most enterprises do not pre-train from scratch. The ADRs that shaped it:

ADR — Model family: Custom pre-train. Alternatives: fine-tune open-weight (Llama, MPT), API-only, managed enterprise offering. Decision driver: proprietary corpus, domain performance, security posture.
ADR — Training data governance: Multi-source blend with per-source licence tracking. This is the Article 10 (EU AI Act data governance) ancestor even pre-dating the regulation.
ADR — Cost boundary: Fixed training budget; inference-serving on internal GPU capacity.
ADR — Release and access: Internal-only; no public weights; per-product permissioning.

The interesting counterfactual is the decision Bloomberg did not make: they did not release weights. That non-decision is itself worth an ADR because it preserves a commercial and data-leak boundary.

Worked example 3 — Shopify Sidekick

Shopify’s Sidekick merchant assistant, disclosed in a 2023 engineering blog and subsequent talks, is a commerce-domain agent.⁴ Sidekick’s public architecture notes suggest a multi-model, multi-tool architecture:

ADR — Model family: Mixed. OpenAI for reasoning, specialised smaller models for routing and structured tasks.
ADR — Tool boundary: Read-only access to merchant data; write actions require confirmation.
ADR — Evaluation: Per-tool correctness plus end-to-end task completion.
ADR — Fallback: When tool response is ambiguous or policy-restricted, route to human merchant operations support.

The Sidekick example shows the agency-boundary ADR in action. Splitting reasoning from tool execution, and requiring explicit confirmation on writes, is an architectural answer to tool-misuse risk (Article 7, Article 32).

Storage and tooling

ADRs live in the repository alongside the code they govern. Markdown in a dedicated adr/ directory with a monotonic numbering scheme is the standard. Teams that use Confluence or Notion sometimes mirror them outside the repo but the source of truth should remain in the code repository for the same reasons code review applies there. Tools like adr-tools or log4brains automate the numbering, linking, and rendering.⁵

In the COMPEL platform, ADRs are first-class records that link to the reference architecture (Article 1), the eval harness spec (Article 11), the registry entries (Article 21), and the stage-gate readouts (Articles 28 through 30). An ADR without those links is a fragment; the platform’s value is in the bidirectional linking.

Anti-patterns

The retrospective ADR. Writing ADRs six months after the decision is made captures the conclusion but not the deliberation. Reviewers read ADRs for the alternatives rejected as much as for the alternative chosen.
The ADR that is really a design document. Twelve-page ADRs with diagrams of every module tend to be design documents wearing ADR clothing. Keep both artefacts; do not conflate them.
The ADR nobody reads. Living on a wiki page that nobody opens means the decision is effectively unrecorded. Link ADRs into onboarding, code-review comments, and stage-gate templates so they show up where decisions are being reconsidered.
The ADR with no supersession policy. An ADR without a trigger for revisiting ages poorly. For AI systems the trigger is often an eval-score threshold, a cost threshold, or a regulatory change.
The ADR that says “we chose X” without naming the alternatives. Reviewers cannot evaluate a decision without the rejected alternatives.

Governance integration

Articles 9 and 11 of the EU AI Act require documented risk-management reasoning and technical documentation.⁶ A mature ADR corpus is the primary evidence that a high-risk deployment has run a defensible decision process. ISO/IEC 42001 clauses on documented information (7.5) and change management map to ADR acceptance and supersession. NIST AI RMF GOVERN 1.3 (accountability) and MAP 2.3 (context documentation) are direct ADR outputs.

The notified body or internal auditor reading the evidence pack is answered by the ADR corpus in three passes: first the index (what decisions were taken), then the spot checks (are the chosen ADRs well-reasoned and linked), then the change log (what was superseded and why).

Summary

The AITE-SAT ADR template extends the classic Nygard format with five AI-specific fields: model-family, retrieval strategy, evaluation target, fine-tuning boundary, fallback plan. The four AI decisions that always deserve an ADR are model-family commitment, retrieval architecture, evaluation contract, and agency boundary. ADRs live in the repository, are linked bidirectionally to the platform’s other artefacts, and form the backbone of both the EU AI Act Article 11 technical documentation and the team’s own memory.

Key terms

Architecture Decision Record (ADR)
AI-extended ADR template
Superseded ADR
Agency-boundary decision
Evaluation contract

Learning outcomes

After this article the learner can: explain the AI-extended ADR template; classify four decision types needing ADRs; evaluate an ADR for completeness against the template; design an ADR for a real decision in their own deployment.