Retrieval-Augmented Generation: When, Why, How Much

FlowRidge

RAG Reference Architecture — Index to Answer

Ingest

Source inventory

Chunk + embed

Metadata tag

Index write

Retrieve

Query rewrite

Vector + BM25

Rerank

Top-k filter

Generate

Context assembly

Grounded answer

Citation

Safety filter

Evaluate

Citation coverage

Faithfulness

Answer quality

Feedback loop

Figure 340. Production RAG is four stages. Each stage has its own evaluation loop — RAG quality collapses when any stage is un-instrumented.

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 4 of 35

Retrieval-augmented generation is the single most consequential architectural pattern of enterprise AI’s first generation. It separates what the model was trained on from what the organization wants the model to know at query time. It keeps the model small, the index large, and the answer grounded in verifiable sources. It became the default not because it is elegant but because it solves the practical problem every enterprise hits within the first month of a generative-AI pilot: the model is confident about things it should not be, and it is confidently wrong about the specific things the organization cares about. RAG is the architectural answer.

What RAG is and why it exists

The RAG pattern was introduced in the Lewis et al. paper at NeurIPS 2020, proposing an architecture in which a retriever fetches relevant passages from a knowledge store and supplies them as additional context to a generator conditioned on the user’s query.¹ The paper’s motivation was factual accuracy on knowledge-intensive question answering; the enterprise application has been to the same set of problems scaled by three orders of magnitude. An enterprise has hundreds of thousands or millions of internal documents, policies, tickets, emails, research notes, product specifications, training materials, and regulatory submissions; a closed-book LLM cannot memorize them, and even if it could, it cannot update when they change. RAG is the architecture that lets the model remain stable while the organization’s knowledge moves.

The pattern at its simplest has five stages. A user asks a question. The system embeds the question into a vector. The retriever searches an index and returns the top-k passages most similar to that vector. The orchestration layer composes a prompt that includes the retrieved passages as context, the user’s question, and the system’s instructions. The model generates an answer grounded in the supplied context.

The pattern looks like a library reference desk where the librarian listens to a question, pulls the right books off the shelves, and hands them to the reader. The model is the reader; the retriever is the librarian; the index is the shelf. What the architect decides is what books go on the shelf, how the librarian finds them, how many books the reader gets to see, and what happens when the reader has a question the library cannot answer.

Why RAG became the default

RAG is the default because of five architectural advantages.

Freshness. The index updates independently of the model. A policy change indexed this morning is answerable this afternoon; retraining the model would take weeks.

Verifiability. Every answer can cite the passages it used. Citation is the property a regulated industry wants most. Article 13 of the EU AI Act requires that high-risk systems provide information to users in a form that allows understanding of the outputs; citations are a concrete implementation of that obligation.²

Scope control. The index determines what the system knows about. Questions outside the index return “I cannot answer that based on the knowledge available to me” more reliably than a closed-book model that will confabulate.

Cost economy. Retrieval plus a small prompt is often cheaper than attempting the same quality with a much larger model operating closed-book. Retrieval cost is predictable and amortizable; larger model cost is per-token and scales worse.

Lifecycle separation. The knowledge plane and the model plane can evolve independently. Changing the embedding model does not require changing the generation model. Rotating to a cheaper generator does not require reindexing.

Two public enterprise deployments illustrate the payoff.

Morgan Stanley’s wealth-management assistant, delivered with OpenAI in 2023, indexes approximately 100,000 internal research documents and exposes them to GPT-class models through a RAG architecture. Advisors receive grounded answers, including citations, and the index updates as research changes. The architecture is explicitly RAG; the model is not fine-tuned on the corpus.³

LexisNexis Lexis+ AI, launched in 2023, indexes legal case law, statutes, and commentary and exposes them to LLMs through a RAG architecture that requires citation for every answer. The use case — legal research — is uninterested in models that sound right; it is interested in models that can be verified. RAG is the architecture that serves that interest; LexisNexis’s product pages say as much.⁴

[DIAGRAM: ConcentricRingsDiagram — aite-sat-article-4-knowledge-layers — Concentric rings from center outward: “Model pre-training knowledge (static, dated)”, “System prompt context (instructions and persona)”, “Retrieved passages (fresh, cited, scoped)”, “User question (immediate)”. Arrows show how each outer ring grounds the inner ring and how the retrieval layer is the bridge between static model and fresh knowledge.]

What RAG is not

RAG is not a cure for hallucination. A model given good passages can still ignore them, misquote them, or blend them with its prior. RAG changes the probability distribution toward grounded answers; it does not eliminate ungrounded ones. The architecture therefore needs downstream defenses: citation verification (the model claims passage X says Y; the orchestration layer checks whether Y actually appears in X), answer regeneration when citation fails, and evaluation (Article 11) that measures grounding faithfulness.

RAG is not a substitute for fine-tuning when the task is style or format rather than knowledge. An agent that must consistently speak in the brand voice, or produce strict structured outputs in a non-standard schema, is not served by retrieval; it is served by fine-tuning or by carefully engineered system prompts. The decision tree in Article 10 develops this trade-off.

RAG is not a replacement for semantic understanding when the query is about relationships the corpus does not encode. If the user asks “which policy contradicts which,” a naive RAG system will retrieve each policy separately and fail to notice the contradiction. Graph-structured retrieval and agentic RAG patterns extend the basic architecture for such cases; Article 5 and Article 7 develop them.

RAG is not scope-proof. The corpus determines what the system can answer; adversaries can exploit that. A carefully crafted query can retrieve passages that, in isolation, mislead the model. The architect treats retrieval results as adversarially selectable, the same way application security treats all user input as adversarial.

Suitability classification

A use case needs RAG when the system must answer from a corpus the model was not trained on, when answers must be verifiable, when the corpus changes, or when scope control is required. A use case does not need RAG when the knowledge is already in the model, when answers do not require verification, when the corpus is static, or when format and style dominate content. The table classifies five example use cases.

Use case	RAG suitability	Reason
Internal policy assistant over 10,000 policy documents	High	Corpus-grounded, verifiable, changes often, scope must be limited
Creative marketing copy generator	Low	No corpus, style-dominated, no verifiability requirement
Customer-support knowledge-base chatbot	High	Product documentation corpus, changes with releases, citation desired
Code completion in an IDE	Partial	Local context (open files) is the “retrieval”; enterprise repo indexing extends it
Legal brief drafting assistant	Critical	Case law corpus, citation is legally required, hallucination has high cost

The table is guidance; an architect never outsources the decision to a generic category. A customer-support chatbot over a tiny product line that never changes may not justify the index infrastructure; a creative-writing tool over a brand-voice corpus may.

Scope creep in RAG proposals

The most common failure mode in RAG design is scope creep. A pilot begins with a small, controlled corpus — one department’s policies, one product’s documentation — and ships successfully. The next version adds three more corpora. The version after that adds a tenth. Within a year the retrieval pipeline is fetching from a stitched-together data lake, the embeddings are wrong for half the new content, the quality metrics are trending down, and no one can identify which corpus contributed which passage to which answer. The architecture’s scope was never bounded, so every new corpus was additive and every incremental addition was cheaper than saying no.

Scope control is an architectural decision. The architect defines, at design time, a corpus policy: what documents are eligible, who owns the ingestion of each source, what the retention policy is for each source, how conflicts are resolved when two sources disagree, what the re-indexing cadence is for each source. The policy is version-controlled and governance-owned. New corpora are additions subject to policy review, not default inclusions. Scope creep is not prevented by saying no to requests; it is prevented by having a document that says which requests are acceptable.

[DIAGRAM: BridgeDiagram — aite-sat-article-4-closed-to-rag-bridge — Horizontal bridge from “Closed-book LLM” (left) to “RAG-augmented LLM” (right), with intermediate components named: embedding model, chunking pipeline, vector index, retrieval function, reranker (optional), prompt assembler, citation validator, evaluation harness. Annotations show which intermediate component is added at each stage of architectural maturity.]

Technology-neutral RAG

Every component of the RAG architecture has multiple implementations. The embedding model can be OpenAI’s text-embedding models, Cohere’s embed family, Google’s embedding APIs, open-weight embeddings such as BAAI’s bge family or Alibaba’s gte models served locally, or sentence-transformer variants trained in-house. The vector index can be Pinecone, Weaviate, Qdrant, pgvector inside PostgreSQL, Milvus, Chroma, or OpenSearch k-NN. The orchestration can be LangChain, LlamaIndex, DSPy, Haystack, Semantic Kernel, or a bespoke library. The model for generation can be any of the options from Article 2. The architect chooses each component on the criteria developed earlier; the RAG architecture itself does not prefer any particular stack.

Two example deployments illustrate the neutrality.

Notion AI was built with Pinecone as the vector store and has been documented through Pinecone’s case study library; the generation model and orchestration have varied over time, and the company has not committed to one permanently.⁵

Supabase’s pgvector-based AI examples show the same RAG architecture implemented with PostgreSQL’s pgvector extension as the store and a range of generation models including open-weight options; the public documentation emphasizes that the pattern is model-agnostic.⁶

The two stacks look nothing alike from the operations perspective. They look identical at the architecture plane: a knowledge plane with ingestion, embedding, and retrieval; an orchestration plane that composes the prompt; a model plane that generates. The architect who recognizes the invariance can move between the stacks without re-learning the pattern.

The one-page RAG proposal

Every RAG project the AITE-SAT holder reviews should carry a one-page proposal that answers the following: what the corpus is, who owns it, how it is updated, what the embedding strategy is, which vector store is selected and why, what the retrieval strategy is (dense, sparse, hybrid; filtering by metadata or tenant), whether reranking is used, how the composed prompt is assembled, how citations are validated, what the evaluation plan is, what the rollback plan is, and what the scope-creep defense is. One page is enough to force precision; any longer becomes unreadable. A proposal that cannot fit on one page is a proposal with unresolved ambiguity.

Regulatory alignment

RAG architectures have specific alignment points with AI governance frameworks. EU AI Act Article 10 (data and data governance) requires that high-risk AI systems use datasets subject to appropriate governance; in a RAG system the retrieval corpus is a dataset in the Article 10 sense, and the corpus policy is the Article 10 record.⁷ Article 13 (transparency) requires that information be provided to users to enable them to interpret outputs; citation is a direct implementation.⁸ ISO/IEC 42001 Clause 8.3 requires life-cycle management for AI systems; the retrieval corpus has a life cycle that includes ingestion, indexing, retention, deprecation, and retirement, and Clause 8.3 requires all five to be managed.⁹ NIST AI RMF MEASURE 2.7 requires safety testing; the retrieval layer’s failure modes (wrong passages, missing passages, poisoned passages) are testable and must be measured.¹⁰

Summary

RAG is the default architecture for most enterprise AI features because it separates model training from organizational knowledge, supports freshness, enables verification through citations, and provides scope control. It is not a cure for hallucination, not a substitute for fine-tuning when style dominates, not semantic-graph reasoning by itself, and not scope-proof. Suitability depends on whether the use case needs corpus grounding and verifiability. Scope creep is the primary failure mode and is prevented by a corpus policy that is version-controlled and governance-owned. The architecture is technology-neutral; the same five stages appear in every implementation regardless of which embedding model, vector store, orchestration framework, or generation model is chosen. Morgan Stanley’s wealth-management assistant and LexisNexis Lexis+ AI are two public deployments that illustrate the pattern at scale. A one-page proposal is the working artifact of a RAG design review. Regulatory alignment with the EU AI Act, ISO 42001, and NIST AI RMF is satisfied by the governance artifacts RAG architectures produce naturally — corpus policy, citation, evaluation.

Further reading in the Core Stream: Grounding, Retrieval, and Factual Integrity for AI Agents, AI Integration Patterns for the Enterprise, and Data Architecture for Enterprise AI.

Patrick Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020. https://arxiv.org/abs/2005.11401 — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (EU AI Act), Article 13. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
Morgan Stanley Wealth Management deploys OpenAI-powered AI @ Morgan Stanley Assistant. Morgan Stanley press release, September 2023. https://www.morganstanley.com/press-releases/key-milestone-in-innovation-journey-with-openai — accessed 2026-04-19. ↩
LexisNexis launches Lexis+ AI. LexisNexis press release, 2023, and product documentation. https://www.lexisnexis.com/en-us/products/lexis-plus-ai.page — accessed 2026-04-19. ↩
Pinecone case study: Notion AI. Pinecone Customers. https://www.pinecone.io/customers/ — accessed 2026-04-19. ↩
Supabase pgvector documentation and AI examples. Supabase. https://supabase.com/docs/guides/ai — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689, Article 10 (data and data governance). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689, Article 13 (transparency). Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 8.3. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩
Artificial Intelligence Risk Management Framework (AI RMF 1.0), NIST AI 100-1, MEASURE function, Subcategory 2.7. National Institute of Standards and Technology, January 2023. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-1.pdf — accessed 2026-04-19. ↩