RAG Prompts and Grounding

FlowRidge

RAG Prompting — Four Grounding Gates

Retrieve

Query rewrite

Vector + keyword

Reranker

Top-k selection

Assemble

Context window fit

Citation marker

Safety filter

Prompt compose

Generate

Grounded generation

Citation enforcement

Faithfulness check

Verify

Citation coverage

Hallucination score

Response to user

Figure 310. Every RAG prompt crosses four gates. Each gate has a failure mode and its own attestation — gate-level metrics are the core of RAG evaluation.

AITM-PEW: Prompt Engineering Associate — Body of Knowledge Article 4 of 10

Retrieval-augmented generation is the pattern most enterprise language-model features eventually adopt, and for good reason. It anchors the model’s answers in a corpus the organisation controls, it reduces the surface area for pretraining-era confabulation, and it produces an audit trail the organisation can show to a regulator or an unhappy customer. It also introduces a new class of failure modes that a practitioner must learn to recognise and a new set of governance duties that a practitioner must learn to enforce. This article concerns the prompt side of RAG; the retrieval pipeline itself is developed in the AITE Solutions Architect credential, and the safety side of the retrieval surface is developed in Article 7.

What RAG actually changes

The foundational paper is Lewis et al. 2020, which introduced retrieval-augmented generation as a way to combine a parametric language model with a non-parametric index of documents¹. The architecture has since become the default for any feature that must answer questions grounded in an organisation-specific corpus: policy assistants, product documentation bots, legal research tools, and internal knowledge search. In each, the prompt to the model includes a chunk or set of chunks retrieved from the index.

The new failure modes are three. A confabulation can now arise from an uninformative retrieval, in which the index returned chunks that do not actually answer the question, and the model filled the gap with plausible prose. A contradiction can arise when retrieved chunks disagree with each other and the model silently picks one. And indirect prompt injection, the OWASP LLM01 risk, can arrive inside a retrieved chunk: if an adversary has placed instruction-shaped content in a document the index has absorbed, those instructions enter the prompt alongside the legitimate chunks². Article 7 develops the defence; here, the concern is how the prompt itself is written so that grounded answers are produced, refusals are clean, and citations are honest.

The grounded-answer template

A practitioner’s baseline template for a RAG prompt has four parts. The first is a system instruction declaring the scope: the assistant answers questions about the organisation’s policies (or products, or cases, or whatever the corpus is) using only the retrieved context below. The second is the retrieved context itself, clearly delimited, with each chunk labelled with an identifier that the model can cite. The third is the user question. The fourth is an answer instruction: produce a concise answer grounded in the retrieved context, citing the identifiers of the chunks that support each factual claim, and refusing if the retrieved context does not contain sufficient information to answer.

The template is short enough to read in one screen. Its value is that every clause is load-bearing. The scope clause prevents the model from ranging outside the corpus. The delimiter around the retrieved context gives the model a consistent signal about where operator content ends and authoritative context begins. The citation requirement forces the model to produce an audit trail. The refusal requirement is the most important: it is the clause that makes confabulation-on-empty-retrieval explicit, rather than hoping the model behaves itself.

[DIAGRAM: HubSpoke — aitm-pew-article-4-rag-pipeline — Hub: grounded-answer prompt; spokes: query rewriter, retriever, reranker, context assembler, citation checker, confidence scorer, refusal branch.]

A practitioner with a working template should treat it as a starting point, not a finished artefact. The Mata v. Avianca case of June 2023³, in which two attorneys were sanctioned for filing a federal-court brief containing citations to non-existent cases generated by a public language model, is the teaching anchor for why a template without a citation-checking pass is not yet a safe RAG prompt. The attorneys did not invent the citations; the model did, in a closed-book zero-shot prompt. A legal research tool using RAG would sit the model on top of a retrieval over a real case corpus, require citations to be of specific chunk identifiers, and include a post-generation pass that verifies each cited identifier resolves to a real case. Each of those steps is implied by the template and is the template’s reason to exist.

Chunking discipline from the prompt’s perspective

The chunking strategy is an architectural concern, but the prompt-side consequences are immediate. Chunks that are too small do not contain enough context for the model to produce a coherent answer; chunks that are too large dilute the signal and waste tokens. A practical heuristic is that each chunk should be a coherent unit of meaning (a section of a policy, a product feature description, a step of a procedure), not an arbitrary slice of N tokens. The retrieval context in the prompt then reads as a handful of coherent units rather than a wall of broken sentences.

Chunk identifiers should be stable. A citation in an answer is useless if the chunk it references cannot be traced back to its source document and its specific location within that document. The simplest identifier is a document title and a section number or a URL with a fragment. More elaborate systems use stable chunk IDs linked to content-addressed storage, so that the citation survives content edits and version changes.

When retrieval returns nothing relevant, the prompt’s refusal clause activates. The model is asked to state that the corpus does not contain sufficient information to answer and to offer a next step (contact a human, try a different phrasing, look at a related topic). A feature that fails to say I do not know when its retrieval returns nothing is a feature that will eventually produce a Mata-style confabulation.

Confidence and hedging

A well-written RAG prompt also governs hedging. The model is asked to distinguish answers it can derive from the retrieved context fully (high confidence, state directly), partially (medium confidence, state with a caveat and cite), or not at all (low confidence, refuse and explain). This is not a safety filter; it is a discipline about how the output is phrased. The disciplined phrasing makes the evaluation harness in Article 8 easier to score, because a sentence marked as partial confidence and cited is a different correctness category from an unqualified assertion.

[DIAGRAM: Matrix — aitm-pew-article-4-confidence-matrix — 2x2: retrieval confidence (low/high) on one axis, model-stated confidence (low/high) on the other; cells: answer, answer with caveat, defer to human, refuse.]

Indirect-injection hygiene in the prompt

Even before Article 7 develops the defence in depth, the prompt can do preliminary work. Two clauses matter. The first tells the model that the retrieved context is data, not instruction, and that any instruction-shaped content found inside a retrieved chunk must be treated as content about instructions, not as an instruction to the model. The second asks the model, if it detects such content, to note the detection in its response rather than acting on it.

These clauses do not replace a layered defence. They do reduce the volume of the simplest indirect-injection variants, and they produce a signal the observability layer can monitor. Llama Guard⁴, NeMo Guardrails⁵, Guardrails AI, Azure AI Content Safety⁶, Amazon Bedrock Guardrails⁷, OpenAI Moderation, and Gemini safety filters are each available, on their respective stacks, to deliver the deeper defences Article 7 describes; the prompt clauses are the first layer of a layered system, never the only one.

Query rewriting and retrieval-time discipline

The prompt’s quality depends in part on the quality of the retrieval it receives, and that quality depends in turn on what the retrieval chain does with the user’s question. A raw user question is often suboptimal for retrieval: it may be ambiguous, may contain shorthand the corpus does not use, or may span multiple sub-questions that should be retrieved independently. A query-rewriting step, itself typically a model call, produces a cleaner retrieval query.

Query rewriting is a prompt design in its own right and should be versioned and evaluated with the same discipline as the main answering prompt. A rewrite prompt’s failure mode is to drop or hallucinate aspects of the user’s original intent; the evaluation for query rewriting is whether the rewritten query produces retrieval results that contain the information needed to answer the original question. Many teams keep the original user question alongside the rewritten query and include both in the evaluation harness, so that rewrite quality is tracked distinctly from answer quality.

The retrieval chain itself may include reranking, deduplication, recency weighting, and source-credibility weighting. Each is a decision a practitioner should know about, even if the practitioner does not own the retrieval implementation. A feature that retrieves three chunks of which two are near-duplicates has effectively retrieved a single chunk from a model’s perspective; a feature that weights an outdated document higher than a current one because the outdated one has more matching keywords has a grounding defect at the retrieval layer rather than at the prompt layer.

Refusal discipline in practice

The refusal clause in the template is where many RAG features quietly misbehave. A weak refusal script produces answers like I could not find specific information on that but here is what I know, which then proceeds to ungrounded generation. A strong refusal script produces answers like the available policy documents do not contain information on this question; please contact the HR team for authoritative guidance.

The difference is structural. A strong refusal names the specific absence (the corpus searched, the fact not found), commits to no compensating prose, and offers a concrete next step. It does not apologise verbosely, does not speculate, and does not try to be helpful by adding generic context the user did not ask for. Several providers document example refusal patterns in their prompting guides; the disciplined version typically outperforms the verbose version when measured against user satisfaction, because users prefer knowing the limits to receiving confident-sounding nothings.

Evaluating grounding

A RAG prompt is evaluated not on how fluent its answers are but on how grounded they are. A grounded answer is one whose every factual claim is supported by a retrieved chunk that the answer cites. The evaluation harness in Article 8 exercises this property explicitly. For a set of known questions with known good answers, the harness checks: did the retrieved context contain the information needed to answer; did the model produce an answer that used that information; did the citations point to chunks that actually support the claims; did the model refuse appropriately when the context was insufficient. Each dimension is measured separately, because each is a distinct failure mode.

Stanford HELM, the holistic evaluation framework published by Liang et al. at TMLR 2023⁸, treats grounding as one of several complementary dimensions alongside accuracy, robustness, calibration, fairness, and efficiency. The HELM vocabulary is a useful anchor because it prevents a practitioner from believing their grounding is good when what they measured was only accuracy on happy-path questions.

Open-source and commercial parity

The grounded-answer template works identically on a managed-API stack using OpenAI, Anthropic, Gemini, or another closed-weight model, and on a self-hosted stack using Llama, Mistral, or Qwen. What varies is the surrounding tooling. On a managed stack, the team typically uses an orchestration framework like LangChain, LlamaIndex, or Haystack together with a vector store like Pinecone, Weaviate, or pgvector. On a self-hosted stack, the same frameworks cover the orchestration and the vector store can be Qdrant, Milvus, Chroma, or any Postgres instance with pgvector. The prompt discipline does not change. The credential’s position is that a practitioner who can write a grounded-answer prompt for one stack can write it for any other.

Citation fidelity and verification

A citation is only valuable if it is verifiable. A prompt that instructs the model to cite sources produces confident-looking citations even when the model’s retrieval was inadequate; the mismatch between cited chunk and supporting claim is a quiet, pernicious failure mode. The control is a post-generation citation check: a deterministic validator takes the answer and the cited chunk identifiers, retrieves the chunks, and verifies that the claimed sentences are actually supported by the cited chunks.

The validator can be a simple string-similarity check for direct quotations, a semantic-similarity check for paraphrased claims, or a second model call asking whether the claim is supported by the cited text. Each has trade-offs; a mature feature runs more than one. The validator’s results go into the evaluation harness from Article 8 under the grounding dimension and produce a measurable citation-fidelity rate that the feature’s dashboard reports.

Dynamic retrieval and adaptive grounding

Not every question needs the same depth of retrieval. A question answerable from a single policy paragraph needs fewer, tighter chunks than a question requiring synthesis across multiple documents. Dynamic retrieval strategies adjust the retrieval’s breadth and depth to the question: a classifier or a preliminary model call estimates the question’s complexity and adjusts the retrieval parameters accordingly.

Adaptive patterns are a small additional layer, but they pay off when the feature’s question distribution is wide. A question-answering feature over a product catalogue may receive both simple lookups and complex comparative questions; a uniform retrieval configuration tuned for one will misperform on the other. A practitioner worth the title recognises when the question distribution justifies adaptation and when it does not.

Summary

RAG is a pattern that reshapes confabulation risk, introduces new failure modes, and produces an audit trail if the prompt enforces citation. The grounded-answer template has four parts: scope, delimited retrieved context, user question, and answer-with-citation-or-refusal instruction. Chunking discipline, confidence hedging, and preliminary indirect-injection hygiene each add to the template. Evaluation asks not whether the answer is fluent but whether it is grounded. Article 5 turns to tool use, the pattern that turns a language model from an answerer into an actor.

Further reading in the Core Stream: Grounding, Retrieval, and Factual Integrity for AI Agents.

Patrick Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. NeurIPS 2020. https://arxiv.org/abs/2005.11401 — accessed 2026-04-19. ↩
OWASP Top 10 for Large Language Model Applications, 2025. OWASP Foundation. https://genai.owasp.org/llm-top-10/ — accessed 2026-04-19. ↩
Mata v. Avianca, Inc., No. 22-cv-1461 (S.D.N.Y. 2023), order of 22 June 2023. https://storage.courtlistener.com/recap/gov.uscourts.nysd.575368/gov.uscourts.nysd.575368.54.0.pdf — accessed 2026-04-19. ↩
Llama Guard model documentation. Meta AI. https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/ — accessed 2026-04-19. ↩
NeMo Guardrails open-source toolkit. NVIDIA. https://github.com/NVIDIA/NeMo-Guardrails — accessed 2026-04-19. ↩
Azure AI Content Safety. Microsoft documentation. https://learn.microsoft.com/en-us/azure/ai-services/content-safety/overview — accessed 2026-04-19. ↩
Amazon Bedrock Guardrails. AWS documentation. https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails.html — accessed 2026-04-19. ↩
Percy Liang et al. Holistic Evaluation of Language Models. TMLR 2023. https://crfm.stanford.edu/helm/ — accessed 2026-04-19. ↩