Foundational Prompting Patterns

FlowRidge

Foundational Prompting Patterns — Determinism × Creativity

Few-shot

Open-ended

Chain-of-thought

Reasoning-first, free-form

Few-shot with schema

Exemplars + output contract

Role + open directive

Persona-based ideation

Instruction + schema

Single-shot constrained

Structured output

Zero-shot

Figure 308. Patterns cluster by whether the target output is deterministic (structured) or generative (open). Matching pattern to use-case is the first design step.

AITM-PEW: Prompt Engineering Associate — Body of Knowledge Article 2 of 10

A practitioner who chases prompting techniques by fashion will try every trend that surfaces on social media and end up with an incoherent pile of prompts that no one can reason about. A practitioner who works from the pattern taxonomy will pick the technique that fits the task, justify the choice with a short rationale, and move on. This article teaches the taxonomy that covers the overwhelming majority of production cases. Four patterns suffice for most work: zero-shot instruction, few-shot demonstration, chain-of-thought, and self-consistency. Every other technique in current circulation is an elaboration or a specialisation of one of these four.

Zero-shot instruction

The zero-shot pattern gives the model an instruction without examples. The prompt describes the task, specifies the constraints, and expects the model to perform. The foundational paper is Brown et al. 2020, which documented that sufficiently large language models can execute tasks they have never seen by following natural-language instructions¹. Subsequent work showed that zero-shot performance on reasoning tasks improves dramatically when the instruction includes the phrase think step by step or an equivalent cue; Kojima et al. demonstrated this result under the title Large Language Models are Zero-Shot Reasoners².

Zero-shot works well when the task is familiar to the model from pretraining, when the output format is easy to describe in prose, and when the evaluation cost of a failure is bounded. It works poorly when the task is nuanced, when the desired output has a specific structure that prose cannot fully convey, or when the task sits near the edge of the model’s knowledge and a concrete example would pin the behaviour down.

A zero-shot prompt for a customer-support triage feature might read: classify the following user message into exactly one of the categories billing, technical, account, other, and respond with only the category name in lowercase. The instruction is unambiguous, the output format is trivial, and the model can execute reliably. A zero-shot prompt for a legal summarisation task, by contrast, is almost always underspecified, because the desired style, citation format, and degree of detail are difficult to convey without examples.

Every major provider’s guide endorses zero-shot as the default starting point: OpenAI’s guide, Anthropic’s, and Google’s each recommend trying zero-shot first and adding examples only when the zero-shot behaviour is insufficient³⁴⁵. The rationale is simple: a shorter prompt is cheaper, easier to maintain, and less prone to accidental drift when the model is updated.

Few-shot demonstration

The few-shot pattern adds labelled examples to the prompt. The examples show the input-output mapping the practitioner wants the model to reproduce. Brown et al. 2020 coined the term in-context learning for the phenomenon¹; a decade of subsequent work has refined it but not displaced it.

Few-shot shines when the task is format-sensitive (the output must match a specific JSON shape, a specific citation style, a specific refusal phrasing); when the task is ambiguous in prose (distinguishing complaints from suggestions depends on tacit signals easier to demonstrate than to describe); or when the practitioner has training data from previous runs and can curate the highest-quality outputs as examples.

Three practical cautions govern few-shot use. The first is example selection: examples should cover the range of inputs the feature will see, including edge cases, rather than being three near-duplicate easy cases. The second is example ordering: several studies have shown that the order of examples can affect model behaviour, particularly for smaller models, so the practitioner should treat example ordering as a tunable rather than a random choice. The third is leakage: an example chosen from the evaluation set will produce misleading quality numbers. The evaluation set must be held out before any example is lifted.

A few-shot prompt for the customer-support triage feature might include three examples:

user: I was charged twice for my subscription this month
category: billing

user: my dashboard shows a blank chart after login
category: technical

user: please delete my account and all associated data
category: account

The prompt is concrete. A new user message is classified by analogy to the examples. If the triage scheme gains a new category later, the few-shot set needs to be regenerated; this is a concrete reason to keep the set in a version-controlled file rather than inline.

Chain-of-thought

The chain-of-thought pattern asks the model to produce intermediate reasoning steps before its final answer. Wei et al. 2022, published at NeurIPS, is the foundational paper; their experiments on grade-school math (GSM8K), commonsense reasoning, and symbolic manipulation showed that explicit reasoning steps could lift accuracy by tens of percentage points on tasks that required multi-step inference⁶. Kojima et al. 2022 showed that a single phrase inserted before the answer, commonly rendered as let us think step by step, produced much of the same effect without the need for curated reasoning examples².

Chain-of-thought applies to tasks where the answer depends on intermediate facts the model must surface. Multi-step math, multi-hop question answering, logical deduction, and complex classification that depends on several sub-criteria all benefit. The pattern applies less well to tasks that are essentially one-step lookups. Forcing a zero-step task into chain-of-thought wastes tokens and can introduce confabulation where the model invents reasoning to justify an answer it already had.

Two practical notes. The chain-of-thought output is often not what the practitioner wants to show the end user. A production implementation typically either parses the final answer out of the response or uses a two-call pattern, where a first call produces the reasoning and a second call produces the user-visible answer grounded on the reasoning. The second note is that chain-of-thought, while effective, is not a safety control. A model that can reason out loud can still reason its way to a wrong or harmful answer. The pattern is a capability enhancer, not a governance device.

[DIAGRAM: Matrix — aitm-pew-article-2-pattern-selection — 2x2: task complexity on x-axis (simple / multi-step), example availability on y-axis (absent / present); cells: zero-shot, zero-shot with think-step-by-step, few-shot, few-shot with chain-of-thought.]

Self-consistency and its siblings

Self-consistency, introduced by Wang et al. 2023 at ICLR, runs the chain-of-thought prompt several times at non-zero temperature and selects the answer that appears most often across the samples⁷. The intuition is that a correct multi-step answer is more likely to be reached by several distinct reasoning paths than a particular incorrect answer. Self-consistency costs N times the inference of a single call and is therefore reserved for tasks where correctness is worth the additional spend.

Siblings of self-consistency include self-critique, where the model is asked to review its own output for errors and produce a revised version, and verify-then-rewrite, where a second model run checks the first run against explicit criteria before the output is emitted. Shinn et al. 2023 introduced Reflexion, a related pattern in which the model produces verbal feedback on its own trajectories across episodes⁸. These patterns are effective on the right tasks but are easy to over-apply; a simple classification does not need reflection, and bolting it on adds latency and cost without meaningful benefit.

How to choose

The four patterns form a small decision tree. The first question is whether the task is one-step or multi-step. One-step tasks default to zero-shot; multi-step tasks benefit from a reasoning pattern. The second question is whether the output format is loose or tight. Loose formats tolerate zero-shot; tight formats need examples or a structured-output pattern (which is the subject of Article 3). The third question is whether a single answer is sufficient or whether the cost of a wrong answer justifies redundancy. Single-answer sufficiency points to a single call; redundancy needs points to self-consistency or a verifier pattern.

[DIAGRAM: StageGateFlow — aitm-pew-article-2-pattern-decision — Decision flow: is the task multi-step? -> is an example set available? -> is output format tight? -> is correctness worth Nx cost? -> pattern recommendation.]

The decision is not final. A practitioner starts with the lightest pattern that plausibly works, runs the evaluation harness (Article 8), and steps up to a heavier pattern only when the evaluation indicates it is needed. Over-engineering a prompt is cheaper than under-engineering it, but only narrowly: a prompt that is longer than it needs to be consumes tokens forever, and the cumulative cost of an oversized system prompt across millions of requests is not trivial.

Pattern composition and the techniques it is not worth learning separately

Many techniques circulating in the literature are compositions of the four patterns. Instruction-tuned prompting is zero-shot with a carefully authored instruction. Role prompting is zero-shot where the instruction assigns a persona. Generated-knowledge prompting is a two-stage zero-shot pipeline that asks the model to surface relevant knowledge first, then to use it to answer. Tree-of-thoughts, introduced by Yao et al. 2023⁹, is chain-of-thought augmented with branching and backtracking over intermediate reasoning states. Graph-of-thoughts, Skeleton-of-thoughts, and Least-to-most prompting are further elaborations. A practitioner does not need to memorise each named pattern to evaluate a task; they need to recognise the base pattern and the elaboration so they can judge whether the elaboration’s cost is worth its benefit for the task at hand.

The practical discipline is compositional. Start with zero-shot; if the output format drifts, add few-shot examples; if the reasoning is shallow, insert chain-of-thought; if correctness matters more than a single pass can deliver, wrap the whole thing in self-consistency or self-critique. Each layer is justified against its incremental cost. A prompt that has accreted four layers and works well on tests has also accreted four layers of complexity that someone will have to debug. A prompt that works well at two layers is, other things equal, a better prompt.

Another category of patterns concerns the instruction’s voice and specificity. Providers’ published guidance converges on a small set of authoring disciplines: be specific about the desired output; state negative constraints as positive requirements where possible (prefer answer in two sentences over do not ramble); supply context generously at the start of the instruction and narrow toward the specific task; and structure the instruction using markers or sections when the instruction is longer than a paragraph. These disciplines apply across all four base patterns and are cheap to adopt.

A worked contrast: sentiment classification versus mathematical word problems

Sentiment classification of short product reviews is a task the model knows extraordinarily well from pretraining. A zero-shot instruction of the form classify the sentiment of the following review as positive, negative, or neutral and respond with only the category name is sufficient for an internal prototype. For production, a handful of examples covering sarcasm, mixed sentiment, and domain-specific phrasing sharpens the behaviour and pins the output format.

A mathematical word problem of the kind found in GSM8K is a task the model sometimes solves by pattern-matching rather than by computation, producing confidently wrong answers. Zero-shot accuracy on GSM8K is weaker than few-shot chain-of-thought accuracy; self-consistency over eight samples closes much of the remaining gap⁶. A production feature answering numeric questions about business data would typically use chain-of-thought, often paired with a calculator tool (Article 5) to avoid leaving arithmetic to the language model at all.

The two cases show why the taxonomy matters. The right pattern for sentiment is the wrong pattern for math; the right pattern for math is overkill for sentiment. A practitioner who has internalised the taxonomy picks the right pattern in both cases without consulting a blog post.

Working with the model’s idiosyncrasies

Each model provider ships idiosyncrasies that affect how foundational patterns land. Some models respond especially well to explicit structure markers in the instruction; others perform better when the instruction is expressed in continuous prose. Some models’ performance on chain-of-thought is sensitive to the exact phrasing of the reasoning cue; let us think step by step outperforms think step by step on some providers and vice versa on others. Some models respond to few-shot examples best when the examples are formatted as a conversation; others benefit from a structured question-answer layout.

The practitioner’s discipline is to write the prompt to the pattern and then to tune the surface details to the target model. A prompt that has been tuned for one model should be re-evaluated when porting to another; the evaluation harness in Article 8 is the instrument that catches the surface-level idiosyncrasies. The pattern itself, the base choice between zero-shot and few-shot and chain-of-thought and self-consistency, survives the port; the surface details do not always.

The honest corollary is that a prompt is not perfectly portable across providers. A practitioner who needs portability designs around the pattern, evaluates against several providers, and accepts a small quality gap rather than a major redesign.

Anti-patterns to avoid

Several anti-patterns recur in teams new to prompt engineering. Over-prompting names the habit of writing a system prompt of several thousand tokens in the hope that more instruction produces more reliable behaviour; the opposite is typically true, because long prompts dilute the signal and admit contradictions. Magic-phrase accretion names the habit of adding one more please be careful or one more this is very important every time the feature misbehaves; the phrases are placebos and the underlying issue usually needs a structural fix. Instruction-by-example-only names the habit of relying entirely on few-shot examples without any explicit instruction; the result is a prompt that works on inputs resembling the examples and fails elsewhere. Each anti-pattern has a corrective, and the corrective is usually to return to the base pattern’s discipline.

Summary

Four patterns cover most enterprise prompting. Zero-shot is the default starting point. Few-shot is the way to pin format and resolve ambiguity with examples. Chain-of-thought elicits intermediate reasoning on multi-step tasks. Self-consistency and its siblings trade cost for reliability on high-stakes multi-step tasks. The choice between patterns is driven by task structure, format tightness, and the cost of error, not by fashion. Article 3 extends the taxonomy to output structuring, where the prompt pattern is paired with a decoding-time constraint to produce outputs a downstream system can consume.

Further reading in the Core Stream: Generative AI and Large Language Models.

Tom B. Brown et al. Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165 — accessed 2026-04-19. ↩ ↩²
Takeshi Kojima et al. Large Language Models are Zero-Shot Reasoners. NeurIPS 2022. https://arxiv.org/abs/2205.11916 — accessed 2026-04-19. ↩ ↩²
Prompt engineering. OpenAI Platform documentation. https://platform.openai.com/docs/guides/prompt-engineering — accessed 2026-04-19. ↩
Prompt engineering overview. Anthropic documentation. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering — accessed 2026-04-19. ↩
Prompting strategies. Google Gemini API documentation. https://ai.google.dev/gemini-api/docs/prompting-intro — accessed 2026-04-19. ↩
Jason Wei et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. https://arxiv.org/abs/2201.11903 — accessed 2026-04-19. ↩ ↩²
Xuezhi Wang et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023. https://arxiv.org/abs/2203.11171 — accessed 2026-04-19. ↩
Noah Shinn et al. Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023. https://arxiv.org/abs/2303.11366 — accessed 2026-04-19. ↩
Shunyu Yao et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023. https://arxiv.org/abs/2305.10601 — accessed 2026-04-19. ↩

Zero-shot instruction

Few-shot demonstration

Chain-of-thought

Self-consistency and its siblings

How to choose

Pattern composition and the techniques it is not worth learning separately

A worked contrast: sentiment classification versus mathematical word problems

Working with the model’s idiosyncrasies

Anti-patterns to avoid

Summary

Footnotes