Chunking and Embedding Strategy

FlowRidge

Chunking × Embedding — Design Space

Multi-vector embedding

Small chunk

ColBERT-style

Small chunk + token-level vectors

Summary + detail

Large chunk + hierarchical embed

Dense passage — default

Small + single vec, fast retrieval

Long doc single-embed

Large chunk, single vec — low precision

Large chunk

Single embedding

Figure 341. Chunk size and embedding strategy trade precision for context. Matching them to retrieval workload is the RAG-tuning crux.

AITE-SAT: AI Solutions Architect Expert — Body of Knowledge Article 5 of 35

Most RAG failures are not retrieval failures. They are chunking and embedding failures that manifest as retrieval failures. A user asks a clear question. The retriever returns three passages, all from the wrong chapter of a 300-page document. The model gives a plausible-sounding answer grounded in irrelevant text. The team investigates the retriever. The retriever is doing its job; the index contained exactly the chunks the retriever found. The real defect is upstream: the chunks were created with a strategy that split the relevant section into pieces that never embedded together, or the embedding model was trained on short passages and performs poorly on the long technical prose the corpus contains. Chunking and embedding are the foundation on which retrieval rests; when the foundation is wrong, the rest of the architecture cannot compensate.

What chunking does

A chunking strategy is the function that takes a document and produces a sequence of text fragments (chunks) that get indexed. Each chunk is embedded and stored in the vector index with metadata linking it back to its source. At retrieval time, the query is embedded and the index returns chunks whose embeddings are most similar to the query embedding. The architect’s decisions are: how large the chunks should be, how much overlap they should have with their neighbors, how document structure (headings, sections, paragraphs, sentences) should influence the boundaries, what metadata travels with each chunk, and what preprocessing (OCR cleanup, language detection, table extraction) runs before chunking.

Chunks that are too large dilute the signal: a relevant sentence is embedded together with many irrelevant ones, and the chunk’s embedding does not emphasize the sentence strongly enough to be retrieved for a narrow query. Chunks that are too small lose context: a relevant sentence embedded alone without its surrounding paragraph may be retrieved but cannot be understood by the generator without the context the chunk no longer provides. The Goldilocks size depends on the embedding model, the document type, and the query distribution.

The five principal strategies

Fixed-window chunking splits a document at a fixed character or token count, typically with overlap. A 1,000-token window with 200-token overlap produces chunks that each carry roughly four-fifths of new content plus one-fifth of context from the prior chunk. Fixed-window is simple to implement, predictable in size, and baseline-quality for a wide class of documents. It is the default in most introductory RAG tutorials and is implemented out of the box in LangChain’s RecursiveCharacterTextSplitter and LlamaIndex’s SentenceSplitter.¹ It works less well when document structure matters — a section-boundary cut in the middle of a paragraph loses information the structure would have preserved.

Sentence-window chunking embeds each sentence as its own chunk but returns, at retrieval time, the sentence plus a window of neighbors. This produces embeddings that are maximally specific (a sentence-level signal is not diluted) while giving the model a larger window of context when a chunk is selected. LlamaIndex’s SentenceWindowNodeParser implements this pattern directly.² It works well for FAQ-style corpora and short-answer retrieval and is harder to tune for long, flowing prose.

Semantic chunking uses an embedding or a classifier to decide where chunk boundaries should fall, so boundaries align with topic shifts rather than with fixed counts. The boundary detector looks at consecutive sentence embeddings and places a boundary where embeddings diverge. This preserves topical coherence within chunks at the cost of variable chunk size and extra indexing compute. It works well for long narrative documents where fixed windows cut awkwardly. Implementation examples appear in both LangChain and Unstructured.io.³

Hierarchical chunking indexes the document at multiple granularities simultaneously: sentences, paragraphs, sections, full documents. Retrieval can first find the relevant section, then drill down into the relevant paragraph, then return the relevant sentences plus the surrounding paragraph as context. LlamaIndex’s hierarchical node parsers and DocumentSummaryIndex implement the pattern.⁴ The architect pays extra indexing cost for the capability to answer queries at multiple levels of detail.

Late chunking is a newer pattern, published in 2024, where the entire document is embedded as a single sequence and chunks are extracted from the resulting contextualized token representations after the fact. The result is chunks whose embeddings reflect the whole-document context, not just the local chunk. Late chunking is implemented by Jina AI and Cohere reference stacks; it shows quality improvements on long-document retrieval benchmarks at the cost of requiring an embedding model that supports long context.⁵

Each strategy has a use case. The architect tests at least two on the corpus before committing.

[DIAGRAM: TimelineDiagram — aite-sat-article-5-chunking-timeline — Horizontal timeline of a single document passing through chunking stages: “Raw file” → “Preprocess (OCR, cleanup)” → “Structure detection (headings, tables)” → “Chunk boundary rule (fixed/sentence-window/semantic/hierarchical/late)” → “Chunk + metadata” → “Embed” → “Write to index”. Metadata carried through each stage shown as annotations on the arrows.]

Embedding model selection

The embedding model is the function that turns text into vectors. Closed-weight options include OpenAI’s text-embedding models, Cohere’s embed family, Google’s embedding APIs through Vertex, and Voyage AI’s specialized models. Open-weight options include BAAI’s bge family (small, base, large, M3), Alibaba’s gte and Qwen embedding models, Mixedbread’s mxbai embeddings, and the broader sentence-transformers ecosystem. The choice depends on five criteria.

Quality on the corpus. The architect evaluates each candidate on a representative sample of the corpus’s query distribution. The MTEB benchmark provides a general-purpose comparison across tasks, but the use-case-specific evaluation is the binding one.⁶ Two models that score within 1% on MTEB may differ by 10% on a specific corpus.

Dimensionality. Embedding dimension determines index size and retrieval latency. A 3,072-dimensional embedding index is four times the size of a 768-dimensional one; for a billion-chunk index the difference is operational. Dimensionality-reduction techniques (such as Matryoshka representation learning, which several 2024-era embeddings support natively) let the architect truncate embeddings at query time for a cost-quality trade-off.⁷

Context length. The maximum input length matters when the corpus has long items. A model with a 512-token context cannot embed a 2,000-word legal passage as a single chunk without truncation or aggregation. Long-context embeddings (Voyage’s voyage-large-2-instruct, Cohere embed v3, Nomic nomic-embed-text-v1.5, Jina v3) handle this natively.

Language coverage. Multilingual corpora need multilingual embeddings. BAAI’s bge-m3, Cohere embed multilingual, and Google’s text-multilingual-embedding models differ in which languages they cover with what quality.

Operational model. A closed-weight embedding API is paid per call, is hosted by the provider, and cannot be self-hosted; an open-weight embedding model runs on the organization’s own infrastructure with predictable unit economics. The selection mirrors the model-plane selection discussed in Article 2.

A crucial rule: changing the embedding model requires reindexing the entire corpus. An architecture that ties itself to one embedding provider pays an exit cost equal to the cost of reprocessing every document. The architect plans reindexing capacity from the beginning rather than discovering the requirement when the provider deprecates a model.

Build-your-own reference

Article 5 is the first article where the curriculum explicitly teaches a build-your-own reference pattern alongside managed alternatives, per the technology-neutrality requirement. The same chunking-and-embedding pipeline runs three ways.

Managed API stack. The architect uses OpenAI’s text-embedding-3-large (or Anthropic-recommended equivalents for their ecosystem) to embed, Pinecone as the vector index, and LangChain as the orchestration layer. Ingestion is a managed-service concern; cost is per-call.

Cloud platform stack. The architect uses Azure OpenAI embeddings through Azure AI Foundry, Azure AI Search as the vector index, and Semantic Kernel as the orchestration layer; or Amazon Titan embeddings through Bedrock with Amazon OpenSearch as the index. Cost is per-call plus platform service fees.

Self-hosted open-source stack. The architect uses BAAI’s bge-large embeddings served via a Text Embeddings Inference container, pgvector inside PostgreSQL as the index, and LlamaIndex as the orchestration layer. Cost is amortized GPU-hours plus database-hours; the stack has no per-call external billing.

The three stacks look different operationally. They look identical at the architecture plane: the same five chunking strategies apply, the same five selection criteria apply, the same reindexing discipline applies. The architect who learns the pattern once applies it in each stack; the architect who learns the pattern in a managed stack alone cannot recognize it in a self-hosted one.

Stack family	Embedding option	Vector store option	Orchestration option
Closed-weight managed APIs	OpenAI text-embedding-3-large; Cohere embed-v3; Voyage voyage-large-2	Pinecone; Weaviate Cloud; Qdrant Cloud	LangChain; LlamaIndex
Open-weight self-hosted	BAAI bge-large; Qwen embedding; Mixedbread mxbai-embed-large	pgvector; Milvus; Qdrant self-hosted	LlamaIndex; Haystack; DSPy
Cloud platforms	Amazon Titan (Bedrock); Google text-embedding-005 (Vertex); Azure OpenAI	Amazon OpenSearch; Vertex Vector Search; Azure AI Search	Semantic Kernel; Bedrock Agents; Vertex AI Extensions

Evaluating chunking and embedding together

Chunking strategy and embedding model interact; evaluating one at a time produces misleading results. The evaluation protocol is: for each combination of chunking strategy and embedding model, index the corpus, run the golden-set queries, and measure retrieval metrics — recall at k, mean reciprocal rank, and a task-specific metric such as answer correctness after generation. The combinations that perform best are the candidates; the one with the best cost-quality trade-off is selected.

Two real cases illustrate how chunking decisions ripple.

Harvey AI, a legal AI company, has described chunking considerations on its engineering blog. Harvey’s corpora include case law and legal briefs whose structure (numbered sections, footnotes, citations) differs from general web text; Harvey’s chunking strategy preserves legal-document structure and embeds citation markers as first-class metadata so citations can be retrieved directly.⁸ The embedding model is chosen with the legal vocabulary in mind rather than a generic web corpus. The architecture is the same RAG pattern; the chunking and embedding details are domain-specific.

Perplexity AI, the search-oriented application, has discussed chunking in public blog posts describing how they process web content for passage-level retrieval. The chunking is aggressive at passage granularity because the use case is short-answer question answering where a long context would be ignored by the generator; the embedding model is optimized for passage-level tasks.⁹ Perplexity’s decisions are almost the opposite of Harvey’s, and both are correct; the corpora and query distributions are different.

Metadata and filtering

A chunk without metadata is indistinguishable from every other chunk in the index. A chunk with metadata — source document identifier, author, creation date, section heading, document type, security classification, tenant identifier — can be retrieved with filters applied. Metadata turns “retrieve the top-k chunks similar to this query” into “retrieve the top-k chunks similar to this query from documents owned by this tenant, created in the last year, in the policy-document class.” The architect plans metadata as a first-class part of the chunking stage, not as an afterthought. Retrieval-time filters that were not present at ingestion time cannot be added without reindexing; metadata designed in at ingestion carries forever.

Tenant isolation is often enforced through metadata filtering (Article 16 develops multi-tenancy); access control is enforced through metadata on security classification; temporal filtering is enforced through creation-date metadata. All three examples show why metadata is not optional.

[DIAGRAM: MatrixDiagram — aite-sat-article-5-chunk-size-overlap — A 2D heatmap matrix with “Chunk size” (tokens: 256 / 512 / 1024 / 2048) on one axis and “Chunk overlap” (percentage: 0 / 10 / 20 / 30) on the other. Cells are shaded by representative retrieval-quality score from an illustrative corpus; the best-performing cell is marked, and annotations note how the heatmap changes for a different corpus type.]

Regulatory alignment

Chunking and embedding strategies affect EU AI Act Article 10 (data and data governance). Article 10 requires that training, validation, and testing datasets be relevant, representative, and free of errors; in a RAG system, the retrieval corpus and its chunks are analogous to training data for the retrieval function.¹⁰ An architect who cannot describe how their chunks are produced, how metadata is assigned, and how the corpus is governed is failing an implicit Article 10 expectation. ISO/IEC 42001 Clause 8.3 requires life-cycle management for AI; the chunking pipeline has a life cycle whose changes must be managed like any other artifact change.¹¹

Summary

Chunking and embedding are the foundation of RAG quality. The five principal chunking strategies — fixed-window, sentence-window, semantic, hierarchical, late — each fit different corpora and query distributions. Embedding model selection depends on quality, dimensionality, context length, language coverage, and operational model; changing the embedding model requires reindexing. The same pipeline runs on managed APIs, cloud platforms, and self-hosted open-source stacks; the architecture is invariant. Metadata is a first-class part of chunking, enabling tenant isolation, access control, and temporal filtering. Harvey AI’s legal-corpus chunking and Perplexity’s passage-level web chunking are two public examples of the same discipline applied to different domains. The architect evaluates chunking and embedding together on the corpus’s own golden set, never on benchmarks alone. Regulatory alignment with the EU AI Act and ISO 42001 is satisfied by documenting the chunking pipeline as a governed life-cycle artifact.

Further reading in the Core Stream: Data Architecture for Enterprise AI, Grounding, Retrieval, and Factual Integrity for AI Agents, and Data Governance for AI.

LangChain RecursiveCharacterTextSplitter reference. LangChain Documentation. https://python.langchain.com/docs/how_to/recursive_text_splitter/ — accessed 2026-04-19. LlamaIndex SentenceSplitter reference. https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/ — accessed 2026-04-19. ↩
LlamaIndex SentenceWindowNodeParser and SentenceWindowRetriever. LlamaIndex Documentation. https://docs.llamaindex.ai/en/stable/examples/node_postprocessor/MetadataReplacementDemo/ — accessed 2026-04-19. ↩
Unstructured.io Partitioning and Chunking documentation. https://docs.unstructured.io/open-source/core-functionality/chunking — accessed 2026-04-19. ↩
LlamaIndex HierarchicalNodeParser and DocumentSummaryIndex. LlamaIndex Documentation. https://docs.llamaindex.ai/en/stable/ — accessed 2026-04-19. ↩
Michael Günther et al., “Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models,” Jina AI, 2024. https://jina.ai/news/late-chunking-in-long-context-embedding-models/ — accessed 2026-04-19. ↩
Niklas Muennighoff et al., “MTEB: Massive Text Embedding Benchmark.” https://huggingface.co/spaces/mteb/leaderboard — accessed 2026-04-19. ↩
Aditya Kusupati et al., “Matryoshka Representation Learning,” NeurIPS 2022. https://arxiv.org/abs/2205.13147 — accessed 2026-04-19. ↩
Harvey AI engineering blog (legal corpus retrieval). https://www.harvey.ai/blog — accessed 2026-04-19. ↩
Perplexity AI engineering blog posts on search and retrieval architecture. https://www.perplexity.ai/ — accessed 2026-04-19. ↩
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 (EU AI Act), Article 10. Official Journal of the European Union. https://eur-lex.europa.eu/eli/reg/2024/1689/oj — accessed 2026-04-19. ↩
ISO/IEC 42001:2023 — Information technology — Artificial intelligence — Management system, Clause 8.3. International Organization for Standardization. https://www.iso.org/standard/81230.html — accessed 2026-04-19. ↩