Multimodal Architecture: Vision, Audio, Document

FlowRidge

Definition

A multimodal AI system accepts or produces more than one modality — text, image, audio, video, structured document, or screen. Enterprise multimodal use cases are increasingly common: a claims adjuster uploads an accident photo and the system extracts damage categories; a call-centre assistant transcribes, analyses, and summarises calls; a compliance team drops a 200-page PDF into a tool and gets a structured extraction. The architect’s job is to pick the right pattern for each modality boundary and specify the cost, latency, and safety implications. The modern multimodal frontier moves fast — GPT-4o, Claude 3 family, Gemini, and open-weight entrants all support native multimodal input — but the architectural patterns remain stable.

This article walks the five multimodal architectural patterns, classifies three enterprise use cases by best-fit pattern, and covers the safety and evaluation extensions each modality requires.

Modality scope

A short inventory keeps the discussion concrete.

Text is the baseline. Every system handles text.

Image covers photos, screenshots, charts, scanned documents, and diagrams. Vision input has been production-ready for enterprise use since late 2023; vision output (image generation) is a separate stack and less common in enterprise AI.

Audio covers speech input (speech-to-text) and speech output (text-to-speech). Production-ready for two decades in its classical form and dramatically improved by foundation-model-era speech services (Whisper, AssemblyAI, Deepgram, Azure Speech, Amazon Transcribe, Google Cloud Speech).

Video is expensive and mostly frame-by-frame today, effectively chaining image understanding over time. Genuinely multimodal video models exist but are not yet common in enterprise production.

Document is a special category. PDFs, Word documents, slides, and scanned images blend text, layout, images, and sometimes tables. Document AI is a distinct subfield with its own tooling — Unstructured.io, Azure AI Document Intelligence, Google Document AI, Amazon Textract, Mistral OCR, and open-source parsers built on layout models.¹

Structured or screen inputs — HTML, UI states, web pages — are increasingly relevant for agentic and automation use cases. Anthropic’s Computer Use and OpenAI’s comparable capabilities make screen-understanding a first-class modality.²

The five multimodal patterns

1. Separate encoders, text pivot

Each modality is processed by its own specialist encoder into text, and downstream reasoning is purely textual. A claim photo is classified by a vision model and described in a natural-language caption; the caption enters the orchestrator alongside the claim narrative. The LLM reasons over text.

This is the most common enterprise pattern in 2025 and 2026 because it decouples concerns cleanly. The vision model, the speech-to-text model, and the document parser each evolve independently. The orchestrator is the boring text-reasoning layer it already was. Cost is reasonably predictable. Failure modes are localised: when the image caption is wrong, the output is wrong in a recognisable way.

2. Unified multimodal model

A single model (GPT-4o, Claude 3.7 Sonnet, Gemini 1.5 Pro, Qwen-VL) accepts text and one or more other modalities in the same context and reasons across them directly. The advantage: the model can attend to low-level image features the caption would have discarded. The disadvantage: higher cost, more variable latency, and harder evaluation because the failure modes mix.

Unified models are increasingly the right choice for visual question-answering, chart reasoning, and OCR-plus-reasoning tasks where the text version cannot preserve enough signal. For routine classification or summarisation of images, the separate-encoder pattern is often simpler and cheaper.

3. Pre-processing pipeline (document-heavy)

For documents, a layered pre-processing pipeline does the modality-specific work before the model sees the content. A typical PDF pipeline: ingest -> layout parser (detect text, tables, images, headers) -> OCR on scanned regions -> semantic chunking that respects layout -> embedding -> index. Retrieval happens over the resulting chunks; the LLM reasons over retrieved text, possibly with accompanying image excerpts for visual references.

This pattern is the dominant enterprise approach for document-heavy workloads. The pipeline can be built on Unstructured.io, Azure Document Intelligence, Google Document AI, open-source equivalents, or a bespoke mix. The architect picks based on document variety, throughput, and residency requirements.

4. Modality-specific retrieval

Where the corpus contains image or audio assets that need to be searchable, the retrieval layer is extended to handle those modalities. CLIP-style embeddings for images, audio embeddings for sound clips. The retriever returns the modality-appropriate asset and the reasoning layer decides how to present or process it.

This pattern shows up in media archives, training-content search, and compliance review over recorded calls. The architect specifies the embedding model, the index, and the retrieval scoring.

5. Multimodal output

The system produces non-text output: a chart, a generated image, a voice response, a formatted document. Output modality adds stages — text-to-speech for voice, chart-rendering libraries for visualisations, DOCX or PDF generation for documents, image generation models for graphical output. Each output modality has its own eval surface.

Document pipeline in depth

Documents deserve their own pipeline section because document AI is the most common multimodal enterprise workload. A production-grade PDF pipeline handles these stages in order:

Format detection and normalisation. PDF, DOCX, HTML, image, scanned PDF, encrypted PDF — each routed to the appropriate path.
Layout parsing. A layout model (DiT, LayoutLMv3-family, or a managed service) identifies blocks: paragraphs, tables, images, headers, footers.
OCR for scanned content. Tesseract, cloud OCR services, or integrated layout-plus-OCR paths.
Table extraction. A specialised step because tables carry structured meaning that text flattening loses. Managed document AI services or libraries like Camelot, pdfplumber, or the model-based equivalents.
Semantic chunking. The chunking decisions from Article 5 apply but with layout awareness — keep tables together, keep headers with their sections.
Embedding and indexing. As covered in Articles 5 and 6.
Retrieval-time reranking. Document chunks benefit from rerankers because layout-respecting chunking tends to produce slightly noisier top-k results than pure text.

The pipeline is a natural candidate for a platform service (Article 24). Multiple product teams need document handling; standardising the pipeline saves significant cost and enforces quality.

Evaluation for multimodal systems

Each modality adds an evaluation surface. The text eval harness (Article 11) is necessary but not sufficient. Extensions:

Vision eval: visual question-answering accuracy on a gold set; OCR accuracy on scanned content; image classification accuracy where applicable. Public benchmarks — ChartQA, DocVQA, MMMU — provide reference baselines.³

Audio eval: word error rate (WER) on speech-to-text, speaker diarisation accuracy, speech-output quality (often subjective MOS scores).

Document eval: layout-detection F1, table-extraction precision and recall, end-to-end question-answering accuracy on documents.

Cross-modal consistency eval: does the model’s description of an image match the image; does the summary of a call match the transcript.

Eval set curation is more expensive for multimodal than for text because the assets are harder to collect and annotate. Budget accordingly.

Cost and latency implications

Every modality added multiplies the cost and latency surface. A rough order-of-magnitude reference (varies by provider and model tier):

Text-only model call: ~1x cost and latency baseline.
Vision-plus-text model call: 2-5x baseline (image tokens plus vision encoder time).
Audio transcription plus model call: add 1-3 seconds of pre-processing latency; transcription cost often low ($0.006/minute class).
Document pipeline plus model call: 5-30 seconds end-to-end for a 20-page PDF; cost depends heavily on pipeline choices.
Image generation output: adds seconds to tens of seconds; cost is per-image.

The architect models the cost curves per use case (Article 33). A document-heavy workload where every ticket includes a PDF will have total cost dominated by the document pipeline, not the model call itself.

Safety extensions per modality

Each modality brings its own safety considerations.

Image input: adversarial images can carry prompt injections (text in images that the model reads). CSAM and other content-safety concerns apply to user-uploaded media. PII in images (license plates, faces, document images) must be handled per the data-governance model.

Audio input: voice-printing and biometric considerations; regional wiretap and call-recording rules; PII in transcripts.

Document input: sensitive data often at higher concentration than text-only inputs; careful redaction before ingestion to platforms that log retrieved content.

Image output: generated-image safety, including the question of whether the generated image could be mistaken for a photograph (misinformation risk) or impinges on an identifiable person or copyrighted work.

Audio output: voice cloning and deepfake concerns; regulatory requirements to disclose synthesised voices in some jurisdictions.

The architect defines modality-specific input filters, output filters, and consent paths where appropriate. The responsible-AI patterns in Article 31 apply to all modalities.

Worked example — Insurance claims adjustment (vision + document + text)

A claims-adjustment assistant accepts an accident photo, the policy document, and the claimant narrative and produces a classified coverage decision, a damage estimate range, and a draft letter.

Pattern: Separate encoders with text pivot for the core reasoning, supplemented by the vision model when a detailed damage visualisation is required.
Document pipeline: parse policy PDF, extract relevant clauses via retrieval.
Vision: classify damage type (dent, scratch, total loss) via a specialist vision model; pass the classification as text into the orchestrator.
Text reasoning: LLM composes the coverage decision by reasoning over the policy clauses, the narrative, and the damage classification.
Evaluation: vision accuracy on a held-out gold set of photos; end-to-end decision accuracy on a held-out gold set of claims; citation accuracy on the retrieved policy clauses.
Safety: human approval mandatory before the draft letter is sent; audit log of all coverage decisions with the AI’s contribution flagged.

Worked example — Call-centre assistant (audio + text)

A call-centre assistant transcribes the live call, highlights compliance concerns to the agent in real time, and drafts a post-call summary.

Pattern: Separate encoders with text pivot. Audio -> transcript via a managed speech service; transcript -> LLM for real-time analysis; transcript + notes -> LLM for post-call summary.
Latency: strict — the agent’s coaching overlay is useless if it arrives more than a few seconds late. Streaming transcription is required.
Safety: regulatory obligations around call recording (both-party consent in many US states, GDPR implications in the EU); sensitive-topic detection for routing to supervisors.
Evaluation: WER on transcription quality; precision/recall on compliance-concern detection; summary quality on a rubric.

Worked example — Compliance document review (document + text)

A compliance team uploads a 200-page contract and asks the assistant to identify deviations from a standard template.

Pattern: Document pipeline plus retrieval plus text reasoning.
Pipeline: PDF -> layout parse -> table extraction -> semantic chunks -> embed and index.
Reasoning: LLM reasons over retrieved chunks plus the standard template’s deviation catalogue.
Safety: no document content leaves the residency region; logs redact deal-sensitive identifiers; human sign-off required before any action is taken on the deviations.
Evaluation: deviation-detection precision and recall against a labelled corpus; citation accuracy (does the detected deviation cite the right clause).

Anti-patterns

Pushing all modalities through the biggest multimodal model. Expensive, slow, and unnecessary for routine cases. Separate encoders plus text pivot is cheaper and easier to evaluate.
Document pipelines without validation gates. Silently passing low-quality OCR into the index produces bad retrieval that looks normal. Gates with confidence thresholds and dead-letter routing are mandatory.
Assuming vision models are accurate. Vision models hallucinate object counts, misread charts, and confabulate text from blurry images. The eval harness must cover these failure modes.
Conflating audio quality with transcription quality. A transcript can be clean while the original audio is compromised by background noise or dialect coverage gaps. Eval the whole path, not just one stage.
No modality-specific fallback. When the vision model is unavailable, what happens to the claim? A graceful fallback to text-only review is better than a broken feature.

Governance integration

EU AI Act Articles 10 (data governance) and 15 (cybersecurity) extend to every modality.⁴ The data governance obligations apply whether the training and retrieval data is text, image, or audio. Biometric processing under Article 5 and Article 6 / Annex III raises the stakes for any system that processes voice or face data. The architect’s compliance review explicitly enumerates the modalities in scope.

Summary

Multimodal architecture in 2026 is largely a pattern-selection exercise across five patterns: separate encoders with text pivot, unified models, pre-processing pipelines for documents, modality-specific retrieval, and modality-specific output. Document pipelines are the most common multimodal workload and deserve platform-level standardisation. Each modality adds cost, latency, eval surface, and safety considerations; the architect specifies all four before the feature ships.

Key terms

Multimodal architecture
Separate-encoder pattern
Unified multimodal model
Document pipeline
Modality-specific eval

Learning outcomes

After this article the learner can: explain five multimodal patterns; classify three enterprise use cases by pattern fit; evaluate a multimodal design for safety coverage across modalities; design a multimodal architecture brief for a given use case.