Skip to main content

COMPEL Glossary / multi-modal-ai

Multi-Modal AI

Multi-modal AI refers to AI systems that can process and reason across multiple types of data simultaneously, such as text, images, audio, and video.

What this means in practice

Modern foundation models like GPT-4 and Gemini can analyze an image and answer questions about it in text, or combine visual and textual information for more comprehensive understanding. Multi-modal capabilities are particularly valuable in enterprise settings where business problems involve diverse data types -- for example, processing insurance claims that include photographs, written descriptions, and structured data simultaneously. For transformation leaders, multi-modal AI expands the frontier of automatable tasks and should be considered when evaluating use case feasibility during the COMPEL Model stage.

Why it matters

Business problems rarely involve only one data type. Insurance claims combine photographs and written descriptions. Manufacturing quality inspection combines visual and sensor data. Multi-modal AI can process these diverse inputs simultaneously, enabling automation of tasks that previously required human judgment to synthesize information across different formats. This expands the frontier of automatable enterprise tasks significantly.

How COMPEL uses it

Multi-modal capabilities are considered when evaluating use case feasibility during the COMPEL Model stage. During Calibrate, the organization's ability to process diverse data types is assessed under the Technology pillar. The Model stage evaluates whether multi-modal AI is appropriate for specific use cases in the portfolio. The Evaluate stage measures multi-modal system performance across all input modalities to ensure balanced quality.

Related Terms

Other glossary terms mentioned in this entry's definition and context.