Modèle Updated 2026-04

Multimodal

Definition

A multimodal model processes and generates multiple data types: text, images, audio and video.

See also in the glossary

LLM (Large Language Model)

An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.

Generative AI refers to artificial intelligence systems capable of creating original content: text, images, video, audio, code.

Text-to-Image refers to generating images from text descriptions using generative AI models.

Text-to-Speech converts written text into spoken voice using AI, with increasingly realistic results.

Tools that use multimodal

The world's most used conversational AI assistant

Google's AI assistant with 1M token context

The AI that understands nuance, by Anthropic

The rebellious AI from xAI, connected to X in real time

Frequently Asked Questions

Which LLMs are multimodal?

GPT-4o, Gemini 2.0, Claude Opus. Most major LLMs are multimodal in 2026.

Does multimodal mean the model does everything?

No. A multimodal model processes multiple input types but doesn't necessarily excel at each one.