Technique Updated 2026-04

Tokenizer

Definition

A tokenizer is the algorithm that splits text into tokens (elementary units) before it is processed by an LLM.

See also in the glossary

A token is the basic unit processed by an LLM. It's a piece of word, punctuation or character that the model uses to understand and generate text.

LLM (Large Language Model)

An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.

An embedding is a numerical representation (vector) of text or data, capturing its semantic meaning.

The context window is the maximum amount of text an LLM can process in a single request.

Tools that use tokenizer

The world's most used conversational AI assistant

The AI that understands nuance, by Anthropic

The open source Chinese model rivaling GPT-4

Mistral Le Chat

The sovereign European AI, GDPR-compliant

Frequently Asked Questions

Why is the tokenizer important?

It determines how many tokens a text consumes, thus the cost and whether text fits in the context window. A bad tokenizer wastes tokens.

Do all LLMs use the same tokenizer?

No. OpenAI uses tiktoken, Anthropic and Google have their own. The same text can be 100 tokens on GPT-4 and 120 on Claude.