Technique Updated 2026-04
Tokenizer
Definition
A tokenizer is the algorithm that splits text into tokens (elementary units) before it is processed by an LLM.
See also in the glossary
T
Token
A token is the basic unit processed by an LLM. It's a piece of word, punctuation or character that the model uses to understand and generate text.
L
LLM (Large Language Model)
An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.
E
Embedding
An embedding is a numerical representation (vector) of text or data, capturing its semantic meaning.
C
Context Window
The context window is the maximum amount of text an LLM can process in a single request.
Tools that use tokenizer
Frequently Asked Questions
Why is the tokenizer important?
It determines how many tokens a text consumes, thus the cost and whether text fits in the context window. A bad tokenizer wastes tokens.
Do all LLMs use the same tokenizer?
No. OpenAI uses tiktoken, Anthropic and Google have their own. The same text can be 100 tokens on GPT-4 and 120 on Claude.