Technique Updated 2026-04
Quantization
Definition
Quantization reduces the precision of numbers in an AI model to make it smaller and faster, with minimal quality loss.
See also in the glossary
L
LLM (Large Language Model)
An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.
A
AI Inference
Inference is the process of using a trained AI model to generate predictions or responses from new data.
S
SLM (Small Language Model)
An SLM is a compact language model optimized to run on local devices with targeted performance on specific tasks.
G
GPU Cloud
GPU Cloud provides on-demand graphics processors for training and running AI models without hardware investment.
Tools that use quantization
Frequently Asked Questions
4-bit, 8-bit quantization, what's the difference?
Original models use 16 or 32-bit numbers. 8-bit quantization halves the size, 4-bit quarters it. A 70B LLM in 4-bit fits in 32GB of RAM.
Does quality drop significantly?
At 8-bit, barely noticeable. At 4-bit, slight drop on complex tasks but acceptable for most uses. At 2-bit, notable loss.