Technique Updated 2026-04

Quantization

Definition

Quantization reduces the precision of numbers in an AI model to make it smaller and faster, with minimal quality loss.

See also in the glossary

LLM (Large Language Model)

An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.

Inference is the process of using a trained AI model to generate predictions or responses from new data.

SLM (Small Language Model)

An SLM is a compact language model optimized to run on local devices with targeted performance on specific tasks.

GPU Cloud provides on-demand graphics processors for training and running AI models without hardware investment.

Tools that use quantization

The open source Chinese model rivaling GPT-4

Stable Diffusion

The open source reference for AI image generation

The open source AI agent that turns your LLMs into autonomous workers

Cloud IDE with built-in AI for coding from anywhere

Frequently Asked Questions

4-bit, 8-bit quantization, what's the difference?

Original models use 16 or 32-bit numbers. 8-bit quantization halves the size, 4-bit quarters it. A 70B LLM in 4-bit fits in 32GB of RAM.

Does quality drop significantly?

At 8-bit, barely noticeable. At 4-bit, slight drop on complex tasks but acceptable for most uses. At 2-bit, notable loss.