Technique Updated 2026-04

RLHF (Reinforcement Learning from Human Feedback)

Reinforcement Learning from Human Feedback

Definition

RLHF is a training technique that uses human feedback to align an LLM's behavior with user expectations.

See also in the glossary

LLM (Large Language Model)

An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.

Fine-tuning is the process of retraining an existing AI model on a specific dataset to adapt it to a particular domain or task.

AI alignment aims to ensure an artificial intelligence system acts in accordance with human values and intentions.

Machine Learning

Machine Learning is a branch of AI where systems learn from data to improve their performance without being explicitly programmed for each task.

Tools that use rlhf

The world's most used conversational AI assistant

The AI that understands nuance, by Anthropic

Google's AI assistant with 1M token context

The open source Chinese model rivaling GPT-4

Frequently Asked Questions

Why is RLHF necessary?

Without RLHF, an LLM is capable but not very useful: it can be toxic, off-topic or too verbose. RLHF makes it helpful, harmless and honest.

How does RLHF work?

Humans rate multiple model responses. A reward model learns these preferences, then the LLM is retrained to maximize that reward.