Technique Updated 2026-04
RLHF (Reinforcement Learning from Human Feedback)
Reinforcement Learning from Human Feedback
Definition
RLHF is a training technique that uses human feedback to align an LLM's behavior with user expectations.
See also in the glossary
L
LLM (Large Language Model)
An LLM is an AI model trained on billions of texts, capable of understanding and generating human language.
F
Fine-tuning
Fine-tuning is the process of retraining an existing AI model on a specific dataset to adapt it to a particular domain or task.
A
AI Alignment
AI alignment aims to ensure an artificial intelligence system acts in accordance with human values and intentions.
M
Machine Learning
Machine Learning is a branch of AI where systems learn from data to improve their performance without being explicitly programmed for each task.
Tools that use rlhf
Frequently Asked Questions
Why is RLHF necessary?
Without RLHF, an LLM is capable but not very useful: it can be toxic, off-topic or too verbose. RLHF makes it helpful, harmless and honest.
How does RLHF work?
Humans rate multiple model responses. A reward model learns these preferences, then the LLM is retrained to maximize that reward.