Technique Updated 2026-04

RLHF (Reinforcement Learning from Human Feedback)

Reinforcement Learning from Human Feedback
Definition

RLHF is a training technique that uses human feedback to align an LLM's behavior with user expectations.

Frequently Asked Questions

Why is RLHF necessary?
Without RLHF, an LLM is capable but not very useful: it can be toxic, off-topic or too verbose. RLHF makes it helpful, harmless and honest.
How does RLHF work?
Humans rate multiple model responses. A reward model learns these preferences, then the LLM is retrained to maximize that reward.