COMPEL Glossary / reinforcement-learning-from-human-feedback-rlhf
Reinforcement Learning from Human Feedback (RLHF)
RLHF is the technique used to align large language model behavior with human preferences and safety requirements.
What this means in practice
In the RLHF process, human evaluators rate model outputs on quality, helpfulness, and safety. These ratings train a reward model that captures human preferences. The language model is then fine-tuned using reinforcement learning to produce outputs that score highly according to the reward model. RLHF is what makes LLMs helpful, harmless, and honest rather than merely predicting likely text. However, RLHF introduces governance challenges: feedback quality and bias (if evaluators have narrow perspectives, the model inherits those biases), reward hacking (the model may optimize for the reward signal rather than genuine quality), and value alignment stability (preferences encoded at one point may become stale as organizational values evolve).
Related Terms
Other glossary terms mentioned in this entry's definition and context.