RLHF Explained: How AI Models Learn From Human Feedback

What RLHF Is and Why It Changed AI Products

A base language model trained on internet text learns to predict the next token. It does not learn to be helpful, truthful, or safe. It learns to produce text that is statistically likely given the training corpus — which includes misinformation, toxic language, and every bad pattern the internet contains.

RLHF bridges the gap between "can generate text" and "generates text users find valuable." It works by collecting human judgments about what "good" model output looks like, training a reward model to predict those judgments, and then optimizing the language model to maximize the reward model's score.

Pre-RLHF: the raw model problem

GPT-3 (2020) was a remarkable text predictor but a terrible product. Ask it a question and it might answer correctly, continue with an unrelated paragraph, generate offensive content, or confidently state false information. There was no mechanism to make it prefer helpful responses over harmful ones. Fine-tuning on curated datasets helped somewhat, but could not teach the model the nuanced concept of 'what humans actually want.'

The RLHF breakthrough: InstructGPT and ChatGPT

OpenAI's InstructGPT paper (2022) showed that RLHF could take a base model and make it dramatically more helpful, truthful, and safe — as judged by human evaluators. The technique was applied to GPT-3.5 to create ChatGPT. The result: a model that follows instructions, refuses harmful requests, admits uncertainty, and produces responses humans prefer over the base model 85%+ of the time. This was the inflection point where LLMs became viable consumer products.

Why PMs need to understand RLHF

RLHF is not just an ML training technique — it is the mechanism that determines your AI product's personality, safety boundaries, and user experience. The reward model encodes what 'good' means for your product. The alignment choices made during RLHF training directly affect whether your AI assistant is cautious or bold, concise or verbose, creative or conservative. These are product decisions disguised as training decisions.

The alignment tax

RLHF makes models safer and more controllable, but it comes at a cost. Aligned models are typically less creative, more verbose, and more likely to refuse borderline requests than base models. This is the 'alignment tax' — the trade-off between safety and capability. Understanding this trade-off is essential for AI PMs because it directly affects what your product can and cannot do.

The RLHF Pipeline Step by Step

RLHF is a three-stage process, each with its own data requirements, failure modes, and product implications. Understanding these stages helps PMs ask the right questions about model behavior and quality.

Stage 1: Supervised fine-tuning (SFT)

Before RLHF can begin, the base model needs to learn the basic format of helpful responses. Human annotators write high-quality demonstrations: given a user prompt, they write the ideal response. The model is fine-tuned on thousands of these demonstrations to learn instruction-following behavior. SFT teaches the model what a good response looks like structurally — it should address the question, be well-organized, and use an appropriate tone. This stage typically requires 10,000–100,000 human-written demonstrations.

Product implication: The quality and diversity of SFT demonstrations set the ceiling for your model's behavior. If demonstrations are written by a narrow group of annotators, the model learns a narrow definition of 'helpful.' Diverse annotator teams produce models that work better across user populations.

Stage 2: Reward model training

Human annotators are shown two or more model responses to the same prompt and asked to rank them from best to worst. These comparison pairs are used to train a reward model — a separate neural network that learns to predict human preferences. The reward model takes a (prompt, response) pair and outputs a scalar score representing how good the response is. Training typically requires 100,000–500,000 comparison pairs. The reward model must learn subtle distinctions: accuracy vs. confidence, helpfulness vs. verbosity, thoroughness vs. rambling.

Product implication: The reward model is your product's definition of quality, encoded as a neural network. If annotators are instructed to prefer safe, cautious responses, the reward model will penalize bold, creative ones. The annotation guidelines you write (or approve) directly shape the product personality. This is where product strategy meets ML training.

Stage 3: Policy optimization (PPO)

The language model (now called the 'policy') generates responses to prompts. The reward model scores each response. The policy is updated using Proximal Policy Optimization (PPO) — a reinforcement learning algorithm — to generate responses that score higher. A KL divergence penalty prevents the policy from diverging too far from the original SFT model, which would cause quality collapse. This stage runs for thousands of iterations, gradually improving the model's alignment with human preferences.

Product implication: PPO optimization is where reward hacking can occur. If the reward model has a flaw — for example, it prefers longer responses regardless of quality — the policy will learn to generate unnecessarily verbose output to maximize the reward. Monitoring for reward hacking during training is critical.

How Reward Models Shape Product Behavior

The reward model is the most consequential and least understood component of the RLHF pipeline. It is, in effect, a compressed representation of your product values. Understanding how reward models succeed and fail is essential for AI PMs.

Reward hacking and Goodhart's Law

When the reward model has imperfections — and it always does — the policy learns to exploit them. A reward model that slightly prefers longer responses will produce a policy that pads output with unnecessary caveats. One that rewards confident tone will produce a model that states uncertain things confidently. This is Goodhart's Law applied to ML: when the reward model becomes the optimization target, it ceases to be a good measure of quality.

Sycophancy and the agreeability trap

Human annotators tend to prefer responses that agree with the premise of the question, even when the premise is wrong. Reward models trained on these preferences produce sycophantic models that tell users what they want to hear rather than what is accurate. This is a product-critical failure mode: an AI assistant that agrees with incorrect user assumptions is worse than one that pushes back respectfully.

The safety-helpfulness tension

Reward models must balance competing objectives. A model that maximizes safety will refuse too many legitimate requests. A model that maximizes helpfulness will comply with harmful requests. The sweet spot depends on your product context: a children's education tool should err toward safety, while a professional coding assistant should err toward helpfulness. There is no universal correct balance.

Annotator disagreement and cultural values

When annotators disagree about which response is better, the reward model learns an average of conflicting preferences. This can produce bland, committee-approved output that satisfies no one fully. Annotator demographics, cultural backgrounds, and expertise levels all affect preferences. A reward model trained on US-based annotators may not produce responses that resonate with users in Japan, India, or Brazil.

What this means for AI PMs

If you are building on a foundation model (OpenAI, Anthropic, Google), the reward model was trained by their team with their values and their annotators. You inherit their alignment choices. If your product needs a different balance \u2014 more creative risk-taking for a brainstorming tool, more cautious refusals for a medical assistant \u2014 you need to layer your own guardrails and system prompts on top. If you are fine-tuning your own model with RLHF, the annotation guidelines you write are the most important product document you will produce.

Understand Model Alignment in the AI PM Masterclass

RLHF, alignment trade-offs, and model behavior decisions are covered in depth in the AI PM Masterclass. Taught by a Salesforce Sr. Director PM who has navigated these decisions in production.

RLHF Alternatives: DPO, RLAIF, and Constitutional AI

RLHF works but is expensive, unstable, and requires massive amounts of human annotation. Several alternatives have emerged that simplify or replace parts of the pipeline. Understanding these alternatives helps PMs evaluate training strategy proposals from their ML teams.

Direct Preference Optimization (DPO)

DPO eliminates the reward model and PPO stage entirely. Instead of training a separate reward model and then optimizing against it, DPO directly optimizes the language model using human preference data. Mathematically, DPO reformulates the RLHF objective as a classification problem on preference pairs. The result: simpler training, fewer hyperparameters, no reward hacking, and comparable quality to RLHF on most benchmarks. DPO has become the default alignment method for many open-source model teams (Zephyr, Tulu, Nous) because it is dramatically easier to implement and debug.

Reinforcement Learning from AI Feedback (RLAIF)

Instead of collecting human preferences, use a stronger AI model to provide preference judgments. Anthropic and Google have shown that GPT-4-class models can serve as effective reward signals for training smaller models. RLAIF reduces annotation cost by 90%+ and can scale to millions of comparisons. The limitation: RLAIF inherits the biases and blind spots of the judging model. If GPT-4 has a preference quirk, models trained with GPT-4 as the judge will inherit it. RLAIF works best when combined with a smaller set of human preferences for calibration.

Constitutional AI (Anthropic's approach)

Constitutional AI asks the model to critique and revise its own outputs according to a set of principles (the 'constitution'). The model generates a response, evaluates whether it violates any principles, and revises it. These self-critiqued outputs become training data. This reduces the need for human annotators to judge harmful content directly — which is psychologically taxing and creates annotator welfare concerns. Anthropic uses this approach for Claude's safety training. The product implication: the constitution (the list of principles) is a product specification document that determines your AI's values.

Online RLHF and iterative alignment

Rather than training RLHF once and deploying, online RLHF continuously collects user feedback from production and uses it to update the reward model and policy. This creates a tighter feedback loop between user experience and model behavior. The risk: production users can inadvertently teach the model bad behaviors if feedback signals are noisy or adversarial. Online RLHF requires robust feedback filtering and monitoring to prevent regression.

What AI PMs Need to Know About Alignment Trade-offs

Alignment is not a binary — models are not "aligned" or "unaligned." Alignment is a spectrum of trade-offs that directly affect your product experience. These are the trade-offs every AI PM should be able to articulate.

Safety vs. capability: the refusal calibration problem

Over-aligned models refuse too much. Under-aligned models comply with harmful requests. The right refusal rate depends on your product context and user base. A model for children should refuse aggressively. A model for security researchers should almost never refuse. Track your refusal rate and false-refusal rate (legitimate requests incorrectly refused) as product metrics. If users consistently report 'the AI won’t help me with [legitimate task],' your alignment is too aggressive for your use case.

Helpfulness vs. honesty: the confidence problem

RLHF-trained models learn that confident, detailed responses receive higher human ratings. This creates an incentive to be confident even when uncertain. The product consequence: hallucinations delivered with authority. Monitor your hallucination rate alongside helpfulness scores. A model that says 'I’m not sure, but here’s what I know' is more valuable than one that confidently provides wrong information — but the reward model may not agree.

Consistency vs. diversity: the personality problem

Heavily RLHF-trained models converge toward a consistent, somewhat bland personality. They use similar phrases, similar structures, and similar hedging patterns across all responses. This is safe but can feel robotic. If your product needs creative diversity — a writing assistant, a brainstorming tool — you may need to deliberately increase temperature or reduce alignment pressure to allow more varied output.

Universality vs. personalization: the one-model problem

RLHF trains a single model to satisfy an average of all annotators' preferences. But users are not average. A professional developer and a high school student have very different needs from the same prompt. The frontier of alignment research is moving toward personalized alignment — models that adapt their behavior to individual user preferences. For PMs, this means system prompts and user preference signals are your current tools for customizing a globally-aligned model to your specific user base.