For a Product Manager owning Search AI experiences, the feedback loops you design—thumbs up, thumbs down, and side-by-side (SxS) comparisons—are not just vanity metrics. They are the raw fuel that physically alters the neural network's brain. But how exactly does a user clicking "Better" on the left-hand response translate into floating-point number adjustments in a 70-billion-parameter model?
This guide explains the specific mechanics of weight adjustment across the three critical stages of training: Pre-training, Supervised Fine-Tuning (SFT), and Preference Tuning (RLHF/DPO).
The Mental Model: The Giant Spreadsheet
Before diving into the stages, visualize the Large Language Model (LLM) not as a brain, but as a colossal Excel spreadsheet containing billions of numbers (weights).
Every time the model generates a word, it runs a math calculation using these numbers. "Training" is simply the process of identifying which specific numbers in the spreadsheet led to a bad error and nudging them slightly up or down so the error is less likely to happen next time. This process is called Backpropagation (calculating the blame) followed by Gradient Descent (nudging the numbers).
Stage 1: Pre-training (The Foundation)
Goal: Teach the model the statistical structure of language.
The Data: Trillions of tokens of raw text (web pages, books, code).
In this stage, the model plays a game of "Guess the Next Word." It looks at a sentence fragment like "The cat sat on the..." and calculates probabilities for the next word.
- The Weight Adjustment: If the model predicts "Cloud" (probability 0.8) and the actual word was "Mat" (probability 0.01), the model calculates a Cross-Entropy Loss. This is a mathematical score representing how surprised the model was by the truth.
- The Update: The optimizer looks at every weight that contributed to the wrong guess and nudges them to slightly lower the probability of "Cloud" and raise the probability of "Mat."
PM Takeaway: At this stage, the model is not learning to be helpful; it is learning to be probable. It captures world knowledge but lacks direction.
Stage 2: Supervised Fine-Tuning (SFT)
Goal: Teach the model to follow instructions and adopt a specific persona.
The Data: Golden pairs of (Prompt, Response) written by expert human labelers.
Here, we stop showing the model raw internet text and start showing it "perfect" behavior.
- The Mechanism: We use the exact same Next Token Prediction mechanics as pre-training. We feed the model a user prompt (e.g., "Summarize this article") and force it to calculate the probability of the expert-written summary.
- The Weight Adjustment: The loss function penalizes the model for any deviation from the expert's specific wording. We are forcing the model's weights to mimic the style, format, and reasoning of the human annotators.
PM Takeaway: SFT is "behavior cloning." If your golden data contains short, punchy answers, the weights shift to penalize verbosity. This is where you set the "product voice."
Stage 3: Preference Tuning (Where Your User Data Lives)
This is the most critical stage for a PM owning feedback. SFT models are good, but they often hallucinate or ramble because they are just mimicking text. To fix this, we use Reinforcement Learning from Human Feedback (RLHF) or the newer Direct Preference Optimization (DPO).
This is where your SxS (Side-by-Side) and Thumbs Up/Down data directly intervene.
The Data: Pairwise Preferences
Your logging pipeline collects data where a human (or user) saw two model responses to the same prompt and picked the winner:
There are two main ways we use this data to adjust weights:
Method A: The Reward Model Approach (Classic RLHF)
This is a two-step process.
-
Train a Judge (Reward Model): We do not update the main LLM yet. Instead, we train a separate, smaller model called a Reward Model (RM). We show the RM the Winning and Losing responses and force it to output a score (scalar reward). The Ranking Loss function adjusts the RM's weights until it consistently assigns a higher score to the Winner than the Loser.
- Result: You now have a mathematical function that can score any text.
-
Train the LLM (PPO): Now we go back to the main LLM. We let it generate new answers. The Reward Model scores them. We then use an algorithm called Proximal Policy Optimization (PPO) to update the LLM's weights.
- The Loss Function: The PPO loss function has two competing terms: "Maximize the Reward Score" (change weights to get high scores) AND "Don't drift too far from the SFT model" (a KL-Divergence penalty). This prevents the model from gaming the system by outputting gibberish that happens to trick the Reward Model.
Method B: Direct Preference Optimization (DPO)
This is the modern, more efficient alternative that many product teams are adopting. It skips the "Reward Model" step entirely.
- The Mechanism: We feed the Pairwise data (Winner/Loser) directly into the training loop of the main LLM.
- The Weight Adjustment: The DPO loss function mathematically derives the update that would have happened if we had used a reward model, but does it directly. It increases the likelihood of the "Winner" tokens and decreases the likelihood of the "Loser" tokens relative to the base model.
PM Takeaway:
- SxS data is the highest-value currency here. It provides the gradient signal that differentiates "correct grammar" (SFT) from "helpful answer" (Preference Tuning).
- Thumbs Up/Down is often converted into weak pairwise data (e.g., Thumbs Up response > Random other response) or used to filter SFT data, but it is noisier than SxS.
Summary of the Value Chain
- User Clicks "Better": This creates a (Winner, Loser) data point.
- Loss Calculation: During training, the system calculates the probability gap between the Winner and Loser.
- Backpropagation: The system computes the gradient—the specific direction each weight needs to move to make the Winner more likely and the Loser less likely.
- Optimizer Step: The weights are permanently altered.
Your feedback UI is not just a reporting tool; it is the steering wheel for the model's convergence.
Backgrounder Notes
Based on the provided article, I have identified key technical concepts that are central to the author’s argument but involve complex underlying mechanics. The following backgrounders provide the necessary context to fully grasp the definitions and implications mentioned in the text.
Token While often equated to "words," tokens are actually the fundamental distinct units of text a model processes, which can be whole words, sub-words, or even single characters. For estimation purposes, 1,000 tokens is roughly equivalent to 750 English words.
Cross-Entropy Loss Mentioned in Stage 1, this is a metric that quantifies the difference between the probability distribution the model predicted and the actual distribution (the truth). In simple terms, it measures how "surprised" the model is by the correct answer; high surprise equals high error, prompting a larger adjustment to the weights.
Backpropagation This is the core mathematical algorithm used to train neural networks, functioning by moving backward from the output error to the input to assign "blame." It calculates the gradient (rate of change) for every single weight in the network, determining how much each specific number needs to change to reduce the error.
KL-Divergence (Kullback-Leibler Divergence) referenced in Method A, this is a statistical measure used to determine how much one probability distribution differs from a reference distribution. In RLHF, it serves as a "drift penalty" to ensure the model doesn't stray so far from its original training that it starts speaking gibberish just to game the reward system.
Proximal Policy Optimization (PPO) This is the industry-standard algorithm used in Method A to update the model's policy (behavior) based on the Reward Model's scores. It is designed to take small, safe, iterative update steps to ensure the training process remains stable, preventing the model from changing too drastically at once and "collapsing."
Hallucination Briefly noted as a reason for Stage 3, this phenomenon occurs when an LLM generates text that is grammatically and syntactically confident but factually incorrect or nonsensical. This happens because the model is designed to predict probable next words based on patterns, not to access a database of verified facts.
Ranking Loss Used when training the Reward Model, this loss function does not care about the absolute score of a response, but rather the relative order. It trains the model to ensure that the mathematical gap between a "better" response and a "worse" response is sufficiently large.