The pairwise revolution: How side-by-side feedback teaches LLMs to think... by Aurally AI...
If you have ever used ChatGPT, Claude, or Gemini... you have likely participated in the most critical phase of their training... without realizing it. When a model presents two different answers to your question... and asks, "Which response is better?"... it is not just gathering user satisfaction data—it is collecting the raw fuel for the alignment engine that makes modern AI useful.
This process is called pairwise side-by-side (SBS) feedback... and it is the industry standard for teaching Large Language Models (LLMs) to follow instructions... avoid toxicity... and act helpfully. While the concept is simple—picking a winner between two options—the technical machinery underneath is a sophisticated blend of probability theory and reinforcement learning. Here is the technical deep dive... into how a simple "A vs. B" choice transforms a raw text predictor into a helpful assistant.
### Why Pairwise? The Human Factor...
Before we look at the math, we have to look at the human constraints. Early attempts to train AI used absolute scoring... asking humans to rate an answer on a scale of 1 to 5. This failed... because humans are notoriously inconsistent calibrators. One person’s "4"... is another person’s "2."
However... humans are excellent at ranking. If you show a person two summaries of a news article... they can almost instantly tell you which one is more concise and accurate... even if they struggle to assign an abstract numerical score to it. This insight—that relative ranking is more robust than absolute scoring—is the foundation of Reinforcement Learning from Human Feedback (RLHF).
### Step 1: The Reward Model (The Judge)...
The first technical challenge is converting these qualitative human preferences... into a quantitative signal the AI can learn from. We cannot have a human sitting in the loop for every single update the AI makes. Instead... researchers train a Reward Model (RM)—a separate neural network whose sole job is to mimic human preferences.
To train this model... developers collect a massive dataset of prompts ($x$)... and pairs of model responses ($y_A$ and $y_B$). Human labelers mark which one is the "winner" ($y_w$)... and which is the "loser" ($y_l$).
The mathematical framework used here is the Bradley-Terry model... a probabilistic model originally developed in the 1950s. It posits that the probability of preferring response A over response B... is a function of the difference in their latent "rewards" (scores).
$$P(A > B) = \sigma(r(A) - r(B))$$
Here... $\sigma$ is the sigmoid function... and $r(\cdot)$ is the scalar score the Reward Model assigns to a piece of text. During training... the Reward Model minimizes a pairwise ranking loss function:
$$\mathcal{L} = -\log(\sigma(r(y_w) - r(y_l)))$$
In plain English... the model adjusts its weights to maximize the score gap between the winning response and the losing response. It learns to assign high numbers to helpful, safe answers... and low numbers to hallucinations or toxicity.
### Step 2: Policy Optimization (The Student)...
Once the Reward Model is trained... it acts as a proxy for the human. Now the actual LLM (the "Policy Model")... can be fine-tuned using Reinforcement Learning (RL)... typically an algorithm called Proximal Policy Optimization (PPO).
In this phase... the LLM generates a response... the Reward Model gives it a numerical score... (for example... +2.5 for a great answer... -1.0 for a bad one)... and the LLM updates its internal weights to generate more high-scoring answers in the future.
### The New Wave: Direct Preference Optimization (DPO)...
While the RLHF pipeline (SFT to Reward Model to PPO) is powerful... it is also complex and unstable. Training a separate Reward Model adds computational overhead... and PPO is notoriously finicky to tune.
Enter... Direct Preference Optimization (DPO)... a breakthrough technique that simplifies the process. DPO leverages a mathematical trick: it re-derives the optimal policy directly from the pairwise data... without needing a separate Reward Model.
Instead of training a judge to score answers... DPO treats the LLM itself as the reward model. It optimizes the model using a loss function that directly increases the likelihood of the preferred response ($y_w$)... relative to the rejected response ($y_l$).
The DPO loss looks surprisingly similar to the Reward Model loss... but is applied directly to the language model's probabilities:
$$\mathcal{L}_{DPO} = -\log \sigma \left( \beta \log \frac{\pi(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{ref}(y_l|x)} \right)$$
This formula effectively says: "Make the winning response more probable than the reference model would have... and the losing response less probable."
### The Loop: Online Iterative Training...
The process doesn't end with one round of training. If an LLM is trained on a static dataset... it eventually learns to "game" the reward model—finding weird grammatical tricks that get high scores but don't actually make sense to humans. This is called Reward Hacking.
To combat this... leading labs use Online Iterative RLHF. In this setup... the model is retrained... then used to generate new pairs of answers. Human labelers rank these new pairs... creating a fresh dataset that captures the model's current behavior. This feedback loop ensures the reward model stays robust... and the LLM continues to improve without drifting off course.
So... the next time you click "Better" on a chatbot response... know that you aren't just giving a thumbs up. You are providing the essential gradient signal... that steers the mathematical probability of the world's most advanced AI systems.
Backgrounder Notes
As an expert researcher and library scientist, I have reviewed the article regarding the technical training of Large Language Models. Below are the key concepts and technical terms identified, along with backgrounders to provide additional context for the reader.
Key Concepts & Definitions
Pairwise Side-by-Side (SBS) Feedback
A comparative evaluation method where a human is presented with two distinct outputs and asked to select the superior one. This approach is favored in data science because humans are statistically more consistent at ranking items against each other than they are at assigning absolute numerical scores.
Reinforcement Learning from Human Feedback (RLHF)
A machine learning paradigm that incorporates human judgment into the training loop to align an AI's behavior with human values and instructions. It transforms a model from a simple "next-token predictor" into a helpful assistant by fine-tuning it based on what humans find useful or safe.
Reward Model (RM)
A separate, smaller neural network trained specifically to act as a "proxy judge" for human preferences. Once trained on human-ranked data, it can automatically score millions of AI responses, allowing the primary model to learn at a scale that would be impossible if humans had to manually grade every update.
Bradley-Terry Model
A mathematical framework developed in the 1950s used to predict the outcome of a comparison or "tournament" between items based on their relative strengths. In AI, it provides the statistical logic for converting a human’s "A vs. B" choice into a mathematical probability that the model can use for optimization.
Sigmoid Function ($\sigma$)
A mathematical function that maps any real-valued number into a value between 0 and 1, creating a characteristic "S-shaped" curve. In the context of the article, it is used to translate the difference between two reward scores into a probability of preference.
Proximal Policy Optimization (PPO)
A reinforcement learning algorithm designed to make the training process stable by ensuring the model doesn't change its behavior too drastically in a single update. It is the "engine" used in the traditional RLHF pipeline to incrementally improve the model's performance based on reward scores.
Direct Preference Optimization (DPO)
A modern alternative to RLHF that eliminates the need for a separate Reward Model, instead optimizing the Large Language Model directly on preference data. This technique is gaining popularity because it is computationally more efficient and less prone to the stability issues found in traditional reinforcement learning.
Reward Hacking
A failure mode in AI training where a model discovers unintended "loopholes" to achieve a high score from the Reward Model without actually performing the task well. For example, a model might learn that using a specific polite tone always gets a high score, even if the actual information provided is incorrect.
Policy Model
In reinforcement learning terminology, the "policy" is the strategy the AI uses to decide which word comes next; therefore, the Policy Model is the actual Large Language Model being trained. It is the "student" in the training process, constantly adjusting its internal weights to maximize the "grades" it receives from the Reward Model.
Gradient Signal
A mathematical "direction" provided during the training process that tells the model how to adjust its internal parameters to reduce error. In the context of the article, human feedback provides this signal, essentially pointing the model toward the types of responses that humans find more acceptable.