Building sophisticated models is one challenge when it comes to Gen AI or LLMs. Knowing if they’re truly effective is another.
Traditional metrics often fall short, reducing complex behaviors to simple scores that don’t reflect how well these models actually make decisions.
This issue becomes even more pronounced with In-Context Learning (ICL), where models don’t just rely on trained knowledge but adapt to new examples in real-time. Unlike traditional machine learning models that need a complete retraining to pick up new tasks, ICL lets models adapt on the fly just by showing them a few examples in their input prompts.
But here’s the catch: evaluating how well ICL works isn’t straightforward.
Traditional metrics like cosine similarity and perplexity often miss the mark. It’s a bit like judging a teacher solely by their qualifications instead of how much their students actually learn.
This blog will cover how ICL truly works, why some evaluation metrics fall short, and how DPO is changing the game.
Let’s start simple and to make sure we are all on the same page: what is In-Context-Learning, Really?
Unlike traditional machine learning, where models need extensive retraining to learn new tasks, ICL allows models to adapt on the fly by simply showing them examples in their input prompt.
What makes this particularly fascinating is that the model isn’t changing its weights or “learning” in the traditional sense. Instead, it’s using its pre-trained knowledge to recognize patterns and adapt its behavior based on the examples provided.
This distinction is crucial — we’re not teaching the model new information; we’re showing it how to apply its existing knowledge in new ways.
ICL works through three key mechanisms:
1. Pattern Recognition: Imagine teaching someone to spot spam emails. Instead of explaining complex rules, you show them a few examples of spam and legitimate emails. The model learns to identify structure, tone, and content patterns that differentiate the categories. This happens without explicitly programming rules — the model automatically discovers these patterns.
2. Contextual Mapping: This is where ICL truly shines. The model doesn’t just memorize patterns — it learns to map them to new contexts. For instance, if you show examples of French-to-English translations about the weather, it can often translate sentences about food correctly, even though it hasn’t seen food-related examples. This transfer ability is what makes ICL so powerful.
3. Task Structure Inference: Perhaps most impressively, models learn how to approach a task from the examples’ structure. Show it a few examples of step-by-step math solutions, and it learns to break down new problems similarly. Show it concise answers instead, and it adapts to that style. The model infers not just what to do but how to do it.
The key idea of in-context learning is to learn from analogy.
The figure below demonstrates how language models make decisions using in-context learning (ICL).
Now, how do you evaluate ICL for LLMs?
Or, more interestingly, how do we “know” that the examples chosen in our few-shot approach are actually the good ones?
Let’s dive into it
Let’s look at 3 main type of metrics here: 1) Relevance Metrics, 2) Diversity Metrics and lastly 3) Quality Metrics.
Cosine Similarity: The Double-Edged Sword: Traditional cosine similarity measures how closely two pieces of text-align in their vector representations.
While intuitive, this approach has subtle but critical limitations. Consider teaching someone about metaphors — showing them similar metaphors might seem helpful but could limit their understanding of the concept’s breadth. The metric converts texts into high-dimensional vectors and measures the angle between them. Smaller angles suggest a greater similarity. However, this can be misleading — two texts might share many words but convey entirely different concepts or use different words to express the same idea.
BM25: The Evolution of Search
Building on traditional TF-IDF (Term Frequency-Inverse Document Frequency), BM25 attempts to capture the importance of words in context. It’s particularly clever in handling document length — unlike more straightforward metrics, it recognizes that a word appearing twice in a short document might be more significant than appearing twice in a long one.
However, BM25’s focus on keyword matching can miss deeper semantic relationships. It might rate an example highly because it shares keywords with our query, even if the underlying concept differs.
Semantic Search: The Neural Approach: Modern semantic search uses neural networks to capture meaning rather than word overlap. This is a significant improvement, but it still faces a fundamental limitation: similarity in meaning doesn’t necessarily translate to educational value. Sometimes, the most helpful examples are those that highlight contrasts rather than similarities.
Maximum Marginal Relevance (MMR): MMR tries to balance relevance with diversity by selecting examples that are relevant to the query but different from each other. The key insight is that showing ten very similar examples might be less helpful than showing a diverse set that covers different aspects of the concept.
For instance, when teaching the concept of “velocity,” MMR might select examples involving cars, planets, and particles — each relevant but showcasing different aspects of the concept.
Coverage Scores: These metrics attempt to measure how much of the possible “space” of examples we’re covering. Think of it like teaching someone to cook — you want examples that cover different techniques, ingredients, and cuisines, not just variations of the same dish.
The challenge here is defining what constitutes complete coverage. In most real-world tasks, the space of possible examples is vast and often undefined.
Perplexity: Measuring “Naturalness”: Perplexity measures how “surprised” a language model is by a piece of text. Lower perplexity suggests more natural, well-formed text. While this seems logical — we want clear, well-written examples — it can be misleading. Sometimes, the most educational examples are those that challenge our expectations or highlight edge cases.
Source Authority: This approach prioritizes examples from reputable sources — textbooks, expert writings, or verified datasets. While intuitively appealing, it suffers from a crucial flaw: authority doesn’t always correlate with teaching effectiveness. A complex, technically correct explanation might be less helpful than a simple, slightly imprecise one for learning purposes.
The core issue with all these metrics is that they measure the properties of the examples themselves, not their effectiveness in teaching.
It’s like judging a teacher’s ability by their credentials rather than their students’ learning outcomes.
This leads us to three critical questions:
Direct Preference Optimization (DPO) was initially introduced in the context of reinforcement learning, specifically for training language models from human preferences.
The method was presented in the paper “Direct Preference Optimization: Your Language Model is Secretly a Reward Model” by Rafailov et al. in 2023.
The genesis of DPO came from a fundamental challenge in reinforcement learning with human feedback (RLHF):
What made DPO interesting was its insight that we could bypass the traditional reinforcement learning pipeline altogether.
Instead of training a reward model on human preferences, using that reward model to guide policy optimization and applying complex RL algorithms like PPO (Proximal Policy Optimization), DPO showed that we could directly optimize a language model to match human preferences in a single step.
The mathematics revealed that the reward model and policy optimization could be combined into a single, more straightforward objective.
The core mathematical insight of DPO came from connecting two seemingly separate pieces:
It showed that these could be unified into a single training objective that:
Even if DPO emerged initially as a fine-tuning technique for refining language models through human feedback, recent work has expanded its role.
It has been applied as a metric to evaluate few-shot strategies and is now a ground-truth benchmark in ranking model embeddings.
The paper by Rafailov et al. in 2023 introduced DPO as a method for training language models by directly aligning them with human preferences.
Traditionally, fine-tuning models to adhere to human judgment has been complex and computationally intensive. It often relies on reinforcement learning with human feedback (RLHF), which requires training separate reward models.
As we discussed, DPO streamlines this process by eliminating the need for an explicit reward model. Instead, it adjusts the language model to maximize the probability of human-preferred responses directly.
In practice, DPO redefines the training objective to favor outputs that align with human preferences without extensive policy optimization.
This breakthrough simplifies fine-tuning while achieving stable results, making DPO a powerful tool for developers aiming to fine-tune models quickly and efficiently.
In essence, DPO here serves as an accelerator for model alignment with human judgment, making it highly practical for deploying LLMs in real-world applications where preferences change rapidly.
The application of DPO evolved with publications such as RAGSys: Item-Cold-Start Recommender as RAG System, which applied DPO as a metric to evaluate the quality of few-shot learning strategies in in-context learning (ICL).
Unlike its role in fine-tuning, here, DPO is used as a comparative measure rather than a training objective.
By using DPO as a metric, this approach evaluates how effectively different few-shot strategies — whether promoting diversity or consistency in examples — improve the model’s ability to generalize to new queries.
For instance, using diverse examples in a few-shot prompt might help the model generalize across contexts, while consistent examples may reinforce specific patterns.
By scoring few-shot strategies with DPO, we could objectively quantify how these strategies impact model performance, bringing transparency to ICL methods that were previously hard to evaluate rigorously.
In a further adaptation, DPO is being used in our forthcoming embedding benchmark to score the quality of few-shot examples for a given query, ranking existing embedding models against this metric.
While previous literature used DPO primarily for preference alignment and comparison, our benchmark leverages it as a ground truth score — a fundamental standard for assessing how well a few-shot example set serves its intended purpose.
In this setup, DPO evaluates whether selected few-shot examples effectively enhance the model’s understanding and response accuracy. By scoring each example set’s effectiveness with DPO, we can create a benchmark that ranks embedding models on their ability to boost ICL.
This novel application of DPO goes beyond fine-tuning or strategy evaluation; it positions DPO as a universal metric to assess few-shot learning quality across various model architectures.
The progression of DPO from a fine-tuning method to a metric and then to a ground-truth benchmark highlights its versatility.
As a fine-tuning tool, DPO optimizes models to align with human preferences in a straightforward and computationally efficient way. When used as a metric, DPO provides a transparent, preference-based comparison for few-shot strategies.
Finally, as a ground-truth benchmark, DPO serves as a universal standard, enabling fair and consistent evaluation of example quality across different models and tasks.
Let me sum up what’s truly important about DPO’s evolution in language model development.
DPO’s journey has been remarkable:
This isn’t just about finding new uses for a technique — it’s about discovering fundamental principles in how language models learn and adapt.
The practical implications are significant:
For anyone working with language models, DPO’s evolution demonstrates something crucial: fundamental insights in machine learning often transcend their original applications. Understanding this progression isn’t just about staying current — it’s about seeing how core principles can transform multiple aspects of language model development.