ICLERB: Rethinking RAG & ICL Retrieval Benchmarks

In the rapidly evolving world of Large Language Models (LLMs), techniques like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) have significantly changed how these models adapt to new tasks and domains. As we leverage retrieval methods to enhance LLM performance, it’s crucial to reassess whether our current strategies are truly adapted to these advanced learning techniques.
‍

In this blog post, we explore the current state of retrieval in RAG and ICL, uncover the limitations of the traditional evaluation of retrieval methods, and introduce our novel benchmark, the In-Context Learning Embedding and Reranker Benchmark (ICLERB).

More details and findings can be found in our paper.

Unlocking LLM Capabilities with In-Context Learning (ICL)

In-Context Learning (ICL) enables LLMs to perform new tasks by providing examples or relevant information directly within the prompt. Instead of updating the model’s weights through fine-tuning, the model is conditioned with context that guides its response.
‍

One form of ICL is few-shot learning. Suppose we want the model to translate English sentences into French. We can craft a prompt that includes examples:

‍

[Prompt]

Translate the following English sentences to French:

“It’s raining cats and dogs.” → “Il pleut des cordes.”

“Bite the bullet.” → “Prendre son courage à deux mains.”

“Under the weather.” → “Ne pas être dans son assiette.”

“Once in a blue moon.” →
‍

[Model output]

“Tous les trente-six du mois.”

‍

By providing these examples through prompt engineering, we enable the model to infer the pattern and generate the correct translation. Another powerful ICL technique is Chain-of-Thought prompting, which encourages the model to “think” through the problem step by step.
‍

For example:

‍

[Prompt]‍

Question: Emma thinks of a number.
She doubles it, adds 10, and the result is 30.

What number did Emma think of?
‍

     Answer: Let the original number be “x”.
     After doubling and adding 10: 2x + 10 = 30.
     Subtract 10 and divide by 2: x = 20 / 2 = 10.
     Emma thought of the number “10”.

‍

     Question: A number is multiplied by 3,
     then decreased by 4, giving 11.
     What was the original number?

‍

[Model output]
     Answer: Let the original number be "x".
     After multiplying by 3 and subtracting 4: 3x - 4 = 11.
     Add 4 and divide by 3: x = 15 / 3 = 5.
     The original number was "5".

‍

By prompting the model to break down the problem, we achieve more accurate and interpretable results.

Enhancing ICL with Retrieval-Augmented Generation (RAG)

In the previous examples of ICL, we provided pre-selected static examples within the prompt. While effective, this approach relies on the examples we manually choose, which may not be as effective for specific questions, especially if they differ from the patterns or content of these examples.
‍

Retrieval-Augmented Generation (RAG) builds upon ICL by integrating dynamically selected external information directly into the model's prompt.
‍

Imagine the LLM is being asked: “What’s the schedule for the 2024 Summer Olympics?”. As an LLM with knowledge only up to a specific cutoff date, it wouldn’t provide an accurate response. RAG systems allow dynamic retrieval of up-to-date or domain-specific information from external sources to augment the model’s knowledge.
‍

RAG operates through:

User query: A question is submitted to the system.
Retrieval step: Relevant documents are retrieved based on their relevance to the query. At the core of the process are retrieval methods (detailed in the next section) that actually decide which documents are selected, coupled with storage and indexing tools that ensure efficient retrieval.
Combining information: Retrieved documents are combined with the original query to form an augmented prompt.
Generation step: The LLM processes the augmented prompt and generates a response that is conditioned by the retrieved information.

*The Retrieval-Augmented Generation (RAG) stack*

Existing Retrieval Methods: Embedding Models and Rerankers

While some RAG systems still use traditional retrieval methods like BM25 due to their simplicity and efficiency, these term-based methods often underperform because they rely on exact term matching. This fails to capture the complex semantics of natural language, making it challenging to retrieve relevant documents that don’t share the exact wording of the query.
‍

More advanced retrieval methods have thus become prevalent, notably embedding models and rerankers:

Embedding Models: These models, such as OpenAI's embeddings models and SBERT, encode queries and documents into high-dimensional vectors. Retrieval is performed by measuring the similarity between the query and document embeddings, usually using cosine similarity. This approach allows for efficient retrieval, as document embeddings can be precomputed and indexed for fast similarity searches across large datasets. To store and manage these embeddings effectively, various storage and indexing tools are employed, including vector databases such as Pinecone and VertexAI (refer to the integrated vector stores of LangChain and LlamaIndex for available tools), or in-memory indexes such as FAISS and nmslib. However, since embedding models encode queries and documents separately, they may miss context-specific nuances or query-document interactions, potentially overlooking important connections.
Rerankers: Rerankers employ more complex architectures, such as cross-encoders, that jointly process queries and documents. These architectures enable the modeling of intricate dependencies between queries and documents and are expected to reach superior retrieval performance. However, rerankers can be computationally intensive, especially when handling a large number of query-document pairs at runtime. Current vector databases and indexes are optimized for simple metrics like dot product or cosine similarity and don’t natively support the complex interactions modeled by rerankers. This limitation makes rerankers prohibitively expensive for large-scale systems.
‍

To overcome this challenge, retrieval systems often adopt a two-stage approach: an initial retrieval using embedding models quickly narrows down the pool of candidate documents, and then a reranker refines the list by reordering the top candidates. This two-stage approach combines the scalability of embedding models with the nuanced understanding of rerankers.

*The two-stage retrieval strategy using embedding models, vector databases, and rerankers*

Practical options: Using APIs or Local Models
‍

When implementing retrieval methods in RAG systems, practitioners have various options for embedding models and rerankers, including API-based services and locally-hosted models.
‍

Many organizations provide these models as API services, simplifying integration. For example:

OpenAI provides embedding models like text-embedding-3-large via their APIs, enabling the generation of embeddings with simple API calls. Similar services are offered by Google Cloud’s Vertex AI and by platforms such as Together AI that expose models from different organizations, including several BERT-based models.
Cohere and Voyage AI provide APIs for generating text embeddings and reranking documents: Cohere Embed, Voyage AI embeddings, Cohere Rerank, Voyage AI rerankers.
‍

These APIs eliminate the need to manage underlying infrastructure but come with considerations such as recurring costs per API call (refer to pricing of OpenAI, Cohere, Voyage AI), potential latency due to network communication, and data privacy concerns.
‍

Alternatively, platforms like Hugging Face offer a vast range of embedding models and rerankers that can be downloaded and run locally. At the time of writing, there are 10.8k models labeled for the task of “Feature Extraction” in NLP and 7.2k for “Sentence Similarity,” both of which are relevant to this context. Examples include:

sentence-transformers/all-mpnet-base-v2 with 109M parameters, downloaded 269M times during the last month (at the time of writing), and requiring around 410MB of memory (practically more due to overhead),
Alibaba-NLP/gte-large-en-v1.5 with 434M parameters, downloaded 2.9M times during the last month, and requiring around 1.62GB of memory,
BAAI/bge-multilingual-gemma2 with 9.2B parameters, downloaded 135k times during the last month, and requiring around 34.43GB of memory.
‍

Several rerankers are also available, such as mixedbread-ai/mxbai-rerank-base-v1 and jinaai/jina-reranker-v2-base-multilingual. There are actually 415 models that have the keyword “rerank” in their name.

Running models locally gives practitioners full control over their data and can reduce long-term costs by eliminating per-call fees. However, it requires investment in hardware and expertise in deploying models.
‍

With so many embedding models and rerankers available, each varying in aspects such as fine-tuning datasets, architectural complexities, and model sizes, understanding how these retrieval methods influence RAG performance is essential.

The Current Approach for Selecting the “Best” Retrieval Method for RAG

When it comes to choosing a retrieval method for RAG, practitioners often turn to benchmarks like the Massive Text Embedding Benchmark (MTEB). MTEB evaluates models based on their ability to retrieve or rank documents in tasks resembling traditional search scenarios. Despite being first introduced in 2022 before the widespread adoption of LLMs and the emergence of RAG, it remains a common reference point in industry discussions on optimizing retrieval for RAG, as seen in several industry blogs.

*Industry blogs referring to MTEB for choosing the best embedding model for RAG (Pinecone*, *MongoDB*, *Modal, and* *Unstructured*)

However, MTEB primarily assesses how well models retrieve semantically similar documents or rank them according to relevance in search tasks. The datasets used for evaluation include DBPedia Entity, MS MARCO, and Quora, which are meant to evaluate search and semantic similarity (more on these datasets in BEIR here).
‍

This traditional approach frames retrieval as a search problem. While effective for search engines, it was not designed with the concept of RAG in mind. Moreover, many retrieval models are fine-tuned on these search-oriented datasets, like ColBERT on MS MARCO. This fine-tuning reinforces the emphasis on semantic similarity, which may not align with the objectives of RAG.

Reframing Retrieval as a Recommendation Problem

To better serve RAG and ICL, we propose reframing the retrieval task in RAG as a recommendation problem. In recommender systems, the goal is to select items that optimize downstream metrics, directly enhancing overall performance.
‍

Applying this to RAG, the retrieval process should prioritize documents that will most significantly boost the LLM’s performance to the task at hand.
‍

To illustrate this point, consider the following example:
‍

User query: “What are the best practices for securing REST APIs?”

Document A: "The best practice to secure a REST API is to use authentication."

Document B: "For robust security of REST APIs, implement advanced measures like

     utilizing OAuth 2.0 with scopes for granular access control, applying rate limiting to mitigate
     DDoS attacks, and adopting mutual TLS for client and server authentication to prevent
     man-in-the-middle attacks."
‍

While Document A is closer semantically to the query, it provides a well-known principle that might already be within the LLM's knowledge. Document B, though less similar to the query in meaning, offers advanced strategies that can enhance the LLM’s response. A good retrieval approach should retrieve Document B instead of Document A.

Introducing ICLERB: A New Benchmark for Retrieval in ICL

To address existing limitations, we introduce ICLERB — the In-Context Learning Embedding and Reranker Benchmark. ICLERB evaluates retrieval methods based on how effectively the retrieved documents enhance the LLM’s performance in ICL tasks.
‍

Key features of ICLERB include:

Comparing Embedding Models and Rerankers: ICLERB provides a standardized framework to compare embedding models and rerankers specifically for ICL tasks.
Benchmarking for RAG Utility: Instead of focusing on semantic similarity, ICLERB evaluates retrieval models based on their ability to retrieve documents that improve the LLM performance across various datasets and language models.
Defining Utility with DPO: Central to ICLERB is the notion of utility, measured using the Direct Preference Optimization (DPO) metric. DPO quantifies the utility of including a document in the LLM's context by measuring its impact on the model's output probability distribution.
‍

Calculating DPO values requires LLMs that expose probability outputs for the provided prompts, as we need to access the probabilities of the answers that we include in the prompt. Models without accessible probability distributions of the input, such as most proprietary LLMs, cannot be evaluated using ICLERB.

ICLERB Results and Findings

In the initial release of ICLERB, focusing on few-shot learning scenarios, we observe several noteworthy findings regarding the performance of prominent embedding models and rerankers for RAG and ICL.

*ICLERB evaluation of embedding models and rerankers for ICL, sorted by nDCG@10*

Finding #1: Rerankers Underperform Compared to Embedding Models Counterparts

Contrary to expectations, rerankers from Cohere and Voyage AI underperform compared to their embedding model counterparts. This trend is also seen with NVIDIA's retriever and embedding models, despite both being competitive overall.
‍

This suggests that rerankers, often fine-tuned for semantic similarity or search relevance, may not be as effective in selecting documents that enhance LLM performance in ICL tasks. Embedding models seem to capture a wider range of contextual information beneficial for improving LLM outputs.
‍

This finding has significant implications for practitioners employing the two-stage retrieval strategy of initially retrieving documents using embeddings and then refining the ordering with rerankers. Adopting this strategy is not only unnecessary, but could actually degrade performance in ICL tasks when using these specific rerankers (though this might not be the case with other rerankers like cm-rerank-mxbai-rlaif-v0.1).

Finding #2: Size Isn’t Everything

An interesting observation from ICLERB is that larger model size doesn't necessarily correlate with better performance in ICL tasks. While many top-ranking models are indeed large, our model, cm-rerank-mxbai-rlaif-v0.1 (more details here), with only 335M parameters, outperforms several larger models. This indicates that models specifically optimized for utility in ICL can achieve superior results without the need for massive parameter counts.
‍

Similarly, within the Snowflake model family, the smallest model evaluated, snowflake-arctic-embed-s (33M parameters), surpasses its larger counterparts in ICL performance. This underscores the importance of targeted optimization over sheer model size.

Retrieval performance for the task of ICL as measured by nDCG@10 in ICLERB vs. model size in millions of parameters

Finding #3: Discrepancies with MTEB

Comparing ICLERB results with those from MTEB reveals important differences in embedding model rankings. Some models performing well on traditional semantic similarity benchmarks don’t necessarily excel in ICL tasks.
‍

While models like BAAI’s bge-en-icl and NVIDIA’s NV-Embed-v2 maintain strong positions across both benchmarks, other models exhibit notable shifts:

Salesforce's SFR-Embedding-2_R ranks higher on ICLERB than on MTEB, indicating enhanced effectiveness for ICL tasks beyond what traditional retrieval metrics suggest.
OpenAI’s text-embedding-3-large is outperformed by Cohere’s embedding model on ICLERB, a reversal of their standings on MTEB. On the other hand, OpenAI’s text-embedding-3-small surpasses all Snowflake models on ICLERB, which is not the case on MTEB.
MixedBread AI's mxbai-embed-large-v1 (335M parameters) outperforms the much larger Zeta-Alpha-E5-Mistral (7.1B parameters) on ICLERB, a contrast not mirrored in MTEB results where the former ranks at position 54 versus the latter at position 11.
‍

These discrepancies highlight that traditional benchmarks like MTEB may not fully capture models' effectiveness in RAG and ICL contexts, emphasizing the need for specialized benchmarks like ICLERB.
‍

Finding #4: Performance Variability Across Datasets

ICLERB results also reveal that some models exhibit significant performance variability across different datasets. While NVIDIA's NV-Embed-v2 and our cm-rerank-mxbai-rlaif-v0.1 maintain consistently high performance, BAAI's bge-en-icl shows fluctuations depending on the dataset used. This suggests certain models may be sensitive to specific data characteristics, impacting their generalizability across diverse ICL tasks.

What’s Next?

ICLERB represents a step forward in evaluating retrieval methods for RAG and ICL by focusing on the utility of retrieved documents in enhancing LLM performance. By reframing retrieval as a recommendation problem, we align the evaluation process with the core objectives of these advanced systems.
‍

Future developments for ICLERB include expanding the benchmark to cover a broader range of LLMs and datasets, including more varied document types. This will provide deeper insights into retrieval methods across different RAG scenarios and contribute to a more robust and comprehensive assessment of retrieval strategies.
‍

As we refine ICLERB, our goal is to facilitate the development of retrieval methods tailored for utility in ICL, ultimately advancing the capabilities of LLMs in providing accurate and contextually rich responses.

‍

Author

Marie Al Ghossein

Staff Research Scientist

Marie Al Ghossein holds a Ph.D. in Computer Science and an engineering degree, both from Télécom Paris in France. Her research focuses on user modeling and recommender systems in real-world settings. She has mainly worked on context-aware recommender systems in the tourism domain and in streaming environments. She has published several papers in prestigious conferences and journals and has organized international workshops related to these topics. Marie held research positions in France and in Finland, and her experience also extends to the industry through various collaborations.

Rethinking Retrieval for RAG: Introducing ICLERB