Closing the Loop: Real-Time Self-Improvement for LLMs with RAGSys

Large Language Models (LLMs) generate useful responses across a wide range of queries but often lack the specialized knowledge and skills needed for niche tasks. They do not have direct access to private or recently updated data that falls outside their pre-training corpus. Additionally, they can struggle with complex tasks, such as generating valid code or following precise domain-specific guidelines.

One way to address these limitations is by augmenting the LLM’s prompt with additional context. This is commonly achieved using Retrieval Augmented Generation (RAG), where relevant information is retrieved from a database and added to the prompt at runtime, grounding the model’s response in external knowledge.

We introduce a new framework that generalizes RAG, transforming it from a retrieval system into a dynamic optimization process. By systematically refining how context is retrieved and composed, this framework enables LLMs to improve continuously, optimizing their outputs in real time without retraining.

RAG generalized

Retrieval Augmented Generation (RAG) enhances an LLM’s prompt by dynamically incorporating relevant external information. The most well-known implementation of this is Document-RAG, where document chunks are retrieved from a knowledge base based on embedding similarity and inserted into a system message to guide the model’s response.

While Document-RAG effectively integrates external knowledge, retrieval is not limited to documents. The same retrieval principles apply to other types of context—such as tailored instructions or few-shot examples—that refine an LLM’s ability to generate accurate and context-aware responses. To generalize RAG, it is essential to break down its core components and their roles in the system.

At a high level, a RAG system retrieves and composes context dynamically to construct an optimized prompt before passing it to the LLM. The key components of this process include:

  • Query: The input the LLM responds to, typically a user message.
  • Prompt: The full input text provided to the LLM, including the query and additional context such as a system message, chat history, or few-shot examples.
  • Response: The output generated by the LLM based on the prompt.
  • Context: Any additional information retrieved to improve response quality. This includes document chunks, tailored instructions, and relevant few-shot examples.
  • Retriever: A function that selects the most useful contexts based on the query.
  • Composer: A function that integrates the retrieved contexts into a structured prompt to maximize the LLM’s effectiveness.

Each of these components serves a distinct function. The retriever identifies relevant information that enhances the model’s response, while the composer determines how that information is framed within the prompt for maximum utility. Together, they dictate the effectiveness of RAG in guiding LLM behavior.

By treating retrieval as a method for dynamically constructing prompts, rather than simply supplementing an LLM with external facts, RAG becomes a flexible optimization mechanism. Retrieving and composing any context that enhances response quality—whether factual data, domain-specific guidelines, or curated examples—extends RAG beyond static document retrieval, making it a more precise and adaptable tool for improving LLM outputs.

In-context learning

LLMs generate responses based on the information in their prompts, and some prompts are more effective than others. The goal is to construct prompts that maximize response accuracy by providing the most useful supporting context.

Retrieval plays a key role in optimizing prompts. Document-RAG enhances response quality by grounding the model in factual knowledge, but retrieval need not be limited to documents. Other forms of context—such as instructions and few-shot examples—also improve LLM outputs by shaping how the model processes and generates responses.

However, instructions and few-shot examples are often static, meaning the same predefined context is used for every query. This rigid approach fails to account for variation across prompts. Different queries require different supporting contexts to produce the best responses.

RAG removes this limitation by making context dynamic. Instead of relying on fixed examples or generic instructions, retrieval selects the most useful guidance for each query, ensuring the LLM is always working with the most effective context available.

Prompt tuning

Retrieval determines what information is available to an LLM, but not all retrieved context contributes equally to response quality. Traditional retrieval methods focus on similarity to the query, but similarity alone does not guarantee an improved response. What matters is utility—how much a retrieved context enhances the likelihood of a correct answer.

Critically, this utility can be measured. Given a prompt, a correct response, and a piece of context, we can directly evaluate its impact on response accuracy. Since this effect is quantifiable, retrieval can be optimized to prioritize high-utility contexts—those that provide the most significant improvement—rather than simply retrieving the most related ones.

A well-tuned retriever aligns retrieval with measured utility. The simplest way to do this is to bias the retrieval to prefer contexts with high utility. To improve on this, a model is trained which maps prompts and contexts into a space where similarity is related to utility. Two approaches are typically used here: one is to train a model, typically a small neural network, to learn a mapping from the space of pre-trained embeddings into a space where similarity is related to utility. The other approach is to use an LLM to rewrite prompts and contexts into a space where similarity is related to utility.

By shifting retrieval from similarity-based selection to utility-driven optimization, the process becomes more than just retrieving information—it acts as a fine-tuning mechanism that continuously improves model performance without modifying its weights. This allows LLMs to adapt in real time, making them more effective for specialized tasks and evolving information needs.

Closing the loop

Each time an LLM generates a response, the system stores an interaction, which consists of the query, the model’s response, and the resulting outcome—an observable effect that provides feedback on performance. Outcomes take various forms: a tool’s output, a follow-up message, an explicit correction, or a downstream KPI such as engagement or resolution rate.

All interactions, both positive and negative, are stored and used by the system to adjust retrieval strategy and create high utility contexts in real time

Correcting Negative Outcomes

When an interaction results in an undesirable outcome—such as an incorrect response, hallucination, or ambiguity—the system generates a corrective instruction to guide future responses.

Rather than relying on manually crafted instructions, rejection sampling constructs them automatically. An LLM generates multiple variations of a corrective instruction, evaluates their impact, and  selects the most effective instruction. This is done looking through the stored interactions and measuring how much the instruction, when used as a context, would have improved the chances of generating responses with a positive outcome, while reducing the chances of generating responses with a negative outcome.

These corrective instructions are stored in the retrieval database and dynamically injected into prompts for relevant queries, allowing the system to self-correct without modifying the model’s weights.

Reinforcing Positive Outcomes

When an interaction produces a strong response, it is stored as a candidate for few-shot examples, reinforcing effective behavior.

Beyond the training phase, the retriever is continuously tuned on the fly to prioritize high-utility examples—examples that consistently improve response quality. As the dataset evolves, examples with broader utility become preferred, while those with diminishing impact are deprioritized. This ensures that few-shot learning remains adaptive, always retrieving the most effective examples.

In cases where explicit feedback is unavailable, the system assigns an implicit reward using long-term credit assignment—tracing system-wide KPIs back to individual interactions. For example, an LLM deployed on a website can use engagement metrics, resolution rates, or user retention as feedback signals, refining retrieval strategies even in the absence of direct corrections.

By embedding feedback directly into retrieval, this system closes the loop between interaction and improvement. The LLM continuously refines its responses, optimizes retrieval, and evolves over time—without manual updates or retraining. Instead of relying on static datasets or human intervention, the system learns from its own interactions, identifying high-utility contexts and dynamically adapting retrieval strategies to maximize performance.

Conclusion

Traditional supervised fine-tuning improves LLM performance by modifying model parameters, but it is costly, introduces long delays, and primarily learns from positive demonstrations rather than a full spectrum of feedback. In contrast, this retrieval-driven approach enables real-time learning from both successes and failures, integrating corrective instructions, few-shot examples, and long-term feedback signals to continuously refine responses.

By embedding retrieval as an optimization layer, the system adapts dynamically, selecting the most effective context for each query and adjusting to new information without retraining. Corrective instructions address mistakes, while high-utility examples ensure adaptability, allowing the model to improve its reasoning over time.

This closes the loop between interaction and improvement, creating a system that learns autonomously from real-world usage. By shifting from static retrieval to continuous optimization, LLMs become more accurate, reliable, and responsive—bridging the gap between pre-trained knowledge and real-time adaptation.

Author
Garrin McGoldrick
Director, Applied Machine Learning and NLP

Garrin made the switch from Physics to Machine Learning after earning his Ph.D. in Experimental High Energy Particle Physics from the University of Toronto in collaboration with CERN. He used the software engineering and statistical learnings from his Ph.D. to co-found a startup working at the forefront of natural language processing and later joined Borealis AI, where he led a team conducting research in transformer networks and applying these advancements to the financial industry. Garrin holds several patent publications and has a strong track record of developing and deploying solutions that merge cutting-edge research with practical impact.