Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models

This paper introduces Direct Retrieval-augmented Optimization (DRO), a framework that synergizes knowledge selection and LLM generation through end-to-end training using a variational approach, achieving 5-15% improvements in EM and F1 scores across five QA datasets.

Retrieval-Augmented Generation, Large Language Model, End-to-End Training, Knowledge Selection, Importance Sampling, Question Answering

Zhengliang Shi, Lingyong Yan, Weiwei Sun, Yue Feng, Pengjie Ren, Xinyu Ma, Shuaiqiang Wang, Dawei Yin, Maarten de Rijke, Zhaochun Ren

Shandong University, Baidu Inc., Carnegie Mellon University, University of Birmingham, University of Amsterdam, Leiden University

Generated by grok-3

Background Problem

Retrieval-Augmented Generation (RAG) combines large language models (LLMs) with external knowledge retrievers to enhance factuality in knowledge-intensive tasks like question answering. However, optimizing RAG performance is challenging due to the lack of end-to-end training supervision between the retriever and generator components. Previous approaches either fine-tuned retrievers independently or trained LLMs with off-the-shelf retrievers, often under unrealistic assumptions of document independence, failing to address complex real-world scenarios such as multi-hop QA. This paper introduces Direct Retrieval-augmented Optimization (DRO) to tackle these issues by enabling synergistic, end-to-end training of both components.

Method

The proposed DRO framework synergizes a generative knowledge selection model and an LLM generator through a variational approach with two alternating phases: (1) Document Permutation Estimation (E-step), where document permutations are treated as latent variables, and their distribution is estimated using importance sampling from the selection model. This involves initially retrieving top-20 documents with an off-the-shelf retriever (e.g., ColBERT), then autoregressively generating a permutation of identifiers (e.g., [1] > [2] > [3]) for a subset (k=5) of documents. (2) Re-weighted Maximization (M-step), where importance weights calibrate the bias from sampling, and both the selection model and LLM generator are jointly optimized by maximizing an evidence lower bound (ELBO) of the log-likelihood for question answering. This iterative process ensures mutual improvement, with theoretical parallels to policy-gradient reinforcement learning, where the selection model is reinforced to choose permutations that enhance generation performance.

Experiment

Experiments were conducted on five question-answering benchmarks: Natural Questions (NQ), HotpotQA, MuSiQue, 2WikiMultihopQA, and Wizard-of-Wikipedia (WoW), using Wikipedia 2018 as the retrieval corpus. The setup involved Llama-3-8B and Mistral-7B as backbones for both selector and generator, with ColBERTv2.0 retrieving top-20 documents, from which 5 were selected. Metrics included Exact Match (EM), F1, and Recall@K for document selection. Results showed DRO outperforming baselines by 5-15% in EM and F1 scores (e.g., EM of 45.76 on NQ vs. 40.74 for the best baseline), with a 17.78% average improvement in document selection recall. The experimental design was comprehensive, covering single-hop and multi-hop QA tasks, and included ablation studies confirming the necessity of joint optimization. Convergence and stability were demonstrated through iterative performance gains and variance reduction, though the fixed retrieval and selection numbers (20 and 5) and specific backbones might limit generalizability. The results matched expectations of improved performance through end-to-end training, but computational overhead from sampling (8 permutations per query) and potential early training instability were noted as practical concerns.

Further Thoughts

The DRO framework presents a compelling advancement in RAG by addressing the critical gap in end-to-end optimization, but it opens several avenues for deeper exploration. One intriguing aspect is its potential application to multi-modal RAG scenarios, as hinted in the conclusion. For instance, integrating visual or audio data alongside textual documents could further enhance factuality in tasks like video question answering, though this would require adapting the importance sampling strategy to handle heterogeneous data distributions. Additionally, the observed failure cases (incorrect selection and generation mismatch) resonate with broader challenges in RAG systems, such as handling noisy contexts, which are also evident in works like Self-RAG or RAAT. A potential cross-connection could be exploring hybrid approaches that combine DRO’s end-to-end training with self-reflection mechanisms from Self-RAG to mitigate generation mismatches. Furthermore, the computational cost of sampling multiple permutations per query suggests a need for adaptive sampling techniques—perhaps inspired by active learning paradigms—where the number of samples dynamically adjusts based on training stability or query complexity. Lastly, the theoretical analogy to policy-gradient reinforcement learning invites a comparison with RL-based RAG methods like DDR-RAG; a deeper investigation into whether DRO’s importance weights could be enhanced with explicit reward shaping from human feedback (e.g., RLHF) could yield even more robust selection-generation synergy. These thoughts underscore DRO’s potential while highlighting practical and theoretical challenges that future research must address.