Code Monkey home page Code Monkey logo

instructrag's Introduction

InstructRAG

Instructing Retrieval-Augmented Generation with Explicit Denoising
[arXiv] [Website] [Model] [Dataset] [X Summary]

InstructRAG is a simple yet effective RAG framework that allows LMs to explicitly denoise retrieved contents by generating rationales for better verifiability and trustworthiness.

InstructRAG Key Features:

  • 🤖 Self-Synthesis: Leverage instruction-tuned LMs to generate their OWN supervision for denoising.
  • 🔌 Easy-to-Use: Support both in-context learning (ICL) and supervised fine-tuning (SFT).
  • 🚀 Effectiveness: Up to 8.3% better results across 5 benchmarks (Table 5).
  • 💪 Noise Robustness: Robust to increased noise ratios in various scenarios (Figure 3).
  • 🔁 Task Transferability: InstructRAG can also solve out-of-domain unseen tasks (Figure 4).

Please see also our paper and X summary for more details.

🔗 Quick Links

Installation

Run the following script to create a Python virtual environment and install all required packages.

bash setup.sh

Alternatively, you can also directly create a conda environment using the provided configuration file.

conda env create -f environment.yml

Training Script

To train the model (i.e., InstructRAG-FT), just activate the environment and run the following training script. The training config is set for 4xH100 80G GPUs. You may need to adjust NUM_DEVICE and PER_DEVICE_BATCH_SIZE based on your computation environment.

conda activate instrag
bash train.sh

Evaluation

There are two instantiations of our framework:

  • InstructRAG-ICL: training-free & easy-to-adapt
  • InstructRAG-FT: trainable & better performance

Use the following script to evaluate InstructRAG in both training-free and trainable settings. You can specify the task and model by adjusting DATASET and MODEL in eval.sh.

conda activate instrag
bash eval.sh

Generation Example

The following case study shows that InstructRAG can effectively identify relevant information from noisy input and leverage its own knowledge to correctly answer questions when required. The red texts denote irrelevant or inaccurate model generations, while the green texts denote contents relevant to the question.

Model Checkpoints

Below is the full list of InstructRAG models fine-tuned on each dataset in our work.

Dataset HF Model Repo Retriever
PopQA meng-lab/PopQA-InstructRAG-FT Contriever
TriviaQA meng-lab/TriviaQA-InstructRAG-FT Contriever
Natural Questions meng-lab/NaturalQuestions-InstructRAG-FT DPR
ASQA meng-lab/ASQA-InstructRAG-FT GTR
2WikiMultiHopQA meng-lab/2WikiMultiHopQA-InstructRAG-FT BM25

Bugs or Questions?

If you have any questions related to the code or the paper, feel free to email Zhepei ([email protected]). If you encounter any problems when using the code, or want to report a bug, feel free to open an issue! Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you find the repo helpful in your work:

@article{wei2024instructrag,
  title={{InstructRAG}: Instructing Retrieval-Augmented Generation with Explicit Denoising},
  author={Wei, Zhepei and Chen, Wei-Lin and Meng, Yu},
  year={2024}
}

instructrag's People

Contributors

weizhepei avatar

Stargazers

 avatar owlwang avatar Victor Chen avatar KABI avatar jacqueline he avatar Yu Meng avatar Dương Xuân Bách  avatar Zhangchi Feng avatar id-2 avatar Tao Yang avatar WAHAHA avatar yanqiangmiffy avatar HaoChen LI avatar Nur Arifin Akbar avatar  avatar Oscar Neto avatar Ramsey avatar Or4cl3 AI Solutions  avatar skykiseki avatar  avatar  avatar Peng Wang avatar  avatar William Berrios avatar Igor Tomashevskiy avatar Itsuki Toyota avatar

Watchers

Yu Meng avatar Kostas Georgiou avatar  avatar  avatar

Forkers

jie311

instructrag's Issues

数据集里缺失上下文文档

数据集里缺失上下文文档,导致data_utils.py文件中第108行example["ctxs"]代码尝试获取上下文时报错,请问这个应该怎么解决?

A question about cache_dir

Dear author, would you please tell me what is this cache? thx a lot.

~/workspace/InstructRAG-main$ sh generate_rationale.sh
usage: inference.py [-h] [--dataset_name DATASET_NAME] [--rag_model {InstructRAG-FT,InstructRAG-ICL}]
[--model_name_or_path MODEL_NAME_OR_PATH] [--load_local_model] [--do_rationale_generation] [--n_docs N_DOCS]
[--output_dir OUTPUT_DIR] [--cache_dir CACHE_DIR] [--prompt_dict_path PROMPT_DICT_PATH] [--temperature TEMPERATURE]
[--max_tokens MAX_TOKENS] [--seed SEED] [--max_instances MAX_INSTANCES]
inference.py: error: argument --cache_dir: expected one argument

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.