activerag's Introduction

OpenMatch v2

An all-in-one toolkit for information retrieval. Under active development.

Install

git clone https://github.com/OpenMatch/OpenMatch.git
cd OpenMatch
pip install -e .

-e means editable, i.e. you can change the code directly in your directory.

We do not include all the requirements in the package. You may need to manually install torch, tensorboard.

You may also need faiss for dense retrieval. You can install either faiss-cpu or faiss-gpu, according to your enviroment. Note that if you want to perform search on GPUs, you need to install the version of faiss-gpu compatible with your CUDA. In some cases (usually CUDA >= 11.0) pip installs a wrong version. If you encounter errors during search on GPUs, you may try installing it from conda.

Features

Human-friendly interface for dense retriever and re-ranker training and testing
Various PLMs supported (BERT, RoBERTa, T5...)
Native support for common IR & QA Datasets (MS MARCO, NQ, KILT, BEIR, ...)
Deep integration with Huggingface Transformers and Datasets
Efficient training and inference via stream-style data loading

Docs

We are actively working on the docs.

Project Organizers

Zhiyuan Liu
- Tsinghua University
- Homepage
Zhenghao Liu
- Northeastern University
- Homepage
Chenyan Xiong
- Microsoft Research AI
- Homepage
Maosong Sun
- Tsinghua University
- Homepage

Acknowledgments

Our implementation uses Tevatron as the starting point. We thank its authors for their contributions.

Contact

Please email to [email protected].

activerag's People

Contributors

Stargazers

Watchers

activerag's Issues

About Dataset

I'm a new rag researcher, I wonder how to sample the dataset.
for example, popQA dataset, the source dataset only have this item
'id', 'subj', 'prop', 'obj', 'subj_id', 'prop_id', 'obj_id', 's_aliases', 'o_aliases', 's_uri', 'o_uri', 's_wiki_title', 'o_wiki_title', 's_pop', 'o_pop', 'question', 'possible_answers'
so how to get the passges?
should i download the whole wikipedia content , then use the popqa queries(questions) to retrieve wiki contents? then use the recalling list to rebulid a dataset to do evaluation? like data_popqa_sampled.jsonl?
i will be grateful if you can reply it

About the generation of the data_nq_sampled.jsonl file passages field

data_nq_sampled.jsonl

Seeing that you wrote in your last issue that you used OpenMatch/t5-ance to do the embedding, I went and generated new vector embeddings using generate_passage_embeddings.py in self-rag, the original text is still in the psgsw100.tsv file, and I used the first example in nq { “id”:“-5011119306587397321”, “question”: “who sings the theme song to all that”}, the results of the recall are not quite the same
The text in this file is relatively long, but the array elements in your passages field are relatively short, and there are a lot of data in the format “BULLET::::”.
Is the text column of psgsw100.tsv chunked and processed more carefully when generating the embed?

Recommend Projects

openmatch / activerag Goto Github PK

activerag's Introduction

OpenMatch v2

Install

Features

Docs

Project Organizers

Acknowledgments

Contact

activerag's People

Contributors

Stargazers

Watchers

Forkers

activerag's Issues

About Dataset

About the generation of the data_nq_sampled.jsonl file passages field

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent