Code Monkey home page Code Monkey logo

seal's Introduction

SEAL: Search Engines with Autoregressive LMs

This repo hosts the code for our paper, SEAL.

@inproceedings{bevilacqua2022autoregressive,
 title={Autoregressive Search Engines: Generating Substrings as Document Identifiers}, 
 author={Michele Bevilacqua and Giuseppe Ottaviano and Patrick Lewis and Wen-tau Yih and Sebastian Riedel and Fabio Petroni},
 booktitle={arXiv pre-print 2204.10628},
 url={https://arxiv.org/abs/2204.10628},
 year={2022},
}

https://arxiv.org/abs/2204.10628

Changelog

UPDATE! (05/22/2022) Preprocessing/training scripts added!

Introduction

We propose a approach to retrieval that uses guided LM decoding to search for occurrences of ngrams of any size in an arbitrary large collection of documents. Constrained decoding blocks the generation of ngrams that never appear in the corpus: generated ngrams are always grounded in one or multiple documents in the retrieval corpus. Documents are then scored by aggregating the scores for individual generated "identifiers".

We use the Ferragina Manzini index (FM-index), an opportunistic, compressed suffix array as the unified data structure for constrained decoding, retrieval and full-text storage.

SEAL architecture

The FM-index

You can think of the FM-index as a trie that not indexes not only a set of strings s, but the union of every substring of each string. We can perform constrained decoding of ngrams of unbounded length from any point in the retrieval corpus, from simple unigrams to entire sentences.

Our implementation relies on sdsl-lite.

Install

SEAL needs a working installation of SWIG, e.g. (on Ubuntu):

sudo apt install swig

We also assume that pytorch is already available in your environment. SEAL has been tested with version 1.11.

Clone this repo with --recursive so that you also include the submodule in res/external.

git clone --recursive https://github.com/facebookresearch/SEAL.git

Compile and install sdsl-lite:

env CFLAGS='-fPIC' CXXFLAGS='-fPIC' res/external/sdsl-lite/install.sh

Install other dependencies:

pip install -r requirements.txt

# pyserini
# pip install -r requirements_extra.txt

Now install this library.

pip install -e .

Download

We make available both model checkpoints and pre-built indices for both Natural Questions and the KILT benchmark:

Retrieval

Command-line interface

To run prediction, launch the following command:

TOKENIZERS_PARALLELISM=false python -m seal.search \
    --topics_format dpr --topics input.json \
    --output_format dpr --output output.json \
    --checkpoint checkpoint.pt \
    --fm_index fm_index \
    --jobs 75 --progress --device cuda:0 --batch_size 20 \
    --beam 15

The script will generate the DPR prediction file output.json. The kilt format is also supported.

The Searcher class

Our codebase relies on a pyserini-like searcher class, that incapsulates both constrained decoding and retrieval. You can use it programmatically:

from seal import SEALSearcher

searcher = SEALSearcher.load('fm_index', 'checkpoint.pt')
searcher.include_keys = True

query = "can you eat soup with a fork"

for i, doc in enumerate(searcher.search(query, k=3)):
    print(i, doc.score, doc.docid, *doc.text(), sep='\t')
    print("Matched:")
    matched = sorted(doc.keys, reverse=True, key=lambda x:x[2])
    matched = matched[:5]
    for ngram, freq, score in matched:
        print("{:.1f}".format(score).zfill(5), freq, repr(ngram), sep='\t')

# 0	375.03041350768547	13796077	Chopsticks	are similar, finer points can differ from region to region. 
# In Cambodia, a fork and spoon are the typical utensils used in Cambodian dining and etiquette. Spoons are 
# used to scoop up food or water and the fork is there to help guide the food onto the spoon. Chopsticks 
# are normally used in noodle dishes such as the Kuy Tiev and soup dishes. When eating soup the chopsticks 
# will typically be paired with the spoon, where the chopsticks will pick up the food and the spoon will be 
# used to drink the broth. Forks are never to touch the mouth,
# Matched:
# 161.3	10	' eating soup'
# 059.5	9390	' fork'
# ...

Constrained decoding

Building the FM-index (CLI)

To most straightforward way to build the FM-index is to use the script we have provided in scripts/build_fm_index.py! You only need to put your retrieval corpus in a very simple TSV format as in the following example:

doc1    Doc 1   This is a sample document
doc2    Doc 2   This is another sample document
doc3    Doc 3   And here you find the final one

Fields are:

  • document id
  • document title
  • text

Then you can build the FM-index with:

FILE_I=res/sample/sample_corpus.tsv
FILE_O=res/sample/sample_corpus.fm_index

python scripts/data/build_fm_index.py \
    $FILE_I $FILE_O \
    --hf_model facebook/bart-large  \
    --jobs 40 --include_title \

The parameter --jobs only speeds up the tokenization at the moment. --include_title only makes sense if your retrieval corpus has non-empty titles.

Building the FM-index (Python)

from seal import FMIndex
from transformers import AutoTokenizer

corpus = [
    "Doc 1 @@ This is a sample document",
    "Doc 2 @@ This is another sample document",
    "Doc 3 @@ And here you find the final one",
]
labels = ['doc1', 'doc2', 'doc3']

tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large')
def preprocess(doc):
    doc = ' ' + doc
    doc = tokenizer(doc, add_special_tokens=False)['input_ids']
    doc += [tokenizer.eos_token_id]
    return doc

corpus_tokenized = [preprocess(doc) for doc in corpus]

index = FMIndex()
index.initialize(corpus_tokenized, in_memory=True)
index.labels = labels

index.save('res/sample/sample_corpus.fm_index')
# writes res/sample/sample_corpus.fm_index.fmi
# writes res/sample/sample_corpus.fm_index.oth

index = FMIndex.load('res/sample/sample_corpus.fm_index')

Check out seal/fm_index.py!

Decoding with the FM-index

You can easily plug in our constrained decoding code in your project by using the fm_index_generate function. In the following snippet we show a use case beyond retrieval: paraphrase mining.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from seal import fm_index_generate, FMIndex

tokenizer = AutoTokenizer.from_pretrained('tuner007/pegasus_paraphrase')
model = AutoModelForSeq2SeqLM.from_pretrained('tuner007/pegasus_paraphrase')

# building the corpus from a single long string
corpus = " ".join("""
They also were found to have perfectly coiffed hair, and wore what appeared to be Dior makeup. 
“We were shocked to discover the unicorns,” said anthropologist Daniel St. Maurice. “They were 
like nothing we had ever seen before. We had heard legends of the unicorns, but never thought 
they actually existed.” When the scientists first arrived in the valley, the unicorns were 
surprised and startled by the presence of humans, but were also excited. The unicorns welcomed 
the researchers and explained that they had been waiting for them for a very long time. “The 
unicorns said that they had been waiting for us for a very long time,” said Dr. St. Maurice. 
“They said they had always known that humans would eventually discover them, but that they had 
also always known that humans would be too stupid to realize the unicorns had been waiting for 
them.”
""".split()).strip()
corpus = tokenizer(' ' + corpus, add_special_tokens=False)['input_ids'] + [tokenizer.eos_token_id]
index = FMIndex()
index.initialize([corpus], in_memory=True)

# constrained generation
query = " ".join("""
The unicorns greeted the scientists, explaining that they had been expecting the encounter for
a while.'
”""".split()).strip()
out = fm_index_generate(
    model, index,
    **tokenizer([' ' + query], return_tensors='pt'),
    keep_history=False,
    transformers_output=True,
    always_allow_eos=True,
    max_length=100,
)
print(tokenizer.decode(out[0], skip_special_tokens=True).strip())
# unicorns welcomed the researchers and explained that they had been waiting for them for a very long time.

Licence

SEAL is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

seal's People

Contributors

fabiopetroni avatar mbevila avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

seal's Issues

Retrieval process

Hi,
In the retrieval section of README, it says that To run prediction, launch the command: .... --topics_format dpr --topics input.json ....
What's input.json here? Is it the NQ Test/KILT dev/ KILT test downstream task?

Thank you!

How to reproduce the training process?

Hi,

Thanks for the great work!
I only found the inference part in the code, I wonder if you can share the training code and training data (or script to construct the training data)?

Evaluation scripts

Hi, can you release the evaluation code? I want to know how to get retrieval result metrics.

Confused about the time complexity?

Thanks for your great work!
In your paper, you say "The FM-index can be used to count the frequency of any sequence of tokens n in O(|n|log|V|)".
But, according to the wikipedia's explanation, the count operation in FM-index can be done in O(|n|) time for a pattern with n item. I don't understand why there is a log|V| in your conclusion.
image
Similarly, I don't understand why "the list of possible token successors can be obtained in O(|V|log|V|)", why the time is relevant to V, can you give a little explanation? I would really appreciate it!

Steps to reproduce LM+FM result in Table 3

Thanks for your great work!

Using the checkpoints you released, I am able to get the result of SEAL (LM+FM, intersective) in Table 3. However, I am wondering how to obtain the result of SEAL (LM, |n| = 5) and SEAL (LM+FM) in the table. There are several flags in SEALSearcher in retrieval.py. Which ones should be set?

In addition, I notice that in retrieval.py, the found_keys are always rescored by rk.rescore_keys. May I know what does this function do? I don't seem to find it in the paper.

Thanks in advance!

Generation failure, TypeError: ones(): argument 'size' must be tuple of ints, but found element of type Tensor at pos 1

Hi Team,

I am able to install and import the SEAL modules. Also able to create the index, and load it with the checkpoint.

However, when I try to perform any search, it gives the following error -
I reckon it might be occurring due to torch and transformer version incompatibility. I am using torch - 1.11.0 and transformers - 4.13.0

I am planning to debug the beam_search script if required, please let me know if I shall try something else instead.

Thanks,
Pranav

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-722411077b17> in <module>
----> 1 searcher.search("how to find my account number", k=3)

projects/seal-exp/SEAL/seal/retrieval.py in search(self, query, k, added_documents, detokenize)
    645         if added_documents is not None:
    646             added_documents = [added_documents]
--> 647         return self.batch_search([query], k=k, added_documents=added_documents, detokenize=True)[0]
    648 
    649     def batch_search(self, queries, k: int = 10, added_documents=None, detokenize=None) -> List[List[SEALDocument]]:

projects/seal-exp/SEAL/seal/retrieval.py in batch_search(self, queries, k, added_documents, detokenize)
    658                 keys = ((kk, None, added_documents[i]) for i, kk in enumerate(keys))
    659 
--> 660         results, keys = zip(*self.batch_retrieve_from_keys(keys))
    661 
    662         keys = list({k for kk in keys for k in kk})

projects/seal-exp/SEAL/seal/retrieval.py in batch_retrieve_from_keys(self, keys)
    758             yield from self._mp_batch_retrieve_from_keys(keys)
    759         else:
--> 760             yield from self._batch_retrieve_from_keys(keys)
    761 
    762     def _mp_batch_retrieve_from_keys(self, keys):

projects/seal-exp/SEAL/seal/retrieval.py in _batch_retrieve_from_keys(self, keys)
    799                 disable=not self.progress
    800             )
--> 801         for i, kk in enumerate(keys):
    802             if self.print_n_doc:
    803                 print(i)

~/.conda/envs/answer_engine_env/lib/python3.8/site-packages/tqdm/std.py in __iter__(self)
   1164         # (note: keep this check outside the loop for performance)
   1165         if self.disable:
-> 1166             for obj in iterable:
   1167                 yield obj
   1168             return

projects/seal-exp/SEAL/seal/retrieval.py in batch_generate_keys(searcher, queries, constrained_generation)
    308         batches = ichunked(queries, searcher.batch_size)
    309         for batch in batches:
--> 310             for instance in process_batch(batch):
    311                 bar.update()
    312                 yield instance

projects/seal-exp/SEAL/seal/retrieval.py in process_batch(inputs)
     68             batch = {k: v.to(searcher.device) for k, v in batch.items()}
     69 
---> 70             found_keys = fm_index_generate(
     71                 searcher.bart_model, searcher.fm_index,
     72                 **batch,

~/.conda/envs/answer_engine_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
     25         def decorate_context(*args, **kwargs):
     26             with self.clone():
---> 27                 return func(*args, **kwargs)
     28         return cast(F, decorate_context)
     29 

projects/seal-exp/SEAL/seal/beam_search.py in fm_index_generate(model, index, input_ids, attention_mask, min_length, max_length, length_penalty, num_beams, diverse_bs_groups, diverse_bs_penalty, eos_token_id, force_decoding_from, always_allow_eos, keep_history, disable_fm_index, sample, stop_at_count, topk, transformers_output, **kwargs)
    482     model_kwargs['use_cache'] = True
    483 
--> 484     decoder_input_ids = model._prepare_decoder_input_ids_for_generation(
    485         input_ids,
    486         decoder_start_token_id=model.config.decoder_start_token_id,

~/.conda/envs/answer_engine_env/lib/python3.8/site-packages/transformers/generation_utils.py in _prepare_decoder_input_ids_for_generation(self, batch_size, decoder_start_token_id, bos_token_id)
    433         decoder_start_token_id = self._get_decoder_start_token_id(decoder_start_token_id, bos_token_id)
    434 
--> 435         decoder_input_ids = torch.ones((batch_size, 1), dtype=torch.long, device=self.device) * decoder_start_token_id
    436         return decoder_input_ids
    437 

TypeError: ones(): argument 'size' must be tuple of ints, but found element of type Tensor at pos 1

Settings on NQ320K

Hi, I'm wondering what's the settings of the SEAL on NQ320K dataset? (Originally I thought it was using the DPR embeddings, forgive me if I'm wrong, but it seems like DPR doesn't have it on NQ320K dataset.)
Thank you very much!

Failure in SEAL pip installation

Hi,

I am facing issues in pip installation.

Initially, due to the git permission issue, I could not clone the repo in a recursive manner so I manually put the SDSL-lite (commit - c32874cb2d8524119f25f3b501526fe692df29f4 ) in res/external dir.

After that, I installed the necessary requirements, and I have torch 1.10 installed; However, I am not able to install SEAL from the source.

pip install -e . gives me the following error -

Obtaining projects/seal-exp/SEAL
Installing collected packages: SEAL
  Running setup.py develop for SEAL
    ERROR: Command errored out with exit status 1:
     command: /home/pranav/.conda/envs/answer_engine_env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'projects/seal-exp/SEAL/setup.py'"'"'; __file__='"'"projects/seal-exp/SEAL/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps
         cwd: projects/seal-exp/SEAL/
    Complete output (12 lines):
    running develop
    running egg_info
    writing SEAL.egg-info/PKG-INFO
    writing dependency_links to SEAL.egg-info/dependency_links.txt
    writing top-level names to SEAL.egg-info/top_level.txt
    reading manifest file 'SEAL.egg-info/SOURCES.txt'
    writing manifest file 'SEAL.egg-info/SOURCES.txt'
    running build_ext
    building 'seal.cpp_modules._fm_index' extension
    swigging seal/cpp_modules/fm_index.i to seal/cpp_modules/fm_index_wrap.cpp
    swig -python -I../include -c++ -o seal/cpp_modules/fm_index_wrap.cpp seal/cpp_modules/fm_index.i
    error: command 'swig' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /home/pranav/.conda/envs/answer_engine_env/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'projects/seal-exp/SEAL/setup.py'"'"'; __file__='"'"projects/seal-exp/SEAL/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' develop --no-deps Check the logs for full command output.

Any idea on how to resolve this?

Thanks,
Pranav

Unable to load checkpoints.pt

In the method load of SEALSearcher class in use as arguments an fm_index and the checkpoints of KILT downloaded from this repo, but it's unable to read them correctly.

Page-level retrieval KILT and KILT scores

Thanks for your excellent work; I'm confused about page-level retrieval details in the KILT test. Does it do the same thing as passage-level? How can page-level retrieval results be used with FiD for KILT scores evaluation?

Preprocessing KILT

Hi, nice work!

There is a 'kb' parameter In the make_supervised_kilt_dataset.py, how to get the kb file?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.