castorini / birch Goto Github PK
View Code? Open in Web Editor NEWDocument ranking via sentence modeling using BERT
Document ranking via sentence modeling using BERT
Hi,
Thank you for your nice work!
I took a look at the project but did not find test collections from the TREC Microblog Tracks (Lin et al., 2014) from 2011 to 2014, which were used to fine-tune BERT as described in your paper.
Could you please kindly let me know where I could find the collections?
Best,
Yumo
When training bert to get a query-doc score, are you using sentence-level or document-level? If sentence-level, what's the label for each example and how to choose the bert model with the dev set?
Looking forward to your early reply!
Dear Sir / Mam
According to this paper : Applying BERT to Document Retrieval with Birch
you have made a google Colab notebooks to run Birch >> Can I use them ? where are they ?
Regards
Throw away all sentence that don't at least have a term that matches the sentence? Other pruning scenarios?
Basically, what the title says...
Hi,
Thanks again for your nice work!
I am quite interested in the QA model which was trained on the data described in your ArXiv paper as follows,
the union of the TrecQA (Yao et al., 2013) and WikiQA (Yang et al., 2015) datasets.
Since I am now also trying to train a similar model and have several minor questions, I would be really appreciative if you could kindly clarify them for me:
Best,
Yumo
I put Anserini and Birch in the same directory, and I run "./train.sh mb 5" in shell. However, it return error "FileNotFoundError: [Errno 2] No such file or directory: 'robust04_rm3_5cv_sent_fields.txt' ", and I can't find the robust04_rm3_5cv_sent_fields.txt either.
Thanks a lot for open sourcing birch. What is the license for the code in this repository?
I can't find sample notebooks in this birch.
what is the index_path?
which file generates it ?
could it be generated by pyserini?
downloading emnlp_bert4ir_v2 too slow. Can you upload it to google drive.
I'd like to get some performance figures on latency on a CPU - queries per second, latency for each individual BERT inference, etc.
We should be able to run ranking end-to-end, so we should fold BERT inference code into this repo.
I did it exactly according to readme, but there was a error... Could you tell me how to fix it? Thanks!
`Running eval/trec_eval.9.0.4/trec_eval data/qrels/qrels.mb.txt data/predictions/predict.tmp -m map -m P.20 -m ndcg_cut.20
Traceback (most recent call last):
File "src/main.py", line 92, in
main()
File "src/main.py", line 32, in main
train(args)
File "/home/castil/xueee/birch-master/src/model/train.py", line 53, in train
best_score = eval_select(args, model, tokenizer, validate_dataset, args.model_path, best_score, epoch)
File "/home/castil/xueee/birch-master/src/model/test.py", line 9, in eval_select
scores_dev = test(args, split='dev', model=model, test_dataset=validate_dataset)
File "/home/castil/xueee/birch-master/src/model/test.py", line 87, in test
qrels_file=os.path.join(args.data_path, 'qrels', 'qrels.{}.txt'.format(args.collection)))
File "/home/castil/xueee/birch-master/src/model/eval.py", line 14, in evaluate
map = float(lines[0].strip().split()[-1])
IndexError: list index out of range`
Small difference (i.e: third decimal) in FP sentence scores leads to ~0.1-0.5 difference in AP. We also observe this in hyperparameter finetuning (currently addressed by picking the smaller numbers).
Relevant: https://cs.uwaterloo.ca/~jimmylin/publications/Lin_Yang_SIGIR2019.pdf
Hi, thanks for this awesome work.
The first question is about document retrieval.
I'd like to use birch as a tool so I can retrive relavent documents given a query.
I am not clear how to birch to achieve such goals after reading the readme.
Can I have more instructions? Thank you!
The second is about sentence retrieval.
How do I use birch for sentence selection? Like what the Figure 2 describes in the 'Applying BERT to Document Retrieval with Birch', can I have the most relevant sentences in a document, given a query?
In interactive mode we're supposed to give the arg --interactive_name what is that supposed to be?
I know @emmileaf is working on this, but I just wanted to have explicit documentation. Let's make sure we can replicate exactly the results in https://arxiv.org/abs/1903.10972
This would be a critical blocker to getting the birch image ready for OSIRRC.
Add instructions for replicating results in: https://arxiv.org/abs/1903.10972
In the README, can I have a snippet for playing with BERT interactively? I.e., fire up Python interpreter, load model, issue a query and a sentence. Should be just a few lines, right?
Hi, congrats on this great work!
As a new user and not having much experience on IR research field, Please don't mind I may have some naive questions. For example:
python src/utils/split_docs.py --collection <robust04, core17, core18> \ --index <path/to/index> --data_path data --anserini_path <path/to/anserini/root>
what does index
here mean, document index? If I want to use it on my own dataset, what kind of values should I put here?
For 'data' (path), if I want to use my own dataset, what format should it be like? What should the data look like?
anserini_path
should be anserini_path
folder path after I execute
!git clone https://github.com/castorini/anserini.git !cd anserini && mvn clean package appassembler:assemble
, right?
Thanks for answering my questions!
Can you provide the code 'robust04_cv.py' for us?
Hi thanks for sharing birch!
I try to predict relevant sentences using the 'saved.msmarco_mb_1' model. One thing I am curious about is the prediction score I get from 'predictions = model(tokens_tensor, segments_tensor, mask_tensor)'. Each tuple in the 'predictions' does not sum to 1. Does it suppose to be a binary classification score?
Hi, thanks for your effort for providing this code,
I couldn't figure how to use your code for getting the embedding of a textual document ( with thousands of words), is it possible to do it with your framework?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.