castorini / buboqa Goto Github PK

View Code? Open in Web Editor NEW

280.0 280.0 88.0 357 KB

Simple question answering over knowledge graphs (Mohammed et al., NAACL 2018)

Python 48.57% Shell 3.17% Jupyter Notebook 48.26%

buboqa's People

Contributors

Stargazers

Watchers

Forkers

ml-ai-nlp-ir aliceranzhou lintool cantbesure charlotteliu impavidity zhouhoo iunderstand chunlinx pkoduganty qhduan hailiang-wang leezqcst jc-wang ramonyeung michael-wzhu fendaq baylee001 pc-huang dsp6414 parsonszeng aiedward lxf123a jorigorn salman1993 singhranjodh caoxu915683474 sungjinlees feiyutalk initc zengpr hussein-alahmad rashad101 memozhu tk1363704 we1l1n i-eloise zjulins zhenzhenclaire shinezai enernitytwinkle li-cheng12 tan92hl pokbe tomarraj008 yjygo coolcoolhua lukovnikov mohamad-mehdi-jafari saist1993 alicexfeng milllllk reiisky zmjm4 taohui1988 aistairc palin2018 dinghe xiaogazi lusonpan62678 fanfanba chenq1114 mingjinc ahmadyan tinnguyenbk anasbenrejeb sandy4321 ethanenguyen mani-vegupatti yinghuofdu datha88 wangbenglight jasper-wu zhangzeyang000 nj-eka niloufarbrv ohadrubin fieldrainzhang antoinesimoulin gt3 zhishuangr ddingwang12

buboqa's Issues

Set up Docker container for BuboQA

Add instrumentation to debug model loading time

Change in test set

Hey, I am having an issue. Can I add my own instances inside the processed_simplequestions_dataset/test.txt directly or need I to put the new instance inside annoted_fb_data_test. Whenever I try to change the test.txt an error comes saying example object has no attribute text.

Generate gold standard entity detection dataset

Since we've identified entity linking as a problem we need to examine in isolation, we should generate a specific dataset for it. Using this thread as a discussion of how to best do it.

My suggestion is to just augment the original fb dataset with new fields. So, annotated_fb_data_train.txt:

www.freebase.com/m/04whkz5	www.freebase.com/book/written_work/subjects	www.freebase.com/m/01cj3p	what is the book e about
www.freebase.com/m/0tp2p24	www.freebase.com/music/release_track/release	www.freebase.com/m/0sjc7c1	to what release does the release track cardiac arrest come from
...

Tack on the following column, from the official alias:

e
cardiac arrest

In most cases, the alias should be a proper substring of the question, but there might be cases in which it doesn't match exactly.

My suggestion is to write a script that generates this file and include the Python script in the repo.

Thoughts @salman1993 and @Impavidity ?

Manual Correction

In scripts/util.py/www2fb()
There are some manual correction for MID.
Can you explain a little bit why you have to do that?

Dockerfile to ease installation

Dockerfile
GPU and CPU

Include graph features in entity linking & cross linking

include MID indegree & outdegree counts as features to the cross linking and entity linking

question about re-implement

Hi everyone:
Do you try to re-implement "Simple Question Answering by Attentive Convolutional Neural Network" this paper?

Thanks

share a common util file between scripts, entity linking and evidence integration

There are 3 util files now in scripts, entity linking and evidence integration
They all do pretty much the same thing
Get rid of all three
Share a common util.py file in the root directory

change LR - w2v_rel to embeddings_rel

no longer using word2vec
replace filenames to embeddings_rel
add documentation that using GloVe

create entity detection dataset like Ferhan described in email

Since we have the relation prediction model exactly as Ferhan described, let's also try to create the entity detection dataset like that and try to replicate his model.

Steps:

create this dataset
run our model and train on this dataset
modify results_to_query.py file in entity_detection directory

model.eval() , error: AttributeError 'tuple' object has no attribute 'eval'

When running top_retrieval.py in 'nn' folder, I meet one question:
model is' tuple' object, and it has no attribut 'eval'

Do you help me?

iPython notebook for e2e system

Hi @salman1993 can you build an iPython notebook for the e2e system? I want to put in "when was albert einstein born" and get back something like [mid, 'albert einstein', rel].

Core dumped Error in running "sudo docker build -t buboqa ."

Hi, I'm really thanks to you about these paper and codes.

I'm following what you wrote,
and I don't know why but there's an error core dumped on "sudo docker build -t buboqa ."
Can I get some hints or tips for solve this problem?

Please tell me if you need more informations.

My computer:
*-cpu
description: CPU
product: AMD Ryzen 5 1600 Six-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 11
bus info: cpu@0
version: AMD Ryzen 5 1600 Six-Core Processor
serial: Unknown
slot: AM4
size: 1374MHz
capacity: 3700MHz
width: 64 bits
clock: 100MHz

*-memory
description: System Memory
physical id: c
slot: System board or motherboard
size: 16GiB

Storage solution for the knowledge graph

@salman1993 IIRC, your code builds various indexes on the kg, but you don't actually index the graph itself. So if the question were "when was albert einstein born" and we get back something like [mid, 'albert einstein', rel], we still need some functionality to actually look up the triple from the kg, right?

We should probably use some sort of DB? Poking around, there are some options:

Perhaps think about storing the index structures in the same db?

get_query_text for one example question - needed for e2e

In entity detection text preprocessing, give me a function called get_query_text(question) that takes in text of one question and returns the query text

"where was sasha vujacic born?" => "sasha vujacic"

Dictionary Search Speed Up

When pickle the dictionary, use
"if names.get(entity) is None:",
instead of using
"if entity not in names.keys():"
would speed up the search.

www2fb function ,why need Manual Correction

all the files do manual correction, so why doing this manual correction?

if in_str == 'fb:m.07s9rl0':
        in_str = 'fb:m.02822'
    if in_str == 'fb:m.0bb56b6':
        in_str = 'fb:m.0dn0r'
    # Manual Correction
    if in_str == 'fb:m.01g81dw':
        in_str = 'fb:m.01g_bfh'
    if in_str == 'fb:m.0y7q89y':
        in_str = 'fb:m.0wrt1c5'
    if in_str == 'fb:m.0b0w7':
        in_str = 'fb:m.0fq0s89'

Logistic regression for relation prediction

Let's run a logistic regression baseline for relation prediction. We can either use something like Mallet, or I suppose we can use one-hot vectors with a softmax?

Stanford NER baseline for entity detection

Let's run an entity detection baseline using Stanford NER straight out of the box.

FB5M.name.txt

How do you get FB5M.name.txt? Is from freebase-FB5M.txt? But I can't find the relation fb:type.object.name. The name in this file seems to have been processed by the word segmentation tool. But I want replace the entity in the SimpleQuestions dataset.

Move datasets to new BuboQA-data

@Impavidity I've given you access to https://git.uwaterloo.ca/jimmylin/BuboQA-data

Currently, we're downloading datasets from web sources... Let's move to a fixed repo?

I noticed that the setup script performs some post processing - can you make sure to distinguish the "raw" (original, downloaded version) and the post-processed version?

Loading in the FB knowledge graph as a DataFrame

If we load in the FB knowledge graph as a DataFrame, we can interactively explore it and play around.

@salman1993 please set up a "stub" iPython notebook that contains the boilerplate of loading everything?

Reorganize directory structure, moving common data out of ferhan_simple_qa_rnn

Currently, everything is in the ferhan_simple_qa_rnn/ directory. It would make sense to move data-processing scripts etc. to top level - might want to reorganize directory structure while you're at it. A suggestion might be:

relation_prediction/
relation_prediction/tj_rnn/ (tj for Ture and Jojic)
relation_prediction/lr_baseline/
etc.

Comparison of Results (Relation Prediction w/ violin plots)

ks_2sampleResult means the Kolmogorov-Smirnov Significance Test
Run 50 Experiments for each model

Relation Prediction

Top 1 Results on Valid

Ks_2sampResult(statistic=1.0, pvalue=2.1646881714606301e-23)

Ks_2sampResult(statistic=0.95999999999999996, pvalue=1.3674173963612658e-21)

Top 1 Results on Test

Ks_2sampResult(statistic=0.80000000000000004, pvalue=4.0088870352288605e-15)

Ks_2sampResult(statistic=0.92000000000000004, pvalue=7.2931782229598809e-20)

Top 5 Results on Valid

Ks_2sampResult(statistic=0.58000000000000007, pvalue=3.761754930361375e-08)

Ks_2sampResult(statistic=0.76000000000000001, pvalue=1.0866243200614003e-13)

Top 5 Results on Test

Ks_2sampResult(statistic=0.65999999999999992, pvalue=1.9824376695414084e-10)

Ks_2sampResult(statistic=0.83999999999999997, pvalue=1.2487577181530575e-16)

Conclusion:
For top1 results, CNN > GRU > LSTM
For top5 results, GRU > CNN > LSTM

Remove junk from repo

@salman1993 there's a lot of junk checked into the repo. Try running something like this:

while read -r largefile; do
    echo $largefile | awk '{printf "%s %s ", $1, $3 ; system("git rev-list --all --objects | grep " $1 " | cut -d \" \" -f 2-")}'
done <<< "$(git rev-list --all --objects | awk '{print $1}' | git cat-file --batch-check | sort -k3nr | head -n 20)"

Also: How to find out which files take up the most space in git repo?

Let's clear out the junk... This will require re-writing history and force pushing, so let's do this first before we do directory reorg?

load data issue

I am using torch0.4 and torchtext0.2.3. When I load the train, dev and test data. I find that TEXT.vocab.num is 12. How to fix the issue?

entity_linking error

when I run the entity_linking.py,and I am having trouble with data
FileNotFoundError: [Errno 2] No such file or directory: '../entity_detection/crf/query_text/query.valid'

then I run sh auto_run.sh, download the https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
I find the data different from that in the entity detection part ,why dont you use same data?

License of the code in the repository.

Hey,
Thank you for sharing the great work! Would it be possible to add a license to the code? And maybe also for the preprocessed data?
Thank you!

Upgrade to PyTorch 0.4

We need to upgrade code base to PyTorch 0.4

entity "prior" (i.e., importance)

One of the obvious weaknesses of the current approach is ambiguity in the entities - e.g., there are many "Einsteins" to choose from.

It would be nice if we could assign some sort of "prior" to the entities in the kg - that reflects its importance. Two simple approaches off the top of my head:

number of out-going edges - heuristic is: more relations = more important entity
run PageRank on the kg and assign PageRank score to entity

e2e demo

I'd like to see a complete e2e demo where I can type in a textual question 'when was albert einstein born' and get back the answer 'March 14, 1879', showing intermediate results along the way (entity detection, entity linking, relation prediction).

Ideally I'd like this all wrapped up in a REST endpoint, but I'm open to discussion.

This is blocked by #3

@aliceranzhou can you please take the lead on this? Coordinate w/ @salman1993 re: #2 to help you out.

Data set is not found!

I am interested to run you entity detection, and I am having trouble with data set.
I followed the bellow process, and I got the error "FileNotFoundError: [Errno 2] No such file or directory: '../../data/processed_simplequestions_dataset/train.txt'"!

Processes:

sh setup.sh
python train.py --entity_detection_mode LSTM --fix_embed ← here, I got the error....

How can I get "/processed_simplequestions_dataset/train.txt"?

FB5M.name.txt can not be downloaded

Hi,
Just want to let you know that the link to file "FB5M.name.txt" is broken. What the following line in scripts/fetch_dataset.sh is saving a 404 webpage to the txt file.

wget https://www.dropbox.com/s/yqbesl07hsw297w/FB5M.name.txt

CNN for relation prediction

We should also build a CNN relation prediction baseline - maybe use Kim CNN as a start?

UnicodeDecodeError for entity_detection

I met the following error for entity_detection:

$ python train.py --entity_detection_mode LSTM --fix_embed
Note: You are using GPU for training
Traceback (most recent call last):
  File "train.py", line 36, in <module>
    train, dev, test = SQdataset.splits(TEXT, ED, args.data_dir)
  File "BuboQA-master/entity_detection/nn/sq_entity_dataset.py", line 10, in splits
    ('obj', None), ('text', text_field), ('ed', label_field)]
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 76, in splits
    os.path.join(path, train), **kwargs)
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in __init__
    examples = [make_example(line, fields) for line in reader]
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in <listcomp>
    examples = [make_example(line, fields) for line in reader]
  File "anaconda2/envs/pytorch/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 4585: invalid continuation byte

My torchtext version is 0.2.3, not sure if that's why I met the problem.
Thanks!

Unfair to skip questions in test dataset

In augment_process_dataset.py, some questions in the test dataset are skipped. It is unfair to use such results to compare to the state-of-the-art. Nothing should be changed in the test data.