castorini / buboqa Goto Github PK
View Code? Open in Web Editor NEWSimple question answering over knowledge graphs (Mohammed et al., NAACL 2018)
Simple question answering over knowledge graphs (Mohammed et al., NAACL 2018)
Hey, I am having an issue. Can I add my own instances inside the processed_simplequestions_dataset/test.txt directly or need I to put the new instance inside annoted_fb_data_test. Whenever I try to change the test.txt an error comes saying example object has no attribute text.
Since we've identified entity linking as a problem we need to examine in isolation, we should generate a specific dataset for it. Using this thread as a discussion of how to best do it.
My suggestion is to just augment the original fb dataset with new fields. So, annotated_fb_data_train.txt
:
www.freebase.com/m/04whkz5 www.freebase.com/book/written_work/subjects www.freebase.com/m/01cj3p what is the book e about
www.freebase.com/m/0tp2p24 www.freebase.com/music/release_track/release www.freebase.com/m/0sjc7c1 to what release does the release track cardiac arrest come from
...
Tack on the following column, from the official alias:
e
cardiac arrest
In most cases, the alias should be a proper substring of the question, but there might be cases in which it doesn't match exactly.
My suggestion is to write a script that generates this file and include the Python script in the repo.
Thoughts @salman1993 and @Impavidity ?
In scripts/util.py/www2fb()
There are some manual correction for MID.
Can you explain a little bit why you have to do that?
Hi everyone:
Do you try to re-implement "Simple Question Answering by Attentive Convolutional Neural Network" this paper?
Thanks
Hi @salman1993 can you build an iPython notebook for the e2e system? I want to put in "when was albert einstein born" and get back something like [mid, 'albert einstein', rel].
Hi, I'm really thanks to you about these paper and codes.
I'm following what you wrote,
and I don't know why but there's an error core dumped on "sudo docker build -t buboqa ."
Can I get some hints or tips for solve this problem?
My computer:
*-cpu
description: CPU
product: AMD Ryzen 5 1600 Six-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 11
bus info: cpu@0
version: AMD Ryzen 5 1600 Six-Core Processor
serial: Unknown
slot: AM4
size: 1374MHz
capacity: 3700MHz
width: 64 bits
clock: 100MHz
*-memory
description: System Memory
physical id: c
slot: System board or motherboard
size: 16GiB
@salman1993 IIRC, your code builds various indexes on the kg, but you don't actually index the graph itself. So if the question were "when was albert einstein born" and we get back something like [mid, 'albert einstein', rel], we still need some functionality to actually look up the triple from the kg, right?
We should probably use some sort of DB? Poking around, there are some options:
Perhaps think about storing the index structures in the same db?
In entity detection text preprocessing, give me a function called get_query_text(question) that takes in text of one question and returns the query text
"where was sasha vujacic born?" => "sasha vujacic"
When pickle the dictionary, use
"if names.get(entity) is None:",
instead of using
"if entity not in names.keys():"
would speed up the search.
all the files do manual correction, so why doing this manual correction?
if in_str == 'fb:m.07s9rl0':
in_str = 'fb:m.02822'
if in_str == 'fb:m.0bb56b6':
in_str = 'fb:m.0dn0r'
# Manual Correction
if in_str == 'fb:m.01g81dw':
in_str = 'fb:m.01g_bfh'
if in_str == 'fb:m.0y7q89y':
in_str = 'fb:m.0wrt1c5'
if in_str == 'fb:m.0b0w7':
in_str = 'fb:m.0fq0s89'
Let's run a logistic regression baseline for relation prediction. We can either use something like Mallet, or I suppose we can use one-hot vectors with a softmax?
Let's run an entity detection baseline using Stanford NER straight out of the box.
How do you get FB5M.name.txt? Is from freebase-FB5M.txt? But I can't find the relation fb:type.object.name. The name in this file seems to have been processed by the word segmentation tool. But I want replace the entity in the SimpleQuestions dataset.
@Impavidity I've given you access to https://git.uwaterloo.ca/jimmylin/BuboQA-data
Currently, we're downloading datasets from web sources... Let's move to a fixed repo?
I noticed that the setup script performs some post processing - can you make sure to distinguish the "raw" (original, downloaded version) and the post-processed version?
If we load in the FB knowledge graph as a DataFrame, we can interactively explore it and play around.
@salman1993 please set up a "stub" iPython notebook that contains the boilerplate of loading everything?
Currently, everything is in the ferhan_simple_qa_rnn/
directory. It would make sense to move data-processing scripts etc. to top level - might want to reorganize directory structure while you're at it. A suggestion might be:
relation_prediction/
relation_prediction/tj_rnn/
(tj for Ture and Jojic)relation_prediction/lr_baseline/
Ks_2sampResult(statistic=1.0, pvalue=2.1646881714606301e-23)
Ks_2sampResult(statistic=0.95999999999999996, pvalue=1.3674173963612658e-21)
Ks_2sampResult(statistic=0.80000000000000004, pvalue=4.0088870352288605e-15)
Ks_2sampResult(statistic=0.92000000000000004, pvalue=7.2931782229598809e-20)
Ks_2sampResult(statistic=0.58000000000000007, pvalue=3.761754930361375e-08)
Ks_2sampResult(statistic=0.76000000000000001, pvalue=1.0866243200614003e-13)
Ks_2sampResult(statistic=0.65999999999999992, pvalue=1.9824376695414084e-10)
Ks_2sampResult(statistic=0.83999999999999997, pvalue=1.2487577181530575e-16)
@salman1993 there's a lot of junk checked into the repo. Try running something like this:
while read -r largefile; do
echo $largefile | awk '{printf "%s %s ", $1, $3 ; system("git rev-list --all --objects | grep " $1 " | cut -d \" \" -f 2-")}'
done <<< "$(git rev-list --all --objects | awk '{print $1}' | git cat-file --batch-check | sort -k3nr | head -n 20)"
Also: How to find out which files take up the most space in git repo?
Let's clear out the junk... This will require re-writing history and force pushing, so let's do this first before we do directory reorg?
I am using torch0.4 and torchtext0.2.3. When I load the train, dev and test data. I find that TEXT.vocab.num is 12. How to fix the issue?
when I run the entity_linking.py,and I am having trouble with data
FileNotFoundError: [Errno 2] No such file or directory: '../entity_detection/crf/query_text/query.valid'
then I run sh auto_run.sh, download the https://nlp.stanford.edu/software/stanford-ner-2017-06-09.zip
I find the data different from that in the entity detection part ,why dont you use same data?
Hey,
Thank you for sharing the great work! Would it be possible to add a license to the code? And maybe also for the preprocessed data?
Thank you!
We need to upgrade code base to PyTorch 0.4
One of the obvious weaknesses of the current approach is ambiguity in the entities - e.g., there are many "Einsteins" to choose from.
It would be nice if we could assign some sort of "prior" to the entities in the kg - that reflects its importance. Two simple approaches off the top of my head:
I'd like to see a complete e2e demo where I can type in a textual question 'when was albert einstein born' and get back the answer 'March 14, 1879', showing intermediate results along the way (entity detection, entity linking, relation prediction).
Ideally I'd like this all wrapped up in a REST endpoint, but I'm open to discussion.
This is blocked by #3
@aliceranzhou can you please take the lead on this? Coordinate w/ @salman1993 re: #2 to help you out.
I am interested to run you entity detection, and I am having trouble with data set.
I followed the bellow process, and I got the error "FileNotFoundError: [Errno 2] No such file or directory: '../../data/processed_simplequestions_dataset/train.txt'"!
Processes:
How can I get "/processed_simplequestions_dataset/train.txt"?
Hi,
Just want to let you know that the link to file "FB5M.name.txt" is broken. What the following line in scripts/fetch_dataset.sh
is saving a 404 webpage to the txt file.
wget https://www.dropbox.com/s/yqbesl07hsw297w/FB5M.name.txt
We should also build a CNN relation prediction baseline - maybe use Kim CNN as a start?
I met the following error for entity_detection:
$ python train.py --entity_detection_mode LSTM --fix_embed
Note: You are using GPU for training
Traceback (most recent call last):
File "train.py", line 36, in <module>
train, dev, test = SQdataset.splits(TEXT, ED, args.data_dir)
File "BuboQA-master/entity_detection/nn/sq_entity_dataset.py", line 10, in splits
('obj', None), ('text', text_field), ('ed', label_field)]
File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 76, in splits
os.path.join(path, train), **kwargs)
File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in __init__
examples = [make_example(line, fields) for line in reader]
File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in <listcomp>
examples = [make_example(line, fields) for line in reader]
File "anaconda2/envs/pytorch/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 4585: invalid continuation byte
My torchtext version is 0.2.3, not sure if that's why I met the problem.
Thanks!
In augment_process_dataset.py, some questions in the test dataset are skipped. It is unfair to use such results to compare to the state-of-the-art. Nothing should be changed in the test data.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.