Code Monkey home page Code Monkey logo

buboqa's People

Contributors

aliceranzhou avatar impavidity avatar lintool avatar salman1993 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

buboqa's Issues

Change in test set

Hey, I am having an issue. Can I add my own instances inside the processed_simplequestions_dataset/test.txt directly or need I to put the new instance inside annoted_fb_data_test. Whenever I try to change the test.txt an error comes saying example object has no attribute text.

Generate gold standard entity detection dataset

Since we've identified entity linking as a problem we need to examine in isolation, we should generate a specific dataset for it. Using this thread as a discussion of how to best do it.

My suggestion is to just augment the original fb dataset with new fields. So, annotated_fb_data_train.txt:

www.freebase.com/m/04whkz5	www.freebase.com/book/written_work/subjects	www.freebase.com/m/01cj3p	what is the book e about
www.freebase.com/m/0tp2p24	www.freebase.com/music/release_track/release	www.freebase.com/m/0sjc7c1	to what release does the release track cardiac arrest come from
...

Tack on the following column, from the official alias:

e
cardiac arrest

In most cases, the alias should be a proper substring of the question, but there might be cases in which it doesn't match exactly.

My suggestion is to write a script that generates this file and include the Python script in the repo.

Thoughts @salman1993 and @Impavidity ?

Manual Correction

In scripts/util.py/www2fb()
There are some manual correction for MID.
Can you explain a little bit why you have to do that?

question about re-implement

Hi everyone:
Do you try to re-implement "Simple Question Answering by Attentive Convolutional Neural Network" this paper?

Thanks

create entity detection dataset like Ferhan described in email

email

Since we have the relation prediction model exactly as Ferhan described, let's also try to create the entity detection dataset like that and try to replicate his model.

Steps:

  • create this dataset
  • run our model and train on this dataset
  • modify results_to_query.py file in entity_detection directory

iPython notebook for e2e system

Hi @salman1993 can you build an iPython notebook for the e2e system? I want to put in "when was albert einstein born" and get back something like [mid, 'albert einstein', rel].

Core dumped Error in running "sudo docker build -t buboqa ."

Hi, I'm really thanks to you about these paper and codes.

I'm following what you wrote,
and I don't know why but there's an error core dumped on "sudo docker build -t buboqa ."
Can I get some hints or tips for solve this problem?

  • Please tell me if you need more informations.

My computer:
*-cpu
description: CPU
product: AMD Ryzen 5 1600 Six-Core Processor
vendor: Advanced Micro Devices [AMD]
physical id: 11
bus info: cpu@0
version: AMD Ryzen 5 1600 Six-Core Processor
serial: Unknown
slot: AM4
size: 1374MHz
capacity: 3700MHz
width: 64 bits
clock: 100MHz

*-memory
description: System Memory
physical id: c
slot: System board or motherboard
size: 16GiB

Storage solution for the knowledge graph

@salman1993 IIRC, your code builds various indexes on the kg, but you don't actually index the graph itself. So if the question were "when was albert einstein born" and we get back something like [mid, 'albert einstein', rel], we still need some functionality to actually look up the triple from the kg, right?

We should probably use some sort of DB? Poking around, there are some options:

Perhaps think about storing the index structures in the same db?

Dictionary Search Speed Up

When pickle the dictionary, use
"if names.get(entity) is None:",
instead of using
"if entity not in names.keys():"
would speed up the search.

www2fb function ,why need Manual Correction

all the files do manual correction, so why doing this manual correction?

if in_str == 'fb:m.07s9rl0':
        in_str = 'fb:m.02822'
    if in_str == 'fb:m.0bb56b6':
        in_str = 'fb:m.0dn0r'
    # Manual Correction
    if in_str == 'fb:m.01g81dw':
        in_str = 'fb:m.01g_bfh'
    if in_str == 'fb:m.0y7q89y':
        in_str = 'fb:m.0wrt1c5'
    if in_str == 'fb:m.0b0w7':
        in_str = 'fb:m.0fq0s89'

FB5M.name.txt

How do you get FB5M.name.txt? Is from freebase-FB5M.txt? But I can't find the relation fb:type.object.name. The name in this file seems to have been processed by the word segmentation tool. But I want replace the entity in the SimpleQuestions dataset.

Reorganize directory structure, moving common data out of ferhan_simple_qa_rnn

Currently, everything is in the ferhan_simple_qa_rnn/ directory. It would make sense to move data-processing scripts etc. to top level - might want to reorganize directory structure while you're at it. A suggestion might be:

  • relation_prediction/
  • relation_prediction/tj_rnn/ (tj for Ture and Jojic)
  • relation_prediction/lr_baseline/
  • etc.

Comparison of Results (Relation Prediction w/ violin plots)

  • ks_2sampleResult means the Kolmogorov-Smirnov Significance Test
  • Run 50 Experiments for each model

Relation Prediction

Top 1 Results on Valid

image
Ks_2sampResult(statistic=1.0, pvalue=2.1646881714606301e-23)
image
Ks_2sampResult(statistic=0.95999999999999996, pvalue=1.3674173963612658e-21)

Top 1 Results on Test

image
Ks_2sampResult(statistic=0.80000000000000004, pvalue=4.0088870352288605e-15)
image
Ks_2sampResult(statistic=0.92000000000000004, pvalue=7.2931782229598809e-20)

Top 5 Results on Valid

image
Ks_2sampResult(statistic=0.58000000000000007, pvalue=3.761754930361375e-08)
image
Ks_2sampResult(statistic=0.76000000000000001, pvalue=1.0866243200614003e-13)

Top 5 Results on Test

image
Ks_2sampResult(statistic=0.65999999999999992, pvalue=1.9824376695414084e-10)

image
Ks_2sampResult(statistic=0.83999999999999997, pvalue=1.2487577181530575e-16)

  • Conclusion:
    For top1 results, CNN > GRU > LSTM
    For top5 results, GRU > CNN > LSTM

Remove junk from repo

@salman1993 there's a lot of junk checked into the repo. Try running something like this:

while read -r largefile; do
    echo $largefile | awk '{printf "%s %s ", $1, $3 ; system("git rev-list --all --objects | grep " $1 " | cut -d \" \" -f 2-")}'
done <<< "$(git rev-list --all --objects | awk '{print $1}' | git cat-file --batch-check | sort -k3nr | head -n 20)"

Also: How to find out which files take up the most space in git repo?

Let's clear out the junk... This will require re-writing history and force pushing, so let's do this first before we do directory reorg?

load data issue

I am using torch0.4 and torchtext0.2.3. When I load the train, dev and test data. I find that TEXT.vocab.num is 12. How to fix the issue?

entity "prior" (i.e., importance)

One of the obvious weaknesses of the current approach is ambiguity in the entities - e.g., there are many "Einsteins" to choose from.

It would be nice if we could assign some sort of "prior" to the entities in the kg - that reflects its importance. Two simple approaches off the top of my head:

  • number of out-going edges - heuristic is: more relations = more important entity
  • run PageRank on the kg and assign PageRank score to entity

e2e demo

I'd like to see a complete e2e demo where I can type in a textual question 'when was albert einstein born' and get back the answer 'March 14, 1879', showing intermediate results along the way (entity detection, entity linking, relation prediction).

Ideally I'd like this all wrapped up in a REST endpoint, but I'm open to discussion.

This is blocked by #3

@aliceranzhou can you please take the lead on this? Coordinate w/ @salman1993 re: #2 to help you out.

Data set is not found!

I am interested to run you entity detection, and I am having trouble with data set.
I followed the bellow process, and I got the error "FileNotFoundError: [Errno 2] No such file or directory: '../../data/processed_simplequestions_dataset/train.txt'"!

Processes:

  1. sh setup.sh
  2. python train.py --entity_detection_mode LSTM --fix_embed โ† here, I got the error....

How can I get "/processed_simplequestions_dataset/train.txt"?

FB5M.name.txt can not be downloaded

Hi,
Just want to let you know that the link to file "FB5M.name.txt" is broken. What the following line in scripts/fetch_dataset.sh is saving a 404 webpage to the txt file.

wget https://www.dropbox.com/s/yqbesl07hsw297w/FB5M.name.txt

UnicodeDecodeError for entity_detection

I met the following error for entity_detection:

$ python train.py --entity_detection_mode LSTM --fix_embed
Note: You are using GPU for training
Traceback (most recent call last):
  File "train.py", line 36, in <module>
    train, dev, test = SQdataset.splits(TEXT, ED, args.data_dir)
  File "BuboQA-master/entity_detection/nn/sq_entity_dataset.py", line 10, in splits
    ('obj', None), ('text', text_field), ('ed', label_field)]
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 76, in splits
    os.path.join(path, train), **kwargs)
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in __init__
    examples = [make_example(line, fields) for line in reader]
  File "anaconda2/envs/pytorch/lib/python3.6/site-packages/torchtext/data/dataset.py", line 235, in <listcomp>
    examples = [make_example(line, fields) for line in reader]
  File "anaconda2/envs/pytorch/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc5 in position 4585: invalid continuation byte

My torchtext version is 0.2.3, not sure if that's why I met the problem.
Thanks!

Unfair to skip questions in test dataset

In augment_process_dataset.py, some questions in the test dataset are skipped. It is unfair to use such results to compare to the state-of-the-art. Nothing should be changed in the test data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.