Code Monkey home page Code Monkey logo

webqa-using-dgcnn's Introduction

WebQA-Using-DGCNN

Reimplementation of DGCNN using PyTorch. The idea come from Jianlin. Su 's Blog Dilate Gated Convolutional Neural Network (DGCNN) is based on CNN and simple Attention mechanism. It is very efficient and lightweight, because of no RNN architecture in this model. It is designed for WebQA Task specificly.

Dataset

I use dataset is WebQA BaiDu Reaserch's Paper

  • Peng Li, Wei Li, Zhengyan He, Xuguang Wang, Ying Cao, Jie Zhou, and Wei Xu. 2016. Dataset and Neural Recurrent Sequence Labeling Model for Open-Domain Factoid Question Answering.arXiv:1607.06275

Thanks for their sharing

Based on above dataset, Jianlin. Su. create a pure version which maybe more suitable for student. He processed these raw data and shared them on his blog

Bacause the dataset is too large, I did not push it to github You can download the Dataset according to Su's blog. And then move these data under data directory.

The Structure of WebQA - WebQA |- readme.md |- me_test.ann.json |- me_test.ir.json |- me_train.json |- me_validation.ann.json |- me_validation.ir.json The difference between ir and ann is that: In ann dataset every data item have one question and coressponding one evidence which have answer In ir dataset every data item have one question and multiple evidences which may not have true ansewer

The stucture of json file:

  • Using json.loads read file, get a dict. The key of this dict for exmaple "Q_TRN_010878" is the index of question
  • We can get data item through d['Q_TRN_010878']. Every data item is a dict which has two keys : 'question' and 'evidences'
  • Through d['Q_TRN_010878']['question'], we can get the text of question. For exmaple: "勇敢的心霍笑林的父亲是谁出演的"
  • Through d['Q_TRN_010878']['evidences'], we can get a dict. The key of this dict is the index of evidence
  • Through d['Q_TRN_010878']['evidences']['Q_TRN_010878#05'],we can also get a dict, which has two keys: 'evidence' and 'answer'
  • evidece is the text of evidence, answer is a list (maybe not only one answer). If no answer, answer is ['no_answer']
filename itemNums
me_train.json 36181
me_validation.json 3018
me_test.json 3024
answer_train.json 140897
no_answer_train.json 307547

Data Preprocess

This Raw Data is not convinient for us. For trainining dataset, I seperate the evidences.Every data item has one question, one evidence and an answer list. Some items have answer but some do not. Their ratio is 1:1。

For validation dataset and test dataset, I just seperate the evidences,keep one item having one question,one evidence and an answer list.

Just Like this:

[
  {
    "question":"世界第一高峰是什么?",
    "evidence":"世界上的第一高峰是珠穆朗玛峰",
  	"answer":["珠穆朗玛峰"]
	},
  ...
  {
    "question":"世界第一高峰是什么?",
    "evidencee":"武夷山很高",
    "answer":["no_answer"]
  }
]

Usage

Download Dataset and Pretrained word vector

(WordVector I use BaiDuBaike https://pan.baidu.com/s/1YYE2T3f-lPyLBrJuUowAsA Password: 5p0h)

Download them and Place them under data directory

Preprocess the WordVector and Dataset

do

python script/buildDataset.py

python script/generateWordVec.py

You will find that

Under data directory there is an aditional directory dataset

Under ChinsesWordVec_baike directory there are four additional file word_embedding.npy, char_embedding.npy, char2id.pkl, word2id.pkl

Training and Prediction

python src/run.py

You can find log information in you terminal and log file under log directory.

The trained Model is stored under result directoy.

Problem

I failed to reimplement the expected result of this model.

i found using pointer-label model , the final classifier trends to predict all item to zero.'

Finally, in validation step, the f1-score, recall, precesion are all zero.

update

2020/09/24 proposing a issue in bojone/dgcnn_for_reading_comprehension#5

2020/09/25 try to adjust some random seed to give a shot 2333.

2020/10/06 adding epoches solves above problem , finally the f1-score is 0.64

Requirement

This repo was tested on Python 3.6.11 and torch 1.5.1+cu101. The main requirements are:

tqdm codecs torch = 1.5.1+cu101

webqa-using-dgcnn's People

Contributors

zrealshadow avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.