Comments (6)
感觉没说清楚:
首先感谢你的代码~
我想通过你的预处理过程获取所有的id化的context,question以及answer和answer span,但是这个id化的过程中最好尽可能不要有UNK,我想问一下在哪里能够参数化的控制这个过程呢
from drqa.
你好:
首先,这里的UNK是由这个词是否出现在GloVe中决定的,最终得到的词表长度符合预期;其次,在这一步融入UNK的原因是因为整数id将会用于embedding matrix的索引(而不是为了还原文本),这样做会让所有的UNK共享同一个embedding vector,为了模型训练的考虑这正是所需要的。
如果要还原原始文本,正确的做法是根据词语在原始文本中的范围(context_span
)进行还原。具体做法可以参考drqa/model.py中的predict函数。具体地说,假如要还原第i
篇文章第s_i
个词到第e_i
个词的内容,可以用如下代码:
def get_original_text(data, i, s_i, e_i):
return data[i][-2][data[i][-1][s_i][0]: data[i][-1][e_i][1]]
text = get_original_text(train, 0, 2, 4) # text from 2nd word to 4th word in 0-th training example
对于第二条评论,能否进一步补充进行完整id化的动机?目前我并未看出这样做的意义。
谢谢你的反馈。
from drqa.
这样做会让所有的UNK共享同一个embedding vector,为了模型训练的考虑这正是所需要的。
我的疑问主要在这个地方,我之前建立vocab的时候是在性能可以接受的前提下尽量增加词典数量,即使用预训练的词向量,请问把所有在Glove以外的词都统一为UNK有什么好处吗?感觉会损失过多的信息。
from drqa.
额,这个问题解决了,还有一个想请问的地方,就是load_squad的时候,feature、tag和ents分别表示了什么意义呢,我看这三个量的长度是相同的。
from drqa.
你好:
第一个问题,这么做的先验信念是:SQuAD数据集就是从Wikipedia中采集而来的,只包含五百多篇文章(尽管有10万问答对),数据量可能不够支撑众多罕见词的训练,使得这些词的词向量容易过拟合,因此不如让模型学会应对UNK,毕竟在测试阶段还会遇到很多先前没见过的UNK;当然实践中可能会有不一样的结论,如果你有兴趣进行对比实验,欢迎分享实验结果 :)
第二个问题,三个分别对应论文中的match特征+TF特征、POS和NER。
from drqa.
奥,了解了解,谢谢。
from drqa.
Related Issues (20)
- no model file HOT 2
- Adding Evidence as Database (like wikipedia ) HOT 5
- Only decode on a test set HOT 3
- FileNotFoundError: [Errno 2] No such file or directory: 'SQuAD/meta.msgpack' HOT 6
- How long to run the model for the default params HOT 2
- Is there a way to know the score of the prediction to analyse whether it is right or wrong? HOT 1
- planning to implement Attend It Again paper. HOT 2
- Using DrQA on an Chinese dataset HOT 3
- using DrQA for Squad 2.0 and other datasets HOT 1
- train stop HOT 3
- Finetune against a custom dataset HOT 1
- AssertionError: Torch not compiled with CUDA enabled HOT 2
- Regarding train.py HOT 2
- msgpack.exceptions.UnpackValueError: Unpack failed: error = 0 HOT 1
- Trying to understand the index_answer funtion HOT 6
- Getting low F1 and EM scores HOT 1
- Different function of evaluating metrics
- Gradient flow of the failing model
- training stopped at epoch 1 HOT 9
- Cant do "bash" HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from drqa.