Comments (6)
Yes. It's hard to automatically fix the tokenization errors.
from drqa.
Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right?
No, they are not necessary.
Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?
If you use the lower-cased GloVe, you should lowercase the text before building the vocab. Otherwise, the vocab and the embedding tokens may not match.
In the case of glove 840B, keeping the data as it does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.
Yes, you lowercase the data when using the GloVe 6B lowercased version.
from drqa.
It's probably due to the tokenization inconsistency between the annotated answer span and spacy tokenization. It's likely to happen where the corpus has unusual punctuations.
If the annotated answer_start or answer_end lies in the middle of a token produced by SpaCy tokenization, it'll raise ValueError.
from drqa.
So you're not considering those examples for training, right?
from drqa.
I banged my head for some days in trying to debug and fix them. It's largely due to the absence of a space character (' ') just before or just after the answer span in the answer. I reduced the errors to 10-15 erroneous examples and dropped them finally.
Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right? Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?
In the case of glove 840B, keeping the data as it is does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.
Thank you for your help!
from drqa.
Thanks a lot!
from drqa.
Related Issues (20)
- no model file HOT 2
- Adding Evidence as Database (like wikipedia ) HOT 5
- Only decode on a test set HOT 3
- FileNotFoundError: [Errno 2] No such file or directory: 'SQuAD/meta.msgpack' HOT 6
- How long to run the model for the default params HOT 2
- Is there a way to know the score of the prediction to analyse whether it is right or wrong? HOT 1
- planning to implement Attend It Again paper. HOT 2
- Using DrQA on an Chinese dataset HOT 3
- using DrQA for Squad 2.0 and other datasets HOT 1
- train stop HOT 3
- Finetune against a custom dataset HOT 1
- AssertionError: Torch not compiled with CUDA enabled HOT 2
- Regarding train.py HOT 2
- msgpack.exceptions.UnpackValueError: Unpack failed: error = 0 HOT 1
- Getting low F1 and EM scores HOT 1
- Different function of evaluating metrics
- Gradient flow of the failing model
- training stopped at epoch 1 HOT 9
- Cant do "bash" HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from drqa.