<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Trying to understand the index_answer funtion about drqa HOT 6 CLOSED

kushalj001 commented on June 21, 2024

Trying to understand the index_answer funtion

from drqa.

Comments (6)

hitvoice commented on June 21, 2024 1

Yes. It's hard to automatically fix the tokenization errors.

from drqa.

hitvoice commented on June 21, 2024 1

Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right?

No, they are not necessary.

Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?

If you use the lower-cased GloVe, you should lowercase the text before building the vocab. Otherwise, the vocab and the embedding tokens may not match.

In the case of glove 840B, keeping the data as it does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.

Yes, you lowercase the data when using the GloVe 6B lowercased version.

from drqa.

hitvoice commented on June 21, 2024

It's probably due to the tokenization inconsistency between the annotated answer span and spacy tokenization. It's likely to happen where the corpus has unusual punctuations.
If the annotated answer_start or answer_end lies in the middle of a token produced by SpaCy tokenization, it'll raise ValueError.

from drqa.

kushalj001 commented on June 21, 2024

So you're not considering those examples for training, right?

from drqa.

kushalj001 commented on June 21, 2024

I banged my head for some days in trying to debug and fix them. It's largely due to the absence of a space character (' ') just before or just after the answer span in the answer. I reduced the errors to 10-15 erroneous examples and dropped them finally.
Also, a follow-up question, removal of punctuation is not necessary from the contexts and questions, right? Your script has only fixed the spaces before building a vocab. Even lowercasing the text is not necessary, right before building the vocab?
In the case of glove 840B, keeping the data as it is does not affect the vocab a lot. But in the case of GloVe 6B lowercasing the data reduces the Out of Vocabulary words to a fair extent.

Thank you for your help!

from drqa.

kushalj001 commented on June 21, 2024

Thanks a lot!

from drqa.

Recommend Projects

Trying to understand the index_answer funtion about drqa HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent