Comments (12)
Here's an example breakdown in precision, recall, F1 by label:
F1 Prec Recall
Micro Avg 90.82 91.05 90.60
-------
LOC 92.56 92.21 92.93
MISC 80.75 81.45 80.06
PER 95.70 96.54 94.87
ORG 88.59 88.61 88.56
I deal with O
by training it like any other label.
from dilated-cnn-ner.
This is exactly what happens in the early stages of training. Initially, the biggest gains in terms of loss reduction come from predicting majority class. However, over time the model will begin to distinguish non O classes in order to further reduce the loss. We can see this is exactly what happens by looking at the final F1 score of the model where it clearly is not only predicting O for every token. If the O class was significantly enough over-represented you may have to address the imbalance directly, but in this dataset that is not an issue.
from dilated-cnn-ner.
Given the annotations of CONLL 2003, shouldn't it have around 9 classes?
'B-LOC': 7140,
'B-MISC': 3438,
'B-ORG': 6321,
'B-PER': 6600,
'I-LOC': 1157,
'I-MISC': 1155,
'I-ORG': 3704,
'I-PER': 4528,
'O': 169578
If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment)
Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class.
As an example:
Class0 - TPR: 9999/10000=0.9999
Class1 - TPR: 0/1=0.0
micro-average TPR: (9999+0)/(10000+1)=0.9998
macro-average TPR: (0.9999+0.0)/2=0.49995
Regards
from dilated-cnn-ner.
from dilated-cnn-ner.
Hello Emma,
Please find the response inline:
'The non-O classes aren't really imbalanced in this dataset'
But, the 'O' class is over-represented and don't you think that will cause the Neural Network to sort of memorize that 'almost every tag is an 'O' and I'll get away most of the times if I predict a word to be belonging to class 'O''?
Agree on the rest.
Regards
from dilated-cnn-ner.
Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags.
Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right.
Any suggestion on this Mr. @patverga ?
Regards
from dilated-cnn-ner.
from dilated-cnn-ner.
Hello Ms. @strubell ,
Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there.
Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea):
input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')
output_windows =
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'),
('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'),
('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'),
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'),
('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'),
('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'),
('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'),
('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')]
(A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.)
Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so:
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott')
; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels.
The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of n tokens into n such sequences of a fixed length.
In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block.
I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end.
I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds.
Regards
from dilated-cnn-ner.
from dilated-cnn-ner.
Hello Ms. @strubell ,
Thanks for the wonderful insights.
The shortest sentence had 2 tokens while the longest one had over a 100 tokens. The idea behind not using maxlen padding was to prevent it from creating sparse representations of sentences. So a sentence with a few tokens (say 5 tokens) will look something like one below using maxlen padding (assuming the longest sentence is 100 tokens long. Each number is a word index in the given vocab. The representations would still be sparse post embedding):
[15619, 3259, 15052, 29961, 48521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0................100 terms]
Any thoughts on this? Are sparse representations good for ConvNets?
Further, we have to teach the model to distinguish between pads and the real tokens by labeling the pads anyway.
Isn't there any other way that you may suggest, than padding, to handle variable length sequences?
To answer your question, I am using a similar padding scheme for test and validation data too.
I had actually applied the padding scheme to the entire dataset (that gives 2100k sentences approx. like so: ('', '', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British')) and I had further divided it thereon into, train, test and validation sets.
Regards
from dilated-cnn-ner.
Your understanding of how padding works in CNNs for text is incorrect. We don't have to train the model to predict pad tokens, in fact we do the opposite, and mask the padding so that the model doesn't get a loss for those tokens, and we ignore the predictions. Similarly, we zero out the padding so it's not provided as input. This is the same thing you would do for e.g. an LSTM or any other batched sequence model. I wouldn't call these sparse inputs, since the part the model is actually trained to reason over is very much dense.
One of the ways we avoid the slowdown due to extra computation on padding is to batch sequences with other sequences of similar length. When doing this you'll usually never have a sequence of length 5 in the same batch as a sequence of length 100; the padding is never as drastic as your example.
The way you're handling evaluation also doesn't make sense. Not only will your evaluation not be comparable to other work which evaluates on the normal data, but think about the actual use case. If someone wants to use your code to tag a sentence, how would they use the output of your model? Your model will produce N different labelings of the sequence.
from dilated-cnn-ner.
I did realize that and am currently on masking the pads but, what I couldn't understand is, if we provide sequences of different lengths in different batches, how does my convnet tackle this variation of dimensions of the input sequences?
Given the batches for your example earlier:
(batch size=128 and embedding dimensions=50; say)
Batch Dimension for length 5 sequences post embedding:
(128,50,5)
Batch dimension for length=100
(128,50,100)
Shouldn't the convnet be fed with fixed dimensional inputs?
About evaluation, apologies but is there anything wrong to expect an output of
[org,o,o,o,o,o,o,o,o]
for the sequence below?
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')
or do you suggest labeling it as ['org'] only?
We can generate some additional feature tags like the POS tags for the other words to train the model on the context around the word? And so we just ignore the 'O' labels?
Regards
from dilated-cnn-ner.
Related Issues (20)
- Inference HOT 5
- What is the use of projection layer HOT 1
- Questions regarding differences between the implementation and the experiment details in the research paper HOT 5
- Need some clarification on the settings HOT 2
- Training File HOT 1
- Training issue HOT 1
- Is the dilated cnn ner model stable? HOT 2
- Did you try your model on other seq labeling tasks like Chunking or POS? HOT 1
- Nan problem during training on ontonotes data set HOT 7
- Getting some issue with permission beyond my understanding HOT 3
- int() argument must be a string, a bytes-like object or a number, not 'map' HOT 2
- About the paper HOT 6
- preprocessing before triggering 'preprocess.sh' for ontonotes HOT 2
- Validate model on real text data HOT 4
- InvalidArgumentError: indices[11,21] = 243838 is not in [0, 243245)
- Inconsistent results when predicting a single sentence versus predicting labels for dev set HOT 4
- Support for Tensorflow 1.13 HOT 4
- Any pytorch version of dilated-cnn-ner around ? HOT 3
- Question for the paper (only)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dilated-cnn-ner.