Hello, Is it feasible to get a classification report on each class f

Here's an example breakdown in precision, recall, F1 by label: <div class="snippet

Hello Ms. <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

details on the accuracy about dilated-cnn-ner HOT 12 CLOSED

iesl commented on June 13, 2024

details on the accuracy

from dilated-cnn-ner.

Comments (12)

strubell commented on June 13, 2024 1

Here's an example breakdown in precision, recall, F1 by label:

                F1      Prec    Recall
Micro Avg       90.82   91.05   90.60
-------
       LOC      92.56   92.21   92.93
      MISC      80.75   81.45   80.06
       PER      95.70   96.54   94.87
       ORG      88.59   88.61   88.56

I deal with O by training it like any other label.

from dilated-cnn-ner.

patverga commented on June 13, 2024 1

This is exactly what happens in the early stages of training. Initially, the biggest gains in terms of loss reduction come from predicting majority class. However, over time the model will begin to distinguish non O classes in order to further reduce the loss. We can see this is exactly what happens by looking at the final F1 score of the model where it clearly is not only predicting O for every token. If the O class was significantly enough over-represented you may have to address the imbalance directly, but in this dataset that is not an issue.

from dilated-cnn-ner.

marc88 commented on June 13, 2024

Given the annotations of CONLL 2003, shouldn't it have around 9 classes?

'B-LOC': 7140,
'B-MISC': 3438,
'B-ORG': 6321,
'B-PER': 6600,
'I-LOC': 1157,
'I-MISC': 1155,
'I-ORG': 3704,
'I-PER': 4528,
'O': 169578

If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment)

Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class.

As an example:

Class0 - TPR: 9999/10000=0.9999
Class1 - TPR: 0/1=0.0
micro-average TPR: (9999+0)/(10000+1)=0.9998
macro-average TPR: (0.9999+0.0)/2=0.49995

Regards

from dilated-cnn-ner.

strubell commented on June 13, 2024

The non-O classes aren't really imbalanced in this dataset, so it's standard to report micro average. Additionally, it is standard to report *segment* F1, i.e. you need to get the entire span correct (e.g. New York = B-LOC I-LOC), not just the token. This is why the BIO prefixes go away. We also don't typically compute/report F1 for O since the goal of the task is to identify the named entities, not all the outside tokens -- O accuracy is accounted for since incorrectly-predicted Os count towards false negatives in the other classes, and Os labeled as entities count towards false positives in the other classes.

…

On Thu, Dec 13, 2018 at 5:07 AM Surojit Sengupta ***@***.***> wrote: Given the annotations of CONLL 2003, shouldn't it have around 9 classes? 'B-LOC': 7140, 'B-MISC': 3438, 'B-ORG': 6321, 'B-PER': 6600, 'I-LOC': 1157, 'I-MISC': 1155, 'I-ORG': 3704, 'I-PER': 4528, 'O': 169578 If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment) Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class. As an example: Class0 - TPR: 9999/10000=0.9999 Class1 - TPR: 0/1=0.0 micro-average TPR: (9999+0)/(10000+1)=0.9998 macro-average TPR: (0.9999+0.0)/2=0.49995 Regards — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZtzn6ShTk95JmELABGdCxlJf8jtyAks5u4ibsgaJpZM4ZM9wM> .

from dilated-cnn-ner.

marc88 commented on June 13, 2024

Hello Emma,

Please find the response inline:
'The non-O classes aren't really imbalanced in this dataset'

But, the 'O' class is over-represented and don't you think that will cause the Neural Network to sort of memorize that 'almost every tag is an 'O' and I'll get away most of the times if I predict a word to be belonging to class 'O''?

Agree on the rest.

Regards

from dilated-cnn-ner.

marc88 commented on June 13, 2024

Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags.
Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right.

Any suggestion on this Mr. @patverga ?

Regards

from dilated-cnn-ner.

strubell commented on June 13, 2024

I don't understand the first part of your message, it sounds like you're just describing the CNN? Are you using this code? I've had success with weighted updates on imbalanced data (per-class as a function of frequency in the data), but if you're using CoNLL-2003 w/ log loss it shouldn't be necessary.

…

On Sat, Dec 15, 2018 at 12:30 PM Surojit Sengupta ***@***.***> wrote: Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags. Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right. Any suggestion on this Mr. @patverga <https://github.com/patverga> ? Regards — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZt1nZKWVJCROiFWQs6DZotdaoiMahks5u5THOgaJpZM4ZM9wM> .

from dilated-cnn-ner.

marc88 commented on June 13, 2024

Hello Ms. @strubell ,

Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there.

Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea):

input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

output_windows = 

('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), 
('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'),
 ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'),
 ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), 
('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), 
('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), 
('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), 
('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')]

(A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.)
Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so:
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott')
; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels.

The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of n tokens into n such sequences of a fixed length.
In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block.

I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end.

I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds.

Regards

from dilated-cnn-ner.

strubell commented on June 13, 2024

Ok, I see now what you were describing. Why do you think this would be better than padding sequences up to the max length in a batch? One problem with your technique is that it completely negates the purpose of the ID-CNN, which is to better model long-term dependencies. I see how this would exacerbate the class imbalance since you're squaring the number of examples of each class. Like I said before, I've had success with scaling updates relative to (inverse) class frequency. But I'm not convinced that your batching approach is going to be better in any way compared to the typical padding approach. At test time I assume you're still using normal padding?

…

On Sun, Dec 16, 2018 at 2:04 AM Surojit Sengupta ***@***.***> wrote: Hello Ms. @strubell <https://github.com/strubell> , Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there. Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea): input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.') output_windows = ('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), ('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'), ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'), ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), ('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), ('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), ('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), ('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')] (A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.) Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so: ('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott') ; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels. The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of *n* tokens into *n* such sequences of a fixed length. In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block. I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end. I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds. Regards — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADHZt6QC8ApxUmuOlOzMkDOb95GUPt5Yks5u5fBlgaJpZM4ZM9wM> .

from dilated-cnn-ner.

marc88 commented on June 13, 2024

Hello Ms. @strubell ,

Thanks for the wonderful insights.

The shortest sentence had 2 tokens while the longest one had over a 100 tokens. The idea behind not using maxlen padding was to prevent it from creating sparse representations of sentences. So a sentence with a few tokens (say 5 tokens) will look something like one below using maxlen padding (assuming the longest sentence is 100 tokens long. Each number is a word index in the given vocab. The representations would still be sparse post embedding):
[15619, 3259, 15052, 29961, 48521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0................100 terms]
Any thoughts on this? Are sparse representations good for ConvNets?

Further, we have to teach the model to distinguish between pads and the real tokens by labeling the pads anyway.
Isn't there any other way that you may suggest, than padding, to handle variable length sequences?

To answer your question, I am using a similar padding scheme for test and validation data too.
I had actually applied the padding scheme to the entire dataset (that gives 2100k sentences approx. like so: ('', '', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British')) and I had further divided it thereon into, train, test and validation sets.

Regards

from dilated-cnn-ner.

strubell commented on June 13, 2024

Your understanding of how padding works in CNNs for text is incorrect. We don't have to train the model to predict pad tokens, in fact we do the opposite, and mask the padding so that the model doesn't get a loss for those tokens, and we ignore the predictions. Similarly, we zero out the padding so it's not provided as input. This is the same thing you would do for e.g. an LSTM or any other batched sequence model. I wouldn't call these sparse inputs, since the part the model is actually trained to reason over is very much dense.

One of the ways we avoid the slowdown due to extra computation on padding is to batch sequences with other sequences of similar length. When doing this you'll usually never have a sequence of length 5 in the same batch as a sequence of length 100; the padding is never as drastic as your example.

The way you're handling evaluation also doesn't make sense. Not only will your evaluation not be comparable to other work which evaluates on the normal data, but think about the actual use case. If someone wants to use your code to tag a sentence, how would they use the output of your model? Your model will produce N different labelings of the sequence.

from dilated-cnn-ner.

marc88 commented on June 13, 2024

I did realize that and am currently on masking the pads but, what I couldn't understand is, if we provide sequences of different lengths in different batches, how does my convnet tackle this variation of dimensions of the input sequences?
Given the batches for your example earlier:
(batch size=128 and embedding dimensions=50; say)
Batch Dimension for length 5 sequences post embedding:
(128,50,5)
Batch dimension for length=100
(128,50,100)

Shouldn't the convnet be fed with fixed dimensional inputs?

About evaluation, apologies but is there anything wrong to expect an output of
[org,o,o,o,o,o,o,o,o]
for the sequence below?
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

or do you suggest labeling it as ['org'] only?
We can generate some additional feature tags like the POS tags for the other words to train the model on the context around the word? And so we just ignore the 'O' labels?

Regards

from dilated-cnn-ner.

details on the accuracy about dilated-cnn-ner HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent