Code Monkey home page Code Monkey logo

Comments (12)

strubell avatar strubell commented on June 13, 2024 1

Here's an example breakdown in precision, recall, F1 by label:

                F1      Prec    Recall
Micro Avg       90.82   91.05   90.60
-------
       LOC      92.56   92.21   92.93
      MISC      80.75   81.45   80.06
       PER      95.70   96.54   94.87
       ORG      88.59   88.61   88.56

I deal with O by training it like any other label.

from dilated-cnn-ner.

patverga avatar patverga commented on June 13, 2024 1

This is exactly what happens in the early stages of training. Initially, the biggest gains in terms of loss reduction come from predicting majority class. However, over time the model will begin to distinguish non O classes in order to further reduce the loss. We can see this is exactly what happens by looking at the final F1 score of the model where it clearly is not only predicting O for every token. If the O class was significantly enough over-represented you may have to address the imbalance directly, but in this dataset that is not an issue.

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

Given the annotations of CONLL 2003, shouldn't it have around 9 classes?

'B-LOC': 7140,
'B-MISC': 3438,
'B-ORG': 6321,
'B-PER': 6600,
'I-LOC': 1157,
'I-MISC': 1155,
'I-ORG': 3704,
'I-PER': 4528,
'O': 169578

If the model is trained with 'O' tags, there should be an F1 for 'O' tags as well ? (like one in my last comment)

Moreover, since there is a high class imbalance given the fact that most tags are 'O', shouldn't we take the macro average instead of the micro average? Else, I think the reported findings will be biased towards the performance of the dominant class.

As an example:

Class0 - TPR: 9999/10000=0.9999
Class1 - TPR: 0/1=0.0
micro-average TPR: (9999+0)/(10000+1)=0.9998
macro-average TPR: (0.9999+0.0)/2=0.49995

Regards

from dilated-cnn-ner.

strubell avatar strubell commented on June 13, 2024

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

Hello Emma,

Please find the response inline:
'The non-O classes aren't really imbalanced in this dataset'

But, the 'O' class is over-represented and don't you think that will cause the Neural Network to sort of memorize that 'almost every tag is an 'O' and I'll get away most of the times if I predict a word to be belonging to class 'O''?

Agree on the rest.

Regards

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

Am trying to incorporate a context window to fix the sequence lengths. This ensures that variable sequence lengths are handled well but, each sequence of length 'n' produces 'n' such context windows and that, in turn over represents the 'O' tags.
Given the approach mentioned above, I have approximately 180k 'O' tag samples but, only 10k PER tag samples. Weighted sampling doesn't seem to help and given, the challenge of maintaining contexts, crude over-sampling or under-sampling doesn't feel right.

Any suggestion on this Mr. @patverga ?

Regards

from dilated-cnn-ner.

strubell avatar strubell commented on June 13, 2024

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

Hello Ms. @strubell ,

Am not exactly trying to copy-paste this code but, am certainly trying to replicate the findings of the related research paper. Apologies for the same, I tried starting a discussion on ResearchGate but, the thread seems pretty dormant there.

Onto your question of describing the CNN, am trying to do something like this below instead of doing a maxlen padding (which seems to be a very bad idea):

input sentence = ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

output_windows = 

('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott'), 
('<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British'),
 ('<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb'),
 ('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'), 
('rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>'), 
('German', 'call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>'), 
('call', 'to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>'), 
('to', 'boycott', 'British', 'lamb', '.', '<PAD>', '<PAD>', '<PAD>', '<PAD>')]

(A single sentence is converted to multiple sequences and hence the samples over represented classes increases even more.)
Apparently, it bumps up the issue of class imbalance even more. Each tuple above like so:
('<PAD>', '<PAD>', '<PAD>', 'EU', 'rejects', 'German', 'call', 'to', 'boycott')
; is treated as a sentence which is further embedded and fed into an ID CNN block with its labels.

The method above was used to feed fixed length sentences into the network and get rid of the problem of variable length sequences, as it converts a sentence of n tokens into n such sequences of a fixed length.
In the case shown above, each sequence is of a fixed length (len=9). Apologies for being a novice first up, but I couldn't really think of any other way to deal with variable length sequences going into an ID CNN block.

I hope that clarifies my position. In case it doesn't, please feel free to question back. It would be a privilege to have any suggestions from you or your team's end.

I am currently working on CONLL-2003 and I plan on implementing this on Ontonotes 5.0 after this succeeds.

Regards

from dilated-cnn-ner.

strubell avatar strubell commented on June 13, 2024

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

Hello Ms. @strubell ,

Thanks for the wonderful insights.

The shortest sentence had 2 tokens while the longest one had over a 100 tokens. The idea behind not using maxlen padding was to prevent it from creating sparse representations of sentences. So a sentence with a few tokens (say 5 tokens) will look something like one below using maxlen padding (assuming the longest sentence is 100 tokens long. Each number is a word index in the given vocab. The representations would still be sparse post embedding):
[15619, 3259, 15052, 29961, 48521,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0................100 terms]
Any thoughts on this? Are sparse representations good for ConvNets?

Further, we have to teach the model to distinguish between pads and the real tokens by labeling the pads anyway.
Isn't there any other way that you may suggest, than padding, to handle variable length sequences?

To answer your question, I am using a similar padding scheme for test and validation data too.
I had actually applied the padding scheme to the entire dataset (that gives 2100k sentences approx. like so: ('', '', 'EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British')) and I had further divided it thereon into, train, test and validation sets.

Regards

from dilated-cnn-ner.

strubell avatar strubell commented on June 13, 2024

Your understanding of how padding works in CNNs for text is incorrect. We don't have to train the model to predict pad tokens, in fact we do the opposite, and mask the padding so that the model doesn't get a loss for those tokens, and we ignore the predictions. Similarly, we zero out the padding so it's not provided as input. This is the same thing you would do for e.g. an LSTM or any other batched sequence model. I wouldn't call these sparse inputs, since the part the model is actually trained to reason over is very much dense.

One of the ways we avoid the slowdown due to extra computation on padding is to batch sequences with other sequences of similar length. When doing this you'll usually never have a sequence of length 5 in the same batch as a sequence of length 100; the padding is never as drastic as your example.

The way you're handling evaluation also doesn't make sense. Not only will your evaluation not be comparable to other work which evaluates on the normal data, but think about the actual use case. If someone wants to use your code to tag a sentence, how would they use the output of your model? Your model will produce N different labelings of the sequence.

from dilated-cnn-ner.

marc88 avatar marc88 commented on June 13, 2024

I did realize that and am currently on masking the pads but, what I couldn't understand is, if we provide sequences of different lengths in different batches, how does my convnet tackle this variation of dimensions of the input sequences?
Given the batches for your example earlier:
(batch size=128 and embedding dimensions=50; say)
Batch Dimension for length 5 sequences post embedding:
(128,50,5)
Batch dimension for length=100
(128,50,100)

Shouldn't the convnet be fed with fixed dimensional inputs?

About evaluation, apologies but is there anything wrong to expect an output of
[org,o,o,o,o,o,o,o,o]
for the sequence below?
('EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.')

or do you suggest labeling it as ['org'] only?
We can generate some additional feature tags like the POS tags for the other words to train the model on the context around the word? And so we just ignore the 'O' labels?

Regards

from dilated-cnn-ner.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.