Context I'm training an NER model using the H

[Question] Underspecified Labels w/ out Fine-Grained Label about skweak HOT 5 CLOSED

norskregnesentral commented on June 12, 2024

[Question] Underspecified Labels w/ out Fine-Grained Label

from skweak.

Comments (5)

plison commented on June 12, 2024

mm, this shouldn't happen indeed. Your code seems correct, I don't see any error. Would it be possible to send me the spacy document (with annotated spans) that triggers the error?

Here's a minimal piece of code I used to test the behavior:

import spacy, skweak
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a test for Pierre Lison living in Oslo, and here is another random Entity, "
          + "and a final person peter jackson.")
doc.spans["lf1"] = [spacy.tokens.Span(doc,5,7, "A"), 
                    spacy.tokens.Span(doc, 22, 24, "A")]
doc.spans["lf2"] = [spacy.tokens.Span(doc,9,10, "B")]
doc.spans["lf3"] = [spacy.tokens.Span(doc, 5,7, "C"), 
                    spacy.tokens.Span(doc,9,10, "C"), 
                    spacy.tokens.Span(doc, 16, 17, "C")]

hmm = skweak.aggregation.HMM("hmm", ["A", "B"], sequence_labelling=True)
hmm.add_underspecified_label("C", ["A", "B"])
_ = hmm.fit_and_aggregate([doc])

When it comes to your questions: no, your initial code was correct, you shouldn't include C as a possible label option if C is an underspecified label. Basically, the underspecified labels are part of the possible HMM observations (outputs from the labelling functions), but are not part of the HMM states. If you call the pretty_print function, you can see the observation matrices (one per labelling function), where the possible states only include A and B, while the LF observations include A, B and C.

from skweak.

schopra8 commented on June 12, 2024

Thanks Pierre! I've included the .spacy file in the attached zip folder.

Replication:

Skweak Version: GitHub Master Branch
Code Snippet:

# NER Labels
HRD_TAG = 'HRD' # Underspecified Label
PGL_TAG = 'PGL'
DB_TAG = 'DB'
SW_TAG = 'SW'
ORG_TAG = 'ORG'

doc = list(docbin_reader('example_error.spacy'))
hmm = aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG])
_ = hmm.fit_and_aggregate(doc)

example_error.spacy.zip

from skweak.

plison commented on June 12, 2024

Thanks! I had a look at your document, but it seems like the only label that is actually provided in this dataset is the underspecified HRD label:

saro_products_lf: {angular: 'HRD'}
saro_tools_lf: {}
o_net_skills_lf: {}
dice_skills_lf: {angular: 'HRD'}
multi_token_dice_detector: {}
multi_token_saro_products_detector: {}
multi_token_saro_tools_detector: {}
multi_token_o_net_skills_detector: {}
digital_com_pgls_lf: {}
ne_pgls_lf: {}
so_pgls_lf: {}
st_pgls_lf: {}
wiki_pgls_lf: {}
db_engines_dbs_lf: {}
nosql_wiki_dbs_lf: {}
popular_dbs_lf: {}
rbdms_wiki_dbs_lf: {}
so_dbs_lf: {}
software_companies_lf: {}
company_with_punctuation_hard_skill_detector: {}
company_within_noun_phrase_detector: {}
company_with_acronym_detector: {}
company_into_database_detector: {}
oracle_into_database_detector: {}
hmm: {}
verb_detector: {}

That's what confuses the model: it hasn't seen any observation of the actual labels you want to aggregate (PDL, DB, etc.). Which means the transition model and observation models are impossible to estimate.

from skweak.

schopra8 commented on June 12, 2024

Thanks for the explanation Pierre! I think I've misunderstood how underspecified labels work.

I've been presuming that:

I can label sequences with the under-specified label (if no other more specific labels are present)
"back-off" to the underspecified label if there is disagreement between more specific labels

Am I correct in stating that assumption 1 is incorrect and assumption 2 is correct? And Is there a good way to realize both assumptions in skweak?

As you saw in the example doc, I have entities that I know belong to "HRD" but don't know which specific sub-category they should be assigned (i.e., not present in sub-category gazetteers). Thanks again!

from skweak.

plison commented on June 12, 2024

Yes, you are correct, the underspecified labels are employed to provide a "weaker" signal (i.e. allowing a labelling function to output a subset of possible labels instead of a single one). But they are not meant to be used as some kind of hierarchical labelling, where one can "back-off" to the underspecified value. That would indeed be very interesting to investigate, but it would require a much more advanced probabilistic model than a classical HMM.

I guess a relatively quick fix would be to add this HRD value to the list of output labels that can be aggregated over, like this:

docs = list(skweak.utils.docbin_reader('example_error.spacy'))
hmm = skweak.aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG, HRD_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG, HRD_TAG])
_ = hmm.fit_and_aggregate(docs)

But I'm not really sure about what kind of solutions the EM algorithm will converge to in this setting.

from skweak.

[Question] Underspecified Labels w/ out Fine-Grained Label about skweak HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent