Code Monkey home page Code Monkey logo

Comments (5)

plison avatar plison commented on June 12, 2024

mm, this shouldn't happen indeed. Your code seems correct, I don't see any error. Would it be possible to send me the spacy document (with annotated spans) that triggers the error?

Here's a minimal piece of code I used to test the behavior:

import spacy, skweak
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a test for Pierre Lison living in Oslo, and here is another random Entity, "
          + "and a final person peter jackson.")
doc.spans["lf1"] = [spacy.tokens.Span(doc,5,7, "A"), 
                    spacy.tokens.Span(doc, 22, 24, "A")]
doc.spans["lf2"] = [spacy.tokens.Span(doc,9,10, "B")]
doc.spans["lf3"] = [spacy.tokens.Span(doc, 5,7, "C"), 
                    spacy.tokens.Span(doc,9,10, "C"), 
                    spacy.tokens.Span(doc, 16, 17, "C")]

hmm = skweak.aggregation.HMM("hmm", ["A", "B"], sequence_labelling=True)
hmm.add_underspecified_label("C", ["A", "B"])
_ = hmm.fit_and_aggregate([doc])

When it comes to your questions: no, your initial code was correct, you shouldn't include C as a possible label option if C is an underspecified label. Basically, the underspecified labels are part of the possible HMM observations (outputs from the labelling functions), but are not part of the HMM states. If you call the pretty_print function, you can see the observation matrices (one per labelling function), where the possible states only include A and B, while the LF observations include A, B and C.

from skweak.

schopra8 avatar schopra8 commented on June 12, 2024

Thanks Pierre! I've included the .spacy file in the attached zip folder.

Replication:

  • Skweak Version: GitHub Master Branch
  • Code Snippet:
# NER Labels
HRD_TAG = 'HRD' # Underspecified Label
PGL_TAG = 'PGL'
DB_TAG = 'DB'
SW_TAG = 'SW'
ORG_TAG = 'ORG'

doc = list(docbin_reader('example_error.spacy'))
hmm = aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG])
_ = hmm.fit_and_aggregate(doc)

example_error.spacy.zip

from skweak.

plison avatar plison commented on June 12, 2024

Thanks! I had a look at your document, but it seems like the only label that is actually provided in this dataset is the underspecified HRD label:

saro_products_lf: {angular: 'HRD'}
saro_tools_lf: {}
o_net_skills_lf: {}
dice_skills_lf: {angular: 'HRD'}
multi_token_dice_detector: {}
multi_token_saro_products_detector: {}
multi_token_saro_tools_detector: {}
multi_token_o_net_skills_detector: {}
digital_com_pgls_lf: {}
ne_pgls_lf: {}
so_pgls_lf: {}
st_pgls_lf: {}
wiki_pgls_lf: {}
db_engines_dbs_lf: {}
nosql_wiki_dbs_lf: {}
popular_dbs_lf: {}
rbdms_wiki_dbs_lf: {}
so_dbs_lf: {}
software_companies_lf: {}
company_with_punctuation_hard_skill_detector: {}
company_within_noun_phrase_detector: {}
company_with_acronym_detector: {}
company_into_database_detector: {}
oracle_into_database_detector: {}
hmm: {}
verb_detector: {}

That's what confuses the model: it hasn't seen any observation of the actual labels you want to aggregate (PDL, DB, etc.). Which means the transition model and observation models are impossible to estimate.

from skweak.

schopra8 avatar schopra8 commented on June 12, 2024

Thanks for the explanation Pierre! I think I've misunderstood how underspecified labels work.

I've been presuming that:

  1. I can label sequences with the under-specified label (if no other more specific labels are present)
  2. "back-off" to the underspecified label if there is disagreement between more specific labels

Am I correct in stating that assumption 1 is incorrect and assumption 2 is correct? And Is there a good way to realize both assumptions in skweak?

As you saw in the example doc, I have entities that I know belong to "HRD" but don't know which specific sub-category they should be assigned (i.e., not present in sub-category gazetteers). Thanks again!

from skweak.

plison avatar plison commented on June 12, 2024

Yes, you are correct, the underspecified labels are employed to provide a "weaker" signal (i.e. allowing a labelling function to output a subset of possible labels instead of a single one). But they are not meant to be used as some kind of hierarchical labelling, where one can "back-off" to the underspecified value. That would indeed be very interesting to investigate, but it would require a much more advanced probabilistic model than a classical HMM.

I guess a relatively quick fix would be to add this HRD value to the list of output labels that can be aggregated over, like this:

docs = list(skweak.utils.docbin_reader('example_error.spacy'))
hmm = skweak.aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG, HRD_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG, HRD_TAG])
_ = hmm.fit_and_aggregate(docs)

But I'm not really sure about what kind of solutions the EM algorithm will converge to in this setting.

from skweak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.