Comments (5)
mm, this shouldn't happen indeed. Your code seems correct, I don't see any error. Would it be possible to send me the spacy document (with annotated spans) that triggers the error?
Here's a minimal piece of code I used to test the behavior:
import spacy, skweak
nlp = spacy.load("en_core_web_md")
doc = nlp("This is a test for Pierre Lison living in Oslo, and here is another random Entity, "
+ "and a final person peter jackson.")
doc.spans["lf1"] = [spacy.tokens.Span(doc,5,7, "A"),
spacy.tokens.Span(doc, 22, 24, "A")]
doc.spans["lf2"] = [spacy.tokens.Span(doc,9,10, "B")]
doc.spans["lf3"] = [spacy.tokens.Span(doc, 5,7, "C"),
spacy.tokens.Span(doc,9,10, "C"),
spacy.tokens.Span(doc, 16, 17, "C")]
hmm = skweak.aggregation.HMM("hmm", ["A", "B"], sequence_labelling=True)
hmm.add_underspecified_label("C", ["A", "B"])
_ = hmm.fit_and_aggregate([doc])
When it comes to your questions: no, your initial code was correct, you shouldn't include C
as a possible label option if C
is an underspecified label. Basically, the underspecified labels are part of the possible HMM observations (outputs from the labelling functions), but are not part of the HMM states. If you call the pretty_print
function, you can see the observation matrices (one per labelling function), where the possible states only include A
and B
, while the LF observations include A
, B
and C
.
from skweak.
Thanks Pierre! I've included the .spacy
file in the attached zip folder.
Replication:
- Skweak Version: GitHub Master Branch
- Code Snippet:
# NER Labels
HRD_TAG = 'HRD' # Underspecified Label
PGL_TAG = 'PGL'
DB_TAG = 'DB'
SW_TAG = 'SW'
ORG_TAG = 'ORG'
doc = list(docbin_reader('example_error.spacy'))
hmm = aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG])
_ = hmm.fit_and_aggregate(doc)
from skweak.
Thanks! I had a look at your document, but it seems like the only label that is actually provided in this dataset is the underspecified HRD
label:
saro_products_lf: {angular: 'HRD'}
saro_tools_lf: {}
o_net_skills_lf: {}
dice_skills_lf: {angular: 'HRD'}
multi_token_dice_detector: {}
multi_token_saro_products_detector: {}
multi_token_saro_tools_detector: {}
multi_token_o_net_skills_detector: {}
digital_com_pgls_lf: {}
ne_pgls_lf: {}
so_pgls_lf: {}
st_pgls_lf: {}
wiki_pgls_lf: {}
db_engines_dbs_lf: {}
nosql_wiki_dbs_lf: {}
popular_dbs_lf: {}
rbdms_wiki_dbs_lf: {}
so_dbs_lf: {}
software_companies_lf: {}
company_with_punctuation_hard_skill_detector: {}
company_within_noun_phrase_detector: {}
company_with_acronym_detector: {}
company_into_database_detector: {}
oracle_into_database_detector: {}
hmm: {}
verb_detector: {}
That's what confuses the model: it hasn't seen any observation of the actual labels you want to aggregate (PDL, DB, etc.). Which means the transition model and observation models are impossible to estimate.
from skweak.
Thanks for the explanation Pierre! I think I've misunderstood how underspecified labels work.
I've been presuming that:
- I can label sequences with the under-specified label (if no other more specific labels are present)
- "back-off" to the underspecified label if there is disagreement between more specific labels
Am I correct in stating that assumption 1 is incorrect and assumption 2 is correct? And Is there a good way to realize both assumptions in skweak?
As you saw in the example doc, I have entities that I know belong to "HRD" but don't know which specific sub-category they should be assigned (i.e., not present in sub-category gazetteers). Thanks again!
from skweak.
Yes, you are correct, the underspecified labels are employed to provide a "weaker" signal (i.e. allowing a labelling function to output a subset of possible labels instead of a single one). But they are not meant to be used as some kind of hierarchical labelling, where one can "back-off" to the underspecified value. That would indeed be very interesting to investigate, but it would require a much more advanced probabilistic model than a classical HMM.
I guess a relatively quick fix would be to add this HRD value to the list of output labels that can be aggregated over, like this:
docs = list(skweak.utils.docbin_reader('example_error.spacy'))
hmm = skweak.aggregation.HMM("hmm", [PGL_TAG, DB_TAG, SW_TAG, ORG_TAG, HRD_TAG], sequence_labelling=True)
hmm.add_underspecified_label(HRD_TAG, [PGL_TAG, DB_TAG, SW_TAG, HRD_TAG])
_ = hmm.fit_and_aggregate(docs)
But I'm not really sure about what kind of solutions the EM algorithm will converge to in this setting.
from skweak.
Related Issues (20)
- Error Importing import examples.ner.conll2003_ner HOT 1
- matcher annotator HOT 1
- Functionality to construct the detected span from start and end index HOT 1
- Converting .spacy files to conll format to train other models on it. HOT 5
- skweak.utils.docbin_reader always loads 'en_core_web_md' regardless which model was saved? HOT 2
- Support for loading any pre-trained model inside the 'Model Annotator' HOT 2
- Error in MultilabelNaiveBayes HOT 5
- SpanCategorizer HOT 1
- Custom NER model training HOT 2
- Support options in displacy.render
- minimal example not working HOT 3
- Does skweak use POS tags and lemma information to aggregate labels? HOT 1
- How to use the already available Label Matrix to train Skweak? HOT 1
- Step by step NER alternative 2 HOT 1
- Annotating whole sentences (without using regex) HOT 2
- Adding to the gazetteer annotator constrains HOT 1
- Is skweak being actively maintained and will be maintained? HOT 1
- How to import annotator in the annotator(doc)
- hmmlearn 0.3.0 HOT 1
- Update examples stepbystep
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skweak.