Comments (10)
This seems like the approach that CRF implementations such as allennlp
take for the transition probabilities, i.e. https://github.com/allenai/allennlp/blob/3fa519333c0042a1b378bd8ac1788d42edaa70be/allennlp/modules/conditional_random_field.py#L372
Not sure if that will cause issues here but it seems a reasonable approach
from skweak.
Thanks for that thought @david-waterworth !
I also think that masking with a very small value instead of minus infinity should not do any harm.
I just wondered whether it breaks the test for tokens with no possible state. If there are other ways of having very unlikely states, then the test still makes sense. If the only way of having so small probabilities was the masking and if that does not do any harm, then the test could just be disabled. If, however, the test is important and thus cannot be disabled and if there is not way of having so small probabilities other than the masking, then something ought to be adapted.
I am just unsure which of these scenarios is the case. Do you know more about the reasoning behind the "test for tokens with no possible state" @david-waterworth ?
from skweak.
@mnschmit in their paper they mention "The likelihood function also includes a constraint that requires latent labels to be observed in at least one labelling function to have a non-zero probability. This constraint reduces the search space to a few labels at each step."
I have a feeling the "test for tokens with no possible state" may be the implementation of the above. I'm not 100% sure though
from skweak.
Interesting idea @david-waterworth
But then I wonder why those tokens do not have 100% probability for the O label. If no labeling function marks them as potential entities, they should likely be Os, no?
from skweak.
Yes, you are right, the current implementation for those corner cases is not optimal. The motivation for having this check was to be able to detect early on when there is a problem with the aggregation, leading to only very improbable states. This was a useful behaviour when implementing the aggregation model itself (to quickly find out when something goes wrong), but I agree that it can create a number of unintended problems.
This being said, I also wonder why you do not get 100% probability for the O label. Would it be possible to get a small example of document (or part of document) when this happens, to investigate this further?
from skweak.
Thank you for the explanation and having a look at this @plison !
Yes, no problem, I can give you the document where the problem occurs for me. It is this very short document (in French):
"Une lettre de
super héros"
It has two newlines in the end (so the additional white space in the quote is intentional; the " only mark the document boundaries and are not part of the actual text). It probably does not matter but I prefer giving you the authentic sample. The problem occurs with the first token "Une".
(For anyone reading this who might not speak French, it means "A super hero letter".)
I have a labeling function saying titlecased words like "Une" should be marked as potential entity candidates. So this is the one labeling function firing for the document, which I mentioned in the first post.
I'm using fr_core_news_sm
for tokenization.
I hope you can reproduce it!
from skweak.
I've now just released a new version of skweak
and have tried your example and do not seem to get an error. Could you check on your side whether things are now working?
from skweak.
Unfortunately, I still seem to get the same error. Now it does not happen after the first couple of documents but right away:
Starting iteration 1
Traceback (most recent call last):
File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 190, in <module>
main(arguments)
File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 161, in main
process_texts_and_store_annotations(
File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 107, in process_texts_and_store_annotations
hmm.fit(annotated_docs)
File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/aggregation.py", line 87, in fit
self._fit(obs_generator, **kwargs)
File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 121, in _fit
curr_logprob += self._accumulate_statistics(X)
File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 630, in _accumulate_statistics
framelogprob = self._get_log_likelihood(X)
File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 167, in _get_log_likelihood
raise RuntimeError("No valid state found at position %i"%pos)
If I replace -np.inf
with -100000
in line 162 (here), it goes away again.
from skweak.
I don't really manage to reproduce this problem. Here is what I ran:
import spacy, skweak
nlp = spacy.load("fr_core_news_sm")
doc = nlp("""Une lettre de
super héros
""")
lf = skweak.heuristics.TokenConstraintAnnotator("test", lambda x: x.text.istitle(), "ENT")
doc = lf(doc)
hmm = skweak.generative.HMM("hmm", ["ENT"])
hmm.fit([doc]*100)
doc = hmm(doc)
And I don't get any error?
from skweak.
Unfortunately, I was not able to reproduce it in a minimal example either...
I tried your code -> no error. I added some of my other labeling functions -> no error. I added some other random documents from my collection -> no error. It only seems to happen when I run it on all my data.
At least, I found that changing -np.inf
to -100000
did not change the training output of the HMM at all in the example runs I tried. So that might be enough for me at the moment.
I fear I would have to dig a little deeper to pinpoint exactly what conditions cause the error and I fear I won't have time for that in the near future. But if I do, I'll let you know!
Anyway, a big thank you for your time @plison ! I'm closing this for now since I don't think we can do much before we haven't figured out how to reproduce it in a minimal example.
from skweak.
Related Issues (20)
- matcher annotator HOT 1
- Functionality to construct the detected span from start and end index HOT 1
- Converting .spacy files to conll format to train other models on it. HOT 5
- skweak.utils.docbin_reader always loads 'en_core_web_md' regardless which model was saved? HOT 2
- Support for loading any pre-trained model inside the 'Model Annotator' HOT 2
- Error in MultilabelNaiveBayes HOT 5
- SpanCategorizer HOT 1
- Custom NER model training HOT 2
- Support options in displacy.render
- minimal example not working HOT 3
- Does skweak use POS tags and lemma information to aggregate labels? HOT 1
- How to use the already available Label Matrix to train Skweak? HOT 1
- Step by step NER alternative 2 HOT 1
- Annotating whole sentences (without using regex) HOT 2
- Adding to the gazetteer annotator constrains HOT 1
- Is skweak being actively maintained and will be maintained? HOT 1
- How to import annotator in the annotator(doc)
- hmmlearn 0.3.0 HOT 1
- Update examples stepbystep
- How to use prefix ner tags with skweak aggregation.HMM HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from skweak.