Code Monkey home page Code Monkey logo

Comments (10)

david-waterworth avatar david-waterworth commented on June 5, 2024

This seems like the approach that CRF implementations such as allennlp take for the transition probabilities, i.e. https://github.com/allenai/allennlp/blob/3fa519333c0042a1b378bd8ac1788d42edaa70be/allennlp/modules/conditional_random_field.py#L372

Not sure if that will cause issues here but it seems a reasonable approach

from skweak.

mnschmit avatar mnschmit commented on June 5, 2024

Thanks for that thought @david-waterworth !
I also think that masking with a very small value instead of minus infinity should not do any harm.

I just wondered whether it breaks the test for tokens with no possible state. If there are other ways of having very unlikely states, then the test still makes sense. If the only way of having so small probabilities was the masking and if that does not do any harm, then the test could just be disabled. If, however, the test is important and thus cannot be disabled and if there is not way of having so small probabilities other than the masking, then something ought to be adapted.

I am just unsure which of these scenarios is the case. Do you know more about the reasoning behind the "test for tokens with no possible state" @david-waterworth ?

from skweak.

david-waterworth avatar david-waterworth commented on June 5, 2024

@mnschmit in their paper they mention "The likelihood function also includes a constraint that requires latent labels to be observed in at least one labelling function to have a non-zero probability. This constraint reduces the search space to a few labels at each step."

I have a feeling the "test for tokens with no possible state" may be the implementation of the above. I'm not 100% sure though

from skweak.

mnschmit avatar mnschmit commented on June 5, 2024

Interesting idea @david-waterworth
But then I wonder why those tokens do not have 100% probability for the O label. If no labeling function marks them as potential entities, they should likely be Os, no?

from skweak.

plison avatar plison commented on June 5, 2024

Yes, you are right, the current implementation for those corner cases is not optimal. The motivation for having this check was to be able to detect early on when there is a problem with the aggregation, leading to only very improbable states. This was a useful behaviour when implementing the aggregation model itself (to quickly find out when something goes wrong), but I agree that it can create a number of unintended problems.

This being said, I also wonder why you do not get 100% probability for the O label. Would it be possible to get a small example of document (or part of document) when this happens, to investigate this further?

from skweak.

mnschmit avatar mnschmit commented on June 5, 2024

Thank you for the explanation and having a look at this @plison !

Yes, no problem, I can give you the document where the problem occurs for me. It is this very short document (in French):

"Une lettre de  
super héros

"

It has two newlines in the end (so the additional white space in the quote is intentional; the " only mark the document boundaries and are not part of the actual text). It probably does not matter but I prefer giving you the authentic sample. The problem occurs with the first token "Une".
(For anyone reading this who might not speak French, it means "A super hero letter".)

I have a labeling function saying titlecased words like "Une" should be marked as potential entity candidates. So this is the one labeling function firing for the document, which I mentioned in the first post.
I'm using fr_core_news_sm for tokenization.

I hope you can reproduce it!

from skweak.

plison avatar plison commented on June 5, 2024

I've now just released a new version of skweak and have tried your example and do not seem to get an error. Could you check on your side whether things are now working?

from skweak.

mnschmit avatar mnschmit commented on June 5, 2024

Unfortunately, I still seem to get the same error. Now it does not happen after the first couple of documents but right away:

Starting iteration 1
Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 190, in <module>
    main(arguments)
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 161, in main
    process_texts_and_store_annotations(
  File "/home/martin/repositories/ner-data-manipulation/src/create_data.py", line 107, in process_texts_and_store_annotations
    hmm.fit(annotated_docs)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/aggregation.py", line 87, in fit
    self._fit(obs_generator, **kwargs)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 121, in _fit
    curr_logprob += self._accumulate_statistics(X)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 630, in _accumulate_statistics
    framelogprob = self._get_log_likelihood(X)
  File "/home/martin/.local/share/virtualenvs/ner-data-manipulation-R28K2yLy/lib/python3.9/site-packages/skweak/generative.py", line 167, in _get_log_likelihood
    raise RuntimeError("No valid state found at position %i"%pos)

If I replace -np.inf with -100000 in line 162 (here), it goes away again.

from skweak.

plison avatar plison commented on June 5, 2024

I don't really manage to reproduce this problem. Here is what I ran:

import spacy, skweak
nlp = spacy.load("fr_core_news_sm")
doc = nlp("""Une lettre de  
super héros

""")
lf = skweak.heuristics.TokenConstraintAnnotator("test", lambda x: x.text.istitle(), "ENT")
doc = lf(doc)
hmm = skweak.generative.HMM("hmm", ["ENT"])
hmm.fit([doc]*100)
doc = hmm(doc)

And I don't get any error?

from skweak.

mnschmit avatar mnschmit commented on June 5, 2024

Unfortunately, I was not able to reproduce it in a minimal example either...

I tried your code -> no error. I added some of my other labeling functions -> no error. I added some other random documents from my collection -> no error. It only seems to happen when I run it on all my data.
At least, I found that changing -np.inf to -100000 did not change the training output of the HMM at all in the example runs I tried. So that might be enough for me at the moment.
I fear I would have to dig a little deeper to pinpoint exactly what conditions cause the error and I fear I won't have time for that in the near future. But if I do, I'll let you know!

Anyway, a big thank you for your time @plison ! I'm closing this for now since I don't think we can do much before we haven't figured out how to reproduce it in a minimal example.

from skweak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.