Code Monkey home page Code Monkey logo

Comments (4)

neilmizzi avatar neilmizzi commented on June 12, 2024

hello , is here anyone who tried to implement another model/framework other than spacy (ner) as a labeling function. i tried to work with flair but didnt work. can anyone help me and thanks in advance .

I am also having this issue at the moment! So far, I've been able to use external models which can be fit into the SpaCy framework: https://spacy.io/universe/category/models

Namely, I've had success with Stanza as there's already a version with the SpaCy wrapper available for this model: spacy-stanza

I'm currently trying to use Flair using a myriad of ways (the SpaCy-wrap tool, changing to the SpaCy tokenizer), but so far have not had any success. I will update if I manage to do this!

In the meantime, having the option to integrate non-SpaCy-based models more easily would definitely be appreciated. I will see how far I get with trying to get Flair to work and keep this thread updated.

This is the errors I get right now by just using the Flair model and retrieving the entities and using the FunctionAnnotator:

IndexError: [E035] Error creating span with start 1 and end 6 for Doc of length 2.

It seems to me that even though I override Flair by using the SpacyTokenizer, there are still some differences which result in a conflict on certain documents.

UPDATE: It seems to me that if there's a way to use the character IDs themselves, over the token IDs, this issue could very easily be mitigated and virtually any model could be used. If there's a way to use SpanAnnotator or FunctionAnnotator in this manner, I'd really appreciate to know how this is done!

from skweak.

plison avatar plison commented on June 12, 2024

Yes, I agree that using character spans instead of token-level spans would make skweak less spacy-dependent and provide more flexibility. But it would mean rewriting a lot of the code, since right now, the results of the labelling functions are stored as Span objects, which require token-level indexing. So it's definitely something that would be worth looking into, but it's not the pipeline at the moment.

from skweak.

neilmizzi avatar neilmizzi commented on June 12, 2024

Thank you for your input @plison! I can understand that this may not be as straightforward to implement.

At the moment, I have been able to figure out a workaround to get Flair working using the FunctionAnnotator, by using the function doc.char_span to manually set the entities such that they are supported (and then tokenized) using the existing doc object. You can see an example below.

This workaround should in theory work for any model which provides character spans for the retrieved entities, however I have only tested it with Flair so far, and haven't had any major issues.

from skweak import heuristics

from flair.data import Sentence
from flair.models import SequenceTagger

...

flair_classifier = SequenceTagger.load("flair/ner-english-large")

def flair_annotator(doc):
    sentence = Sentence(doc.text)
    flair_classifier.predict(sentence)

    spans = []
    for entity in sentence.get_spans('ner'):
        spans.append(doc.char_span(entity.start_position, entity.end_position, entity.labels[0].value))

    for token in spans:
        if token:   # exclude NoneType Entities from extraction which may be retrieved, known issue with the char_span fn
            yield token.start, token.end, token.label

...

# declare and set Flair annotations
flair_annotator = heuristics.FunctionAnnotator("flair_annotator", flair_annotator)

docs = list(flair_annotator.pipe(docs))
...

from skweak.

plison avatar plison commented on June 12, 2024

Thanks, that's very useful indeed!

from skweak.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.