Code Monkey home page Code Monkey logo

Comments (8)

syllog1sm avatar syllog1sm commented on April 27, 2024

I'm not really happy with any of the existing algorithms, so I've been working on a novel shift-reduce approach. Briefly, where previous work usually encodes the structure into sequence tags, so that a finite-state machine can be used, I think it makes more sense to use a push-down automaton, now that work in parsing with shift-reduce grammars is so well understood.

I'm just starting to get results for this. Currently accuracy is only 77% on OntoNotes, where the Stanford NER system reportedly gets around 84%. I still need to do a lot of bug-fixing and tuning, and I'm not using gazetteers or any semi-supervised learning at the moment.

So, in short: yes, NER is planned, and the bulk of the work is done. It remains to be seen whether my approach will hit comparable accuracy to previous work, but imo it should. Once the accuracy is good, I then need to design and implement the Python API, and write the testing and deployment code. Probably about 1 month all up, given other things I'm working on.

from spacy.

viksit avatar viksit commented on April 27, 2024

Thanks, that's an interesting approach. Are there any specific papers you recommend for PDA based NER?

Also, are you inviting code/collaboration on this yet?

from spacy.

syllog1sm avatar syllog1sm commented on April 27, 2024

As far as I know PDA for NER is a new idea, since most of the previous work uses HMMs and CRFs. If it works, I'll write it up.

I need to set up the contributors agreement, but then I could accept contributions. But, I think it's easiest if I do the research parts myself. Collaborating on that gets complicated.

If you want to weigh in on what sort of API you'd like to see though, that would be very welcome.

from spacy.

viksit avatar viksit commented on April 27, 2024

Ah, I didn't realize it hasn't been tried before - I remember coming across a chinese NER system that used PDAs, but I can't find that paper. Would you be interested in sharing some high level thoughts on the PDA/NER approach that you're taking?

Re: collaborating on the research parts - just an idea - it might be interesting to have a shared ipynb or some such, on one of the github style research collaboration platforms.

Definitely, let me think about the APIs. I've always thought that the GATE or UIMA style, and even the Stanford NER APIs have been super heavy weight.

It would be good to have a visual representation of the parse tree, like NLTKs as this progresses,

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.) 

from spacy.

syllog1sm avatar syllog1sm commented on April 27, 2024

Quick update:

This is progressing well: I'm now getting 81% on the OntoNotes WSJ corpus. I expect gazzetteers from Wikidata will bring this in line with current state-of-the-art.

It's hard to say, but this might be ready by the end of April.

from spacy.

honnibal avatar honnibal commented on April 27, 2024

NER now included, although the model still needs accuracy improvements. Currently it's getting 82% F on OntoNotes, and 86% on CoNLL '03. State-of-the-art is around 85% and 90% on these benchmarks. Improvements are in the works.

from spacy.

viksit avatar viksit commented on April 27, 2024

Sweet - is there a pointer on usage?

from spacy.

lock avatar lock commented on April 27, 2024

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

from spacy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.