Comments (8)
I'm not really happy with any of the existing algorithms, so I've been working on a novel shift-reduce approach. Briefly, where previous work usually encodes the structure into sequence tags, so that a finite-state machine can be used, I think it makes more sense to use a push-down automaton, now that work in parsing with shift-reduce grammars is so well understood.
I'm just starting to get results for this. Currently accuracy is only 77% on OntoNotes, where the Stanford NER system reportedly gets around 84%. I still need to do a lot of bug-fixing and tuning, and I'm not using gazetteers or any semi-supervised learning at the moment.
So, in short: yes, NER is planned, and the bulk of the work is done. It remains to be seen whether my approach will hit comparable accuracy to previous work, but imo it should. Once the accuracy is good, I then need to design and implement the Python API, and write the testing and deployment code. Probably about 1 month all up, given other things I'm working on.
from spacy.
Thanks, that's an interesting approach. Are there any specific papers you recommend for PDA based NER?
Also, are you inviting code/collaboration on this yet?
from spacy.
As far as I know PDA for NER is a new idea, since most of the previous work uses HMMs and CRFs. If it works, I'll write it up.
I need to set up the contributors agreement, but then I could accept contributions. But, I think it's easiest if I do the research parts myself. Collaborating on that gets complicated.
If you want to weigh in on what sort of API you'd like to see though, that would be very welcome.
from spacy.
Ah, I didn't realize it hasn't been tried before - I remember coming across a chinese NER system that used PDAs, but I can't find that paper. Would you be interested in sharing some high level thoughts on the PDA/NER approach that you're taking?
Re: collaborating on the research parts - just an idea - it might be interesting to have a shared ipynb or some such, on one of the github style research collaboration platforms.
Definitely, let me think about the APIs. I've always thought that the GATE or UIMA style, and even the Stanford NER APIs have been super heavy weight.
It would be good to have a visual representation of the parse tree, like NLTKs as this progresses,
(S
Over/IN
(NP a/DT cup/NN)
of/IN
(NP coffee/NN)
,/,
(NP Mr./NNP Stone/NNP)
told/VBD
(NP his/PRP$ story/NN)
./.)
from spacy.
Quick update:
This is progressing well: I'm now getting 81% on the OntoNotes WSJ corpus. I expect gazzetteers from Wikidata will bring this in line with current state-of-the-art.
It's hard to say, but this might be ready by the end of April.
from spacy.
NER now included, although the model still needs accuracy improvements. Currently it's getting 82% F on OntoNotes, and 86% on CoNLL '03. State-of-the-art is around 85% and 90% on these benchmarks. Improvements are in the works.
from spacy.
Sweet - is there a pointer on usage?
from spacy.
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
from spacy.
Related Issues (20)
- Install via `requirements.txt` documentation doesn't work HOT 17
- catalogue.RegistryError: [E892] Unknown function registry: 'vectors'. HOT 1
- invalid whitespace entity spans msg but no whitespace is there HOT 2
- Upgrade to spacy 3.7.2 throws Attribute error HOT 9
- Training spacy model stucking in 99% HOT 1
- displaCy: Separating Punctuations in Dependency Visualization HOT 1
- spaCy training stopping automatically in Google Colab
- Spacy-transformers - update transformers compatibility HOT 4
- NER component in en_core_web_trf doesn't depend on transformer HOT 1
- en_core_web_sm/md/lg stopped loading today (02/04/2024) HOT 1
- Custom component to split coordinations
- Fail to train openai-community / gpt2 model for custom NER on SpaCy framework HOT 1
- Summary HOT 1
- Sharding Warning HOT 1
- nlp.pipe() with multiple processes on Windows VSCode HOT 2
- `Spacy` has inconsistency when dividing sentences HOT 5
- Incorrect detection of sentence boundaries, if last sentence missing eos symbol for trf model
- Enable override of existing custom pipe HOT 1
- Check that filter_spans input is a Span HOT 3
- Tokenizer Incorrectly Splitting "M1M" HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from spacy.