mikahama / natas Goto Github PK
View Code? Open in Web Editor NEWPython 3 library for processing historical English
License: Other
Python 3 library for processing historical English
License: Other
Hey, would use is_correctly_spelled for OCR errors but it seems overly generous.
I guess the pages that it finds are https://en.wiktionary.org/wiki/HATO and https://en.wiktionary.org/wiki/Teri, it nicely failed at 'itsejf'.
I wonder if it could make a more conservative match, and only offer lowercase options as real words? Perhaps optionally?
Thanks!
Running test examples, it seems to work very well, except there seems to be a problem using a set here. I'm probably just using it wrong, so advice is helpful.
seed_words = set(["logic", "logical"]) #list of correctly spelled words you want to find matching OCR errors for
dictionary = wiktionary #Lemmas of the English Wiktionary, you will need to change this if working with any other language
lemmatize = True #Uses Spacy with English model, use natas.set_spacy(nlp) for other models and languages
results = ocr_builder.extract_parallel(seed_words, model, dictionary=dictionary, lemmatize=lemmatize)
I get the error, TypeError: 'set' object is not subscriptable
Any idea what might be going on? Thanks!
First off, this is a great library. I've been quite impressed with the results of its normalization functionality. One thing I'd be keen to see, however, is some kind of metric with which to associate the topn
candidate words produced by the normalization process (say, for instance, vector similarity or the NMT model's prediction score). Is Natas already capable of doing this (in which case I'm missing something), or are there plans to implement such functionality?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.