Code Monkey home page Code Monkey logo

augury's Introduction

Augury

Further work is happening in the portent repo: https://github.com/jeanbern/portent

Build status Stories in Ready

A small collection of natural language processing tools in C#. Augury is intended for use as a text predictor/spell-checker.
Using a DAWG and the Jaro-Winkler distance, we evaluate possible word endings and spell-checks. These are then evaluated using Modified-Kneser-Ney smoothing, and the top results are returned. Support for symmetric-delete correction as an alternative is included.
Most behavior is injectable and interfaces are provided to enable extension.

Documentation coming soon.


The MIT License (MIT)

Copyright (c) 2016 Jean-Bernard Pellerin

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

augury's People

Contributors

jeanbern avatar waffle-iron avatar

Stargazers

 avatar

Watchers

 avatar

augury's Issues

Add learning to ModifiedKnesserNey

This can be as simple as storing all text that passes through it in a spare file and recalculating with the corpus + that file.

Could also be on the fly, I think manipulation of counts and NW* values is O(1)

Consider interface for Auger

Is this just another case of INextWordModel? It fits the description, would just require a single rename of the Predict(...) method in Auger.

Could alternatively argue for a different interface that takes a list of previous words and a string representing the partial next word.

Add unit testing for IPrefixLookup

  • Create with small, known corpus. Test things we know are true and things we know aren't true.
  • Create with large corpus. Present with selections from corpus.
    • Create some metric for evaluating prediction score. Words guessed immediately should give more points than after 1 letter. Guesses after 3 or more letters should rank poorly.

Add injection for IStringMetric to SpellCheck and SymmetricPredictor

Could add them as a passed-in parameter in the constructors. Will also need to add the information to the serializers.

Should be pretty easy, just

  • Create ISerializer for each implementation of IStringMetric. That serializer can be empty and return an object created with the default constructor.
    • Could create an abstract class where T : class, default. Then empty subclass for the string metrics.
  • SpellCheck and SymmetricPredictor can use Serializer.SerializeInterface

Add unit testing for IStringMetric.

More generic than Augury.Lucene unit tests.

  • Ensure results are the same when order is swapped.
  • Ensure matching value is greater than or equal to non-matching value.
  • Do the same for transpositions.
  • Insertions
  • Deletions
  • Test against empty and null strings. Test different lengths.

Add unit testing for ISerializer

Check to see if .Equals is overriden, if so, can simply serialize and deserialize.If not, need custom equality code in unit test.

Investigate Double-Array Dawg

Expected outcome is that it will take a bit less space but run slower.

Could just implement both and leave them available for injection based on user preference.

Trade-offs:

  • DADAWG uses less memory. Maybe somewhere on the order of 30% less.
  • DADAWG can traverse down a know branch in O(1) while DAWG needs O(log |alphabet|).
  • DAWG can traverse all branches in O(|branches|) while DADAWG requires O(|alphabet|) in all cases.
  • Creation of DADAWG might be slower. Depends on how good of a space-filling algorithm I can create.

Investigate removing string indexed dictionaries

These all cause strings to be stored in memory/interned. That takes up a large majority of the memory consumed by Auger.

Could consider creating a radix trie that stores int indices instead.

If using a trie, the DAWG will be redundant information. The exact same traversal method can be used.

Clean up Auger

Pretty sure it could use a rewrite of that ugly switch statement. Something like

var temp = history.skip(history.length-3).toList();
previous = temp.last();
partial = temp.take(temp.lenth-1);

Or, just change the method signature to accept partial and history separately.

Improve the Tokenizer

I can probably draw from some existing code online. Also, try to make it work on a stream rather than a giant string.

Add unit testing for ISpellCheck

  • Create with small, known corpus. Test things we know are true and things we know aren't true.
    • Use the injected string metric to validate results. Add manual results that you know were traversed in the tree (or w/e) and should not make it past the metric. Use a case where you know all the words expected, make sure they're present.
  • Create with large corpus. Present with selections from corpus.

Add unit testing for Augury.Lucene

Test JaroWinkler and BoundedJaroWinkler.

  • Make sure BoundedJaroWinkler is always less than or equal to JaroWinkler.
  • Test known values. Test cases with 0,1,all of matches and transpositions.
  • Test various string lengths. Nulls and empty strings as well.

Add unit testing for ILanguageModel

  • Use small corpus and hand-calculated values to double check.
  • Use large corpus and ensure some select known phrases compare predictably with each other. aka "How are you" > "How are monkeys"

Add unit testing for PriorityQueue.WordQueue

It should be simple enough.

  • Attempt inserts in sorted, unsorted, reversed orders.
  • Try with lists that are full, too small, empty, too big.
  • Try with tie breakers. Use a list where you would expect them and validate. Then add a value and see that the tie-breakers are gone. Then add more tie-breakers and validate.
  • Try negative values only. Mixed negative and positive. Positive only.
  • Try adding the same string twice.

Add smarter traversal for SpellCheck.PrefixLookup

Instead of the completely arbitrary "maxErrors = prefix.Length > 4 ? 2 : prefix.Length > 2 ? 1 : 0", we could use the metric. A more likely scenario is a new interface type that would be a preemptive filter-metric, since a partial traversal of "word" by "wo" will fail, even though we mean to go further.
If there is a new interface for this and we feed it to the constructor, we should really provide a default, because this is going too far down the rabbit-hole of customization.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.