Code Monkey home page Code Monkey logo

mycroft's People

Contributors

wpm avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

mycroft's Issues

Documentation pass

Go over the codebase and insert inline documentation for the classes and functions.

Predict needs --label-name and --omit-labels options

The "coarse" labels In the Stanford Treebank set include a "neutral" value which should not be included in any evaluation. It would be good to be able to handle this directly from the Mycroft command line.

Add --label-name and --omit-labels just as in the evaluate command.

If you supply these labels we can also print the accuracy to standard out.

Support command-line interface for custom models

It would be waste of time for me to try and put all the possible Keras options you could tweak into the command line interface. Ultimately Mycroft is more useful if you can write your own models in code.

Still Mycroft provides some advantages over starting from scratch every time: e.g. all the pandas data processing, seamless incorporation of word embedding vectors, the prediction and evaluation code which remains the same regardless of model, and a command line interface. It would be good to be able to easily leverage that for new models.

Do this by adding command line support to the TextEmbeddingClassifier. Add to the interface a command_line function that returns an argparse subparser. Refactor the current models to use this and call it from console.py:main to build the command line parser. Also export a main function that somebody else's code could call passing in a TextEmbeddingClassifier as an argument. This would build a command-line interface for just this class.

Would need to explain this clearly in the README, where the value-add over writing Keras code from scratch lies.

Optionally use a TF-IDF SVM bag of words model

Mycroft is primarily a tool for doing learning over word embeddings, so this is here as a convenient baseline.

There's no point in putting a lot of work into supporting all the various grid parameters of an SVM, or different learning algorithms. Anyone who wants to seriously explore this space can write their own sklearn code.

Add TensorBoard Support

Keras supports logging to the TensorBoard format. Provide a command line option to use this. Maybe settle on TensorBoard as Mycroft's preferred format for detailed analysis of the training process. Maybe even ditch the history.json file.

Smoke test for Stanford Sentiment data set

Before trying to reproduce published results #11 just get a set of results for the Stanford Sentiment data set. This is a smoke test to give me some confidence that this is working for release 1.0.0.

Do this an experimental grid for a few different model/hyperparameter combinations on the coarse (either "positive" or "negative") sentiment samples. Do an 80/20 split and measure validation accuracy and loss.

These were run on a G1 GPU server on Floydhub with 4 cores, 1 GPU, 61G memory, and 12G GPU memory.

Model Parameters Time Accuracy Loss URL
svm defaults 1:08:28 0.81020 1.14729 mycroft/9
nbow batch=32 2:38 0.70025 1.91377 mycroft/10
nbow batch=64 1:33 0.70969 1.96907 mycroft/11
nseq batch=32, units=32 2:26:52 0.80131 2.45249 mycroft/5
nseq batch=32, units=64 2:28:39 0.80594 2.40330 mycroft/6
nseq batch=32, units=256 2:59:28 0.80365 2.37600 mycroft/7
nseq batch=256, units=64 24:18 0.75026 2.53510 mycroft/13

Conclusions:

  • SVM and sequential models are about the same. Both outperform a neural bag of words model.
  • SVM trains a lot faster.
  • The literature says 80% is a decent baseline for this data set, so the code does appear to be working.

Models are subcommands of a training command

Instead of creating a separate training command for each model, make models a subcommand of a shared train command. The training arguments will be the same across all of them.

The train command could then also have a load subcommand that reloads an existing model instead of creating a new one.

Demo

Have a demo command that given a text file randomly shuffles the words in some of its lines then builds and runs a classifier to detect the difference between shuffled and non-shuffled English text.

It writes the data files and output model to the current directory so you have an example of how to configure your own data files.

Change meaning of maximum vocabulary in training

If no vocabulary size argument is specified simply make the vocabulary the set of all types in the language model for which we have embedding vectors. Don't tie this to the number of types in the data.

Including all the types in the language model vocabulary will always be the "best" we can do, so it should be the default.

Syntactically driven convolutional models

The standard convolutional model generates features from windows of contiguous words. If we are given a syntactic parse of a sentence we can instead generate a window for each syntactic component. Perhaps the pooling layer could extract the maximum value from sets of constituents at the same parsing depth.

The more general idea is to redefine the notion of "window" to be something more relevant than contiguous tokens. Maybe implement a more generic version of Keras' Conv1D.

Check the literature to see if this has been done. This is basically a research paper idea.

Break out API

Separate code that interacts with the console or the file system from code that does the work. All the former should go into console.py. The latter should be organized as an API.

(One exception: training will write a model file.)

Create a website for this project

Use all the standard github tools. Have autogenerated API documentation, tutorials on writing your own classifiers, writeups of experiments comparable to published baselines.

Error when specifying a non-default sequence length

For example

mycroft train conv ../data/MR/scaledata.csv --validation-fraction 0.1 --sequence-length 100 --text-name subj --label-name label.3class
usage: mycroft train conv [-h] [--limit LIMIT]
                          [--validation-fraction FRACTION]
                          [--validation-data FILE] [--text-name NAME]
                          [--label-name NAME]
                          [--omit-labels [LABEL [LABEL ...]]]
                          [--epochs EPOCHS] [--batch-size SIZE]
                          [--model-directory DIRECTORY]
                          [--logging {none,progress,epoch}]
                          [--tensor-board DIRECTORY]
                          [--sequence-length LENGTH] [--vocabulary-size SIZE]
                          [--dropout DROPOUT] [--filters FILTERS]
                          [--kernel-size SIZE] [--pool-factor FACTOR]
                          [--language-model MODEL]
                          TRAINING-FILE
mycroft train conv: error: argument --sequence-length: invalid NoneType value: '100'

Support for Very Large Vocabularies

Don't serialize the vocabulary when serializing an TextSequenceEmbedder object. Instead, recreate the vocabulary when deserializing.

This would require breaking off the vocabulary creation at the end of the TextSequenceEmbedder constructor into its own class and overriding the __getstate__ and __setstate__ pickle functions.

This is good because you may want to have huge vocabularies (e.g. 2,000,000 words) that work just fine as embedding matrices but cause pickle to run out of memory when trying to serialize them.

Add regularization to built-in models

Give callers the option of specifying l1 or l2 and a regularization rate.

Apply this to the Dense layer at the end of the two built-in models.

Again, don't want to add too many more bells and whistles to the built-in models beyond this.

Crossvalidation

As an alternative to specifying a validation set (#31) or proportion, you should be able to specify the number of folds for cross validation. Models will be saved to subdirectories of the model directory. Print a mean and standard deviation of the statistics.

Obviously you can do this by making multiple runs with a validation proportion, but that's a lot of command lines to keep track of.

Delete the svm Model

I put that in to make things easier for Anna's intern, but really supporting every possible text classifier is beyond the scope of this project. Mycroft should just focus on deep learning methods. If you want to do something else, you're better off using scikit-learn.

Also SVM support complicates the architecture and user interface somewhat.

Get vocabulary from data

Instead of specifying the vocabulary size and creating an Embedder from that, do a pass over the training data and build and embedder that has precisely all the words in that vocabulary.

Actually, the interesting experimental question is: does this help, or is it just insignificant fine tuning?

Support multiple text inputs

Mycroft should support multiple text inputs so that it can be used for multi-text labeling tasks like textual entailment.

TextEmbeddingClassifier.train would take a list of texts arguments and pass a list of training_vectors to self.model.fit. (Maybe the list would be converted to a single vector if there is only one text input.)

The --text-name option would take multiple arguments. I'd need to add a mechanism for the TextEmbeddingClassifier subclass to specify what these would be. The default would remain a single argument called "text".

Add support for grid search over hyperparameters?

This is a nice-to-have. Of course you can always do grid search by generating Mycroft command lines outside this package, but maybe it's worth doing inside this code if (a) there would be a convenient command line interface (b) it doesn't require much implementation work.

The criteria for (b) is "can I wrap these models in Keras' scikit-learn layer"? If not, I'm not going to reimplement scikit-learn's grid search capability.

For (a) I guess you could simply change all the command line arguments to take multiple values, though you might end up wanting to be able to launch a job from a configuration file.

This would need some investigation to see if it's worthwhile.

Explain the metrics

Explain that acc is classification accuracy and loss is cross-entropy loss in the documentation and the online help.

Reproduce published results from Yoon Kim 2014

Use this code to reproduce published results for RNNs for text classification. Put the results and references in the README.

Nothing exhaustive, just reproduce something to show that it's working.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.