mycroft,wpm

Optionally use either LSTM or a GRU

Specify as a train argument.

Cannot use an SVM model to predict or evaluate from the command line

The command line assumes it's reading model directory, not a model file.

Check to see whether the argument is reading a file or directory then load accordingly.

Documentation pass

Go over the codebase and insert inline documentation for the classes and functions.

Jeremy's Suggestions for the README

Say "process" text not "parse", maybe brief justification for why I'm using spaCy instead of Keras' built-in tokenizer.

Save model when no validation set is used

If you don't specify a validation set, save best model does nothing. Need to explicitly save the model in this case.

Predict needs --label-name and --omit-labels options

The "coarse" labels In the Stanford Treebank set include a "neutral" value which should not be included in any evaluation. It would be good to be able to handle this directly from the Mycroft command line.

Add --label-name and --omit-labels just as in the evaluate command.

If you supply these labels we can also print the accuracy to standard out.

Support command-line interface for custom models

It would be waste of time for me to try and put all the possible Keras options you could tweak into the command line interface. Ultimately Mycroft is more useful if you can write your own models in code.

Still Mycroft provides some advantages over starting from scratch every time: e.g. all the pandas data processing, seamless incorporation of word embedding vectors, the prediction and evaluation code which remains the same regardless of model, and a command line interface. It would be good to be able to easily leverage that for new models.

Do this by adding command line support to the TextEmbeddingClassifier. Add to the interface a command_line function that returns an argparse subparser. Refactor the current models to use this and call it from console.py:main to build the command line parser. Also export a main function that somebody else's code could call passing in a TextEmbeddingClassifier as an argument. This would build a command-line interface for just this class.

Would need to explain this clearly in the README, where the value-add over writing Keras code from scratch lies.

Optionally use a TF-IDF SVM bag of words model

Mycroft is primarily a tool for doing learning over word embeddings, so this is here as a convenient baseline.

There's no point in putting a lot of work into supporting all the various grid parameters of an SVM, or different learning algorithms. Anyone who wants to seriously explore this space can write their own sklearn code.

Add TensorBoard Support

Keras supports logging to the TensorBoard format. Provide a command line option to use this. Maybe settle on TensorBoard as Mycroft's preferred format for detailed analysis of the training process. Maybe even ditch the history.json file.

Demo runs out of memory

The demo runs out of memory. The machine freezes. I'm doing something wrong.

Smoke test for Stanford Sentiment data set

Before trying to reproduce published results #11 just get a set of results for the Stanford Sentiment data set. This is a smoke test to give me some confidence that this is working for release 1.0.0.

Do this an experimental grid for a few different model/hyperparameter combinations on the coarse (either "positive" or "negative") sentiment samples. Do an 80/20 split and measure validation accuracy and loss.

These were run on a G1 GPU server on Floydhub with 4 cores, 1 GPU, 61G memory, and 12G GPU memory.

Model	Parameters	Time	Accuracy	Loss	URL
svm	defaults	1:08:28	0.81020	1.14729	mycroft/9
nbow	batch=32	2:38	0.70025	1.91377	mycroft/10
nbow	batch=64	1:33	0.70969	1.96907	mycroft/11
nseq	batch=32, units=32	2:26:52	0.80131	2.45249	mycroft/5
nseq	batch=32, units=64	2:28:39	0.80594	2.40330	mycroft/6
nseq	batch=32, units=256	2:59:28	0.80365	2.37600	mycroft/7
nseq	batch=256, units=64	24:18	0.75026	2.53510	mycroft/13

Conclusions:

SVM and sequential models are about the same. Both outperform a neural bag of words model.
SVM trains a lot faster.
The literature says 80% is a decent baseline for this data set, so the code does appear to be working.

Models are subcommands of a training command

Instead of creating a separate training command for each model, make models a subcommand of a shared train command. The training arguments will be the same across all of them.

The train command could then also have a load subcommand that reloads an existing model instead of creating a new one.

Demo

Have a demo command that given a text file randomly shuffles the words in some of its lines then builds and runs a classifier to detect the difference between shuffled and non-shuffled English text.

It writes the data files and output model to the current directory so you have an example of how to configure your own data files.

Change meaning of maximum vocabulary in training

If no vocabulary size argument is specified simply make the vocabulary the set of all types in the language model for which we have embedding vectors. Don't tie this to the number of types in the data.

Including all the types in the language model vocabulary will always be the "best" we can do, so it should be the default.

Multiple layers in the models

Allow the model to have multiple layers. You could input a list of hidden unit sizes.

Syntactically driven convolutional models

The standard convolutional model generates features from windows of contiguous words. If we are given a syntactic parse of a sentence we can instead generate a window for each syntactic component. Perhaps the pooling layer could extract the maximum value from sets of constituents at the same parsing depth.

The more general idea is to redefine the notion of "window" to be something more relevant than contiguous tokens. Maybe implement a more generic version of Keras' Conv1D.

Check the literature to see if this has been done. This is basically a research paper idea.

Reload a trained model for additional training

Easier to do with the change in #36.

Allow a different language model to be loaded

The default is still "en", but you should be able to specify something else on the command line.

Autoencoding

Implement autoencoding strategy described in

Gan, Zhe, Pu, Yunchen, Henao, Ricardo, Li, Chunyuan, He, Xiaodong, and Carin, Lawrence. Unsupervised learning of sentence representations using convolutional neural networks. arXiv preprint arXiv:1611.07897, 2016.

For any model you should be able to run autoencoding, the save the weights.

Break out API

Separate code that interacts with the console or the file system from code that does the work. All the former should go into console.py. The latter should be organized as an API.

(One exception: training will write a model file.)

Optionally learn embeddings

Add a command-line option to specify the embeddings as learnable during training.

Better optimization options

Add learning rate as a parameter since it is common to all optimizers.
Optionally use a reduce learning rate on plateau callback.
Optionally use an early stopping callback.

These may help eke out a few more good epochs from training.

Add a unit test for the demo

It's too easy to break this when you're futzing around with the command line code.

Optional TF-IDF weighting

Optionally weight input vectors by TF-IDF. Should be easy to get from scikit-learn.

Create a website for this project

Use all the standard github tools. Have autogenerated API documentation, tutorials on writing your own classifiers, writeups of experiments comparable to published baselines.

Error when specifying a non-default sequence length

For example

mycroft train conv ../data/MR/scaledata.csv --validation-fraction 0.1 --sequence-length 100 --text-name subj --label-name label.3class
usage: mycroft train conv [-h] [--limit LIMIT]
                          [--validation-fraction FRACTION]
                          [--validation-data FILE] [--text-name NAME]
                          [--label-name NAME]
                          [--omit-labels [LABEL [LABEL ...]]]
                          [--epochs EPOCHS] [--batch-size SIZE]
                          [--model-directory DIRECTORY]
                          [--logging {none,progress,epoch}]
                          [--tensor-board DIRECTORY]
                          [--sequence-length LENGTH] [--vocabulary-size SIZE]
                          [--dropout DROPOUT] [--filters FILTERS]
                          [--kernel-size SIZE] [--pool-factor FACTOR]
                          [--language-model MODEL]
                          TRAINING-FILE
mycroft train conv: error: argument --sequence-length: invalid NoneType value: '100'

Support for Very Large Vocabularies

Don't serialize the vocabulary when serializing an TextSequenceEmbedder object. Instead, recreate the vocabulary when deserializing.

This would require breaking off the vocabulary creation at the end of the TextSequenceEmbedder constructor into its own class and overriding the __getstate__ and __setstate__ pickle functions.

This is good because you may want to have huge vocabularies (e.g. 2,000,000 words) that work just fine as embedding matrices but cause pickle to run out of memory when trying to serialize them.

Write unit tests

Complete when I have 100% code coverage or near to it.

Specify a validation set instead of a random validation portion

These two arguments would be mutually exclusive.

Want this for the Stanford Sentiment Treebank because that experimental setup specifies train, dev, and test sets.

Update README for version 1.0.0

Mention the demo command.

Point out the classes that comprise the programmatic interface.

Include numbers generated by #20.

Accept multiple data files as input

This could be helpful for data augmentation experiments.

Release this as a Pypi package

That way I can mention the pip install line in the README.

Here's a tutorial.

Add regularization to built-in models

Give callers the option of specifying l₁ or l₂ and a regularization rate.

Apply this to the Dense layer at the end of the two built-in models.

Again, don't want to add too many more bells and whistles to the built-in models beyond this.

Crossvalidation

As an alternative to specifying a validation set (#31) or proportion, you should be able to specify the number of folds for cross validation. Models will be saved to subdirectories of the model directory. Print a mean and standard deviation of the statistics.

Obviously you can do this by making multiple runs with a validation proportion, but that's a lot of command lines to keep track of.

Train function takes a --test option

Optionally allow you to predict and evaluate a test file at the end of training.

Saves time running a separate predict job.

Optionally create a bag-of-words model

Change the embedding code to take the mean of the token vectors instead of putting them in a matrix.

Delete the svm Model

I put that in to make things easier for Anna's intern, but really supporting every possible text classifier is beyond the scope of this project. Mycroft should just focus on deep learning methods. If you want to do something else, you're better off using scikit-learn.

Also SVM support complicates the architecture and user interface somewhat.

Get vocabulary from data

Instead of specifying the vocabulary size and creating an Embedder from that, do a pass over the training data and build and embedder that has precisely all the words in that vocabulary.

Actually, the interesting experimental question is: does this help, or is it just insignificant fine tuning?

Have the models manage all the label integer to label name mapping

Don't try to manage part of this outside the models in the pandas frames. It'll be more robust to put it all in one place.

Ensure Python 2 compatibility

Make sure I can install and run unit tests in a Python 2 environment.

Write training history to a file

Optionally write the loss scores to a .JSON file in case you want to plot them later.

mask_zero for ConvolutionNetClassifier?

@wpm commented on Wed Aug 02 2017

I don't pass mask_zero = True into the embedder for the ConvolutionNetClassifier even though I do for the RNNClassifier. Should I?

Support multiple text inputs

Mycroft should support multiple text inputs so that it can be used for multi-text labeling tasks like textual entailment.

TextEmbeddingClassifier.train would take a list of texts arguments and pass a list of training_vectors to self.model.fit. (Maybe the list would be converted to a single vector if there is only one text input.)

The --text-name option would take multiple arguments. I'd need to add a mechanism for the TextEmbeddingClassifier subclass to specify what these would be. The default would remain a single argument called "text".

Add support for grid search over hyperparameters?

This is a nice-to-have. Of course you can always do grid search by generating Mycroft command lines outside this package, but maybe it's worth doing inside this code if (a) there would be a convenient command line interface (b) it doesn't require much implementation work.

The criteria for (b) is "can I wrap these models in Keras' scikit-learn layer"? If not, I'm not going to reimplement scikit-learn's grid search capability.

For (a) I guess you could simply change all the command line arguments to take multiple values, though you might end up wanting to be able to launch a job from a configuration file.

This would need some investigation to see if it's worthwhile.

wpm / mycroft Goto Github PK

mycroft's People

Contributors

Stargazers

Watchers

Forkers

mycroft's Issues

Recommend Projects

Recommend Topics

Recommend Org