Code Monkey home page Code Monkey logo

xnmt's Issues

Ability to set learning rate of trainer

It would be nice to have an option to set the learning rate of the trainer. If the option is not specified, use the default learning rate.

Also, I think we should make Adam the default trainer. SGD takes forever to train.

Make docstring formatting consistent

Currently the format of our docstrings is not consistent. I'd suggest following the format in ResidualLSTMEncoder, as it seems to be the most thoroughly documented. We should also add the documentation style to the README.md or some other coding style document.

Update: To be more clear -- This means "use double quotes for docstrings" and "mark parameters as @param"

Error during decoding

Hi,

I ran a super-small experiment training on the dev set from the example Japanese data (see xnmt-small.yaml in mcds-exp.zip) and got the following error. It looks like this might be because search_strategy is outputting NumPy arrays instead of integers at each timestep. Is anyone else getting this problem?

[dynet] random seed: 852191151
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
=> Running ja_check
   > Training   
   Start training in minibatch mode...   
   Epoch 1.0000: train_ppl=372.9873 (words=5057, time=0-00:00:01)   
   Epoch 1.0000: test_ppl=220.3990 (words=5057, time=0-00:00:02)   
   Epoch 1.0000: best dev loss, writing model to xnmtmodel/dev.mod   
   > Evaluating   
   Traceback (most recent call last):
     File "/home/gneubig/work/xnmt/xnmt/xnmt_run_experiments.py", line 123, in <module>
          xnmt_trainer.input_reader.vocab, xnmt_trainer.output_reader.vocab, xnmt_trainer.translator))
     File "/usr0/home/gneubig/work/xnmt/xnmt/xnmt_decode.py", line 52, in xnmt_decode
          target_sentence = output_generator.process(token_string)[0]
     File "/usr0/home/gneubig/work/xnmt/xnmt/output.py", line 32, in process
          self.token_string.append(self.vocab[token])
     File "/usr0/home/gneubig/work/xnmt/xnmt/vocab.py", line 35, in __getitem__
          return self.i2w[i]
   TypeError   :    only integer scalar arrays can be converted to a scalar index   

Standard example fails (on Python3)

Currently the standard example seems to fail on the master branch with Python3:

(python3) gneubig@lor:~/work/xnmt$ python xnmt/xnmt_run_experiments.py examples/standard.yaml 
Traceback (most recent call last):
  File "xnmt/xnmt_run_experiments.py", line 18, in <module>
    import xnmt.xnmt_preproc, xnmt.xnmt_train, xnmt.xnmt_decode, xnmt.xnmt_evaluate
ModuleNotFoundError: No module named 'xnmt'

@msperber @philip30 , have you tested on Python3 and not encountered this error (i.e. it's a problem with my environment) or not tested at all?

Cython setup doesn't work on Mac OS?

It seems that cython is broken on mac OS, e.g.:

(python3) neubig@itachi:~/work/xnmt$ python setup.py build_ext --inplace --use-cython-extensions
running build_ext
building 'xnmt.cython.xnmt_cython' extension
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/neubig/anaconda/envs/python3/include -arch x86_64 -I/Users/neubig/anaconda/envs/python3/include -arch x86_64 -I/Users/neubig/anaconda/envs/python3/include/python3.6m -c xnmt/cython/xnmt_cython.cpp -o build/temp.macosx-10.7-x86_64-3.6/xnmt/cython/xnmt_cython.o -std=c++11
gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/neubig/anaconda/envs/python3/include -arch x86_64 -I/Users/neubig/anaconda/envs/python3/include -arch x86_64 -I/Users/neubig/anaconda/envs/python3/include/python3.6m -c xnmt/cython/src/functions.cpp -o build/temp.macosx-10.7-x86_64-3.6/xnmt/cython/src/functions.o -std=c++11
xnmt/cython/src/functions.cpp:3:10: fatal error: 'unordered_map' file not found
#include <unordered_map>
         ^~~~~~~~~~~~~~~
1 error generated.
error: command 'gcc' failed with exit status 1

Create tested "recipes"

It would be nice if xnmt had a list of "recipes" that get competitive results on standard datasets. These would be in contrast to "examples", which do not necessarily have to have this accuracy guarantee, or do not have to be on standard datasets. Some examples would be:

  • Standard attentional model on WMT2014
  • Speech-to-text translation on Fisher corpus

Any other interesting things.

Corpus filtering option in xnmt_train, or as a pre-proecessing step

So as we've seen, having super-long sentences in our training set can cause xnmt to run out of memory. I think there are a couple of ways around this:

  • Create xnmt_preprocess.py, which will perform pre-processing of a corpus. One of the options could be to remove all sentences that are over a certain length.
  • Change the corpus reading code in xnmt_train.py to allow it to throw away sentences over a certain length when reading in the corpus.

What do you think?

Memory leak when saving model?

Hi guys, I tried using xnmt under graham's advice but I get a problem when running :

python xnmt_run_experiments.py ../test/experiments-config.txt

At the end of each epoch, the memory consumed by the program either augments by ~ 500MB or increases continually until my computer freezes.

I use the cpu version of dynet and my OS is ubuntu 15.10

EDIT : After doing more tests I have more infos

  • Not sure if this is model saving. It seems to happen exclusively between two epochs
  • Maybe this is linked to dynet models which are notably bugged? (ie if you delete a model the memory allocated to the parameters is not freed, cf clab/dynet#418)

Also, pic or it didn't happen :

screenshot from 2017-04-13 14-56-09

Examples on Standard Datasets

It would be nice if we had examples of how to run xnmt on standard datasets and get great scores, the best scores! I think this could be implemented by restructuring the examples folder to have on large README.md explaining what each of the examples are, then sub-directories with a README.md explaining the various commands that need to be run to obtain the data and train the model. For models that can be run as-is from the top directory of xnmt using the example data, then it's fine to leave them as-is, although a short explanation in the top README.md might be warranted.

Decorators break Python 2

I get an error with the most recent code using Python 2:

Traceback (most recent call last):
  File "xnmt/xnmt_run_experiments.py", line 10, in <module>
    import xnmt_preproc, xnmt_train, xnmt_decode, xnmt_evaluate
  File "/Users/neubig/work/xnmt/xnmt/xnmt_train.py", line 13, in <module>
    from encoder import *
  File "/Users/neubig/work/xnmt/xnmt/encoder.py", line 8, in <module>
    from decorators import recursive
  File "/Users/neubig/work/xnmt/xnmt/decorators.py", line 29
    def rec_f(obj, *args, **kwargs, context=None):
                                      ^
 SyntaxError: invalid syntax

Any ideas @philip30 ?

Better Support for Sharing Components, Multi-task Learning

Currently it is very difficult to implement multi-task learning in xnmt. This could probably be fixed by doing a few things:

  1. Making it possible to define something like CompoundModel, which can contain a Translator and a Retriever, two Translators, etc.
  2. Making it easier to "reference" previously defined model components. For example, Translator number two might "reference" the Encoder of Translator number 1. This would allow them to share a single encoder and train them in a multi-task fashion.
  3. Come up with a new TrainingTask interface that references a model and its training data and parameters, a DecodingTask that performs decoding, and a EvaluationTask that performs evaluation for the various tasks.

This would be a large refactoring of the code, but could potentially make things much more flexible, so it would potentially be nice to have.

standard example seems broken

Hi,

First of all, as a dynet/nmt fan, this project is very exciting!

To the issue: I tried running the standard example from the documentation using:

python xnmt/xnmt_run_experiments.py examples/standard.yaml

And got the following output:

[dynet] random seed: 2045434078
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
Traceback (most recent call last):
  File "xnmt/xnmt_run_experiments.py", line 108, in <module>
    config = config_parser.args_from_config_file(args.experiments_file)
  File "/home/nlp/aharonr6/git/xnmt/xnmt/options.py", line 105, in args_from_config_file
    {name: self.check_and_convert(task_name, name, value) for name, value in exp_task_values.items()})
  File "/home/nlp/aharonr6/git/xnmt/xnmt/options.py", line 105, in <dictcomp>
    {name: self.check_and_convert(task_name, name, value) for name, value in exp_task_values.items()})
  File "/home/nlp/aharonr6/git/xnmt/xnmt/options.py", line 42, in check_and_convert
    raise RuntimeError("Unknown option {} for task {}".format(option_name, task_name))
RuntimeError: Unknown option encoder_layers for task train

Is the example broken or is it me doing something wrong?
Thanks!

Error while loading the pre-trained model

initialized BilingualTrainingCorpus({'dev_src': '/projects/tir2/users/sjpadman/temp_data/bilingual_dev_src.txt', 'dev_trg': '/projects/tir2/users/sjpadman/temp_data/bilingual_dev_tar.txt', 'train_src': '/projects/tir2/users/sjpadman/temp_data/bilingual_train_src.txt', 'train_trg': '/projects/tir2/users/sjpadman/temp_data/bilingual_train_tar.txt'})
   Traceback (most recent call last):
     File "xnmt/xnmt_run_experiments.py", line 166, in <module>
          sys.exit(main())
     File "xnmt/xnmt_run_experiments.py", line 120, in main
          xnmt_trainer = xnmt.xnmt_train.XnmtTrainer(train_args)
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/xnmt_train.py", line 101, in __init__
          self.load_corpus_and_model()
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/xnmt_train.py", line 162, in load_corpus_and_model
          self.corpus_parser = self.model_serializer.initialize_object(corpus_parser) if self.need_deserialization else self.args.corpus_parser
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/serializer.py", line 54, in initialize_object
          return self.init_components_bottom_up(deserialized_yaml, deserialized_yaml.dependent_init_params(), context=context)
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/serializer.py", line 139, in init_components_bottom_up
          init_params[init_arg] = self.init_components_bottom_up(val, sub_dependent_init_params, context)
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/serializer.py", line 139, in init_components_bottom_up
          init_params[init_arg] = self.init_components_bottom_up(val, sub_dependent_init_params, context)
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/serializer.py", line 158, in init_components_bottom_up
          print("initialized %s(%s)" % (obj.__class__.__name__, init_params))
     File "/projects/tir1/users/sjpadman/xnmt/xnmt/tee.py", line 40, in write
          self.stdstream.write(" " * self.indent + data)
   UnicodeEncodeError   :    'ascii' codec can't encode character '\xe1' in position 99: ordinal not in range(128)

The above error is thrown while trying to load a pre-trained model.

Configuration Files shouldn't be copied if not provided

I think the copied yaml configuration shouldn't be copied if the yaml_file config is not specified.
I just think it is weird to copy the configuration to the directory where scripts are being run.

For example, if you run the test of XNMT, it will copy all the test/config/*.yaml to the root of xnmt.

Feature Request: Tokenization

It would be nice to be able to perform tokenization/detokenization as part of the preprocessing capability: #104

Options include BPE, sentencepiece, or manual tokenization like the Moses tokenizer. For ease of implementation, particularly for sentencepiece, I think it's OK to assume a call to an external program when implementing these.

Two issues with dev set evaluation when doing minibatching

Hi @CharlotteKay , I have two questions about evaluation when using minibatching.

First, it looks like the number of words evaluated in the dev set is inconsistent when using minibatching or not. Here is without:

[dynet] random seed: 3841206789
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
Start training in non-minibatch mode...
0.01 Dev perplexity: 616.9578868143398 (32490.217478 over 5057 words)
0.02 Dev perplexity: 468.18571355561767 (31094.810513 over 5057 words)
0.03 Dev perplexity: 428.57042830985233 (30647.721363 over 5057 words)
0.04 Dev perplexity: 495.58422153685274 (31382.413588 over 5057 words)

and here is with (32 sentences):

[dynet] random seed: 811370858
[dynet] allocating memory: 512MB
[dynet] memory allocation done.
Start training in minibatch mode...
0.33557046979865773 Dev perplexity: 347.5941161824009 (13638.763672 over 2331 words)
0.6711409395973155 Dev perplexity: 311.98835044678697 (13386.853394 over 2331 words)

Second, I think we should probably evaluate after the same number of sentences regardless of the minibatch size. Now we are evaluating every 100*minibatch_size sentences, but let's set add eval_every setting that is specified by sentences, and then evaluate every eval_every sentences.

Multi-dataset Evaluation

It would be nice if we could evaluate on multiple test sets. This could be done in one of two ways:

  • Pass in multiple files
  • Pass in a single file, but specify a range of lines that correspond to each different set

I prefer the first, but the second might be OK as well.

Create over-arching layer size option

Currently xnmt has a bunch of different places to specify the size of the embeddings, encoder, decoder, etc. I think it would be helpful to make it possible to specify a default layer size that is used in all places, unless something else is specified explicitly.

Print training speed in words/sec

Just for convenience, could we print the number of words processed per second every time we print logging information? This could be done for the training and dev sets.

Specify experiment to run by name

Given an experiment file, it might be nice to be able to specify which experiments to run via a command line option. If the command line option was not specified, we could revert to the current behavior of running all of them.

masking not implemented correctly with initialization in lstm.py

in lstm.py, the customLSTMbuilder uses the class LSTMState. When you call add_input, you just pass in previous state. However, if you call a customLSTMbuilder with initial_state, it returns a LSTM state with some initialized c and h value. But this initial state does not have a previous_state property. So when you call add_input to it, it would not pass in the initialized c and h, and this initialization information would be lost forever?

Report fine-grained statistics for BLEU

It would be nice if BLEU could also report fine-grained statistics similar to the following (from the mt-evaluator program of my travatar toolkit)

e.g.: BLEU = 0.56557, 0.82951/0.757936/0.71831/0.68288 (BP=0.758942, ratio=0.783803, hyp_len=5449, ref_len=6952)

This gives you the precision of each n-gram, the brevity penalty, and the overall length compared to the reference. This is really useful in debugging, as sometimes we're getting a low BLEU score just because our method is outputting hypotheses that are too short.

Ability to Check if Decoding Matches Loss Calculation

It would be nice if we had a testing setup that allowed us to check if the score calculated during decoding matched the score by calc_loss. This would greatly help with debugging one of the most common errors when implementing models (train-test differences).

PolynomialNormalization missing attribute

c6fac72 seems to have introduced a bug with PolynomialNormalization; can be reproduced by running the standard.yaml config file in examples/.

Call:
python3 xnmt/xnmt_run_experiments.py examples/standard.yaml --dynet-gpu

Produces this output (truncated):

> Training
   Epoch 0.1002: train_loss/word=7.026294 (words=9426, words/sec=1150.65, time=0-00:00:08)
   Epoch 0.2000: train_loss/word=6.703589 (words=18864, words/sec=1155.63, time=0-00:00:16)
   Epoch 0.3003: train_loss/word=6.518545 (words=28110, words/sec=1188.03, time=0-00:00:24)
   Epoch 0.4005: train_loss/word=6.392840 (words=37588, words/sec=1138.11, time=0-00:00:32)
   Epoch 0.5003: train_loss/word=6.307682 (words=46648, words/sec=1159.72, time=0-00:00:40)
   Epoch 0.6004: train_loss/word=6.225853 (words=55790, words/sec=1144.99, time=0-00:00:48)
   Epoch 0.7000: train_loss/word=6.154869 (words=65143, words/sec=1154.20, time=0-00:00:56)
   Epoch 0.8003: train_loss/word=6.088884 (words=74466, words/sec=1143.59, time=0-00:01:04)
   Epoch 0.9000: train_loss/word=6.024372 (words=83640, words/sec=1188.11, time=0-00:01:12)
   Epoch 1.0000: train_loss/word=5.964533 (words=93086, words/sec=1154.51, time=0-00:01:20)
   Traceback (most recent call last):
     File "xnmt/xnmt_run_experiments.py", line 154, in <module>
          sys.exit(main())
     File "xnmt/xnmt_run_experiments.py", line 118, in main
          training_regimen.run_epochs(exp_args["run_for_epochs"])
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/train.py", line 190, in run_epochs
          self.one_epoch()
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/train.py", line 241, in one_epoch
          self.dev_evaluation()
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/train.py", line 260, in dev_evaluation
          xnmt.xnmt_decode.xnmt_decode(model_elements=(self.corpus_parser, self.model), **self.decode_args)
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/xnmt_decode.py", line 111, in xnmt_decode
          output = generator.generate_output(src, i, forced_trg_ids=ref_ids)
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/generator.py", line 6, in generate_output
          generation_output = self.generate(*args, **kwargs)
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/translator.py", line 127, in generate
          output_actions, score = self.search_strategy.generate_output(self.decoder, self.attender, self.trg_embedder, dec_state, src_length=len(sents), forced_trg_ids=forced_trg_ids)
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/search_strategy.py", line 97, in generate_output
          new_set.append(self.Hypothesis(self.len_norm.normalize_partial(hyp.score, score[cur_id], len(new_list)),
     File "/home/ziyux/installs/miniconda3/envs/dynet/lib/python3.6/site-packages/xnmt-0.0.1-py3.6.egg/xnmt/length_normalization.py", line 73, in normalize_partial
          return (score_so_far * pow(new_len-1, self.m) + score_to_add) / pow(new_len, self.m)
   AttributeError   :    'PolynomialNormalization' object has no attribute 'm'

Travis CI is failing

Travis CI checks are failing on the pip install of DyNet. I'm guessing that this is something simple like not installing mercurial or the compile environment beforehand, and might be resolved by adding packages to the .travis.yml file: http://dynet.readthedocs.io/en/latest/python.html

@philip30 if you have time today perhaps you could take a look? If not, I'll try to take a look later.

Loss Calculator

Specifying the loss_calculator within the model seems like a good idea but it is binding the model to a specific training process as we get the entire model specifications from the pre-trained model. This prevents us from choosing to fine-tune a model with a training process other than it was initially trained with.
Should this be made more flexible? Or am I missing something?

Terminology Confusing

Currently, some key terms are reused for different concepts:

  • model_globals.params (global hyperparams + dynet weights)
  • model_globals.params.model (dynet weights)
  • model in the YAML config (top of the model hierarchy, e.g. translator or retriever)
  • ModelParams: container for serialization, contains YAML model, corpus parser, global_params

Residual network serialization fails

When saving a model that contains a residual encoder (e.g. a ResidualLSTMEncoder), save_to_file fails with "Class LookupParameters is not serializable. Try adding serialize_params to it."

However it seems that the model_lookup field in ResidualLSTMEncoder (the one that's causing the issue) is never used anywhere in the code (since lookup can be performed directly from the embeddings field of the embedder). Just deleting that field makes model serialization work. I just wanted to confirm that my understanding was correct and that the field can safely be removed.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.