Code Monkey home page Code Monkey logo

morph-segmentation's People

Contributors

e9t avatar juditacs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

morph-segmentation's Issues

Standardize dataset

Create and document a standardized dataset. This will be used during the camp.

I'll document the preprocessing steps in the Wiki.

Ensure all data splits contain at least one sample

This is important when using toy datasets. Valid and test splits can easily end up empty.

I'll probably change the sampling method to fixed sized samples instead of random sampling and using threshold.

STOP symbol is added after padding inth decoder data

This is the current implementation:

padded = ['GO'] + dec + ['PAD' for p in range(self.maxlen_dec - len(dec))] + ['STOP']

I don't want to change it right now because I'm in the middle of debugging #9
but let's not forget it.

BTW this means that even without a STOP symbol, it can learn to stop at a certain point.

Motivation for this project

Explain the motivation for this project with use cases in downstream tasks.

Questions to be answered:

  • Why is this necessary?
  • Why not just use the rule based analyzers?
  • Why not just use word-level approaches with more data?

Config: move defaults to default.yaml

Config has currently hard coded defaults in the source code which is very hard to maintain and a bad practice in general. They should be moved to default.yaml and loaded from there.

Create one common DataSet class

Both supervised models use their own DataSet implementation. There should be one base class and several subclasses if needed.

Penalize length difference

The input word should be the same length as the output word excluding spaces (morpheme boundaries). The current loss function does not penalize length and many errors are words of different length:

zöld ek et              zöldeek et
segítség hez         segít ég hez
át rohan t              átrrohantt
fa telep                  fattelep

I'm really not sure how to implement it and whether it is worth implementing. The character code of space would need to be masked and then the length of the output sequence minus the length of the input sequence needs to be added to the loss function.

Python2 support

The code currently runs on Python3 but Google Cloud does not support it yet. Make it version agnostic or Python2 only.

Create a sandhi (assimilation) corpus

Create a sandhi corpus from morphologically analyzed Hungarian text.

I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey

  1. take a few inflection rules that cause assimilation such as the instrumental case and extract words with those inflections
  2. find words where the lemma is not a substring of the inflected word. I'm checking this option right now, it might introduce many false positives

Seq2seq output evaluation

Seq2seq can and does change the input word which is not taken into account at boundary prediction evaluation. How should I handle this? @e9t

Use gs:// paths for train files

gs://path-s are not directly usable from Python.

tensorflow.python.lib.io.file_io.FileIO solves this issue and I implemented it for plain text reading in f606943

Gzip and STDIN reading however are not yet supported.

seq2seq: retrain and save best configurations

Experiments (to be finished by Monday)

  • normal-normal
  • normal-reversed
  • reversed-normal
  • reversed-reversed

Setup

Dataset

  • Train data: data/webcorp/webcorp.all.freqs.train.gz (400k word types)
  • Test data: data/webcorp/webcorp.all.freqs.test.gz (100k word types)

Both the input and the output can be reversed. I will try all 4 combinations.

Model parameters

  • early stopping threshold: 0.001
  • early stopping patience: 3
  • cell size: 64
  • cell type: LSTM
  • embedding dim: 20
  • single layer

Trained models will be saved to the results/models subdirectory.

Decrease memory usage

Memory consumption is too large right now. It runs out of memory on 50k samples in some cases. I'm now trying to run it using tf.int8 placeholders.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.