juditacs / morph-segmentation Goto Github PK

View Code? Open in Web Editor NEW

6.0 6.0 5.0 9.6 MB

Experimenting with supervised morphological segmentation

License: MIT License

Python 25.78% Jupyter Notebook 74.22%

morph-segmentation's People

Contributors

Stargazers

Watchers

Forkers

mljejucamp2017 e9t blissray

morph-segmentation's Issues

Sequence tagger inference

Create inference script for the sequence tagger.

Standardize dataset

Create and document a standardized dataset. This will be used during the camp.

I'll document the preprocessing steps in the Wiki.

seq2seq: Add reverse_input and reverse_output options

Add reverse_input and reverse_output options to the training pipeline. The inference phase should reverse them if necessary. These parameters have to be saved in the dataset_params.json file.

Ensure all data splits contain at least one sample

This is important when using toy datasets. Valid and test splits can easily end up empty.

I'll probably change the sampling method to fixed sized samples instead of random sampling and using threshold.

STOP symbol is added after padding inth decoder data

This is the current implementation:

padded = ['GO'] + dec + ['PAD' for p in range(self.maxlen_dec - len(dec))] + ['STOP']

I don't want to change it right now because I'm in the middle of debugging #9
but let's not forget it.

BTW this means that even without a STOP symbol, it can learn to stop at a certain point.

Motivation for this project

Explain the motivation for this project with use cases in downstream tasks.

Questions to be answered:

Why is this necessary?
Why not just use the rule based analyzers?
Why not just use word-level approaches with more data?

Check data length in DataSet.featurize_samples

if maxlen_enc or maxlen_dec are specified in the function's arguments instead of deriving them from the samples, longer samples than maxlen should be filtered.

Reorganize directory structure

Move s2s to a subdirectory and update README.

Config: move defaults to default.yaml

Config has currently hard coded defaults in the source code which is very hard to maintain and a bad practice in general. They should be moved to default.yaml and loaded from there.

Seq2seq: use non-legacy seq2seq

Reimplement seq2seq according to Google NMT tutorial: https://github.com/tensorflow/nmt

Use other inference strategies than greedy.

This issue covers #19 #13 #10 and partly #17

Create one common DataSet class

Both supervised models use their own DataSet implementation. There should be one base class and several subclasses if needed.

Inference breaks when non-default arguments are used

I get tensor shape mismatch errors such as:

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [40,20] rhs shape= [36,20]

Penalize length difference

The input word should be the same length as the output word excluding spaces (morpheme boundaries). The current loss function does not penalize length and many errors are words of different length:

zöld ek et              zöldeek et
segítség hez         segít ég hez
át rohan t              átrrohantt
fa telep                  fattelep

I'm really not sure how to implement it and whether it is worth implementing. The character code of space would need to be masked and then the length of the output sequence minus the length of the input sequence needs to be added to the loss function.

seq2seq: unidirectional encoder

Unidirectional encoder doesn't work yet.

Unit tests for sequence tagger

Sequence tagger performs very poorly right now, I suspect there are bugs.

Create inference script

https://www.tensorflow.org/programmers_guide/meta_graph

Python2 support

The code currently runs on Python3 but Google Cloud does not support it yet. Make it version agnostic or Python2 only.

seq2seq: bidirectional encoder

Add bidirectional encoder option to seq2seq.

Actually segmentation of Korean sentence can be done with splitter

Natural representation of Korean language is in 'disassembled form'.

e.g.) 이건 --> ㅇ ㅣ ㄱ ㅓ ㄴ

In this way, you can segment Korean word just as what you do about Hungarians.

This is the repo preprocessing Korean dataset.

https://github.com/nakosung/hangul-asm

Create a sandhi (assimilation) corpus

Create a sandhi corpus from morphologically analyzed Hungarian text.

I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey

take a few inflection rules that cause assimilation such as the instrumental case and extract words with those inflections
find words where the lemma is not a substring of the inflected word. I'm checking this option right now, it might introduce many false positives

Seq2seq output evaluation

Seq2seq can and does change the input word which is not taken into account at boundary prediction evaluation. How should I handle this? @e9t

Test output is mismatched with test input

CNN tagging

Impelement CNN tagging.

Use gs:// paths for train files

gs://path-s are not directly usable from Python.

tensorflow.python.lib.io.file_io.FileIO solves this issue and I implemented it for plain text reading in f606943

Gzip and STDIN reading however are not yet supported.

seq2seq: retrain and save best configurations

Experiments (to be finished by Monday)

normal-normal
normal-reversed
reversed-normal
reversed-reversed

Setup

Dataset

Train data: data/webcorp/webcorp.all.freqs.train.gz (400k word types)
Test data: data/webcorp/webcorp.all.freqs.test.gz (100k word types)

Both the input and the output can be reversed. I will try all 4 combinations.

Model parameters

early stopping threshold: 0.001
early stopping patience: 3
cell size: 64
cell type: LSTM
embedding dim: 20
single layer

Trained models will be saved to the results/models subdirectory.