juditacs / morph-segmentation Goto Github PK
View Code? Open in Web Editor NEWExperimenting with supervised morphological segmentation
License: MIT License
Experimenting with supervised morphological segmentation
License: MIT License
Create inference script for the sequence tagger.
Create and document a standardized dataset. This will be used during the camp.
I'll document the preprocessing steps in the Wiki.
feed_previous is always set to False
Add reverse_input
and reverse_output
options to the training pipeline. The inference phase should reverse them if necessary. These parameters have to be saved in the dataset_params.json
file.
This is important when using toy datasets. Valid and test splits can easily end up empty.
I'll probably change the sampling method to fixed sized samples instead of random sampling and using threshold.
This is the current implementation:
padded = ['GO'] + dec + ['PAD' for p in range(self.maxlen_dec - len(dec))] + ['STOP']
I don't want to change it right now because I'm in the middle of debugging #9
but let's not forget it.
BTW this means that even without a STOP symbol, it can learn to stop at a certain point.
Explain the motivation for this project with use cases in downstream tasks.
Questions to be answered:
if maxlen_enc
or maxlen_dec
are specified in the function's arguments instead of deriving them from the samples, longer samples than maxlen should be filtered.
Move s2s to a subdirectory and update README.
Config has currently hard coded defaults in the source code which is very hard to maintain and a bad practice in general. They should be moved to default.yaml
and loaded from there.
Reimplement seq2seq according to Google NMT tutorial: https://github.com/tensorflow/nmt
Use other inference strategies than greedy.
Both supervised models use their own DataSet
implementation. There should be one base class and several subclasses if needed.
I get tensor shape mismatch errors such as:
InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [40,20] rhs shape= [36,20]
The input word should be the same length as the output word excluding spaces (morpheme boundaries). The current loss function does not penalize length and many errors are words of different length:
zöld ek et zöldeek et
segítség hez segít ég hez
át rohan t átrrohantt
fa telep fattelep
I'm really not sure how to implement it and whether it is worth implementing. The character code of space would need to be masked and then the length of the output sequence minus the length of the input sequence needs to be added to the loss function.
Unidirectional encoder doesn't work yet.
Sequence tagger performs very poorly right now, I suspect there are bugs.
The code currently runs on Python3 but Google Cloud does not support it yet. Make it version agnostic or Python2 only.
Add bidirectional encoder option to seq2seq.
Natural representation of Korean language is in 'disassembled form'.
e.g.) 이건 --> ㅇ ㅣ ㄱ ㅓ ㄴ
In this way, you can segment Korean word just as what you do about Hungarians.
This is the repo preprocessing Korean dataset.
Create a sandhi corpus from morphologically analyzed Hungarian text.
I have two ideas, please let me know what you think. @e9t @kornai @DavidNemeskey
Seq2seq can and does change the input word which is not taken into account at boundary prediction evaluation. How should I handle this? @e9t
Impelement CNN tagging.
gs://path
-s are not directly usable from Python.
tensorflow.python.lib.io.file_io.FileIO
solves this issue and I implemented it for plain text reading in f606943
Gzip and STDIN reading however are not yet supported.
data/webcorp/webcorp.all.freqs.train.gz
(400k word types)data/webcorp/webcorp.all.freqs.test.gz
(100k word types)Both the input and the output can be reversed. I will try all 4 combinations.
Trained models will be saved to the results/models
subdirectory.
Currently 10% of the data is always reserved for testing. This is only a problem if there's not enough training data.
Memory consumption is too large right now. It runs out of memory on 50k samples in some cases. I'm now trying to run it using tf.int8 placeholders.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.