We provide a general framework for training subword-informed word representations by varying the following components:
- subword segmentation methods;
- subword embeddings and position embeddings;
- composition functions;
For the whole framework architecture and more details, please refer to the reference.
There are 4 segmentation methods, 3 possible ways of embedding subwords, 3 ways of enhancing with position embeddings, and 3 different composition functions.
Here is a full table of different options and their labels:
Component | Option | Label |
---|---|---|
Segmentation methods | CHIPMUNK Morfessor BPE Character n-gram |
sms morf bpe charn |
Subword embeddings | w/o word token w/ word token w/ morphotactic tag (only for sms) |
- ww wp |
Position embeddings | w/o position embedding addition elementwise multiplication |
- pp (not applicable to wp) mp (not applicable to wp) |
Composition functions | addition single self-attention multi-head self-attention |
add att mtxatt |
For example, sms.wwppmtxatt means we use CHIPMUNK as segmentation, insert word token into the subword sequence, enhance with additive position embedding, and use multi-head self-attention as composition function.
Taking the word dishonestly as an example, with different segmentation methods, the word will be segmented into the following subword sequence:
- ChipMunk: (<dis, honest, ly>) + (PREFIX, ROOT, SUFFIX)
- Morfessor: (<dishonest, ly>)
- BPE (10k merge ops): (<dish, on, est, ly>)
- Character n-gram (from 3 to 6): (<di, dis, ... , ly>, <dis, ... ,tly>, <dish, ... , stly>, <disho, ... , estly>)
where < and > are word start and end markers.
After the segmentation, we will obtain a subword sequence S for each segmentation method, and another morphortactic tag sequence T for sms.
We can embed the subword sequence S directly into subword embedding sequence by looking up in the subword embedding matrix, or insert a word token (ww) into S before embedding, i.e. for sms it will be (<dis, honest, ly>, <dishonestly>).
Then we can enhance the subword embeddings with additive (pp) or elementwise (mp) multiplication.
For sms, we can also embed the concatenation of the subword and its morphortactic tags (wp): (<dis:PREFIX, honest:ROOT, ly>:SUFFIX). And <dishonest>:WORD will be inserted if we choose ww. Note that position embeddings are not applicable to wp as a kind of morphological position information has already been provided.
Call gen_word_emb.py
to generate embeddings of new words for a specific composition function or use batch_gen_word_emb.sh
to generate for all composition functions.
Your input, i.e. --in_file
in input arg, needs to be a list of word, where each line only consists of a single word.
- A Systematic Study of Leveraging Subword Information for Learning Word Representations. Yi Zhu, Ivan Vulić, and Anna Korhonen. In Proc. of NAACL 2019.