Code Monkey home page Code Monkey logo

sw_study's Introduction

Subword-informed word representation training framework

We provide a general framework for training subword-informed word representations by varying the following components:

For the whole framework architecture and more details, please refer to the reference.

There are 4 segmentation methods, 3 possible ways of embedding subwords, 3 ways of enhancing with position embeddings, and 3 different composition functions.

Here is a full table of different options and their labels:

Component Option Label
Segmentation methods CHIPMUNK
Morfessor
BPE
Character n-gram
sms
morf
bpe
charn
Subword embeddings w/o word token
w/ word token
w/ morphotactic tag (only for sms)
-
ww
wp
Position embeddings w/o position embedding
addition
elementwise multiplication
-
pp (not applicable to wp)
mp (not applicable to wp)
Composition functions addition
single self-attention
multi-head self-attention
add
att
mtxatt

For example, sms.wwppmtxatt means we use CHIPMUNK as segmentation, insert word token into the subword sequence, enhance with additive position embedding, and use multi-head self-attention as composition function.

Subword segmentation methods

Taking the word dishonestly as an example, with different segmentation methods, the word will be segmented into the following subword sequence:

  • ChipMunk: (<dis, honest, ly>) + (PREFIX, ROOT, SUFFIX)
  • Morfessor: (<dishonest, ly>)
  • BPE (10k merge ops): (<dish, on, est, ly>)
  • Character n-gram (from 3 to 6): (<di, dis, ... , ly>, <dis, ... ,tly>, <dish, ... , stly>, <disho, ... , estly>)

where < and > are word start and end markers.

After the segmentation, we will obtain a subword sequence S for each segmentation method, and another morphortactic tag sequence T for sms.

Subword embeddings and position embeddings

We can embed the subword sequence S directly into subword embedding sequence by looking up in the subword embedding matrix, or insert a word token (ww) into S before embedding, i.e. for sms it will be (<dis, honest, ly>, <dishonestly>).

Then we can enhance the subword embeddings with additive (pp) or elementwise (mp) multiplication.

For sms, we can also embed the concatenation of the subword and its morphortactic tags (wp): (<dis:PREFIX, honest:ROOT, ly>:SUFFIX). And <dishonest>:WORD will be inserted if we choose ww. Note that position embeddings are not applicable to wp as a kind of morphological position information has already been provided.

Prerequisites

Calculate new word embeddings from subword embeddings

Call gen_word_emb.py to generate embeddings of new words for a specific composition function or use batch_gen_word_emb.sh to generate for all composition functions.

Your input, i.e. --in_file in input arg, needs to be a list of word, where each line only consists of a single word.

References

sw_study's People

Contributors

zhu-y11 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

sw_study's Issues

GPU utilization

Hi, I have a problem when using this method.
I tried to train English embeddings on a downsampled corpus (100
M tokens) by using bpe method.
My problem is for the when the code reaches this part in the train function:
if pairs:
while pairs:
total_loss += train_batch(m, optimizer, pairs[:m.bs], len(pairs[:m.bs]), m.neg_idxs)
.
.
The problem is, it is very slow, and takes too much time to be trained.
I tried to enable the Cuda flag, but it did not change anything. It only uses my GPU memory but the GPU volatile (GPU utilization) remains zero.
Do you know what might be wrong here, or is this method very time-consuming?

scripts to generate files for sms

Hi, I am having a hard time finding the script (if there is one) that generates the file en.sent.1m.5.sms. I understand that the en.sent.1m has to go through CHIPMUNK, but en.sent.1m.5.sms is a vocabulary using the CHIPMUNK output. How can I get that? Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.