finalfusion / finalfrontier Goto Github PK

View Code? Open in Web Editor NEW

85.0 8.0 4.0 615 KB

Context-sensitive word embeddings with subwords. In Rust.

Home Page: https://finalfusion.github.io/finalfrontier

License: Other

Rust 100.00%

word embeddings rust skipgram structured subword-units finalfusion

finalfrontier's People

Contributors

Stargazers

Watchers

Forkers

jguhlin realnicolasbourbaki bytesnake twuebi

finalfrontier's Issues

Investigate use of dictionary automaton for the vocabulary

One way to reduce the memory use of the vocabulary (besides memory mapping the vocabulary) would be to store the vocabulary in a dictionary automaton. Investigate whether the reduction in memory use is worthwhile.

Training without subwords

#13 was a first stab at it, maybe we can include this in 0.6?

Support explicitly stored ngrams

Entails changes to finalfusion-rust for serialization/usage.

Add config
Restructure vocab into module (vocab.rs for all variants is quite unwieldy)
Extend vocab module
- Use indexer approach from finalfusion-rust, parameterize SubwordVocab with Indexer, separate ngram and word indices in the index through enum (might get ugly, considering how many type parameter we already have for the trainer structs)
- Implement independent vocab type
Add support in binaries (depends on support in finalfusion-rust)
Update finalfusion-rust dependency once it supports NGramVocabs
Replace finalfusion dependency with release.
fix #72

Add flag to dump the context matrix

I see two ways:

Extend finalfusion to handle files with both input and output matrices
Dump the output matrix + vocab in a seperate finalfusion model.

Re 1.: More work and might make APIs (more) complex.
Re 2.: Hackier solution, output types need to implement to_string(). Lookup consequently also done through stringly keys.

Thank you

Hi @danieldk @sebpuetz ,

I just want to say thanks. You saved me from a text classifier deadline where flair/bert embedding would be too slow, and I was unable to do any magical invocation(tried version 3.8, version 4, other parameters) to force gensim to train an working word2vec, the model would simplify not converge and got way worse on each epoch. You rock and finalfrontier from my POV look like the only game in town ( spaCy was lack-lusting even with floret)!!!

Some general vocab questions

Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).

For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.

It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.
Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.
Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?
I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

Any and all help is greatly appreciated!

Cheers,
--Joseph

Endianness in memory mapped embedding matrices

We store embeddings in little-endian byte order. However, the byte order is not taken into account when embedding matrices are memory mapped. Consequently, incorrect embeddings will be used on big-endian platforms when memory mapping is used.

Make a Homebrew tap for macOS release builds

This would permit people to install finalfrontier on a Mac with brew install finalfrontier without compilation.

Store the number of threads in metadata

Replace custom fallible conversion methods by TryFrom

Switch to rust2vec format?

I wonder whether it is still necessary to retain finalfrontier's own binary format. It is now possible on master to convert embeddings to the new rust2vec format. However, it is an extra step. For users, it would be much simpler if finalfrontier directly stored trained embeddings in rust2vec format. Additionally, this would allow us to remove a lot of code from finalfrontier, such as Model and the similarity/analogy query functionality.

Stuff that is currently stored in the finalfrontier format that are lost in the rust2vec conversion:

metadata
l2 norms of the word embeddings pre-normalization

@sebpuetz

Investigate performance hit of using atomics in place of 'hogwild counters'

Provide subcommand to generate completions

Like we did in finalfusion-utils.

Add overarching `finalfrontier` man page

We currently have finalfrontier-skipgram(1) and finalfrontier-deps(1). We should have finalfrontier(1) that describes what finalfrontier is and gives a brief overview of the subcommands and pointers to their man pages. See man cargo or man git for some conventions.

Implement ASAG and AsySVRG

As you are using Hogwild! for the multicore SGD implementation, it would perhaps be interesting to investigate whether you can speed up the optimization with

AsySVRG (https://arxiv.org/abs/1508.05711)
ASAG (https://arxiv.org/abs/1606.04809)

ps nice project you have there 👍

Support memory mapping of the embedding matrix

Support memory mapping of the embedding matrix to reduce memory use.

Automatic builds of binary packages are broken?

At least, we do not have binaries in the assets of 0.7.0.

Kick out EOS marker

It's reduntant because we train with punctuation. Also, EOS pops up in different, somewhat unrelated components. E.g. vocabs need to explicitly match for the EOS appended by SentenceIterator.

Save norms in finalfusion format

In the old finalfrontier format, we saved the norms of the word embeddings before normalization. This information is lost now that we save directly in finalfusion format. Add an appropriate chunk to finalfusion and restore this functionality.

Broken CI script

https://travis-ci.org/finalfusion/finalfrontier/jobs/590369304#L562-L565

+'[' '!' rustc --version
+grep '^rustc 1.31.0' ']'
ci/script.sh: line 9: [: missing `]'
grep: ]: No such file or directory

Also, while we're at it, we might want to move cargo fmt to the beginning of the script to get builds to fail sooner.

Document side-effects of increasing the number of threads

Dealing with different set of command-line options

I have implemented support for training floret embeddings, but the command-line gets a bit unwieldy. Floret is quite a bit different than what we have so far:

We need an option to set the number of hashes.
We need an option to set the seed for murmur3.
Upstream floret doesn't use a matrix size that is a power of 2, it would be nice to provide the same freedom in finalfrontier.
Most output formats do not really make sense for floret, e.g. the word2vec and text formats are useless, since floret does not use word embeddings.

I see two ways forward:

We add the necessary options and validations to ensure that no incompatible set of options is used.
We add another level of subcommands, with only the relevant set of options, e.g.:
finalfrontier skipgram floret, finalfrontier skipgram fasttext, finalfrontier skipgram buckets, finalfrontier skipgram explicit and the same for deps.

For (2), I am not sure if this is the best partitioning.

Update man pages

ff-train: verify that the options are still in sync
ff-deps: write the manpage

Add support for dependency embeddings

Optimization opportunities

The vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.

Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.

I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.

Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.
The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).

@sebpuetz

Fix EOS ngrams

The ngram vocab currently brackets the <\s> marker and extracts ngrams from "<<\s>>". Those subwords aren't trained because the indices are never added to the subwords Vec in the vocab via:

            if word.word() == util::EOS {
                subword_indices.push(Vec::new());
                continue;
            }

It's fairly unlikely to encounter those ngrams anywhere but we should fix this once we figured out the bug.

Is there anything else that we want to add before branching 0.6?