finalfusion / finalfrontier Goto Github PK
View Code? Open in Web Editor NEWContext-sensitive word embeddings with subwords. In Rust.
Home Page: https://finalfusion.github.io/finalfrontier
License: Other
Context-sensitive word embeddings with subwords. In Rust.
Home Page: https://finalfusion.github.io/finalfrontier
License: Other
One way to reduce the memory use of the vocabulary (besides memory mapping the vocabulary) would be to store the vocabulary in a dictionary automaton. Investigate whether the reduction in memory use is worthwhile.
#13 was a first stab at it, maybe we can include this in 0.6
?
Entails changes to finalfusion-rust
for serialization/usage.
vocab.rs
for all variants is quite unwieldy)finalfusion-rust
, parameterize SubwordVocab
with Indexer
, separate ngram and word indices in the index
through enum (might get ugly, considering how many type parameter we already have for the trainer structs)finalfusion-rust
)finalfusion-rust
dependency once it supports NGramVocabsfinalfusion
dependency with release.I see two ways:
finalfusion
to handle files with both input and output matricesfinalfusion
model.Re 1.: More work and might make APIs (more) complex.
Re 2.: Hackier solution, output types need to implement to_string()
. Lookup consequently also done through stringly keys.
I just want to say thanks. You saved me from a text classifier deadline where flair/bert embedding would be too slow, and I was unable to do any magical invocation(tried version 3.8, version 4, other parameters) to force gensim to train an working word2vec, the model would simplify not converge and got way worse on each epoch. You rock and finalfrontier from my POV look like the only game in town ( spaCy was lack-lusting even with floret)!!!
Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).
For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.
It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.
Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.
Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?
I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?
Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?
Any and all help is greatly appreciated!
Cheers,
--Joseph
We store embeddings in little-endian byte order. However, the byte order is not taken into account when embedding matrices are memory mapped. Consequently, incorrect embeddings will be used on big-endian platforms when memory mapping is used.
This would permit people to install finalfrontier on a Mac with brew install finalfrontier
without compilation.
I wonder whether it is still necessary to retain finalfrontier's own binary format. It is now possible on master to convert embeddings to the new rust2vec
format. However, it is an extra step. For users, it would be much simpler if finalfrontier directly stored trained embeddings in rust2vec format. Additionally, this would allow us to remove a lot of code from finalfrontier, such as Model
and the similarity/analogy query functionality.
Stuff that is currently stored in the finalfrontier format that are lost in the rust2vec conversion:
Like we did in finalfusion-utils
.
We currently have finalfrontier-skipgram(1)
and finalfrontier-deps(1)
. We should have finalfrontier(1)
that describes what finalfrontier is and gives a brief overview of the subcommands and pointers to their man pages. See man cargo
or man git
for some conventions.
As you are using Hogwild! for the multicore SGD implementation, it would perhaps be interesting to investigate whether you can speed up the optimization with
ps nice project you have there ๐
Support memory mapping of the embedding matrix to reduce memory use.
At least, we do not have binaries in the assets of 0.7.0.
It's reduntant because we train with punctuation. Also, EOS
pops up in different, somewhat unrelated components. E.g. vocabs need to explicitly match for the EOS
appended by SentenceIterator
.
In the old finalfrontier format, we saved the norms of the word embeddings before normalization. This information is lost now that we save directly in finalfusion format. Add an appropriate chunk to finalfusion and restore this functionality.
https://travis-ci.org/finalfusion/finalfrontier/jobs/590369304#L562-L565
+'[' '!' rustc --version
+grep '^rustc 1.31.0' ']'
ci/script.sh: line 9: [: missing `]'
grep: ]: No such file or directory
Also, while we're at it, we might want to move cargo fmt
to the beginning of the script to get builds to fail sooner.
I have implemented support for training floret embeddings, but the command-line gets a bit unwieldy. Floret is quite a bit different than what we have so far:
I see two ways forward:
finalfrontier skipgram floret
, finalfrontier skipgram fasttext
, finalfrontier skipgram buckets
, finalfrontier skipgram explicit
and the same for deps
.For (2), I am not sure if this is the best partitioning.
ff-train
: verify that the options are still in syncff-deps
: write the manpageThe vast majority of time during training is spent in the dot product and scaled additions. We have been doing unaligned loads so far. I have made a quick modification that ensures that every embedding is aligned on a 16-byte boundary and changed the SSE code to do aligned loads, the compiled machine code seems ok and the compiler even performs some loop unrolling.
Unfortunately, using aligned data/loads does not seem to have a measurable impact on running time. This is probably caused by those functions being constrained by memory bandwidth.
I just wanted to jot down two possible opportunities for reducing cache missed that might have an impact on performance.
Some papers that replace core word2vec computations using by kernels sample a set of negatives for a sentence, rather than for each token. In the base case the number of cache misses due to negatives is reduced by a factor corresponding to the sentence length. Of course, this modification may have an impact on the quality of the embeddings.
The embeddings in the output matrix and the vocab part of the input matrix are ordered by the frequencies of the corresponding tokens. This might improve locality (due to zipf's law). However, the lookups for subword units are randomized by the hash function. Maybe something can be gained by ordering the embeddings in the subword matrix by hash code frequency. However, in the most obvious implementation it would add an indirection (hash index code -> index).
The ngram vocab currently brackets the <\s>
marker and extracts ngrams from "<<\s>>"
. Those subwords aren't trained because the indices are never added to the subwords
Vec
in the vocab via:
if word.word() == util::EOS {
subword_indices.push(Vec::new());
continue;
}
It's fairly unlikely to encounter those ngrams anywhere but we should fix this once we figured out the bug.
Some other embedding packages allow you to set a target vocabulary size. In this case, the mincount
is sized so that the vocabulary size is less than the target vocab size.
Note that this is different than taking the N most frequent items, since that might remove words with a certain frequency, while retaining other words with the same frequency.
I think the norms storage change is pretty important. The earlier we push 0.6.0 out, the better, since it reduces the number of embeddings in the wild that do not have norms.
That said, I think it would be nice to have Nicole's directional skipgram implementation in as well, since then we also have a nice user-visible feature.
Is there anything else that we want to add before branching 0.6?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.