Entails changes to finalfusion-rust for serialization

I'm currently training an ngram model on turing (started with <code class="notranslate

~48% on compute-accuracy for a skipgram model with <div class="snippet-clipboard-c

Support explicitly stored ngrams,about finalfusion/finalfrontier

Comments (90)

sebpuetz commented on June 23, 2024 1

Threads could really be our answer, this is from tesniere with 20 threads on 0.5:

finalfusion compute-accuracy twe-bucket-vocab-test_0.5_20threads.fifu ../finalfusion-utils/de_trans_Google_analogies.txt --threads 40
██████████████████████████████ 100%  ETA: 00:00:00
capital-common-countries: 454/506 correct, accuracy: 89.72, avg cos: 0.76, skipped: 0
capital-world: 3885/4524 correct, accuracy: 85.88, avg cos: 0.74, skipped: 0
city-in-state: 876/2467 correct, accuracy: 35.51, avg cos: 0.69, skipped: 0
currency: 95/866 correct, accuracy: 10.97, avg cos: 0.59, skipped: 0
family: 289/506 correct, accuracy: 57.11, avg cos: 0.74, skipped: 0
gram2-opposite: 188/812 correct, accuracy: 23.15, avg cos: 0.65, skipped: 0
gram3-comparative: 893/1332 correct, accuracy: 67.04, avg cos: 0.67, skipped: 0
gram4-superlative: 282/1122 correct, accuracy: 25.13, avg cos: 0.57, skipped: 0
gram5-present-participle: 199/1056 correct, accuracy: 18.84, avg cos: 0.65, skipped: 0
gram6-nationality-adjective: 793/1599 correct, accuracy: 49.59, avg cos: 0.73, skipped: 0
gram7-past-tense: 812/1560 correct, accuracy: 52.05, avg cos: 0.68, skipped: 0
gram8-plural: 708/1332 correct, accuracy: 53.15, avg cos: 0.65, skipped: 0
gram9-plural-verbs: 704/870 correct, accuracy: 80.92, avg cos: 0.75, skipped: 0
Total: 10178/18552 correct, accuracy: 54.86, avg cos: 0.69
Skipped: 0/18552 (0%)

And turing on 0.5 with 30 threads:

 finalfusion compute-accuracy ../finalfrontier/twe-bucket-vocab-test_0.5.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 40 
capital-common-countries: 273/506 correct, accuracy: 53.95, avg cos: 0.75, skipped: 0
capital-world: 2581/4524 correct, accuracy: 57.05, avg cos: 0.73, skipped: 0
city-in-state: 713/2467 correct, accuracy: 28.90, avg cos: 0.68, skipped: 0
currency: 73/866 correct, accuracy: 8.43, avg cos: 0.61, skipped: 0
family: 282/506 correct, accuracy: 55.73, avg cos: 0.74, skipped: 0
gram2-opposite: 179/812 correct, accuracy: 22.04, avg cos: 0.65, skipped: 0
gram3-comparative: 882/1332 correct, accuracy: 66.22, avg cos: 0.66, skipped: 0
gram4-superlative: 279/1122 correct, accuracy: 24.87, avg cos: 0.56, skipped: 0
gram5-present-participle: 196/1056 correct, accuracy: 18.56, avg cos: 0.64, skipped: 0
gram6-nationality-adjective: 568/1599 correct, accuracy: 35.52, avg cos: 0.70, skipped: 0
gram7-past-tense: 688/1560 correct, accuracy: 44.10, avg cos: 0.66, skipped: 0
gram8-plural: 491/1332 correct, accuracy: 36.86, avg cos: 0.68, skipped: 0
gram9-plural-verbs: 537/870 correct, accuracy: 61.72, avg cos: 0.72, skipped: 0
Total: 7742/18552 correct, accuracy: 41.73, avg cos: 0.68
Skipped: 0/18552 (0%)

I'll give master on tesniere and turing with 20 threads a go now.

We should add n_threads to metadata.

from finalfrontier.

sebpuetz commented on June 23, 2024

Before going deeper into NGram vocabs in this repo, I'd like to pull in some more logic from finalfusion-rust. There's a lot of duplicate code floating around in the subword and vocab department which I think can be simplified a fair bit.

For instance, we should be able to use the finalfusion-rust SubwordVocab to cover both bucket and ngram vocabs if we wrap it in some TrainVocab in finalfrontier. Then we wouldn't need any subword module in finalfrontier and would only need to provide additional methods required for training on the vocabs.

What do you think? @danieldk @NianhengWu

from finalfrontier.

danieldk commented on June 23, 2024

Yes, I think this is a great idea. The finalfusion crate wasn't around when we implemented finalfrontier, but now that it exists, it makes sense to reuse as many primitives from finalfusion as we can.

from finalfrontier.

sebpuetz commented on June 23, 2024

This one's pretty much done, what's missing now is a release for `finalfusion-rust', then we can replace the dependencies with proper versions.

from finalfrontier.

danieldk commented on June 23, 2024

Maybe it would be good to do test runs and compute-accuracy for finalfrontier/finalfusion before the changes and then after (obviously, the n-grams model could only be done after). To see if there are no regressions.

The last time we did big changes, this shook out an important bug.

from finalfrontier.

sebpuetz commented on June 23, 2024

I'm currently training an ngram model on turing (started with ngram-mincount=100), that'll be done some time tonight. After that there are two more models lined up with ngram-mincount=50 and ngram-mincount=30.

I can train a bucket-vocab model in parallel to catch possible regressions, I guess a skipgram model should suffice since we didn't touch the training routine, only the vocab and config. The ngram models I'm training are all structgram models, since we've been using structgram for almost all experiments.

from finalfrontier.

sebpuetz commented on June 23, 2024

~48% on compute-accuracy for a skipgram model with

[common_config]
dims = 300
epochs = 15
loss = 'LogisticNegativeSampling'
lr = 0.05000000074505806
negative_samples = 5
zipf_exponent = 0.5

[model_config]
context_size = 10
model = 'SkipGram'
type = 'SkipGramLike'

[vocab_config]
discard_threshold = 0.00009999999747378752
max_n = 6
min_count = 30
min_n = 3
type = 'SubwordVocab'

[vocab_config.indexer]
buckets_exp = 21
type = 'Buckets'

that doesn't seem right. I'm now training on 0.6.1 with the same parameters to see where we might have introduced a regression.

@danieldk did we verify anything after adding support for no-subwords training?

from finalfrontier.

sebpuetz commented on June 23, 2024

I checked finalfusion-utils for possible regressions in compute-accuracy, but our public skipgram model still gets the 57.2%. No change on that end.

Unfortunately I didn't change the name before restarting training and since finalfrontier eagerly creates the model file before training, the model from my previous comment is gone.

from finalfrontier.

danieldk commented on June 23, 2024

So, accuracy is down ~9%, that must be a bug. I guess the next step is finding out whether it's finalfusion or finalfrontier changes causing this.

from finalfrontier.

sebpuetz commented on June 23, 2024

Do you remember if we verified models before the 0.6 release?

from finalfrontier.

danieldk commented on June 23, 2024

Of finalfrontier? I don't think so. We verified 0.5.0, where we had the bug where the ZipfRangeGenerator was accidently replaced by an RNG.

from finalfrontier.

sebpuetz commented on June 23, 2024

My hunch is that it's related to the 0.6 release. We changed quite a lot directly related to the training. If the cargo installed ff-train-skipgram results in similar accuracy we might be onto something.

from finalfrontier.

danieldk commented on June 23, 2024

0.5.0 -> 0.6.1 is fairly small though: embeddings norms storage, better default hyperparameters, dirgram model, and that's pretty much it. All the big changes were in 0.5.0, which we checked.

from finalfrontier.

sebpuetz commented on June 23, 2024

I misremembered releasing the no-subwords training. I meant to say it's probably related to changing the vocabs and indexing.

from finalfrontier.

danieldk commented on June 23, 2024

I misremembered releasing the no-subwords training. I meant to say it's probably related to changing the vocabs and indexing.

That's likely. One possibility is some incorrect offset in the embedding matrix (where the subword index is not added to the known vocab size).

from finalfrontier.

danieldk commented on June 23, 2024

What version of ff-compute-accuracy did you use? Should be easy to check on the old embeddings whether the finalfusion part is correct.

from finalfrontier.

sebpuetz commented on June 23, 2024

I verified against finalfusion-utils/master with an updated finalfusion dependency pointing to finalfusion = { git = "https://github.com/finalfusion/finalfusion-rust" }.

So I think that rules out most of finalfusion changes. Unless the serialization was broken in the meantime.

from finalfrontier.

sebpuetz commented on June 23, 2024

Shouldn't be about subword indices, adding this to SkipgramTrainer::train_iter_from doesn't panic but would do so if there are multiple indices into the word matrix for a single word.

                let mut word_n = 0;
                for idx in &idx {
                    if (idx as usize) < self.vocab.len() {
                        if word_n >= 1 {
                            panic!();
                        }
                        word_n += 1
                    }
                }

from finalfrontier.

sebpuetz commented on June 23, 2024

Every word - apart form EOS marker - has at least one subword index, so we're properly generating the subwords.
TrainModel::mean_embedding also gets the correct number of indices.
Running in debug mode also doesn't produce any crashes.
Upper bounds of negative sampling are correctly set to vocab length, negative sampling doesn't seem to be periodical and produces random output.

from finalfrontier.

danieldk commented on June 23, 2024

Can we see the difference in performance in the loss for a relatively small training corpus? If so, you could git bisect to find the offending commit.

from finalfrontier.

sebpuetz commented on June 23, 2024

0.4

$ cargo install finalfrontier-utils --force --version ^0.4
$ ff-train tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.66629 lr: 0.00064 ETA: 00:00:00
$ ff-train tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.65123 lr: 0.00013 ETA: 00:00:00

0.5

$ cargo install finalfrontier-utils --force --version ^0.5
$ ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.66744 lr: 0.00053 ETA: 00:00:00
$ ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.64953 lr: 0.00010 ETA: 00:00:00

0.6

$ cargo install finalfrontier-utils --force
$ ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.20165 lr: 0.00002 ETA: 00:00:00
$ ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.20786 lr: 0.00020 ETA: 00:00:00

master

$ cargo build --release
$ target/build/release ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5
    100% loss: 1.20882 lr: 0.00021 ETA: 00:00:00
$ target/build/release ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5   
    100% loss: 1.21025 lr: 0.00020 ETA: 00:00:00

Starting with 0.6 something apparently changed. The loss is now much lower. Although not sure whether that's really indicative. We changed something about the loss calculation at some point.

from finalfrontier.

sebpuetz commented on June 23, 2024

Default ctx-size changed. So that explains my differences but not the regression.

from finalfrontier.

danieldk commented on June 23, 2024

We changed some defaults:

    The changed defaults are:
    
    - Context size: 5 -> 10
    - Dims: 100 -> 300
    - Epochs: 5 -> 15

So I guess you have to modify the context size.

from finalfrontier.

danieldk commented on June 23, 2024

Heh, simultaneous.

Maybe you could redo them with the same context size?

from finalfrontier.

sebpuetz commented on June 23, 2024

Already done:

: 1570446569:0;cargo install finalfrontier-utils --force --version ^0.4
: 1570446620:0;ff-train tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5 --context 10
    100% loss: 1.19736 lr: 0.00018 ETA: 00:00:00

and

: 1570446436:0;cargo install finalfrontier-utils --force --version ^0.5
: 1570446497:0;ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 100 --epochs 5 --context 10
    100% loss: 1.20563 lr: 0.00017 ETA: 00:00:00

Loss is the same as on 0.6 and master now. I'll convert one of our public models with current finalfusion-utils and run it against compute-accuracy to see whether serialization is broken...

from finalfrontier.

sebpuetz commented on June 23, 2024

$ finalfusion convert /zdv/sfb833-a3/public-data/finalfusion/german-skipgram-mincount-30-ctx-10-dims-300.fifu convert_test.fifu -f finalfusion -t finalfusion
$ finalfusion compute-accuracy ./convert_test.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 10
[...]
    Total: 10611/18552 correct, accuracy: 57.20, avg cos: 0.69

Looks fine...

from finalfrontier.

sebpuetz commented on June 23, 2024

Cargo installed finalfrontier-utils coming in with even lower scores:

$ finalfusion metadata twe-bucket-vocab-test.fifu
[common_config]
dims = 300
epochs = 15
loss = 'LogisticNegativeSampling'
lr = 0.05000000074505806
negative_samples = 5
zipf_exponent = 0.5

[model_config]
context_size = 10
model = 'SkipGram'
type = 'SkipGramLike'

[vocab_config]
buckets_exp = 21
discard_threshold = 0.00009999999747378752
max_n = 6
min_count = 30
min_n = 3
type = 'SubwordVocab'

cargo installed finalfusion-utils produce the same numbers.

$ finalfusion compute-accuracy ../finalfrontier/twe-bucket-vocab-test.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 40                            
██████████████████████████████ 100%  ETA: 00:00:00
capital-common-countries: 282/506 correct, accuracy: 55.73, avg cos: 0.74, skipped: 0
capital-world: 2678/4524 correct, accuracy: 59.20, avg cos: 0.73, skipped: 0
city-in-state: 716/2467 correct, accuracy: 29.02, avg cos: 0.68, skipped: 0
currency: 58/866 correct, accuracy: 6.70, avg cos: 0.60, skipped: 0
family: 275/506 correct, accuracy: 54.35, avg cos: 0.74, skipped: 0
gram2-opposite: 181/812 correct, accuracy: 22.29, avg cos: 0.65, skipped: 0
gram3-comparative: 901/1332 correct, accuracy: 67.64, avg cos: 0.66, skipped: 0
gram4-superlative: 310/1122 correct, accuracy: 27.63, avg cos: 0.56, skipped: 0
gram5-present-participle: 202/1056 correct, accuracy: 19.13, avg cos: 0.64, skipped: 0
gram6-nationality-adjective: 583/1599 correct, accuracy: 36.46, avg cos: 0.70, skipped: 0
gram7-past-tense: 734/1560 correct, accuracy: 47.05, avg cos: 0.66, skipped: 0
gram8-plural: 494/1332 correct, accuracy: 37.09, avg cos: 0.67, skipped: 0
gram9-plural-verbs: 601/870 correct, accuracy: 69.08, avg cos: 0.72, skipped: 0
Total: 8015/18552 correct, accuracy: 43.20, avg cos: 0.68
Skipped: 0/18552 (0%)

from finalfrontier.

danieldk commented on June 23, 2024

So, does this mean that the bug was already in 0.6.1?

from finalfrontier.

sebpuetz commented on June 23, 2024

Either that or we're missing something else. I started training a model on 0.5 last night to see whether that matches scores from the last verification.

I looked a bit into the model last night but couldn't find anything unexpected. Vocab size is correct, the right part of the matrix is normalized, shape of the matrix is fine...

from finalfrontier.

danieldk commented on June 23, 2024

That's really strange. finalfusion seems to be fine, because it works correctly with the old embeddings. But from 0.5.0 -> 0.6.1 there are only trivial changes:

Directional skipgram (should not affect normal skipgram).
Addition of the embedding norms.

It would really be good to see the results for 0.5. In the meanwhile I guess I'll check if we are not using the norms somewhere.

from finalfrontier.

danieldk commented on June 23, 2024

It seems that it is really the embeddings that are incorrect. I took an old version of ff-convert from before norms and all and did a fifu -> word2vec -> fifu using your twe-bucket-vocab-test.fifu. This excludes the norms layer (and does not restore norms in the word2vec flavor) and discards the subword units. The results are as bad, so it seems really that the trained embeddings are problematic (which could of course still be related to the subword units which are summed to the word embedding after training).

from finalfrontier.

danieldk commented on June 23, 2024

Another datapoint: the German skipgram embeddings that work correctly are even from before the addition of dependency embeddings, so they must have been trained with 0.4.0 or 0.4.1.

Edit: Nicole posted some results for 0.6.0 and they were also low, but only trained for 1 epoch:
#45 (comment)

from finalfrontier.

sebpuetz commented on June 23, 2024

We did verify scores before releasing 0.5 tho, we fixed the Rng before the release iirc.

I just found another model that I trained for my ngram experiments:

$ ls -hl 
[...]
3,2G Jul 24 00:11 twe-subword_vocab-skipgram-15eps-10ctx-5ns-300dims-30min.fifu
$ finalfusion metadata twe-subword_vocab-skipgram-15eps-10ctx-5ns-300dims-30min.fifu 
[common_config]
dims = 300
epochs = 15
loss = 'LogisticNegativeSampling'
lr = 0.05000000074505806
negative_samples = 5
zipf_exponent = 0.5

[model_config]
context_size = 10
model = 'SkipGram'
type = 'SkipGramLike'

[vocab_config]
buckets_exp = 21
discard_threshold = 0.00009999999747378752
max_n = 6
min_count = 30
min_n = 3
type = 'SubwordVocab'

That model was trained on this branch https://github.com/sfb833-a3/finalfrontier/tree/ngram-vocab (includes the no-subword changes) with a dependency on https://github.com/sfb833-a3/finalfusion-rust/tree/ngram-vocab (including the actually-write-norms commit)

        if let Some(norms) = self.norms() {
            norms.write_chunk(write)?;
        }

The numbers I collected back then were fine for the analogy datasets. So it might be down to serialization in finalfusion. I'm running compute-accuracy against this model, should be done in a few minutes.

from finalfrontier.

danieldk commented on June 23, 2024

The numbers I collected back then were fine for the analogy datasets. So it might be down to serialization in finalfusion. I'm running compute-accuracy against this model, should be done in a few minutes.

But it is strange that read -> write roundtrip with finalfusion does not cause problems. What would be different? (Except if the writer code that feeds things to finalfusion is wrong in finalfrontier). Also, 0.6.1 uses a much older version of finalfusion.

There could still be a bug there, but it would be somewhat surprising.

from finalfrontier.

sebpuetz commented on June 23, 2024

Done running, scores are as we like them:

$ finalfusion compute-accuracy twe-subword_vocab-skipgram-15eps-10ctx-5ns-300dims-30min.fifu ~/finalfrontier/analogies/de_trans_Google_analogies.txt --threads 10 
capital-common-countries: 466/506 correct, accuracy: 92.09, avg cos: 0.77, skipped: 0
capital-world: 4063/4524 correct, accuracy: 89.81, avg cos: 0.75, skipped: 0
city-in-state: 917/2467 correct, accuracy: 37.17, avg cos: 0.68, skipped: 0
currency: 104/866 correct, accuracy: 12.01, avg cos: 0.59, skipped: 0
family: 279/506 correct, accuracy: 55.14, avg cos: 0.74, skipped: 0
gram2-opposite: 188/812 correct, accuracy: 23.15, avg cos: 0.65, skipped: 0
gram3-comparative: 892/1332 correct, accuracy: 66.97, avg cos: 0.67, skipped: 0
gram4-superlative: 249/1122 correct, accuracy: 22.19, avg cos: 0.57, skipped: 0
gram5-present-participle: 197/1056 correct, accuracy: 18.66, avg cos: 0.66, skipped: 0
gram6-nationality-adjective: 785/1599 correct, accuracy: 49.09, avg cos: 0.74, skipped: 0
gram7-past-tense: 847/1560 correct, accuracy: 54.29, avg cos: 0.68, skipped: 0
gram8-plural: 872/1332 correct, accuracy: 65.47, avg cos: 0.66, skipped: 0
gram9-plural-verbs: 714/870 correct, accuracy: 82.07, avg cos: 0.76, skipped: 0
Total: 10573/18552 correct, accuracy: 56.99, avg cos: 0.69
Skipped: 0/18552 (0%)

from finalfrontier.

sebpuetz commented on June 23, 2024

So, up until 369f474 everything should be fine. That's the last commit included in the sfb-fork.

or finalfusion, this is the lower bound, unfortunately there are a ton of changes coming after that: finalfusion/finalfusion-rust@46e5ff1

from finalfrontier.

danieldk commented on June 23, 2024

Ok, now I am even more confused ;). This is also exactly the commit that made 0.6.1. How can it be that this works, but 0.6.1 (which you mentioned above) does not work?

from finalfrontier.

sebpuetz commented on June 23, 2024

A bunch of commits are excluded from 0.6.1 but are on the sfb-branch. Maybe we can find something by diff-ing the released version and the one I used.

from finalfrontier.

sebpuetz commented on June 23, 2024

We can probably look at the commit before the ngram-vocab since that iteration was much less intrusive, I just added a new vocab type and a config. SubwordVocab and NGramVocab are distinct types on that branch.

edit: Although I'm unsure whether it makes sense since master includes all those commits...

from finalfrontier.

danieldk commented on June 23, 2024

edit: Although I'm unsure whether it makes sense since master includes all those commits...

Like you have said, there are a lot of changes. So I think we should first figure out if 0.5.0 works correctly (which it should if we verified it correctly back then) and if so, diff that version and 0.6.1 and the corresponding finalfusion-rust versions.

from finalfrontier.

danieldk commented on June 23, 2024

finalfusion-rust diff of the versions used by finalfrontier 0.5.0 and 0.6.1:

finalfusion/finalfusion-rust@v0.5.1...0.7.1

from finalfrontier.

danieldk commented on June 23, 2024

Just to exclude another factor: do you use the default build or do you enable optimizations that'd trigger the AVX code paths?

from finalfrontier.

sebpuetz commented on June 23, 2024

Never used any explicit features, just cargo build --release and cargo install finalfusion-utils

from finalfrontier.

sebpuetz commented on June 23, 2024

I found this model again on tesniere #39 (comment)

On current finalfusion-utils I get the same 56.94% accuracy that we had back then.

from finalfrontier.

sebpuetz commented on June 23, 2024

Something's really off, section accuracies above 1, sections getting 4/506 correct but accuracy is 0.79...

# finalfrontier master
target/release/ff-train-skipgram tdz-text.conll throwaway --threads 14 --dims 10 --epochs 5 --context 10
██████████████████████████████ 10.82MB/10.82MB ETA: 00:00:00
██████████████████████████████ 100% loss: 1.22040 lr: 0.00013 ETA: 00:00:00

# finalfusion-utils master
$ target/release/finalfusion compute-accuracy ../finalfrontier/throwaway /data/de_trans_Google_analogies.txt  --threads 14
██████████████████████████████ 100%  ETA: 00:00:00
capital-common-countries: 4/506 correct, accuracy: 0.79, avg cos: 0.94, skipped: 0
capital-world: 12/4524 correct, accuracy: 0.27, avg cos: 0.95, skipped: 0
city-in-state: 0/2467 correct, accuracy: 0.00, avg cos: 0.95, skipped: 0
currency: 0/866 correct, accuracy: 0.00, avg cos: 0.95, skipped: 0
family: 6/506 correct, accuracy: 1.19, avg cos: 0.96, skipped: 0
gram2-opposite: 2/812 correct, accuracy: 0.25, avg cos: 0.97, skipped: 0
gram3-comparative: 3/1332 correct, accuracy: 0.23, avg cos: 0.96, skipped: 0
gram4-superlative: 1/1122 correct, accuracy: 0.09, avg cos: 0.96, skipped: 0
gram5-present-participle: 2/1056 correct, accuracy: 0.19, avg cos: 0.97, skipped: 0
gram6-nationality-adjective: 0/1599 correct, accuracy: 0.00, avg cos: 0.96, skipped: 0
gram7-past-tense: 7/1560 correct, accuracy: 0.45, avg cos: 0.96, skipped: 0
gram8-plural: 32/1332 correct, accuracy: 2.40, avg cos: 0.96, skipped: 0
gram9-plural-verbs: 15/870 correct, accuracy: 1.72, avg cos: 0.97, skipped: 0
Total: 84/18552 correct, accuracy: 0.45, avg cos: 0.96
Skipped: 0/18552 (0%)

Computation of skips is completely off. We never skip in bucket vocabs although we should if the answer is unknown. We only check if idx() returns Some which is always the case for buckets.

from finalfrontier.

danieldk commented on June 23, 2024

Something's really off, section accuracies above 1, sections getting 4/506 correct but accuracy is 0.79...

(4 / 506.) * 100
0.7905138339920948

Seems correct? Remember that these are scores between 0-100.

from finalfrontier.

sebpuetz commented on June 23, 2024

Oops, makes sense. Accuracy is correct, I didn't read it as %. Skips are still not added and the result is distorted because of it

If we correct for skips:

target/release/finalfusion compute-accuracy ../finalfrontier/throwaway /data/de_trans_Google_analogies.txt  --threads 14
██████████████████████████████ 100%  ETA: 00:00:00
capital-common-countries: 4/484 correct, accuracy: 0.83, avg cos: 0.94, skipped: 22
capital-world: 12/3276 correct, accuracy: 0.37, avg cos: 0.94, skipped: 1248
city-in-state: 0/1778 correct, accuracy: 0.00, avg cos: 0.95, skipped: 689
currency: 0/230 correct, accuracy: 0.00, avg cos: 0.95, skipped: 636
family: 6/396 correct, accuracy: 1.52, avg cos: 0.96, skipped: 110
gram2-opposite: 2/224 correct, accuracy: 0.89, avg cos: 0.97, skipped: 588
gram3-comparative: 3/1044 correct, accuracy: 0.29, avg cos: 0.96, skipped: 288
gram4-superlative: 1/297 correct, accuracy: 0.34, avg cos: 0.96, skipped: 825
gram5-present-participle: 2/96 correct, accuracy: 2.08, avg cos: 0.97, skipped: 960
gram6-nationality-adjective: 0/156 correct, accuracy: 0.00, avg cos: 0.96, skipped: 1443
gram7-past-tense: 7/1170 correct, accuracy: 0.60, avg cos: 0.96, skipped: 390
gram8-plural: 32/1044 correct, accuracy: 3.07, avg cos: 0.96, skipped: 288
gram9-plural-verbs: 15/754 correct, accuracy: 1.99, avg cos: 0.97, skipped: 116
Total: 84/10949 correct, accuracy: 0.77, avg cos: 0.95
Skipped: 7603/18552 (40.982104355325575%)

from finalfrontier.

danieldk commented on June 23, 2024

Indeed. This is a porting error from finalfrontier (where it did the correct thing) to finalfusion. But let's fix this after we've fixed the bug.

from finalfrontier.

danieldk commented on June 23, 2024

Random non-bug observation: I have been running 0.5.0 and 0.6.1 on hopper with a few false starts. And it seems that with every run 0.5.0 is faster (they started with a 15 minute difference and now its ~35 minutes).

from finalfrontier.

sebpuetz commented on June 23, 2024

Are you currently running 0.5 and 0.6.1 in parallel? Would be somewhat unexpected since there weren't really changes to the training loop...apart from an added match arm

from finalfrontier.

danieldk commented on June 23, 2024

Surprised me too. Could just be a fluke. But one of the matches is in a hot loop (via output_), so it is possible that be that we have less inlining or that branching became more expensive.

from finalfrontier.

sebpuetz commented on June 23, 2024

But aren't you running 0.4 with ff-train? Iirc, ff-train-skipgram was introduced with the dependency embeddings release on 0.5?

from finalfrontier.

danieldk commented on June 23, 2024

You are right! I am running 0.4.0 and 0.6.1. Good catch! More changes then.

from finalfrontier.

danieldk commented on June 23, 2024

finalfrontier 0.4.0:

❯ ~/.cargo/bin/finalfusion compute-accuracy skipgram-0.4.0.fifu de_trans_Google_analogies.txt
capital-common-countries: 461/506 correct, accuracy: 91.11, avg cos: 0.76, skipped: 0
capital-world: 3993/4524 correct, accuracy: 88.26, avg cos: 0.74, skipped: 0
city-in-state: 862/2467 correct, accuracy: 34.94, avg cos: 0.68, skipped: 0
currency: 97/866 correct, accuracy: 11.20, avg cos: 0.59, skipped: 0
family: 283/506 correct, accuracy: 55.93, avg cos: 0.73, skipped: 0
gram2-opposite: 190/812 correct, accuracy: 23.40, avg cos: 0.65, skipped: 0
gram3-comparative: 878/1332 correct, accuracy: 65.92, avg cos: 0.66, skipped: 0
gram4-superlative: 262/1122 correct, accuracy: 23.35, avg cos: 0.56, skipped: 0
gram5-present-participle: 195/1056 correct, accuracy: 18.47, avg cos: 0.64, skipped: 0
gram6-nationality-adjective: 773/1599 correct, accuracy: 48.34, avg cos: 0.73, skipped: 0
gram7-past-tense: 809/1560 correct, accuracy: 51.86, avg cos: 0.67, skipped: 0
gram8-plural: 745/1332 correct, accuracy: 55.93, avg cos: 0.65, skipped: 0
gram9-plural-verbs: 694/870 correct, accuracy: 79.77, avg cos: 0.75, skipped: 0
Total: 10242/18552 correct, accuracy: 55.21, avg cos: 0.68
Skipped: 0/18552 (0%)

0.6.1 should be done in half an hour or so.

from finalfrontier.

danieldk commented on June 23, 2024

0.6.1:

❯ ~/.cargo/bin/finalfusion compute-accuracy skipgram-0.6.1.fifu de_trans_Google_analogies.txt                                                                                                                                                
capital-common-countries: 454/506 correct, accuracy: 89.72, avg cos: 0.76, skipped: 0
capital-world: 3864/4524 correct, accuracy: 85.41, avg cos: 0.73, skipped: 0
city-in-state: 879/2467 correct, accuracy: 35.63, avg cos: 0.68, skipped: 0
currency: 90/866 correct, accuracy: 10.39, avg cos: 0.59, skipped: 0
family: 298/506 correct, accuracy: 58.89, avg cos: 0.73, skipped: 0
gram2-opposite: 177/812 correct, accuracy: 21.80, avg cos: 0.64, skipped: 0
gram3-comparative: 901/1332 correct, accuracy: 67.64, avg cos: 0.66, skipped: 0
gram4-superlative: 282/1122 correct, accuracy: 25.13, avg cos: 0.56, skipped: 0
gram5-present-participle: 202/1056 correct, accuracy: 19.13, avg cos: 0.64, skipped: 0
gram6-nationality-adjective: 766/1599 correct, accuracy: 47.90, avg cos: 0.73, skipped: 0
gram7-past-tense: 822/1560 correct, accuracy: 52.69, avg cos: 0.67, skipped: 0
gram8-plural: 696/1332 correct, accuracy: 52.25, avg cos: 0.65, skipped: 0
gram9-plural-verbs: 694/870 correct, accuracy: 79.77, avg cos: 0.75, skipped: 0
Total: 10125/18552 correct, accuracy: 54.58, avg cos: 0.68
Skipped: 0/18552 (0%)

Still a bit worse, but not the 44-45% scores.

from finalfrontier.

sebpuetz commented on June 23, 2024

~55% sounds justifiable. Are you going to train a model on master on hopper? 0.5 on tesniere will be done within the next hour, the one on turing is going to take another 12 hours.

from finalfrontier.

danieldk commented on June 23, 2024

~55% sounds justifiable. Are you going to train a model on master on hopper? 0.5 on tesniere will be done within the next hour, the one on turing is going to take another 12 hours.

I'll do master on hopper.

from finalfrontier.

sebpuetz commented on June 23, 2024

This is becoming really, really strange. Results from tesniere are the worst so far?

Results are below 35%

$ finalfusion compute-accuracy twe-bucket-vocab-test_0.5.fifu ../finalfusion-utils/de_trans_Google_analogies.txt --threads 40
██████████████████████████████ 100%  ETA: 00:00:00
capital-common-countries: 244/506 correct, accuracy: 48.22, avg cos: 0.76, skipped: 0
capital-world: 2433/4524 correct, accuracy: 53.78, avg cos: 0.73, skipped: 0
city-in-state: 673/2467 correct, accuracy: 27.28, avg cos: 0.68, skipped: 0
currency: 46/866 correct, accuracy: 5.31, avg cos: 0.65, skipped: 0
family: 260/506 correct, accuracy: 51.38, avg cos: 0.73, skipped: 0
gram2-opposite: 194/812 correct, accuracy: 23.89, avg cos: 0.65, skipped: 0
gram3-comparative: 889/1332 correct, accuracy: 66.74, avg cos: 0.66, skipped: 0
gram4-superlative: 309/1122 correct, accuracy: 27.54, avg cos: 0.57, skipped: 0
gram5-present-participle: 82/1056 correct, accuracy: 7.77, avg cos: 0.72, skipped: 0
gram6-nationality-adjective: 321/1599 correct, accuracy: 20.08, avg cos: 0.72, skipped: 0
gram7-past-tense: 265/1560 correct, accuracy: 16.99, avg cos: 0.67, skipped: 0
gram8-plural: 424/1332 correct, accuracy: 31.83, avg cos: 0.69, skipped: 0
gram9-plural-verbs: 281/870 correct, accuracy: 32.30, avg cos: 0.72, skipped: 0
Total: 6421/18552 correct, accuracy: 34.61, avg cos: 0.69
Skipped: 0/18552 (0%)

Metadata looks correct

$ finalfusion metadata twe-bucket-vocab-test_0.5.fifu 
[common_config]
dims = 300
epochs = 15
loss = 'LogisticNegativeSampling'
lr = 0.05000000074505806
negative_samples = 5
zipf_exponent = 0.5

[model_config]
context_size = 10
model = 'SkipGram'
type = 'SkipGramLike'

[vocab_config]
buckets_exp = 21
discard_threshold = 0.00009999999747378752
max_n = 6
min_count = 30
min_n = 3
type = 'SubwordVocab'

Most recent cargo install was version ^0.5

tac ~/.zsh_history | grep "cargo install" | head
: 1570782512:0;tac ~/.zsh_history | grep "cargo install" | head -2
: 1570706483:0;cargo install finalfrontier-utils --force --version ^0.5

Currently installed ff-train-skipgram has 0.5's defaults (100 dims, etc.)

$ ff-train-skipgram --help
ff-train-skipgram 

USAGE:
    ff-train-skipgram <CORPUS> <OUTPUT>

OPTIONS:
        --buckets <EXP>             Number of buckets: 2^EXP (default: 21)
        --context <CONTEXT_SIZE>    Context size (default: 5)
        --dims <DIMENSIONS>         Embedding dimensionality (default: 100)
        --discard <THRESHOLD>       Discard threshold (default: 1e-4)
        --epochs <N>                Number of epochs (default: 5)
    -h, --help                      Prints help information
        --lr <LEARNING_RATE>        Initial learning rate (default: 0.05)
        --maxn <LEN>                Maximum ngram length (default: 6)
        --mincount <FREQ>           Minimum token frequency (default: 5)
        --minn <LEN>                Minimum ngram length (default: 3)
        --model <MODEL>             Model: skipgram or structgram
        --ns <FREQ>                 Negative samples per word (default: 5)
        --threads <N>               Number of threads (default: logical_cpus / 2)
    -V, --version                   Prints version information
        --zipf <EXP>                Exponent Zipf distribution for negative sampling (default: 0.5)

ARGS:
    <CORPUS>    Tokenized corpus
    <OUTPUT>    Embeddings output

Training command I used (ff-train-skipgram, not a path to target/release/ff-train-skipgram).

ff-train-skipgram taz-wiki-ep-sentences.txt twe-bucket-vocab-test_0.5.fifu --epochs 15 --context 10 --dims 300 --mincount 30 --model skipgram --ns 5 --threads 40

Training data is also not a disguised conll file:

$ head taz-wiki-ep-sentences.txt 
Die einzige Weltmacht : Amerikas Strategie der Vorherrschaft

from finalfrontier.

danieldk commented on June 23, 2024

Training command I used (ff-train-skipgram, not a path to target/release/ff-train-skipgram).
ff-train-skipgram taz-wiki-ep-sentences.txt twe-bucket-vocab-test_0.5.fifu --epochs 15 --context 10 --dims 300 --mincount 30 --model skipgram --ns 5 --threads 40

Could you try to retrain with exactly the same setup but half the threads (20)? How many threads did you use with the other training runs?

I just want to exclude the possibility that the matrix is missing updates due to the lack of synchronization in Hogwild training (where one update overwrites another). It is known that the number of threads can embedding accuracy, but I wouldn't expect such decreases. But it'd be good to exclude this possibility.

My 54-55 results from above were trained with 30 threads. I used to train with 20 threads.

from finalfrontier.

sebpuetz commented on June 23, 2024

On turing all models were trained with 30 threads, so that would match your number...

$ ff-train-skipgram taz-wiki-ep-sentences.txt twe-bucket-vocab-test_0.5_20threads.fifu --epochs 15 --context 10 --dims 300 --mincount 30 --model skipgram --ns 5 --threads 20

ETA ~19 hours. (I think it'd be nice to get elapsed time once training is done btw. Or add timestamps to the metadata...)

rustc on tesniere:

rustc --version            
rustc 1.37.0 (eae3437df 2019-08-13)

from finalfrontier.

danieldk commented on June 23, 2024

Also see:

http://www.ece.ubc.ca/~matei/papers/ipdps16.pdf

Particularly Figure 2 is interesting. They show the speedup in convergence compared to serial optimization. They measure in this way, because parallelism speeds up epochs, but you may need more epochs due to missed updates. It shows that for them on a regular corpus, improvement in convergence flattens around 28 threads for a regular corpus. For a dense corpus, hogwild barely works at all and regresses to serial optimization with more than 16 threads.

from finalfrontier.

danieldk commented on June 23, 2024

On turing all models were trained with 30 threads, so that would match your number...

But I think there are other factors. First of all, if the memory bandwidth is smaller, then updates will probably propagate to memory/caches less quickly. I guess that there may be other effects as well, e.g. if you are running multiple training processes at the same time and if there are other processes running on the machine. This is all possibly made worse by hyper threading.

from finalfrontier.

sebpuetz commented on June 23, 2024

If it's really down to missed updates, then we should see differences in losses, right?

from finalfrontier.

sebpuetz commented on June 23, 2024

Are spectre mitigations turned on for hopper?

from finalfrontier.

danieldk commented on June 23, 2024

If it's really down to missed updates, then we should see differences in losses, right?

At some point yes, but we are obscuring losses a bit by averaging over all time. The initial updates will always be much larger and the effect is probably less observable. The long tail we cannot observe as closely due to averaging. This is because we simulated word2vec, but we should really show some moving average.

BTW. not saying that this is the culprit, but it's one of the few things I could think of why a single version would produce good and bad results between runs.

Also why the older embeddings do so much better. I think I used to train with 20 threads at most. Sometimes even fewer (e.g. on shaw).

from finalfrontier.

danieldk commented on June 23, 2024

Are spectre mitigations turned on for hopper?

Yes, didn't disable them yet.

from finalfrontier.

danieldk commented on June 23, 2024

I'll restart master training on hopper with 10 rather than 30 threads. Would be good if we didn't train other embeddings on the same machine.

from finalfrontier.

sebpuetz commented on June 23, 2024

Losses for the 30 threads models on turing are super close to each other with 0.00195 and 0.00203.
Tesniere on 40 threads was a fair bit higher at 0.00282.

What did you get on hopper?

from finalfrontier.

danieldk commented on June 23, 2024

0.4.0: 0.00178
0.6.1: 0.00212

But again, let's not over-analyze an average loss, which is also updated with Hogwild ;). (Both were with 30 threads.)

from finalfrontier.

sebpuetz commented on June 23, 2024

Sure, just thought it'd be good to verify we're not magnitudes apart. ~~Btw, 0.4.0 probably still had the incorrect loss calculation.~~ Nvm 9bb40df

from finalfrontier.

danieldk commented on June 23, 2024

It would not be very satisfying, but it at least would be a good explanation. I agree on adding n_threads to the metadata. Once we have establish what works, we may also want to print a warning when someone uses high thread counts.

If this is indeed the culprit, we should also investigate switching to HogBatch. From a quick glance over the paper (still need to read it in detail), this seems to be simple enough to implement -- it updates the 'global' matrix every n instances rather than per instance. We would need a performant sparse matrix implementation, but there is a lot of prior work there (and only the rows need to be sparse, within a row we could use dense vectors).

from finalfrontier.

sebpuetz commented on June 23, 2024

I haven't really looked at more than the figures in the paper, if I find time, I'll also go through it. There was also an issue that suggested implementing some different optimization techniques: #42

Another small datapoint not related to the bug but turing: With identical hyperparameters turing shows an ETA of +24h compared to tesniere.

Are your results from hopper in yet?

from finalfrontier.

sebpuetz commented on June 23, 2024

There's also additional support for the threads theory - the skipgram model from a couple of months ago which wasn't broken was also trained on 20 threads. Found that by accident in my bash history.

from finalfrontier.

danieldk commented on June 23, 2024

Are your results from hopper in yet?

Nope, but I suspended the process overnight, because I had to run something else (prepare Dutch treebank), and I didn't want that to influence the result in any way. I have continued to train this morning. But the ETA is completely unreliable now. IIRC it would take a bit more than a day with 10 threads and it's now at 26%.

from finalfrontier.

danieldk commented on June 23, 2024

Another small datapoint not related to the bug but turing: With identical hyperparameters turing shows an ETA of +24h compared to tesniere.

I continue to be surprised about this, since they have identical CPUs. Of course, there is the VM, but that should only be a few percent difference. I have reset the CPU affinities of the VMs. I also noticed that the powersave CPU governor was used again since the last reboot, changed to performance.

Edit: tesniere also uses powersave.

from finalfrontier.

sebpuetz commented on June 23, 2024

The difference is still almost 20 hours, current ETAs are 17h on tesniere and 36h on turing.

Turing is running an older Ubuntu release at 16.04 while tesniere is on 18.04, although that shouldn't explain such dramatic differences.

from finalfrontier.

danieldk commented on June 23, 2024

You could perf record both processes for a bit (perf record -p <PID>) and read the perf report to see if the hot loops are spending most time in the same machine code.

The Ubuntu versions should indeed not make a large difference. On both machines, the same compiler is used. Should produce roughly the same machine code. And our hot loops don't do syscalls anyway.

from finalfrontier.

sebpuetz commented on June 23, 2024

Results from tesniere on master are in, 56.51%. Looks like everything was fine all along...

from finalfrontier.

danieldk commented on June 23, 2024

master with 10 threads should be done in half an hour or so ;).

from finalfrontier.

sebpuetz commented on June 23, 2024

Turing won't be done until the early morning. Anyways, I don't expect different results from turing, we've seen on 0.5, 0.6 and master that the number of threads can really do this. Which probably means we should set upper bounds on the default number of threads. Something like n_cpus / 2 unless n_cpus > 20 or 18.

from finalfrontier.

danieldk commented on June 23, 2024

10 threads on hopper on master:

capital-common-countries: 471/506 correct, accuracy: 93.08, avg cos: 0.77, skipped: 0
capital-world: 4029/4524 correct, accuracy: 89.06, avg cos: 0.75, skipped: 0
city-in-state: 825/2467 correct, accuracy: 33.44, avg cos: 0.69, skipped: 0
currency: 94/866 correct, accuracy: 10.85, avg cos: 0.60, skipped: 0
family: 300/506 correct, accuracy: 59.29, avg cos: 0.75, skipped: 0
gram2-opposite: 192/812 correct, accuracy: 23.65, avg cos: 0.65, skipped: 0
gram3-comparative: 888/1332 correct, accuracy: 66.67, avg cos: 0.67, skipped: 0
gram4-superlative: 278/1122 correct, accuracy: 24.78, avg cos: 0.57, skipped: 0
gram5-present-participle: 194/1056 correct, accuracy: 18.37, avg cos: 0.65, skipped: 0
gram6-nationality-adjective: 785/1599 correct, accuracy: 49.09, avg cos: 0.74, skipped: 0
gram7-past-tense: 822/1560 correct, accuracy: 52.69, avg cos: 0.68, skipped: 0
gram8-plural: 793/1332 correct, accuracy: 59.53, avg cos: 0.66, skipped: 0
gram9-plural-verbs: 718/870 correct, accuracy: 82.53, avg cos: 0.76, skipped: 0
Total: 10389/18552 correct, accuracy: 56.00, avg cos: 0.70
Skipped: 0/18552 (0%)

from finalfrontier.

danieldk commented on June 23, 2024

Which probably means we should set upper bounds on the default number of threads. Something like n_cpus / 2 unless n_cpus > 20 or 18.

Yes. And we have to put a warning in the README. Most people will think 'I have 40 cores, let's use all of them'.

from finalfrontier.

danieldk commented on June 23, 2024

Ok, seems like this issue could almost be closed. Maybe we should also run master with 20 threads for structured skipgram, dependency embeddings, and directional skipgram, to ensure that those are fine too. I am not sure if we want to do bucketed/explicit ngrams for each of these? I guess that we would detect regressions is we do something like:

skipgram - bucket
structured skipgram - bucket
directional skipgram - bucket
dependency - bucket
skipgram - explicit n-grams

We should probably do all of them with the fixed compute-accuracy and then replace https://github.com/finalfusion/finalfrontier/blob/master/ACCURACY.md with the results (that file is outdated anyway), so that we have everything readily available for testing 0.8 and beyond.

from finalfrontier.

sebpuetz commented on June 23, 2024

Well, there's still #72 to fix, the finalfusion dependency needs to be updated to a release

from finalfrontier.

danieldk commented on June 23, 2024

Well, there's still #72 to fix, the finalfusion dependency needs to be updated to a release

That's true. But we currently don't add EOS markers anyway, right? So this would only affect a hypothetical </s> in the data that makes the cut-off.

from finalfrontier.

danieldk commented on June 23, 2024

Never mind, we do.

from finalfrontier.

sebpuetz commented on June 23, 2024

They appear among the most frequent tokens in the vocab, usually the top index.

from finalfrontier.

danieldk commented on June 23, 2024

I had a vague recollection that we once had a discussion about not using EOS.

from finalfrontier.

sebpuetz commented on June 23, 2024

That was in #60, but we didn't remember this. I think we could patch this out since we include punctuation in training which essentially acts as EOS marker(s).

from finalfrontier.

sebpuetz commented on June 23, 2024

Btw, following these changes, enabling training of actual fastText embeddings is only a From<SubwordVocab<FastTextIndexer>>> and addition of a "fasttext"choice to the subwords option away.

from finalfrontier.

Support explicitly stored ngrams about finalfrontier HOT 90 CLOSED

Comments (90)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent