chunyuanli / optimus Goto Github PK

View Code? Open in Web Editor NEW

357.0 357.0 38.0 1.09 MB

Optimus: the first large-scale pre-trained VAE language model

Python 96.85% Shell 3.15%

language-model pretrained-models vae vae-pytorch

optimus's People

Contributors

Stargazers

Watchers

optimus's Issues

additional sampling scheme

Currently, text is generated from latent point by sampling from distributions produced by generator over vocabulary of tokens. But since z is multivariate gaussian we can also sample from it thus having more diversity and nuances in generated samples.

GPT2ForLatentConnector

Hello, great work there !

Everything is fine with the docker env, and the code works amazingly. However I would like to compute the code in my own conda environement, and it seems that the version of pytorch-transformers (1.2.0) in your requirement file doesn't have GPT2ForLatentConnector and BertForLatentConnector, and I couldn't find it in other versions.

Could you give me the actual version that is used in your code ? Or even the source code for these classes.

Thanks a lot !
Romain

Pre-trained model download is not available.

Optimus/doc/optimus_finetune_language_models.md

beta=0, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

Pre-trained model download is not available.
Need Permissions?

DailyDialogue dataset

Where can I get the preprocessed dailydialog dataset used for spacefusion pretraining code? Any suggestion on how to preprocess the original dailydialog would be appreciated! Thanks

The loss_encoder and loss_lsc in cara.py cancel each other

Not sure what the purpose of these two losses is if they are cancelling each other. I believe this is a mistake. Can you please explain what is the goal of those two losses (cara.py line 77-85)

Missing requirements file

Thank your for sharing this great work !

I have been trying some of your checkpoints but they don't even seem to perform reconstruction right (even at beta=0).
Then I tried performing interpolation between a sentence and the same sentence with 10 steps (which should give 10 times the same sentence). But it yields differents sentences.

I suspect this is because of a difference in the version of the python libraries we use (since I didn't modify anything in your scripts).
Could you please provide a requirements.txt file with the exact version of the libraries you used ?
Thanks a lot !

How about the reconstruction BLEU of AE and VAE?

Dear Researcher:

I trained an AE on Flick30k dataset, I found that the reconstruction BLEU score is about 35 on validation set.
I think the reconstruction ability of AE is better than VAE.
I wonder did you test the reconstruction ability of the both models?
Do you have any results or cases of reconstruction?

Thanks,
Chawdoe

Format of input files split by NLTK used as input for preprocessing: "wikipedia.segmented.nltk.split.seq64.0.json"

Please, delete this.

issue about reproducing results on SNLI dataset

Hi! I'm trying to reproduce the reported result on SNLI, I followed the doc 'optimus_for_snli.md' and successfully downloaded the checkpoints, but when I run your examples, it turns out that in file run_latent_generation.py, the sample_sequence_conditional function receives 'input_ids' and 'past' in mismatched shape. I can fix this by past = torch.mean(past, dim=0).unsqueeze(0), but is it right? Thanks for reading.

How to run the Label-Conditional Text Generation experiment on YELP dataset

Dears

Thanks for sharing your amazing work!

I am trying to run the Label-Conditional Text Generation experiment, but unfortunately, I didn't find the entry point for the training where there is no code call the class "Ctrl_Gen".
Thus, it would be appreciated if you can guide me, where I can reproduce your results for the Label-Conditional Text Generation experiment.

Thanks in advance!

running your docker on an arm computer

Hello,

Following up on the previous issue. I cannot afford to run your docker on a cloud instance and I dont have a gpu do you have any suggestions? I could try to reimplement this.

About Pre-training on the Wikipedia dataset

Hi Chunyuan,

Thank you for sharing the source code.

I am wondering any reason why we need to fine-tune Optimus on the Wikipedia dataset first and then fine-tune it on another four datasets (i.e., ptb, snli, etc.) for performance evaluation.

Pre-training Optimus directly on four datasets is also possible, though the results will definitely be worse than the reported ones. The effectiveness of Optimus should be independent of if it was pre-trained on the Wikipedia dataset. Right? Otherwise, if we want to further develop a new model, then we have to fine-tune it on the Wikipedia dataset first.

Looking forward to hearing from you.

Best,

Dong

Seems like checkpoints for {beta=0, beta=0.5} latent size=32 are the same checkpoints

For the following two checkpoints listed in optimus_finetune_language_models.md:

beta=0, latent size = 32
https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

beta=0.5, latent size = 32
https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip

Their sums of all parameters are the same. So I think they are the same checkpoints.
Could anyone please double-check this?

Btw, thanks for publishing your work on github.

interpolation scheme

congratulations! indeed, controlled text generation works!

quick experiments are very promising

experiment 1: purpose is to generate sentences where age of the boy is continuously increasing
and spelled by letters

src/target: 1 - > 100
seed sentence: the boy is twelve years old.

0: 0.000000 the boy is twelve years old.
13: 0.206349 the boy is twenty years old.
24: 0.380952 the boy is forty years old.
59: 0.936508 the boy is fifty years old.

(showing only uniq samples)

experiment 2: controlling both increasing age and gender

src/target 1: 1 - > 100
src/target 2: man - > woman
seed sentence: the boy is twelve years old.

0: 0.000000 the boy is twelve years old.
40: 0.317460 the girl is twelve years old.
49: 0.388889 the girl is twenty years old.

(showing only uniq samples)

experiment 3: interpolation

0: 0.00 the sisters are hugging while holding up goodbye to get snacks before going home.
1: 0.10 the sisters are hugging while holding up snacks next to goodbye for their dad.
2: 0.20 the sisters are hugging while holding up goodbye to shopping bags in a .
3: 0.30 the sisters are hugging while holding up a sign in front of york airlines.
4: 0.40 the girl wearing beanies stands next to a truck while celebrating together.
5: 0.50 a girl in blue shirts stands posing next to a refrigerator while holding up important .
6: 0.60 a boy in a blue shirt standing amidst all construction logos is hugging while laying down a
7: 0.70 a man in a blue shirt standing next to packaging constructions with their thumbs in a row.
8: 0.80 a man in a blue outfit standing in front of a building styled like garage vaults with
9: 0.90 a man in a blue shirt standing in front of a construction base with styled decorations
10: 1.00 a man in a blue shirt standing in front of a design center with structure painted `` funhouse ''

i like it goes from "sisters" to "man" throught "girl" and "boy" this is aslo smooth in some sence :)

just amazing !!!
not every run gives good results but it is definitely a step forward! just a question of time to get it working right.

and here is issue/question

I noticed you use linear interpolation scheme, but as it was pointed out by Ferenc Huszár
here https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/
it makes sense to evolve interpolating trajectory along surface of a sphere.

demo website

Hi, great work using VAE.
I can't open your demo website.Could you rewrite the demo website link?

Hyper-parameters to reproduce language modelling results

Thank you for this great repo !
I was trying to use it for language modeling but I couldn't find, amongst the checkpoints you provide, any model that performed well in terms of perplexity. I measure perplexity on your SNLI test set with code/examples/big_ae/run_lm_vae_training.py by setting the --do_eval option (and without the --do_train option). This yielded high KL (~2000) for all the checkpoints you provide.

I tried finetuning a wikipedia checkpoint with your script on SNLI but I only get the following results:

with high beta (1.0) and low r0 (0.1): perplexity in the order of 30 with KL around 10 and and mutual info ~0.2
with low beta (0.5) and high r0 (0.5): perplexity in the order of 1000 with KL around 75 and mutual info ~1.5

I can't seem to get it to have low perplexity with high mutual information. Could you provide a language modeling checkpoint or just specify the hyper-parameters and wikipedia pretrained model used to produce the results in the paper ?

Thank you very much for your help !

Chinese Pretrained Model

      hi ! 
      Thanks for releasing the code and checkpoints, but i  want to know have  you released a model of pretrained with Chinese dataset?
     look forward to your reply!

Question about mutual information

Hello, thank you very much for making the code available. I'm confused about the mutual information math, more specifically about the line

     E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
    neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()

When I derive it, it gives me
neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (logvar).sum(-1)).sum().item()

So I think I must have made a mistake somewhere? Thank you very much

Suggestion for some added functions

Your program works very well! I rewrote the interpolation function to make it easier for me to use in different ways. Perhaps others would also find this useful.

def latent_code_from_text(text, encoder_tokenizer, model_vae, args):
    tokenized1 = encoder_tokenizer.encode(text)
    tokenized1 = [101] + tokenized1 + [102]
    coded1 = torch.Tensor([tokenized1])
    coded1 =torch.Tensor.long(coded1)
    with torch.no_grad():
        x0 = coded1
        x0 = x0.to(args.device)
        pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
        mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
        latent_z = mean.squeeze(1)  
        coded_length = len(tokenized1)
        return latent_z, coded_length

def text_from_latent_code(latent_z, model_vae,sentence_length,args, decoder_tokenizer):
    past = latent_z
    context_tokens = decoder_tokenizer.encode('<BOS>')
    coded_length = torch.Tensor([[sentence_length]])
    coded_length = torch.Tensor.long(coded_length)
    length = torch.Tensor([[sentence_length]])
    out = sample_sequence_conditional(
        model=model_vae.decoder,
        context=context_tokens,
        past=past,
        length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
        temperature=args.temperature,
        top_k=args.top_k,
        top_p=args.top_p,
        device=args.device,
        decoder_tokenizer = decoder_tokenizer
    )
    text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
    text_x1 = text_x1.split()[1:-1]
    text_x1 = ' '.join(text_x1)
    return text_x1

...

# and then in the main function         
latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)
    
result = text_from_latent_code((latent_z1 + latent_z2)/2, model_vae,coded_length1,args, tokenizer_decoder)
print(result)

Demo webset is dead

It seems that your demo webset can not be accessed. Can you fix it?

Question: why this choice of BERT and GPT2?

Hi,
Thank you for this work and for releasing the code as well ! 🎉
I was wondering if there was any reason you chose to use BERT as an encoder and GPT2 as a decoder, instead of other pretrained language models ? In particular, why not considering models that already have an encoder/decoder architecture, such as T5 or BART ?
Thanks

The default value of the argument "--decoder_model_name_or_path" is bert-base-cased

In the path "code/examples/big_ae", the default model name of the decoder is bert-base-cased. I think it might have to be gpt2

Question Label-Conditional Text Generation

Dear authors,

First, thank you for sharing the code!

I was interested in the experiment of Label-Conditional Text Generation. I would have some questions about the losses.

loss_lsd corresponds to equation 16 of the paper. and loss_lsg I guess complement the equation 16. Is there a reason why this is not mentioned in Eq 16?
I have the feeling that loss_encode cancels out loss_lsc when added together in loss. Is it correct?
What would be needed to change to handle, for example, 3 classes? Ony this ?

Thanks a lot for your answers!

How can I load your docker in colab?

I'd like to run this on colab any suggestions? Awesome work super excited to experiment!

One question about the decoder of vae

File: https://github.com/ChunyuanLI/Optimus/blob/master/code/examples/big_ae/modules/vae.py

code in line 188, 133, 143: outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
this line takes labels as the input_ids of decoder. I wonder know if it is an error.
Should it be input_ids=inputs? Since there exits -1 in labels, it occurs an error in line 460 (inputs_embeds = self.wte(input_ids)) in modeling_gpt2.py in pytorch_transformers.

Thanks.

when will release the pre-training codes and pre-trained models ?

how about using gpt2 as encoder and decoder?

Hi! Great work. But one open question :D

I am curious about the performance of using gpt2 as both encoder and decoder? I am not sure if the discrepancy from different tokenization can result in performance degradation.

Thanks

Dataset access denied

Can't access datasets at "https://github.com/ChunyuanLI/Optimus/blob/master/data/download_datasets.md"

Curious about the Computing Resources for Pre-training Optimus

In the paper, it writes,

First, our pre-trained language VAE is still under-trained due to limited compute resource, as the training reconstruction loss can still decrease. One may further train the models with higher latent dimension and longer time to fully release the power of pre-trained latent spaces.

So how long did it take to pre-train Optimus in terms of days or weeks with its encoder and decoder initialized with weights of
BERT and GPT-2 respectively?

Number of pretraining epochs

Settings in 'train_vae_wikipedia.yaml' and 'train_vae_wikipedia_distributed.yaml' seem to differ a lot. (20 at former, 1 at latter)
How many pretraining epochs did you go over? + Which script should I refer to?

chunyuanli / optimus Goto Github PK

optimus's People

Contributors

Stargazers

Watchers

Forkers

optimus's Issues

Recommend Projects

Recommend Topics

Recommend Org