chunyuanli / optimus Goto Github PK
View Code? Open in Web Editor NEWOptimus: the first large-scale pre-trained VAE language model
Optimus: the first large-scale pre-trained VAE language model
Currently, text is generated from latent point by sampling from distributions produced by generator over vocabulary of tokens. But since z is multivariate gaussian we can also sample from it thus having more diversity and nuances in generated samples.
Hello, great work there !
Everything is fine with the docker env, and the code works amazingly. However I would like to compute the code in my own conda environement, and it seems that the version of pytorch-transformers (1.2.0) in your requirement file doesn't have GPT2ForLatentConnector and BertForLatentConnector, and I couldn't find it in other versions.
Could you give me the actual version that is used in your code ? Or even the source code for these classes.
Thanks a lot !
Romain
Optimus/doc/optimus_finetune_language_models.md
beta=0, latent size = 32 https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip
Pre-trained model download is not available.
Need Permissions?
Where can I get the preprocessed dailydialog dataset used for spacefusion pretraining code? Any suggestion on how to preprocess the original dailydialog would be appreciated! Thanks
Thank your for sharing this great work !
I have been trying some of your checkpoints but they don't even seem to perform reconstruction right (even at beta=0).
Then I tried performing interpolation between a sentence and the same sentence with 10 steps (which should give 10 times the same sentence). But it yields differents sentences.
I suspect this is because of a difference in the version of the python libraries we use (since I didn't modify anything in your scripts).
Could you please provide a requirements.txt file with the exact version of the libraries you used ?
Thanks a lot !
Dear Researcher:
I trained an AE on Flick30k dataset, I found that the reconstruction BLEU score is about 35 on validation set.
I think the reconstruction ability of AE is better than VAE.
I wonder did you test the reconstruction ability of the both models?
Do you have any results or cases of reconstruction?
Thanks,
Chawdoe
Please, delete this.
Hi! I'm trying to reproduce the reported result on SNLI, I followed the doc 'optimus_for_snli.md' and successfully downloaded the checkpoints, but when I run your examples, it turns out that in file run_latent_generation.py, the sample_sequence_conditional function receives 'input_ids' and 'past' in mismatched shape. I can fix this by past = torch.mean(past, dim=0).unsqueeze(0), but is it right? Thanks for reading.
Dears
Thanks for sharing your amazing work!
I am trying to run the Label-Conditional Text Generation experiment, but unfortunately, I didn't find the entry point for the training where there is no code call the class "Ctrl_Gen".
Thus, it would be appreciated if you can guide me, where I can reproduce your results for the Label-Conditional Text Generation experiment.
Thanks in advance!
Hello,
Following up on the previous issue. I cannot afford to run your docker on a cloud instance and I dont have a gpu do you have any suggestions? I could try to reimplement this.
Hi Chunyuan,
Thank you for sharing the source code.
I am wondering any reason why we need to fine-tune Optimus on the Wikipedia dataset first and then fine-tune it on another four datasets (i.e., ptb, snli, etc.) for performance evaluation.
Pre-training Optimus directly on four datasets is also possible, though the results will definitely be worse than the reported ones. The effectiveness of Optimus should be independent of if it was pre-trained on the Wikipedia dataset. Right? Otherwise, if we want to further develop a new model, then we have to fine-tune it on the Wikipedia dataset first.
Looking forward to hearing from you.
Best,
Dong
For the following two checkpoints listed in optimus_finetune_language_models.md:
beta=0, latent size = 32
https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.0_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip
beta=0.5, latent size = 32
https://chunylcus.blob.core.windows.net/machines/msrdl/optimus/output/pretrain/philly_rr3_vc4_g8_base_vae_wikipedia_pretraining_beta_schedule_beta0.5_d1.0_ro0.5_ra0.25_32_v2/checkpoint-508523.zip
Their sums of all parameters are the same. So I think they are the same checkpoints.
Could anyone please double-check this?
Btw, thanks for publishing your work on github.
congratulations! indeed, controlled text generation works!
quick experiments are very promising
experiment 1: purpose is to generate sentences where age of the boy is continuously increasing
and spelled by letters
src/target: 1 - > 100
seed sentence: the boy is twelve years old.
0: 0.000000 the boy is twelve years old.
13: 0.206349 the boy is twenty years old.
24: 0.380952 the boy is forty years old.
59: 0.936508 the boy is fifty years old.
(showing only uniq samples)
experiment 2: controlling both increasing age and gender
src/target 1: 1 - > 100
src/target 2: man - > woman
seed sentence: the boy is twelve years old.
0: 0.000000 the boy is twelve years old.
40: 0.317460 the girl is twelve years old.
49: 0.388889 the girl is twenty years old.
(showing only uniq samples)
experiment 3: interpolation
0: 0.00 the sisters are hugging while holding up goodbye to get snacks before going home.
1: 0.10 the sisters are hugging while holding up snacks next to goodbye for their dad.
2: 0.20 the sisters are hugging while holding up goodbye to shopping bags in a .
3: 0.30 the sisters are hugging while holding up a sign in front of york airlines.
4: 0.40 the girl wearing beanies stands next to a truck while celebrating together.
5: 0.50 a girl in blue shirts stands posing next to a refrigerator while holding up important .
6: 0.60 a boy in a blue shirt standing amidst all construction logos is hugging while laying down a
7: 0.70 a man in a blue shirt standing next to packaging constructions with their thumbs in a row.
8: 0.80 a man in a blue outfit standing in front of a building styled like garage vaults with
9: 0.90 a man in a blue shirt standing in front of a construction base with styled decorations
10: 1.00 a man in a blue shirt standing in front of a design center with structure painted `` funhouse ''
i like it goes from "sisters" to "man" throught "girl" and "boy" this is aslo smooth in some sence :)
just amazing !!!
not every run gives good results but it is definitely a step forward! just a question of time to get it working right.
and here is issue/question
I noticed you use linear interpolation scheme, but as it was pointed out by Ferenc Huszár
here https://www.inference.vc/high-dimensional-gaussian-distributions-are-soap-bubble/
it makes sense to evolve interpolating trajectory along surface of a sphere.
Hi, great work using VAE.
I can't open your demo website.Could you rewrite the demo website link?
Thank you for this great repo !
I was trying to use it for language modeling but I couldn't find, amongst the checkpoints you provide, any model that performed well in terms of perplexity. I measure perplexity on your SNLI test set with code/examples/big_ae/run_lm_vae_training.py
by setting the --do_eval option (and without the --do_train option). This yielded high KL (~2000) for all the checkpoints you provide.
I tried finetuning a wikipedia checkpoint with your script on SNLI but I only get the following results:
I can't seem to get it to have low perplexity with high mutual information. Could you provide a language modeling checkpoint or just specify the hyper-parameters and wikipedia pretrained model used to produce the results in the paper ?
Thank you very much for your help !
hi !
Thanks for releasing the code and checkpoints, but i want to know have you released a model of pretrained with Chinese dataset?
look forward to your reply!
Hello, thank you very much for making the code available. I'm confused about the mutual information math, more specifically about the line
E_{q(z|x)}log(q(z|x)) = -0.5*nz*log(2*\pi) - 0.5*(1+logvar).sum(-1)
neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (1 + logvar).sum(-1)).sum().item()
When I derive it, it gives me
neg_entropy += (-0.5 * nz * math.log(2 * math.pi)- 0.5 * (logvar).sum(-1)).sum().item()
So I think I must have made a mistake somewhere? Thank you very much
Your program works very well! I rewrote the interpolation function to make it easier for me to use in different ways. Perhaps others would also find this useful.
def latent_code_from_text(text, encoder_tokenizer, model_vae, args):
tokenized1 = encoder_tokenizer.encode(text)
tokenized1 = [101] + tokenized1 + [102]
coded1 = torch.Tensor([tokenized1])
coded1 =torch.Tensor.long(coded1)
with torch.no_grad():
x0 = coded1
x0 = x0.to(args.device)
pooled_hidden_fea = model_vae.encoder(x0, attention_mask=(x0 > 0).float())[1]
mean, logvar = model_vae.encoder.linear(pooled_hidden_fea).chunk(2, -1)
latent_z = mean.squeeze(1)
coded_length = len(tokenized1)
return latent_z, coded_length
def text_from_latent_code(latent_z, model_vae,sentence_length,args, decoder_tokenizer):
past = latent_z
context_tokens = decoder_tokenizer.encode('<BOS>')
coded_length = torch.Tensor([[sentence_length]])
coded_length = torch.Tensor.long(coded_length)
length = torch.Tensor([[sentence_length]])
out = sample_sequence_conditional(
model=model_vae.decoder,
context=context_tokens,
past=past,
length= length, # Chunyuan: Fix length; or use <EOS> to complete a sentence
temperature=args.temperature,
top_k=args.top_k,
top_p=args.top_p,
device=args.device,
decoder_tokenizer = decoder_tokenizer
)
text_x1 = decoder_tokenizer.decode(out[0,:].tolist(), clean_up_tokenization_spaces=True)
text_x1 = text_x1.split()[1:-1]
text_x1 = ' '.join(text_x1)
return text_x1
...
# and then in the main function
latent_z1, coded_length1 = latent_code_from_text("a brown dog likes to eat his food very slowly .", tokenizer_encoder, model_vae, args)
latent_z2, coded_length2 = latent_code_from_text("a yellow cat likes to chase a long string .", tokenizer_encoder, model_vae, args)
result = text_from_latent_code((latent_z1 + latent_z2)/2, model_vae,coded_length1,args, tokenizer_decoder)
print(result)
It seems that your demo webset can not be accessed. Can you fix it?
Hi,
Thank you for this work and for releasing the code as well ! 🎉
I was wondering if there was any reason you chose to use BERT as an encoder and GPT2 as a decoder, instead of other pretrained language models ? In particular, why not considering models that already have an encoder/decoder architecture, such as T5 or BART ?
Thanks
In the path "code/examples/big_ae", the default model name of the decoder is bert-base-cased. I think it might have to be gpt2
Dear authors,
First, thank you for sharing the code!
I was interested in the experiment of Label-Conditional Text Generation. I would have some questions about the losses.
loss_lsd corresponds to equation 16 of the paper. and loss_lsg I guess complement the equation 16. Is there a reason why this is not mentioned in Eq 16?
I have the feeling that loss_encode cancels out loss_lsc when added together in loss. Is it correct?
What would be needed to change to handle, for example, 3 classes? Ony this ?
Thanks a lot for your answers!
I'd like to run this on colab any suggestions? Awesome work super excited to experiment!
File: https://github.com/ChunyuanLI/Optimus/blob/master/code/examples/big_ae/modules/vae.py
code in line 188, 133, 143: outputs = self.decoder(input_ids=labels, past=latent_z, labels=labels, label_ignore=self.pad_token_id)
this line takes labels as the input_ids of decoder. I wonder know if it is an error.
Should it be input_ids=inputs? Since there exits -1 in labels, it occurs an error in line 460 (inputs_embeds = self.wte(input_ids)) in modeling_gpt2.py in pytorch_transformers.
Thanks.
Hi! Great work. But one open question :D
I am curious about the performance of using gpt2 as both encoder and decoder? I am not sure if the discrepancy from different tokenization can result in performance degradation.
Thanks
Can't access datasets at "https://github.com/ChunyuanLI/Optimus/blob/master/data/download_datasets.md"
In the paper, it writes,
First, our pre-trained language VAE is still under-trained due to limited compute resource, as the training reconstruction loss can still decrease. One may further train the models with higher latent dimension and longer time to fully release the power of pre-trained latent spaces.
So how long did it take to pre-train Optimus in terms of days or weeks with its encoder and decoder initialized with weights of
BERT and GPT-2 respectively?
Settings in 'train_vae_wikipedia.yaml' and 'train_vae_wikipedia_distributed.yaml' seem to differ a lot. (20 at former, 1 at latter)
How many pretraining epochs did you go over? + Which script should I refer to?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.