xiangli1999 / diffusion-lm Goto Github PK

Diffusion-LM

License: Apache License 2.0

Python 98.72% Shell 0.24% Makefile 0.01% Dockerfile 0.03% Jsonnet 0.01% Jupyter Notebook 0.76% C 0.02% C++ 0.02% Cuda 0.19%

diffusion-lm's Introduction

Diffusion-LM Improves Controllable Text Generation

https://arxiv.org/pdf/2205.14217.pdf

Conda Setup:

conda install mpi4py
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
pip install -e improved-diffusion/ 
pip install -e transformers/
pip install spacy==3.2.4
pip install datasets==1.8.0 
pip install huggingface_hub==0.4.0 
pip install wandb

Train Diffusion-LM:

cd improved-diffusion; mkdir diffusion_models;

python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e

python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

Decode Diffusion-LM:

mkdir generation_outputs

python scripts/batch_decode.py {path-to-diffusion-lm} -1.0 ema

Controllable Text Generation

First, train the classsifier used to guide the generation (e.g. a syntactic parser)

python train_run.py --experiment e2e-tgt-tree --app "--init_emb {path-to-diffusion-lm} --n_embd {16} --learned_emb yes " --pretrained_model bert-base-uncased --epoch 6 --bsz 10

Then, we can use the trained classifier to guide generation. (currently, need to update the classifier directory in scripts/infill.py. I will clean this up in the next release.)

python python scripts/infill.py --model_path {path-to-diffusion-lm} --eval_task_ 'control_tree' --use_ddim True --notes "tree_adagrad" --eta 1. --verbose pipe

For details of the methods and results, please refer to our paper.

@article{Li-2022-DiffusionLM,
  title={Diffusion-LM Improves Controllable Text Generation},
  author={Xiang Lisa Li and John Thickstun and Ishaan Gulrajani and Percy Liang and Tatsunori Hashimoto},
  journal={ArXiv},
  year={2022},
  volume={abs/2205.14217}
}

diffusion-lm's People

Contributors

Stargazers

Watchers

Forkers

dashstander dumpmemory codeaudit techthiyanes foolsholder gaohuan2015 liuxg16 murilo thomascherickal luhuanwu inimah cdnclass hyaena93 hdvvip joanzhou shincling verigle jokiva butyuhao murad-1999 linmou wentropy lookuz jxzhangjhu daisaku-ss lipiji mohan-zhang-u rogervaas bbwen henry9do credwood armadini irenelizihui owuqqq quangminhdinh amirzur zenmoore hungphongtrn xianchao-wu mjdhasan syncdoth sarkhelritesh wabyking louchao98 c00renut swordelucidator zihanwangki kunlun-zhu megamattc ewanwong zurichrain zxl502 bosheng2020 mathematiguy stiglidu slidersun mainpyp wicknight lijiazheng99 sfzhou5678 liweitj47 bochengli desis123 jiemingcheng-hub ckqqqq yunusdemirag maximzubkov gg-big-org stat-eklee geschenkk radi-cho dannygao1984 marzimov sbwww lunaryan zxcayumi zacharyhorvitz chenxwh stlovaer mitudesk singhgautam yani-alt yklinverted doctorjia hengle sebochs liamjxu hongruhu codingchild2424 iq-scm mliwang 404dreamer zetangforward vatsj zizhao-hu captaincuong shaochongjia strategist922 li-jie00 kk1tsch

diffusion-lm's Issues

Fork of Transformers

Hey! Thanks so much for making your code available. I'm following the set up right now, but could you perhaps talk about why you're using a fork of the Huggingface Transformers repo? I imagine that you made some changes to it, if so could you point out what and where they are?

Thanks again!

train the diffusion model for sentence infilling

Hi Lisa,

Thanks for releasing the code for your interesting research.
I have tried to train a diffusion model using the following code

python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 400000 --seed 101 --noise_schedule sqrt --in_channel 128 --modality roc --submit no --padding_mode pad --app "--predict_xstart True --training_mode e2e --vocab_size 11043 --roc_train ../datasets/ROCstory " --notes xstart_e2e --bsz 64

and used the following code to generate, but the model seems to output whatever I input to it. Is there something wrong with my procedure.

python scripts/infill.py --model_path diffusion_models/diff_roc_pad_rand128_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd101_xstart_e2e/  --eval_task_ 'infill' --use_ddim True --notes "tree_adagrad" --eta 1. --verbose yes --partial_seq "My dog loved tennis balls."

Also, I am curious about the detailed functions of the parameters listed above, could you please add some explanation?

Why only padding tokens are generated after a period of training, but no words?

Hi, Lisa
I try to duplicate your work. But I got into some trouble. Why only padding tokens are generated after a period of training, but no words? Do you ever meet it before? Looking forward to your reply!

License

Thanks for the very interesting work. Can you include a LICENSE file? Something permissive like Apache 2.0, MIT, or BSD would be great.

A question about gradient

Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py

Lines 1549 to 1557 in 352b28c

    
           target = { 
        
               ModelMeanType.PREVIOUS_X: self.q_posterior_mean_variance( 
        
                   x_start=x_start, x_t=x_t, t=t 
        
               )[0], 
        
               ModelMeanType.START_X: x_start, 
        
               ModelMeanType.EPSILON: noise, 
        
           }[self.model_mean_type] 
        
           assert model_output.shape == target.shape == x_start.shape 
        
           terms["mse"] = mean_flat((target - model_output) ** 2)

Hi, I see there are three main losses. MSE, decoder_nll and kl_T. I think decoder_nll is designed for the decoder and MSE is designed for the diffusion model. However, I see x_start is not detached from these two losses, so these two losses also influence the embedding part. Is this a bug or a particular design?

Where is train_run.py

Hi, I'm interested in your work and trying to follow up.
For CTG part, when training the classsifier, python train_run.py, but this file is not exist.

Losses for E2E Training

Hi, thanks for releasing the code! I had a quick question about the different loss functions in the code.

I'm trying to wrap my head around the loss function presented in the paper and compare it to what's in the code. I'm taking a look at the function
Le2esimple(w) = Eqφ(x0:T |w) Lsimple(x0) + ||EMB(w) − µθ(x1, 1)||2 − log pθ(w|x0)

LSimple appears to be this line

The loss between the embeddings seem to be these lines

And the cross entropy loss between the logits and input tokens appears to be here

However, I'm a little confused on what these lines account for. From my debugging, this just seems to be taking the embeddings multiplied with noise, multiplied with sqrt_alphas_cumprod across all timesteps.

Am I misinterpreting what's in the code versus what's in the paper?

Evaluating methods

Hi, I finished all commands that you provided, but I don't know how to evaluate the generated files.
I noticed that there is a script (Diffusion-LM/improved-diffusion/control_gen/eval_control.py) to evaluate, but I can't find "diffusion_lm/misc/self-attentive-parser" and "evaluate". Could you add them or reply here with evaluating method?
Thanks!

What if DiffusionLM is initialized with BERT?

Hi, Lisa.

Thank you for your wonderful paper and for sharing the code. I notice in the code that one can initialize the transformer encoder with BERT. I'm wondering what will such initialization bring. Does it help DiffusionLM to converge way faster or achieve better generation results? And is there possibly any negative effect on DiffusionLM if initialized with BERT? Thanks!

some problems on reproducing the results

thanks for your brilliant work, i was reproducing the results on ROC unconditional generation, but i met with some problems

when roc story training, your code ends with eval loss ~0.055
1.but when i simplified your code and try to reproduce it, sqrt schedule converges at eval loss ~0.09, linear schedule achived ~0.07 which is still not good enough for my future work.

2.here I noticed
in sqrt schedule:
betas = [0.01464131 0.00888909 0.00706818 ... 0.35722328 0.55561113 0.999 ]
sqrt_one_minus_alphas_cumprod = [0.12100128 0.1529714 0.17407766 ... 0.99977265 0.99989897 0.9999999 ]

in linear schedule:
betas = [5.00000000e-05 5.49774887e-05 5.99549775e-05 ... 9.99004502e-03, 9.99502251e-03 1.00000000e-02]
sqrt_one_minus_alphas_cumprod = [0.00707107 0.01024572 0.01284225 ... 0.9999787 0.99997891 0.99997912]

is that normal? since sqrt betas are much greater than linear

i've spent a lot of time debuging, and check every detail(my model have the same output, loss, grad and kept the same after optimizer one step as your model), but my code cannot converge at a better eval loss, can you give me any advise, I'd be really grateful.

thank you for your time, I could really need some help

Pre-trained model weights?

Hello! Is there any plan to release the pre-trained model weights?

How to train a new diffusion model & classifer with different diff_steps or embedding dimension?

Hi again! I would like to train a new diffusion model and a matched classifier with different diff_steps or embedding dimension, but I am confused about the parameters that need to be changed.

For different diff_steps, for example 3000, I change the parameters of --diff_steps to 3000 when running improved-diffusion/scripts/run_train.py, some variables in transformers\examples\pytorch\language-modeling\run_clm.py and improved-diffusion/scripts/infill.py which are named diffusion_steps to 300.
However, when running infill.py, there are errors shown as follow:

/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [173,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [173,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
......
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [306,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
scripts/infill.py:656: UserWarning: Use of masked_fill_ on expanded tensors is deprecated. Please clone() the tensor before performing this operation. This also applies to advanced indexing e.g. tensor[mask] = scalar (Triggered internally at  /opt/conda/conda-bld/pytorch_1640811805959/work/aten/src/ATen/native/TensorAdvancedIndexing.cpp:1280.)
  encoded_seq.masked_fill_(encoded_seq == todo_pad_token, 3)
ddim_sample_loop_progressive device: cuda:0
ddim_sample_loop_progressive noise: None
ddim_sample_loop_progressive progress: False
Traceback (most recent call last):
  File "scripts/infill.py", line 1131, in <module>
    args = main()
  File "scripts/infill.py", line 698, in main
    eta=args.eta,
  File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 1163, in ddim_sample_loop_progressive
    langevin_fn=langevin_fn,
  File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 1039, in ddim_sample
    sample=langevin_fn(sample, mean_pred, sigma, self.alphas_cumprod_prev[t[0]], t, x)
  File "/Diffusion-LM/improved-diffusion/scripts/infill_util.py", line 162, in langevin_fn_tone_length
    model_kwargs={},
  File "/Diffusion-LM/improved-diffusion/improved_diffusion/respace.py", line 93, in p_mean_variance
    return super().p_mean_variance(self._wrap_model(model), *args, **kwargs)
  File "/Diffusion-LM/improved-diffusion/improved_diffusion/gaussian_diffusion.py", line 479, in p_mean_variance
    model_output = model(x, self._scale_timesteps(t), **model_kwargs)
  File "/Diffusion-LM/improved-diffusion/improved_diffusion/respace.py", line 122, in __call__
    map_tensor = th.tensor(self.timestep_map, device=ts.device, dtype=ts.dtype)
RuntimeError: CUDA error: device-side assert triggered

I failed to handle the error so can you show me a solution to change diff_steps?

For different embedding dimension, for example 32 (original 16), is it enough after the modification shown as follow? (omit other para)
①run_train.py --inchannel 32
②run_clm.py --n_embd 32

Missing `training_args.json` during training

Thanks so much for making the code available!
I'm trying to run the example command in the README. However, it occurs the same error of missing a file named training_args.json during both Train Diffusion-LM and Contrallable Text Generation.

python scripts/run_train.py --diff_steps 2000 --model_arch transformer --lr 0.0001 --lr_anneal_steps 200000 --seed 102 --noise_schedule sqrt --in_channel 16 --modality e2e-tgt --submit no --padding_mode block --app "--predict_xstart True --training_mode e2e --vocab_size 821 --e2e_train ../datasets/e2e_data " --notes xstart_e2e

python train_run.py --experiment e2e-tgt-tree --app "--init_emb {path-to-diffusion-lm} --n_embd {16} --learned_emb yes " --pretrained_model bert-base-uncased --epoch 6 --bsz 10

Could you please tell me where the file is or how to make it?
Thanks a lot!

Where is the mask indicator to calculate the decoder loss?

Hi, you use token_discrete_loss to compute the decoder loss. However, it seems to include all cross-entropy losses for all input tokens, which may be unreasonable.

In contrast, should you exclude all PAD tokens in your decoder loss?

Why model.model.module instead of model.model?

Hi, may I ask a question about DDP?

I notice that you use model.model.module to get access some customized attributes like get_embedding(). Nevertheless, as far as I know, the right way ought to be model.module.

Interestingly, model.model.module works well in your original version.

Dependencies and Data-path to read

Thanks for making the code implementation publicly available!

Minor stuffs:

Missing dependencies: stanza, spacy-stanza, benepar (+ download the nltk data for local reading), scikit-learn
Path for e2e-data in run_clm.py need to be corrected (I have not checked the running for the other data) in ~/transformers/../../run_clm.py : ./datasets/e2e_data/

^ which also implies that there is a modification in the transformers library to train the control-based classifier.

Is it possible to pre-train a diffusion language model on Wikipedia texts?

I see some "simple_wiki" in the code. Did you intent to train a diffusion LM on wiki? Is it too difficult (require more computing resources, for example) to pre-train such a model?

ModuleNotFoundError: No module named 'custom_trainer'

Where could I find this module please?

Hugging-face online model not found when decoding Diffusion-LM

Hey there, I was just trying to run the decoding model while it says I have an error about

OSError: We couldn't connect to 'https://huggingface.co/' to load this model and it looks like predictability/diff_models/e2e-tgt_e=15_b=20_m=gpt2_wikitext-103-raw-v1_101_None is not the path to a directory conaining a config.json file.

Just 10 hours ago I was able to successfully load the model from hugging-face, I wonder if this is my personal problem?

top_p parameter and scaling of timesteps

Hi
thanks for sharing the codes. I wonder why there is top_p in the codes, the part you are adjusting the noise, also the scaling of timesteps here

    def _scale_timesteps(self, t):
        if self.rescale_timesteps:
            return t.float() * (1000.0 / self.num_timesteps)
        return t

Are these necessary ? thanks

Strength of classifiers vs. results: do all baselines in the paper use the same classifier as Diffusion-LM?

Thank you so much for making your code public. Again, very interesting work!
This is not really an issue but more of a question.

I have a feeling that the strength of the classifier really matters in plug-and-play controlled text generation. I was wondering if the models that you compare against all use the same classifiers. If not, can you comment on differences (number of parameters, pretrained or not, architectures, etc...)?

Thank you.

Importance of small vocab size and dimensionality of diffusion space for e2e-tgt experiments

Hi,

I notice that using the default settings of the code for training the diffusion model, the result is the use of a small vocab custom tokenizer and matching randomly initialized embedding layer (821 -> 16).

I also know this setting uses a small latent space of dimension 16 that the hidden states are projected to and from during diffusion.

How important were these two factors for training? I see that using the pretrained bert tokenizer was an option, I assume this doesn't work as well? Were these smaller scale components required to get the diffusion-lm to be stable in training and converge?

about {path-to-diffusion-lm}

When finally generating controllable text, python scripts/infill.py --model_path {path-to-diffusion-lm} command, which path does "path-to-diffusion-lm" refer to, the first step model path, or the last but one model generated from the classifier?
"improved diffusion/diffusion_models/diff_e2e-tgt_block_rand16_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd102_xstart_e2e" or "Classifier_models/e2e-tgt-tree_e=6_b=10_m=bert base uncased_wikitext-103-raw-v1_101_wp_None"

How did you derive your sampling algo?

Hi Lisa,

Thanks for your wonderful work.

May I ask how did you derive the sampling algo mathmatically for x_0 prediction? (I am looking for the sort of proof given in DDPM regarding the e-prediction)

Necessity of using diffusion model

Hi!
If possible, can you tell the reasons why your team choose diffusion model to complete the tasks of controllable text generation?

How to control the length

In the paper, the length of the generated sentence can be controlled without a classifier.
How can we do this? Could you give us a detailed explanation or implementation for this?

Best,

Gwanghyun

How to deal with sequences with different lengths?

Thank you for your great work! I've read your paper and am having trouble understanding generating sequences with different lengths. It seems to me that as you fix n=64 in experiments, you can't change it anymore as the hidden size d'=n*d in Transformer is fixed. As a result, it should be impossible during inference time to generate sequences with length other than 64...?

The effect of "logp_term"

Thanks for sharing the code. But I would like to ask some questions.
In the line of 51-54 colomn, infill_util.py. what's the effect of logp_term and how to choose the value of coef. Why add the logp_term to model_out.loss.
coef = 0.01
logp_term = coef * ((mean - input_embs_param) ** 2 / sigma).mean(dim=0).sum()
loss = model_out.loss + logp_term

Where is the mbr.py file?

Hi, I notice that when modality == 'e2e, the batch_decode.py calls the diffusion_lm/e2e_data/mbr.py file.
However, I failed to find this script. Is there an error or typo?

    elif modality == 'e2e':
    COMMAND1 = f"python diffusion_lm/e2e_data/mbr.py {out_path2}"

    os.system(COMMAND1)

Missing `control_gen/target_tree.json` file for infill script

I'm trying to run the last example command in your README

python scripts/infill.py --model_path {path-to-diffusion-lm}  --model_arch transformer --eval_task_ 'control_tree' --use_ddim True  --notes "tree_adagrad" --eta 1. --verbose pipe

and with a few edits I get it to run up until the line for loading the control_label_lst for the tree task.

Could you add the control_gen/target_tree.json file or reply here with the file contents? I'm trying to get a minimum working execution of the entire pipeline for at least one task to better understand how you implemented the conditional denoising/generation using the embedded control sequence. With this file I think I can get the infill script to execute fully, which should help in this process.

Thanks!

Infilling and text generation in specific contexts.

Hello, thank you for the great work.
Can someone guide me on how to test the infilling task - give surrounding context and expect a sentence in the middle to be generated? Also, I would be thankful if you showcase how the length of the generated response could be controlled.

Would you please explain all the experiment names in your code?

Hi, thanks for your interesting work!

There are a lot of arguments for model_args.experiment:

'e2e-tgt-tree', 'e2e-tgt-gen-tree', 'e2e-tgt-gen-spans'
e2e-tgt-gen-length
'e2e-tgt-pos', 'e2e-tgt-gen-pos'
'synth_emb', 'pos_emb', 'roc_emb', 'simple-wiki_emb', 'e2e-tgt_emb'
....

These choices make it hard to follow your work and reproduce some experiments. I guess lots of choices are for baseline implementation. Which experiments should I choose if I want to run the controlling tasks by Diffusion-LM in your paper? Would you please explain them in your README page?

Thanks a lot!

ModuleNotFoundError: No module named 'mpi4py'

I am getting this error after installing the depedencies mentioned. Could you please update the installation and run the commands to make sure it runs with the dependencies mentioned?

Using learned_emb for training the classifier

Hi Lisa,
Thanks for sharing the code! I am trying to run the scripts for the 'e2e-tgt-tree' task, and I noticed that in the instruction for training the syntactic parser classifier, the "--learned_emb yes " is not used in the code. Perhaps I am reading this wrong, in this line it looks like the randomized embedding is loaded, instead of the trained embedding weights. Could I please ask is my understanding correct or did I miss anything? Thank you for your help!

How could I perform semantic experiment in e2e dataset?

Hello, thank you for the great work!
How could I perform semantic experiment (use a attribute such as 'food: Japanese' to control the model) in e2e dataset? I don't know which command should be execute .
Is there an example please?

In addition, what do I do if I want to use more than one attribute, such as "name", "type", etc. given in the e2e dataset?

Is there any guideline about different task?

very impressive job! But it looks like many tasks in one script? For example, if i want to try semantic control task, is there any example? Thanks a lot

HF diffusers and wandb Reports

Hello, there. Was wondering if you are planning to add your model to HF diffusers.
Lastly do you plan to release the wandb logs/Reports to community I am sure lot of insights can be derived from those.

Train on Multi GPU

Hi Lisa, I have been successfully run the training schedule of your code, but during training it turns out that the GPU memory is not enough. The thing is that the cluster that I am using has multiple GPUs, and the training schedule didn't seem to utilize this fact. I wonder if there is a way to train Diffusion LM on multiple GPUs? Thank you very much if you would know how, I am just a new bee in this field.

Is the embedding model trainable during the training process?

Hi, thanks for providing the code. However, I am confused regarding the embedding layer.

In the train.py script, the model weight is loaded from ema_0.9999_200000.pt for 'roc' dataset. This indicates that the embedding layer is using the pre-trained parameters.

   if args.experiment == 'random1':
        args.experiment = 'random'
        print('loading from the vocabs here.')
        assert args.in_channel == 64
        assert args.modality == 'roc'
        model22 = torch.nn.Embedding(args.vocab_size, args.in_channel)
        model22_weight = torch.load('predictability/diffusion_models_v7/diff_roc-aug_pad_rand64_'
                                    'transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd108_xstart_e2e/'
                                    'ema_0.9999_200000.pt', map_location='cpu')['word_embedding.weight']
        model22.weight = model22_weight
        model22.weight.requires_grad=False

But as for other datasets or for experiment = 'random, the embedding layer is randomly initialized.

        model = torch.nn.Embedding(len(vocab_dict), data_args.in_channel)
        print('initializing the random embeddings', model)
        torch.nn.init.normal_(model.weight)
        path_save = f'{data_args.checkpoint_path}/random_emb.torch'
        print(f'save the random encoder to {data_args.checkpoint_path}/random_emb.torch')
        torch.save(model.state_dict(), path_save)

So, first of all, I guess that this embedding model is trained during the training process. Am I right?

Nevertheless, when we decode the text batches and hope to sample texts by batch_decode.py and text_sample.py. It turns out that the embedding model loads the weight of the randomly initialized model, which means that the embedding layer is not trained during the training process. This is very weird, isn't it?

            model = torch.nn.Embedding(len(tokenizer), emb_dim)
            path_save = '{}/random_emb.torch'.format(file)
            model.load_state_dict(torch.load(path_save))

To summarize, I am uncertain about why you do not load a well-trained embedding layer when you decode the batches but adopt a randomly initialized embedding layer.

problem about attention_mask

hey! I'm reading the code and I find that the atomic model architecture(self.input_transformers) is a huggingface BERT-encoder. It doesn't seem to add a attention_mask into the encoder, though padding the input. So is it necessary to add the attention_mask, and mask the same position in MSE-loss ?

Are these normal results？

Hi, I got some results by following README and running infill.py with task "control_pos". Each sentence is 64-word-long and doesn't seem to quite match the POS, as shown in the following example.

{('START', 'ADV', 'NOUN', 'ADV', 'ADP', 'PROPN', 'PROPN', 'PUNCT', 'DET', 'PROPN', 'NOUN', 'NOUN', 'VERB', 'ADJ', 'NOUN', 'NOUN', 'CCONJ', 'AUX', 'PART', 'VERB', 'NOUN', 'NOUN', 'PUNCT', 'PROPN', 'END'): [
    "House shop found 's House type type usually bad usually French and highly - French family and family upscale as for makes convenient our - cuisine cuisine at at has very views at price very low , fine he non type non he perfect UNK call non call Children family by a Mid family family convenient family family family kids at family at convenient", 
    "called stop family 's many usually usually at high usually diner child highly family French French for family upscale as makes make convenient usually grub at because cuisine non has pub non at one price we at UNK non as they he new perfect UNK many atmosphere House he who UNK Bibimbap who new family convenient UNK family But new at family at convenient", 
    "House . at 's many type type take French usually while and Join friendly serving with and family usually as so tag are serve 's non among fine perfect it UNK we , options he perfect he he who UNK customers who he - - Conveniently new who who most options a - a having quiet view at quiet has at a by by", 
    "House One make 's 's usually usually very family usually French child Is for French and for family tasty highly French family UNK UNK pub cuisine cuisine Cambridge in cuisine has has has a cuisine out is UNK UNK The by A he out by at atmosphere atmosphere UNK who a he UNK most beautiful quiet who family But who kids a kids convenient", 
    "called shop family and many type usually at somewhere when and French of - for French and by upscale low with gave UNK may pub has can at at has has with UNK price good the , we 're restaurant atmosphere who are UNK he UNK UNK he UNK can 're family who a family new family family who a at a by convenient", 
    ...

And I have the same question on other tasks except for "control_length".

Could you please tell me whether these are normal results? If yes, how can I get a decent sentence like you show in paper？
Thanks a lot!

Wandb log or Codalab log

#10
Hi,
Currently the link is broken, and I don't see log files publicly available on your Codalab homepage.

Do we need to scale word embeddings to [-1, 1]?

Hi there, thank you very much for providing the code!

I am new to diffusion model, so I apologize in advance if I ask a dumb question.

In this line, it seems we are getting word embeddings and adding noise directly to it, without making sure word embeddings are between [-1, 1].

In DDPM, we need to scale image to [-1, 1] for parameters in noise scheduler to work properly.

I am wondering how we control the scale in text.

Thank you very much!

Training Cost due to the EMA mechanism

Thanks for such nice work and your kind released code! I have just tried it and found that the EMA mechanism has been used in your optimization of the Diffusion-LM code, which limits the update of the model parameters a lot. Indeed, such a way may stabilize the training process but also increase the training cost. I suppose once it has been removed, would the performance of Diffusion-LM degrade a lot? Or maybe the training cost could be further reduced a lot?

I am very expected to know the motivation of using EMA in your approach. Looking forward to your response.

dependency conflict

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 1.8.0 requires huggingface-hub<0.1.0, but you have huggingface-hub 0.4.0 which is incompatible.

Time embeddings

Hi,

The paper, Appendix F, 4'th bullet point states that time embeddings are incorporated by softmax applied on learnable scaling and offsetting operations. I could not find the part of the code that does this. My first impressions are that time embedding is achieved at 902'th line of the "transformer_model2.py" file, in which the learnable time_embeddings are just linearly added to the input features. Am I missing something? Thank you...

Error message when trying to train the model

I receive this message when I tried to train the model: [''OPENAI_LOGDIR'' is not recognized as an internal or external command, operable program or batch file.] Could anyone help me with this?

decoding diffusion-lm and Controllable text generation

Hello!
Would you mind telling me what I should type in the {path-to-diffusion-lm} in decoding Diffusion-LM, I have tried several but they all return an empty output list.
Also may I ask where did you implement the formula in 5.1 Controllable Text Generation?

Thank you!

Paths to models/hf hub, extra transformer subclasses

Hi there, great work on this paper!

I was just trying to run some of the code to understand your full pipeline and I was able to "Train Diffusion-LM" using the scripts/run_train.py under improved-diffusion. However, the next utility scripts/batch_decode.py has a series of paths like predictability/diff_models/... which I'm pretty sure were all local paths during dev, as the code is unable to pull them from huggingface model hub. The issue occurs after generation, when scripts/ppl_under_ar.py is launched. It can't get this model from the hub https://huggingface.co/predictability/diff_models/e2e-tgt_e=15_b=20_m=gpt2_wikitext-103-raw-v1_101_None/resolve/main/config.json

My understanding is that ppl_under_ar.py is supposed to use your "teacher LM (i.e., a carefully fine-tuned GPT-2 model)" to asses the generation quality of the diffusion model trained by scripts/run_train.py
(my trained diffusion model is at Diffusion-LM/improved-diffusion/diffusion_models/diff_e2e-tgt_block_rand16_transformer_lr0.0001_0.0_2000_sqrt_Lsimple_h128_s2_d0.1_sd102_xstart_e2e)

I assume that you used your modified run_clm.py at Diffusion-LM/transformers/examples/pytorch/language-modeling to finetune a GPT-2 model for the task of measuring perplexity of the generated samples. Is this correct? And as such, one could use any ARLM, they just won't get the same perplexities/"lm-scores" you reported?

Thanks for helping clarify how the components work and what models are available for download (if any).

PS ... what are the various custom subclass models for? and why the "compression"? I can't really match this ideas or these models perfectly to the paper : {GPT2LMHeadModelCompress, BERTModelCompress, AutoEncoderWithNoise, AR_for_cont, GPT2VAE }. Is the compression/down projection your way of enabling the model to diffuse in a reduced dim space?

Final e2e training objective definition in code

Hi again,

I am working on understanding the training setup and want to make sure I am clear on the loss function that worked out best in your paper. i.e. which you'd consider best in the end for learning a BERT backbone word vector diffusion model.

I believe the training_losses_e2e gives the three term loss, which matches the vlb equation somewhat. But what does the training_losses_e2e_simple give you exactly? Which one (both?) did you find were optimal for learning the embedding step, the decoding step, and the denoising step simultaneously? And what terms correspond to which components?