xhan77 / ssd-lm Goto Github PK

View Code? Open in Web Editor NEW

63.0 63.0 5.0 356 KB

Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control

Python 81.76% Shell 18.24%

ssd-lm's People

Contributors

Stargazers

Watchers

Forkers

kastnerkyle zacharyhorvitz hamishivi yinguoshun yair-schiff

ssd-lm's Issues

Where is the Training Code?

I think it is an interesting work and would like to train the SSD-LM. The implementation looks elegant. When will the training code be uploaded?

Any plan to release the pretrained diffusion LM?

The pretrained diffusion LM outperforms GPT2, it means a lot for diffusion model research in natural language processing. I am interested in using the pretrained diffusion LM on downstream tasks, I may ask if there is any plan to release the model? like add the model to Huggingface ?

parameter setting

Hi， how to configure the argument args.remove_noise_mode, just confused, please enumerate some examples, thanks.

question about the metrics used in the paper

could you please clarify which of zipf coefficient are reported? I see three of them not sure which one is reported.
for the \delta_log perperplexity metric in table of results, could you please point me how this is computed? I only see the PPL one in the codes. thanks

Why do you calcluate the classifier gradient on W_0 instead of W_t

Dear authors,

In line 230 of ssd_model_decode_fileio.py, it seems that you are calculating loss on the predicted W_0 and drift W_0 accordingly, instead of W_t as in equation (22) in the paper. May I ask why do you make such choice?

Thanks.

Classifier-guidence generation code

Hi, Thanks for sharing your implementation. It is really helpful and easy to follow.

Except for the prompt-based generation, the controlled text generation is also conducted in your experiment part. I think it is called classifier guidence in the diffusion models. But I did not found the code to run the controlled text generation.

Do you have any plan to open-source this?

Is the model fully trained?

the dimension length is the size of vocabulary, I doubt that with such long dimension length, if the training data or training time is enough to fully tune the model parameters ?

Paper, equation (5)

Hi! Thanks for the paper! A few questions:

In equation 5, K is not defined. Is it just an hyperparam?
Why -K and not 0, like in regular one hot? To make the softmax more extreme/discrete?
In equation (8) I think you forgot to put the 'sotftmax' as a wrapper for the logits, as the model receives the softmax simplex. (before the softmax, it is not a simplex)
In eq (10), why did you write 'logits'? After all, you wrote 'which are converted to a dist using softmax' so it should be probability, not logits
in eq (10) do you predict the original tokens, or the previous timestep (meaning less noisy tokens)?
In table 1, in the PPL, why isn't the arrow pointing down as in 'lower is better'?
Re the sentence: "Prior works, however, have shown that low perplexity of generated text is not necessarily an indication of high quality but of degenerate behavior (Nadeem et al., 2020; Zhang et al., 2021) and have proposed closeness to the perplexity of human-written text as a better evaluation. Hence, we also report the difference of log perplexity between the generated text and human-written continuations" - I couldn't find the word 'perplexity' in the cited article. What is the formula that you use? Do you calc the difference in log perplexity of the gold continuation as prediction by GPT NEO to the prediction of GPT NEO of the text generated by the specific model you test?

getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the openwebtext dataset from huggingface

hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:

Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00,  1.70s/it]
Traceback (most recent call last):
  File "ssd_process_data.py", line 485, in <module>
    main()
  File "ssd_process_data.py", line 369, in main
    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]

I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.

I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:

    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
    verification_mode = VerificationMode(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'none' is not a valid VerificationMode

Has anyone encountered such a problem, or knows what I can do?

reproducing GPT3 results

Hi
thanks for sharing the codes and for the great work!
for reproducing GPT-3 results with script loop_baseline_gpt2.sh, It needs the output files ctx25_trunc150_depth1_ctrlr0.0_step1000_topp0.0_sad_gen.jsonl. Could you also update the files needed to run this script?
It helps a lot to check our outputs.

thanks

data used for evaluation?

Would it be possible to share the generated text used to compute metrics for SSD-LM and baselines?
I am interested in doing some analysis of the outputs with respect to measures beyond those used in the paper and hope to avoid rerunning the full generation.

Thank you (and thank you for the generally well-documented code and interesting paper!).

question about metrics and zipf

Hi
thanks for the codes and great work

Zipf return three numbers of "-s, -r, p" could you clarify which of these is reported?
there is metric called |\Delta log (PPL)|, I could not find it within the implemented metrics, could you provide the code on how to compute it?
thanks

provide a smaller version of ssd-lm comparable to gpt2?

The ssd-lm provided in huggingface is comparable to gpt2-medium, could a smaller version of ssd-lm comparable to gpt2 be provided?

A few questions regarding the paper and the code.

Hi Xiaochuan,

Thanks for your wonderful work. It is really eye-opening to come up with the diffusion process on the logits space instead of the token embeddings!
I have a few questions regarding the paper and the code. Could you kindly respond to them?

It is also possible to perform diffusion directly on the probability space. We just need to normalize the probability by dividing the sum at each timestep. Why do you choose to perform diffusion on the logit space instead of the probability space?
Why don't you link the parameter matrix of "embedding_sum_layer" and "model_embedding_lut"? (You copy the parameter at the beginning of the training, but you did not link them. So the two parameter matrix may become different after training.)
You did not use the transformer-style sine-consine function for the timestep embedding. Is this choise based on empirical results or intuition?

Thank you so much!

Question about SSD decoding algorithm

Hi,

Thank you for the great work and the well-documented code! I have a general question regarding SSD decoding algorithm. In the paper you mentioned that "The DDPM decoding is designed for diffusion in a continuous space and failed to generate sensible outputs in our preliminary experiments based on simplexes.", with details in the appendix. What would be an explanation for a worse sampling performance with continuous DDPM decoding on simplexes, and what was the intuition behind designing the modified sampling procedure? For example, why does using a noise z instead of using the deterministic z help?

Thank you in advance for your help!