lucidrains / denoising-diffusion-pytorch Goto Github PK
View Code? Open in Web Editor NEWImplementation of Denoising Diffusion Probabilistic Model in Pytorch
License: MIT License
Implementation of Denoising Diffusion Probabilistic Model in Pytorch
License: MIT License
I'm pretty sure the answer is yes, and this is more of a sanity check that the following line implies that the batches will be shuffled.
Is there something in the repository that treats mirrors about the y-axis as equivalent? Wondering if this is just an artifact of the checkpoint sample-##.png
stacking of the images or if those are the real outputs.
I'm using ~16k grayscale training images that are
conda create -n xtal2png-ddpm python==3.9.*
conda activate xtal2png-ddpm
pip install denoising_diffusion_pytorch xtal2png
from os import path
import torch
from denoising_diffusion_pytorch import GaussianDiffusion, Trainer, Unet
from mp_time_split.core import MPTimeSplit
from xtal2png.core import XtalConverter
mpt = MPTimeSplit()
mpt.load()
fold = 0
train_inputs, val_inputs, train_outputs, val_outputs = mpt.get_train_and_val_data(fold)
data_path = path.join("data", "preprocessed", "mp-time-split")
xc = XtalConverter(save_dir=data_path)
xc.xtal2png(train_inputs.tolist())
model = Unet(dim=64, dim_mults=(1, 2, 4, 8), channels=1).cuda()
diffusion = GaussianDiffusion(
model, channels=1, image_size=64, timesteps=1000, loss_type="l1"
).cuda()
trainer = Trainer(
diffusion,
data_path,
image_size=64,
train_batch_size=32,
train_lr=2e-5,
train_num_steps=700000, # total training steps
gradient_accumulate_every=2, # gradient accumulation steps
ema_decay=0.995, # exponential moving average decay
amp=True, # turn on mixed precision
)
trainer.train()
sampled_images = diffusion.sample(batch_size=100)
Notice how they're all oriented with the small square of zeros at the top-left (this is the case for all training data).
loss: 0.0535: 4%| | 25507/700000 [4:26:37<94:32:36, 1.98it/s]9it/s]
Notice how many of them are mirrored about the y-axis, which is not desired. If it's just an artifact of how the images are stacked, that's one thing - but if it's the actual sampled images it's a bit worrisome. Note also that it never seems to do an x-axis mirror with the small square of zeros at the bottom-right or bottom-left.
cc my labmate @hasan-sayeed who has also been working on this
Curious if you have any thoughts of how this can be applied to something other than image data. In my case, it's a task in materials informatics.
The cuda call in the training loop
while self.step < self.train_num_steps:
for i in range(self.gradient_accumulate_every):
data = next(self.dl).cuda() #here
....
leads to cuda error when running on CPU AssertionError: Torch not compiled with CUDA enabled.
It seems that "groups" for the Unet aren't used. It's not passed to the ResNetBlock modules nor the Block in the final_conv.
Dear @lucidrains ,
Thanks for the great work.
I change the image_size to (256, 256)
and encounter the following error.
It would be highly appreciated if you could give me some guidance to fix this error.
trainer.train()
File "/home/diffusion/denoising-diffusion-pytorch/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 566, in train
all_images_list = list(map(lambda n: self.ema_model.sample(batch_size=n), batches))
File "/home/diffusion/denoising-diffusion-pytorch/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 566, in <lambda>
all_images_list = list(map(lambda n: self.ema_model.sample(batch_size=n), batches))
File "/home/diffusion/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/diffusion/denoising-diffusion-pytorch/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 398, in sample
return self.p_sample_loop((batch_size, channels, image_size, image_size))
File "/home/diffusion/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/home/diffusion/denoising-diffusion-pytorch/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 388, in p_sample_loop
img = torch.randn(shape, device=device)
TypeError: randn(): argument 'size' must be tuple of ints, but found element of type tuple at pos 3
Hello, I downloaded your latest commit code. I feel that the reconstruction effect is much worse than the code downloaded in March. Have you observed the same phenomenon?
Thanks a lot!
In the "Improved Denoising Diffusion Probabilistic Models" paper, the authors claim that cosine schedule of beta makes alpha-bar change more smoothly, leading to better results. Then I wonder why not linearly schedule alpha-bar, instead of beta? We can compute beta from alpha-bar. Anyone tried that?
Hey there,
After we trained the model, I would try to sample the images by
sampled_images = diffusion.sample(128, batch_size = 750).
My question is, do we sample or get new unique images every time that we execute the above line of code? like is the first images of batch=750 different from the 2nd time that I sample ?
Best
it looks like it only supports a single GPU training rn. Do you have any plans to add multi-gpu training option any time soon? much appreciated
Thanks for all the work you shared.
I found possible bug: the cosine_beta_schedule
function should be
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * math.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0, 0.999)
Now the function is:
def cosine_beta_schedule(timesteps, s = 0.008):
steps = timesteps + 1
x = torch.linspace(0, steps, steps)
alphas_cumprod = torch.cos(((x / steps) + s) / (1 + s) * torch.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0, 0.999)
Hi everyone,
There is some way to compute Negative Log Likelihood using the functiondef discretized_gaussian_log_likelihood(x, *, means, log_scales, thres = 0.999)
defined in the file learning_gaussian_diffution
? Which are the parameters of the function? How I get them after trained the model?
Hi, thanks for porting the tf project to torch and open sourcing the code.
Looking at the linear attention code, it seems quite different from the standard self attention in at least two ways:
the images must be square, as the image_size determines both the height and width,
but what if the dataset is not square, like CelebA or Fashion Product Images Dataset?
thanks
Hi, thanks for the useful repo. I tried the code on a single spiral image. After 10k training steps, the model still doesn't converge. How could this happen? And I don't understand why the artifacts occur around the border?
I am showing the results (left) and the training image (right) below.
I'm using the default setting for the loss and the lr.
Hi, thanks for your excellent project !
I am new to diffusion models and I am now training a vanilla diffusion model over a small dataset (100 images). After being trained for 3000 epoches (around 10k iterations) with initial learning rate 5e-5, the trained model with small training loss produces images with much noises just like
How can I resolve the problem ? Should I train it for longer time or enlarge the learning rate ? Some useful tricks to debug ?
Appreciate any help !
Could you please provide the code by which the "sample.png" is generated? (with the dataset path)
or better a sample code with MNIST
thanks
In denoising_diffusion_pytorch/denoising_diffusion_pytorch.py
class Trainer
takes an argument update_ema_every = 10,
and sets self.update_ema_every = update_ema_every
but then this is not used. self.ema.update()
is called every step.
Suggested fix:
if self.step != 0 and self.step % self.update_ema_every == 0:
self.ema.update()
EDIT: I see ema_pytorch can take in update_every
, so a better fix is to switch self.ema = EMA(diffusion_model, beta = ema_decay)
with self.ema = EMA(diffusion_model, beta = ema_decay, update_every=self.update_ema_every)
.
Not sure if this is just a temporary issue due to the EMA bits being partly moved to ema_pytorch but figured I should point it out in case it got missed :)
All of my outputs from the diffusion model, following the code snippets, are bounded between -1 to 1. Is this meant to be the case? I couldn't find anything obvious in this repo that would cause that.
My images are all square matching the same image size that's passed into the GaussianDiffusion
model and pixel values are between 0-255. I saw that whenever I'm was trying to display the images as a result of model.sample(...)
I was getting what seemed like noise on the output, and also noticed that during training my loss is stuck around 0.8 as shown below after 2.5k iterations (image size: 128, batch: 32, l1 loss, lr: 0.0001, time_steps: 4000 instead of the default 1000, also had similar with the default time steps):
Hi!
Thank you so much for your nice work.
I've a question regarding the Resnet structure that you used in the code. As I read Jonathan Ho's repository Resnet architecture, I saw some different implementation there. Could you please let me know what are the enhancement you applied on his implementation?
Thanks.
I don't think this was meant to stay in the code.
It's printing "hmmm" on every ResnetBlock forward.
Edit: Turns out it does have a channel argument, it just wasn't shown on the readme. Nevermind.
I'd like to train for gray images, without wasting computation on three channels. I reckon other folks might like the option for 4 channels for RGBA, or who knows what else.
Hi,
May I ask a question about the DDPM(based on implementation)?
I am a bit confused about its training loss: loss = (noise - x_recon).abs().mean().
The noise is random noise? I am confused why it (the loss) is based on the random noise here? Forgive my foolishness and misunderstanding. Could you please explain a bit?
Thanks for your help.
Hi, I would like to know if there is a pre-trained model that we can play with?
I noticed the trainer reports loss + step manually with print.
Moving the TQDM or something the of the sort to match pytorch standard would be a QOL boost.
I'd be happy to make the update if needed.
Hi,
I was wondering why every diffusion models implementation uses this specific sampling procedure?
When I take a look at the DDPM paper they show the sampling algorithm to be:
However, it seems that no implementation follows that and rather takes a really complicated route of first predicting the noise, then calculating x_0, then the mean and logvariance and then construct x_t-1 from that.
I implemented the above algorithm while using your codebase:
@torch.no_grad()
def my_sample(self, n):
x = torch.randn((n, 3, self.image_size, self.image_size)).to(self.device)
for i in tqdm(reversed(range(1, self.num_timesteps)), position=0):
t = (torch.ones(n) * i).long().to(self.device)
predicted_noise = self.denoise_fn(x, t)
beta = extract(self.betas, t, x.shape)
alpha_hat = extract(self.alphas_cumprod, t, x.shape)
alpha = 1. - beta
if i > 1:
noise = torch.randn_like(x)
else:
noise = torch.zeros_like(x)
x = 1 / torch.sqrt(alpha) * (x - ((1 - alpha) / (torch.sqrt(1 - alpha_hat))) * predicted_noise) + torch.sqrt(beta) * noise
x = x.clamp(-1., 1.)
return x.add(1).mul(0.5)
But the results are just gray images with a bit of shape and colour:
(top is the normal sampling, like your code, bottom is using the above sampling function)
Do you have any idea why this kind of sampling does not work?
What does the last term of the loss, -log p(x_0 | x_1) mean? It seems similar to the the log-likelihood of a single data point from VAE's ELBO. It that's what it is, I'm puzzled how to interpret the conditional, | x_1.
Also, the paper's authors mention that this L_{0} term is included in the L_{simple} loss that they used, saying:
The t = 1 case corresponds to L0 with the integral in the discrete decoder definition (13) approximated by the Gaussian probability density function times the bin width, ignoring ฯ21 and edge effects.
How does this correspondence work?
It seems that even if you implemented a LinearAttention module, you are not actually using it since every time you instantiate it, you use
Residual(Rezero(LinearAttention(mid_dim)))
where the Rezero module discards its input parameter, resulting in an identity operation.
Thanks for sharing your code. When I am reading the code with the DDPM paper, I find that
mean
directly, then use the mean
to do samplingx_start
, then use x_start
and x_t
to compute mean
, then use the mean
to do samplingI don't think these two ways are same in the formulation, which confuses me a lot. Can you give me some suggestions about it, thanks
Thank you such a great work.
I have question about p2_loss_weight
in here.
# calculate p2 reweighting
register_buffer("p2_loss_weight", (p2_loss_weight_k + alphas_cumprod / (1 - alphas_cumprod)) ** -p2_loss_weight_gamma)
It is saying it followed the original paper paper.
Eq. 5 of the paper is:
But your implementation is:
I don't understand how you induced this formula of the implementation.
Thank you.
Dear @lucidrains ,
Thanks for the great repo.
I was wondering whether the model is generating training set images?
Here is a saved sample during training. The results look very good but I found all the images are training images.
i denoising_diffusion_pytorch.py in the Trainer Class init there is an option for 'image_size' which has a default value of 128, and these arguments are not documented anywhere I could locate.
The issue then comes later inside the init method where you set:
self.image_size = diffusion_model.image_size
however, when the Dataset instance is declared is is initialized with:
self.ds = Dataset( folder, image_size, augment_horizontal_filip = augment_horizontal_filp)
So, if one has originally created a diffusion_model with an image size of 64, but does not know to set the argument on the Trainer, then a default image size of 128 will be used and cause a conflict.
I believe it can be fixed by simply declaring the Dataset with self.image_size, instead of image_size.
Hi!,
What loss (L1 and L2) should I expect for a properly trained model? And which loss usually performs better?
I just conducted training this model, but i noticed that it took enormous hours.
According to READ.ME, check points are published while training, but i don't know how to use it to start training process again.
Could you tell me how to restart training?
File "/usr/local/lib/python3.6/dist-packages/denoising_diffusion_pytorch/denoising_diffusion_pytorch.py", line 358, in cosine_beta_schedule
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
AttributeError: module 'torch' has no attribute 'pi'
I'm guessing this doesn't support some versions of pytorch?
Unfortunately I can not spin really my head around it what the major benefit it has? Any short explanation?
The code seems that the image loading method is slow.
For example, I think the code should load learning images in parallel.
Current:
After:
self.dl = cycle(data.DataLoader(self.ds, batch_size = train_batch_size, num_workers=os.cpu_count(), shuffle=True, pin_memory=True))
May I make an pull request?
Recently I am trying to use the diffusion model to do 1-D vector generation task, such as to generate sentence embedding which is originally generated from Bert, I have some questions about it
thanks for your suggestions
Looking at your ConvNextBlock I noticed your LayerNorm is different from torch.nn.LayerNorm(). In fact yours appears to normalize over the dimension of the Channels. In Liu et al 2022 they specifically mention LayerNormalization. It would put a crimp on the definition of the model as you have to fix the image plane dimension on model definition, but if Liu is correct it 'should' ? improve models.
Is there a reason why you are normalizing along channels that I somehow missed from the papers?
Thanks.
Thanks for all the work you shared.
I noticed the recent update in ddpm code and found possible bug:
The line was previously:
From my understanding, padding should be applied to the left instead of right, which may require modification:
alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value = 1.)
to make them equivalent.
Hi,
first of all thanks for this repository and for your work. I'm a student that is working with this code for a project's university. My aim is to replicate the results written in this paper using the same parameters.
I tried to run the code with Colab, but the computation time for each step is really huge (e.g. 1000 steps in 4 hours); how can i reduce the computation time for each step? Is there some optimization that i can do to speed-up the training?
Any suggestions (also from other users, not only the creator of this repo) would be appreciated. Thank you!
PS i'm using the CIFAR10 as dataset
I recently was thinking of doing the same but was wondering if you have tried it and seen any actual benefit?
I am currently using the UNet architecture from OpenAI improved/guided diffusion as well
Thanks
Hello,
Do you think it is possible to train a DDPM with images that have different resolutions? And therefore sample from the model some images at different resolutions?
I am currently trying it but the samples seem not good. Do you see any upgrades in the current architecture/training in order to face this issue?
Thanks for your help.
Hi Phil,
Thanks for all your amazing repos. @kashif and me wrote a blog about your diffusion repo, where we go over your code step-by-step. We called it the "Annotated Diffusion Model".
https://huggingface.co/blog/annotated-diffusion
Let me know if you like it or whether there's anything to be improved!
Is it possible to use this for conditional generation?
I'm training a model on illustrations with the new ConvNext modules and it's not giving me the results I expected from training with the same parameters and data before the ConvNext change. Previously it used to create lines and shapes of solid color, but now there aren't any smooth lines that appear when checking the output, it's like fuzzy scribbles instead.
Previous version without ConvNext: https://i.imgur.com/qeF0dNn.png
Current version with ConvNext: https://i.imgur.com/S3ierK5.png
These outputs are typical for the old and new code when I trained different models with differing parameters. Without ConvNext I get lines and shapes, with ConvNext I get lots of fuzzy but hard lines/edges.
It seems like maybe it's very slowly becoming less fuzzy over time but considering the amount of time that has to be put into training this I want to be sure that it's definitely going to work and that it has produced results with these changes.
For example, if I set 700k training steps, and stop at ~30k when there are 15k images, is there an idea of how many times the training data has been cycled through?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.