Code Monkey home page Code Monkey logo

ncsn's Introduction

Generative Modeling by Estimating Gradients of the Data Distribution

This repo contains the official implementation for the NeurIPS 2019 paper Generative Modeling by Estimating Gradients of the Data Distribution,

by Yang Song and Stefano Ermon. Stanford AI Lab.

Note: The method has been greatly stabilized by the subsequent work Improved Techniques for Training Score-Based Generative Models (code) and more recently extended by Score-Based Generative Modeling through Stochastic Differential Equations (code). This codebase is therefore not recommended for new projects anymore.


We describe a new method of generative modeling based on estimating the derivative of the log density function (a.k.a., Stein score) of the data distribution. We first perturb our training data by different Gaussian noise with progressively smaller variances. Next, we estimate the score function for each perturbed data distribution, by training a shared neural network named the Noise Conditional Score Network (NCSN) using score matching. We can directly produce samples from our NSCN with annealed Langevin dynamics.

Dependencies

  • PyTorch

  • PyYAML

  • tqdm

  • pillow

  • tensorboardX

  • seaborn

Running Experiments

Project Structure

main.py is the common gateway to all experiments. Type python main.py --help to get its usage description.

usage: main.py [-h] [--runner RUNNER] [--config CONFIG] [--seed SEED]
               [--run RUN] [--doc DOC] [--comment COMMENT] [--verbose VERBOSE]
               [--test] [--resume_training] [-o IMAGE_FOLDER]

optional arguments:
  -h, --help            show this help message and exit
  --runner RUNNER       The runner to execute
  --config CONFIG       Path to the config file
  --seed SEED           Random seed
  --run RUN             Path for saving running related data.
  --doc DOC             A string for documentation purpose
  --verbose VERBOSE     Verbose level: info | debug | warning | critical
  --test                Whether to test the model
  --resume_training     Whether to resume training
  -o IMAGE_FOLDER, --image_folder IMAGE_FOLDER
                        The directory of image outputs

There are four runner classes.

  • AnnealRunner The main runner class for experiments related to NCSN and annealed Langevin dynamics.
  • BaselineRunner Compared to AnnealRunner, this one does not anneal the noise. Instead, it uses a single fixed noise variance.
  • ScoreNetRunner This is the runner class for reproducing the experiment of Figure 1 (Middle, Right)
  • ToyRunner This is the runner class for reproducing the experiment of Figure 2 and Figure 3.

Configuration files are stored in configs/. For example, the configuration file of AnnealRunner is configs/anneal.yml. Log files are commonly stored in run/logs/doc_name, and tensorboard files are in run/tensorboard/doc_name. Here doc_name is the value fed to option --doc.

Training

The usage of main.py is quite self-evident. For example, we can train an NCSN by running

python main.py --runner AnnealRunner --config anneal.yml --doc cifar10

Then the model will be trained according to the configuration files in configs/anneal.yml. The log files will be stored in run/logs/cifar10, and the tensorboard logs are in run/tensorboard/cifar10.

Sampling

Suppose the log files are stored in run/logs/cifar10. We can produce samples to folder samples by running

python main.py --runner AnnealRunner --test -o samples

Checkpoints

We provide pretrained checkpoints run.zip. Extract the file to the root folder. You should be able to produce samples like the following using this checkpoint.

Dataset Sampling procedure
MNIST MNIST
CelebA Celeba
CIFAR-10 CIFAR10

Evaluation

Please refer to Appendix B.2 of our paper for details on hyperparameters and model selection. When computing inception and FID scores, we first generate images from our model, and use the official code from OpenAI and the original code from TTUR authors to obtain the scores.

References

Large parts of the code are derived from this Github repo (the official implementation of the sliced score matching paper)

If you find the code / idea inspiring for your research, please consider citing the following

@inproceedings{song2019generative,
  title={Generative Modeling by Estimating Gradients of the Data Distribution},
  author={Song, Yang and Ermon, Stefano},
  booktitle={Advances in Neural Information Processing Systems},
  pages={11895--11907},
  year={2019}
}

and / or

@inproceedings{song2019sliced,
  author    = {Yang Song and
               Sahaj Garg and
               Jiaxin Shi and
               Stefano Ermon},
  title     = {Sliced Score Matching: {A} Scalable Approach to Density and Score
               Estimation},
  booktitle = {Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial
               Intelligence, {UAI} 2019, Tel Aviv, Israel, July 22-25, 2019},
  pages     = {204},
  year      = {2019},
  url       = {http://auai.org/uai2019/proceedings/papers/204.pdf},
}

ncsn's People

Contributors

yang-song avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ncsn's Issues

Why need adding noise twice

Hi,

Thank you for the paper and code.

I notice that when training with the annealing strategy, the samples is corrupted twice -- the first one is here, and the second one is here. In my understanding, the second one corresponds to { \sigma_1, ... \sigma_L } described in the paper. May I ask why the first one is needed?

Thank you and look forward to your reply!

Annealed GMM Analytical Log Probabilities

This may be a bit pedantic but, I wonder if there is a bug in the way the log probabilities are calculated for the annealed GMM distribution. My understanding is that annealed sampling uses scores from the ‘noise distribution’ which should reflect a Gaussian perturbed input. I’m assuming you guys achieve this by adding the noise distribution to the original GMM, hence allowing you to calculate the true log probabilities from the same GMM now scaled with the noise sigmas.

However, adding two Gaussian random variables should add the variances from both distributions as shown (taken from Wikipedia)

Annotation 2020-09-10 225227

In this case sigma_x would be the original GMM sigma (which is set to 1), and sigma_y would correspond to the noise level. So shouldn't the calculation in the lines below be using something like sigma = np.sqrt(sigma**2 + self.sigma**2) instead of just the sigma passed in to the function? In my own tests, sampling with these updated sigmas gives more faithful results to the original distribution.

ncsn/models/gmm.py

Lines 45 to 46 in adb98fb

logps.append((-((samples - self.means[i]) ** 2).sum(dim=-1) / (2 * sigma ** 2) - 0.5 * np.log(
2 * np.pi * sigma ** 2)) + self.mix_probs[i].log())

A question on convergence of DSM

Hey,

I'm currently tracing out the story of diffusion generative models and right now, I'm studying the denoising score matching objective (DSM). I've noticed that your multi-scale approach relies heavily on it (and the original paper is quite old), so I decided that to ask my question here.

I gone through the theory of DSM and got a good grip on how it works and why it works. However, in practice I observe slow convergence (much slower convergence than with ISM) on toy examples. In particular I believe this might be due type of noise distribution selected. While not restricted, it seems everyone goes with a normal distribution since it provides a simple derivative. The derivate being 1/sigma**2 * (orig-perturbed). In practive, I've observed that the scale term in front causes the derivative to take values on the order of 1e4 for sigma=1e-2 and loss jumps around quite heavily. The smaller sigma, the slower the convergence. The loss never actually decreases, but the resulting gradient field looks comparable to what ISM gives.

Did you observe this in your experiments as well?

About the condition in conditional score network

Hi,
thanks for your great work and the code sharing!
I have a little confusion after I trace the code.
In the original paper,
the objective of eq(2) can be expand as eq(5) under the denoising score matching scenario.
In the equation, the score network produce gradient of logarithmic data density ,
which condition on the sample and noise scale .
We can see the snippet in the following (in page 6)

image

However,
we can notice that the score network compute score which condition on data sample and label.
here is the code of training.
So which one is correct?

By the way, this is not a big issue :-)
Since new noise conditioning technique is introduced in latter version (NCSNv2),
the score network is only condition on sample .
Just want to realize the correctness of old version.

Question about adding noise to input

Hi!
I am currently studying your NCSN project, and there is a bit confused about the handling of the input:
on Line 136 in the file ncsn/runners/anneal_runner.py --- "X = X / 256. * 255. + torch.rand_like(X) / 256." Why add a small Gaussian noise to X,and the predefined Gaussian noise is added in function "anneal_dsm_score_estimation"
I am curious about why this is necessary and would appreciate any insights you could provide.

Custom dataset

Hi thanks for the great work. Because I couldn't find any information about custom dataset, I wonder if this repo works for training and generating using custom dataset? If yes, do I have to convert my images to (for example) Cifar10 format?

Error loading pre-trained checkpoints

Hi,

Thank you for your great work. I downloaded the checkpoints on Google drive and tried to do sampling, but I got this error:

RuntimeError: Error(s) in loading state_dict for DataParallel:
	Missing key(s) in state_dict: "module.normalizer.alpha", "module.normalizer.gamma", "module.normalizer.beta", "module.res1.0.normalize2.alpha", "module.res1.0.normalize2.gamma"
....

I was wondering how you and others load the model wrapped with DataParallel? I can train the model from scratch and fix the saving but wanted to check if there is better workaround.

Baseline not converge

I run the baseline model with the default setting, but it seems not converged (cifar10), loss is huge, around 6600000. Is this normal, or just mine this? If it is normal, why is this? Based on my research, traditional langevin dynamics can easily converge with restricted number of steps, such as 50. I'm quite curious why the author set the number of this sampling process to 1000, with a relatively small learning rate.

Thanks for your excellent work, the anneal model converges well, by the way, i just try to figure out why this method works. Can you please tell us?

Is it feasible to directly calculate the DSM loss function without scorenet estimation?

I saw an implementation of the formula form. I wonder whether it is feasible to directly calculate the loss via torch.autograd operations, if I define an energy function, like the logSumExp function in this paper 'Your classifier is secretly an energy based model and you should treat it like one'.

ncsn/losses/dsm.py

Lines 5 to 15 in 7f27f4a

def dsm(energy_net, samples, sigma=1):
samples.requires_grad_(True)
vector = torch.randn_like(samples) * sigma
perturbed_inputs = samples + vector
logp = -energy_net(perturbed_inputs)
dlogp = sigma ** 2 * autograd.grad(logp.sum(), perturbed_inputs, create_graph=True)[0]
kernel = vector
loss = torch.norm(dlogp + kernel, dim=-1) ** 2
loss = loss.mean() / 2.
return loss

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.