Code Monkey home page Code Monkey logo

nvae's Introduction

The Official PyTorch Implementation of "NVAE: A Deep Hierarchical Variational Autoencoder" (NeurIPS 2020 Spotlight Paper)



NVAE is a deep hierarchical variational autoencoder that enables training SOTA likelihood-based generative models on several image datasets.

Requirements

NVAE is built in Python 3.7 using PyTorch 1.6.0. Use the following command to install the requirements:

pip install -r requirements.txt

Set up file paths and data

We have examined NVAE on several datasets. For large datasets, we store the data in LMDB datasets for I/O efficiency. Click below on each dataset to see how you can prepare your data. Below, $DATA_DIR indicates the path to a data directory that will contain all the datasets and $CODE_DIR refers to the code directory:

MNIST and CIFAR-10

These datasets will be downloaded automatically, when you run the main training for NVAE using train.py for the first time. You can use --data=$DATA_DIR/mnist or --data=$DATA_DIR/cifar10, so that the datasets are downloaded to the corresponding directories.

CelebA 64 Run the following commands to download the CelebA images and store them in an LMDB dataset:
cd $CODE_DIR/scripts
python create_celeba64_lmdb.py --split train --img_path $DATA_DIR/celeba_org --lmdb_path $DATA_DIR/celeba64_lmdb
python create_celeba64_lmdb.py --split valid --img_path $DATA_DIR/celeba_org --lmdb_path $DATA_DIR/celeba64_lmdb
python create_celeba64_lmdb.py --split test  --img_path $DATA_DIR/celeba_org --lmdb_path $DATA_DIR/celeba64_lmdb

Above, the images will be downloaded to $DATA_DIR/celeba_org automatically and then then LMDB datasets are created at $DATA_DIR/celeba64_lmdb.

ImageNet 32x32

Run the following commands to download tfrecord files from GLOW and to convert them to LMDB datasets

mkdir -p $DATA_DIR/imagenet-oord
cd $DATA_DIR/imagenet-oord
wget https://storage.googleapis.com/glow-demo/data/imagenet-oord-tfr.tar
tar -xvf imagenet-oord-tfr.tar
cd $CODE_DIR/scripts
python convert_tfrecord_to_lmdb.py --dataset=imagenet-oord_32 --tfr_path=$DATA_DIR/imagenet-oord/mnt/host/imagenet-oord-tfr --lmdb_path=$DATA_DIR/imagenet-oord/imagenet-oord-lmdb_32 --split=train
python convert_tfrecord_to_lmdb.py --dataset=imagenet-oord_32 --tfr_path=$DATA_DIR/imagenet-oord/mnt/host/imagenet-oord-tfr --lmdb_path=$DATA_DIR/imagenet-oord/imagenet-oord-lmdb_32 --split=validation
CelebA HQ 256

Run the following commands to download tfrecord files from GLOW and to convert them to LMDB datasets

mkdir -p $DATA_DIR/celeba
cd $DATA_DIR/celeba
wget https://storage.googleapis.com/glow-demo/data/celeba-tfr.tar
tar -xvf celeba-tfr.tar
cd $CODE_DIR/scripts
python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=train
python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=validation
FFHQ 256

Visit this Google drive location and download images1024x1024.zip. Run the following commands to unzip the images and to store them in LMDB datasets:

mkdir -p $DATA_DIR/ffhq
unzip images1024x1024.zip -d $DATA_DIR/ffhq/
cd $CODE_DIR/scripts
python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/images1024x1024/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train
python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/images1024x1024/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=validation
LSUN

We use LSUN datasets in our follow-up works. Visit LSUN for instructions on how to download this dataset. Since the LSUN scene datasets come in the LMDB format, they are ready to be loaded using torchvision data loaders.

Running the main NVAE training and evaluation scripts

We use the following commands on each dataset for training NVAEs on each dataset for Table 1 in the paper. In all the datasets but MNIST normalizing flows are enabled. Check Table 6 in the paper for more information on training details. Note that for the multinode training (more than 8-GPU experiments), we use the mpirun command to run the training scripts on multiple nodes. Please adjust the commands below according to your setup. Below IP_ADDR is the IP address of the machine that will host the process with rank 0 (see here). NODE_RANK is the index of each node among all the nodes that are running the job.

MNIST

Two 16-GB V100 GPUs are used for training NVAE on dynamically binarized MNIST. Training takes about 21 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
cd $CODE_DIR
python train.py --data $DATA_DIR/mnist --root $CHECKPOINT_DIR --save $EXPR_ID --dataset mnist --batch_size 200 \
        --epochs 400 --num_latent_scales 2 --num_groups_per_scale 10 --num_postprocess_cells 3 --num_preprocess_cells 3 \
        --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 --num_latent_per_group 20 --num_preprocess_blocks 2 \
        --num_postprocess_blocks 2 --weight_decay_norm 1e-2 --num_channels_enc 32 --num_channels_dec 32 --num_nf 0 \
        --ada_groups --num_process_per_node 2 --use_se --res_dist --fast_adamax 
CIFAR-10

Eight 16-GB V100 GPUs are used for training NVAE on CIFAR-10. Training takes about 55 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
cd $CODE_DIR
python train.py --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset cifar10 \
        --num_channels_enc 128 --num_channels_dec 128 --epochs 400 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 30 --batch_size 32 \
        --weight_decay_norm 1e-2 --num_nf 1 --num_process_per_node 8 --use_se --res_dist --fast_adamax 
CelebA 64

Eight 16-GB V100 GPUs are used for training NVAE on CelebA 64. Training takes about 92 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
cd $CODE_DIR
python train.py --data $DATA_DIR/celeba64_lmdb --root $CHECKPOINT_DIR --save $EXPR_ID --dataset celeba_64 \
        --num_channels_enc 64 --num_channels_dec 64 --epochs 90 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 3 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-1 --num_groups_per_scale 20 \
        --batch_size 16 --num_nf 1 --ada_groups --num_process_per_node 8 --use_se --res_dist --fast_adamax
ImageNet 32x32

24 16-GB V100 GPUs are used for training NVAE on ImageNet 32x32. Training takes about 70 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
export IP_ADDR=IP_ADDRESS
export NODE_RANK=NODE_RANK_BETWEEN_0_TO_2
cd $CODE_DIR
mpirun --allow-run-as-root -np 3 -npernode 1 bash -c \
        'python train.py --data $DATA_DIR/imagenet-oord/imagenet-oord-lmdb_32 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset imagenet_32 \
        --num_channels_enc 192 --num_channels_dec 192 --epochs 45 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 28 \
        --batch_size 24 --num_nf 1 --warmup_epochs 1 \
        --weight_decay_norm 1e-2 --weight_decay_norm_anneal --weight_decay_norm_init 1e0 \
        --num_process_per_node 8 --use_se --res_dist \
        --fast_adamax --node_rank $NODE_RANK --num_proc_node 3 --master_address $IP_ADDR '
CelebA HQ 256

24 32-GB V100 GPUs are used for training NVAE on CelebA HQ 256. Training takes about 94 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
export IP_ADDR=IP_ADDRESS
export NODE_RANK=NODE_RANK_BETWEEN_0_TO_2
cd $CODE_DIR
mpirun --allow-run-as-root -np 3 -npernode 1 bash -c \
        'python train.py --data $DATA_DIR/celeba/celeba-lmdb --root $CHECKPOINT_DIR --save $EXPR_ID --dataset celeba_256 \
        --num_channels_enc 30 --num_channels_dec 30 --epochs 300 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 5 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-2 --num_groups_per_scale 16 \
        --batch_size 4 --num_nf 2 --ada_groups --min_groups_per_scale 4 \
        --weight_decay_norm_anneal --weight_decay_norm_init 1. --num_process_per_node 8 --use_se --res_dist \
        --fast_adamax --num_x_bits 5 --node_rank $NODE_RANK --num_proc_node 3 --master_address $IP_ADDR '

In our early experiments, a smaller model with 24 channels instead of 30, could be trained on only 8 GPUs in the same time (with the batch size of 6). The smaller models obtain only 0.01 bpd higher negative log-likelihood.

FFHQ 256

24 32-GB V100 GPUs are used for training NVAE on FFHQ 256. Training takes about 160 hours.

export EXPR_ID=UNIQUE_EXPR_ID
export DATA_DIR=PATH_TO_DATA_DIR
export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR
export CODE_DIR=PATH_TO_CODE_DIR
export IP_ADDR=IP_ADDRESS
export NODE_RANK=NODE_RANK_BETWEEN_0_TO_2
cd $CODE_DIR
mpirun --allow-run-as-root -np 3 -npernode 1 bash -c \
        'python train.py --data $DATA_DIR/ffhq/ffhq-lmdb --root $CHECKPOINT_DIR --save $EXPR_ID --dataset ffhq \
        --num_channels_enc 30 --num_channels_dec 30 --epochs 200 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 5 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-1  --num_groups_per_scale 16 \
        --batch_size 4 --num_nf 2  --ada_groups --min_groups_per_scale 4 \
        --weight_decay_norm_anneal --weight_decay_norm_init 1. --num_process_per_node 8 --use_se --res_dist \
        --fast_adamax --num_x_bits 5 --learning_rate 8e-3 --node_rank $NODE_RANK --num_proc_node 3 --master_address $IP_ADDR '

In our early experiments, a smaller model with 24 channels instead of 30, could be trained on only 8 GPUs in the same time (with the batch size of 6). The smaller models obtain only 0.01 bpd higher negative log-likelihood.

If for any reason your training is stopped, use the exact same commend with the addition of --cont_training to continue training from the last saved checkpoint. If you observe NaN, continuing the training using this flag usually will not fix the NaN issue.

Known Issues

Cannot build CelebA 64 or training gives NaN right at the beginning on this dataset

Several users have reported issues building CelebA 64 or have encountered NaN at the beginning of training on this dataset. If you face similar issues on this dataset, you can download this dataset manually and build LMDBs using instructions on this issue #2 .

Getting NaN after a few epochs of training

One of the main challenges in training very deep hierarchical VAEs is training instability that we discussed in the paper. We have verified that the settings in the commands above can be trained in a stable way. If you modify the settings above and you encounter NaN after a few epochs of training, you can use these tricks to stabilize your training: i) increase the spectral regularization coefficient, --weight_decay_norm. ii) Use exponential decay on --weight_decay_norm using --weight_decay_norm_anneal and --weight_decay_norm_init. iii) Decrease learning rate.

Training freezes with no NaN

In some very rare cases, we observed that training freezes after 2-3 days of training. We believe the root cause of this is because of a racing condition that is happening in one of the low-level libraries. If for any reason the training is stopped, kill your current run, and use the exact same commend with the addition of --cont_training to continue training from the last saved checkpoint.

Monitoring the training progress

While running any of the commands above, you can monitor the training progress using Tensorboard:

Click here
tensorboard --logdir $CHECKPOINT_DIR/eval-$EXPR_ID/

Above, $CHECKPOINT_DIR and $EXPR_ID are the same variables used for running the main training script.

Post-training sampling, evaluation, and checkpoints

Evaluating Log-Likelihood

You can use the following command to load a trained model and evaluate it on the test datasets:

cd $CODE_DIR
python evaluate.py --checkpoint $CHECKPOINT_DIR/eval-$EXPR_ID/checkpoint.pt --data $DATA_DIR/mnist --eval_mode=evaluate --num_iw_samples=1000

Above, --num_iw_samples indicates the number of importance weighted samples used in evaluation. $CHECKPOINT_DIR and $EXPR_ID are the same variables used for running the main training script. Set --data to the same argument that was used when training NVAE (our example is for MNIST).

Sampling

You can also use the following command to generate samples from a trained model:

cd $CODE_DIR
python evaluate.py --checkpoint $CHECKPOINT_DIR/eval-$EXPR_ID/checkpoint.pt --eval_mode=sample --temp=0.6 --readjust_bn

where --temp sets the temperature used for sampling and --readjust_bn enables readjustment of the BN statistics as described in the paper. If you remove --readjust_bn, the sampling will proceed with BN layer in the eval mode (i.e., BN layers will use running mean and variances extracted during training).

Computing FID

You can compute the FID score using 50K samples. To do so, you will need to create a mean and covariance statistics file on the training data using a command like:

cd $CODE_DIR
python scripts/precompute_fid_statistics.py --data $DATA_DIR/cifar10 --dataset cifar10 --fid_dir /tmp/fid-stats/

The command above computes the references statistics on the CIFAR-10 dataset and stores them in the --fid_dir durectory. Given the reference statistics file, we can run the following command to compute the FID score:

cd $CODE_DIR
python evaluate.py --checkpoint $CHECKPOINT_DIR/eval-$EXPR_ID/checkpoint.pt --data $DATA_DIR/cifar10 --eval_mode=evaluate_fid  --fid_dir /tmp/fid-stats/ --temp=0.6 --readjust_bn

where --temp sets the temperature used for sampling and --readjust_bn enables readjustment of the BN statistics as described in the paper. If you remove --readjust_bn, the sampling will proceed with BN layer in the eval mode (i.e., BN layers will use running mean and variances extracted during training). Above, $CHECKPOINT_DIR and $EXPR_ID are the same variables used for running the main training script. Set --data to the same argument that was used when training NVAE (our example is for MNIST).

Checkpoints

We provide checkpoints on MNIST, CIFAR-10, CelebA 64, CelebA HQ 256, FFHQ in this Google drive directory. For CIFAR10, we provide two checkpoints as we observed that a multiscale NVAE provides better qualitative results than a single scale model on this dataset. The multiscale model is only slightly worse in terms of log-likelihood (0.01 bpd). We also observe that one of our early models on CelebA HQ 256 with 0.01 bpd worse likelihood generates much better images in low temperature on this dataset.

You can use the commands above to evaluate or sample from these checkpoints.

How to construct smaller NVAE models

In the commands above, we are constructing big NVAE models that require several days of training in most cases. If you'd like to construct smaller NVAEs, you can use these tricks:

  • Reduce the network width: --num_channels_enc and --num_channels_dec are controlling the number of initial channels in the bottom-up and top-down networks respectively. Recall that we halve the number of channels with every spatial downsampling layer in the bottom-up network, and we double the number of channels with every upsampling layer in the top-down network. By reducing --num_channels_enc and --num_channels_dec, you can reduce the overall width of the networks.

  • Reduce the number of residual cells in the hierarchy: --num_cell_per_cond_enc and --num_cell_per_cond_dec control the number of residual cells used between every latent variable group in the bottom-up and top-down networks respectively. In most of our experiments, we are using two cells per group for both networks. You can reduce the number of residual cells to one to make the model smaller.

  • Reduce the number of epochs: You can reduce the training time by reducing --epochs.

  • Reduce the number of groups: You can make NVAE smaller by using a smaller number of latent variable groups. We use two schemes for setting the number of groups:

    1. An equal number of groups: This is set by --num_groups_per_scale which indicates the number of groups in each scale of latent variables. Reduce this number to have a small NVAE.

    2. An adaptive number of groups: This is enabled by --ada_groups. In this case, the highest resolution of latent variables will have --num_groups_per_scale groups and the smaller scales will get half the number of groups successively (see groups_per_scale in utils.py). We don't let the number of groups go below --min_groups_per_scale. You can reduce the total number of groups by reducing --num_groups_per_scale and --min_groups_per_scale when --ada_groups is enabled.

Understanding the implementation

If you are modifying the code, you can use the following figure to map the code to the paper.

Traversing the latent space

We can generate images by traversing in the latent space of NVAE. This sequence is generated using our model trained on CelebA HQ, by interpolating between samples generated with temperature 0.6. Some artifacts are due to color quantization in GIFs.

License

Please check the LICENSE file. NVAE may be used non-commercially, meaning for research or evaluation purposes only. For business inquiries, please contact [email protected].

You should take into consideration that VAEs are trained to mimic the training data distribution, and, any bias introduced in data collection will make VAEs generate samples with a similar bias. Additional bias could be introduced during model design, training, or when VAEs are sampled using small temperatures. Bias correction in generative learning is an active area of research, and we recommend interested readers to check this area before building applications using NVAE.

Bibtex:

Please cite our paper, if you happen to use this codebase:

@inproceedings{vahdat2020NVAE,
  title={{NVAE}: A Deep Hierarchical Variational Autoencoder},
  author={Vahdat, Arash and Kautz, Jan},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2020}
}

nvae's People

Contributors

arash-vahdat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

nvae's Issues

Importance of pre processing the Gaussian parameters ?

Hi, thank you for your outstanding work in making VAEs great again !

My question is about the pre processing of Gaussian parameters in distributions.py:

def soft_clamp5(x: torch.Tensor):
        return x.div_(5.).tanh_().mul(5.)    #  5. * torch.tanh(x / 5.) <--> soft differentiable clamp between [-5, 5]

[...]

        self.mu = soft_clamp5(mu)
        log_sigma = soft_clamp5(log_sigma)
        self.sigma = torch.exp(log_sigma) + 1e-2

I don't think this is discussed in the paper, what is the role of this pre processing ?
It seems to be linked with the model's stability when I remove it. Do you have results on the relationship between this and the other stabilization methods discussed in the paper ?

Thank you

i am getting this error when try to run train.py file

_norm_anneal=False, weight_decay_norm_init=10.0)
01/04 09:33:02 AM (Elapsed: 00:00:23) param size = 6.482514M
01/04 09:33:02 AM (Elapsed: 00:00:23) groups per scale: [10], total_groups: 10
01/04 09:33:02 AM (Elapsed: 00:00:23) epoch 0
/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for
usage information.

/usr/bin/nvidia-modprobe: unrecognized option: "-s"

ERROR: Invalid commandline, please run /usr/bin/nvidia-modprobe --help for
usage information.

CelebA 64

First of all I would like to say thank you for the great VAE implementation. Looking forward to Celeb256 training instructions!
I tried preprocessing CelebA64 but got an error when executing create_celeba64_lmdb.py. The way you obtain the dataset (dset.celeba.CelebA) downloads corrupted img_align_celeba.zip, I was able to fix the problem by replacing 'dset.celeba.CelebA' with 'dset.CelebA'. According to the official API this is correct way to do it (https://pytorch.org/docs/stable/torchvision/datasets.html#celeba).

Why some of the generate images by the official checkpoint of CelebA64 are NaN-value?

Great work!
I have observed that some generated images by official checkpoint of CelebA64 are NaN . This might be caused by "Batch Norm Statistics Readjusted" introduced by the paper, becuase without BN statistics readjusted, none of the generated images are NaN (but the generated iamges quality have a sharp decline). Do you have this problem and how to deal with it?
Thank you for the excelllent paper and the great repo!

Generating Encoder Output from Images

Hi!
I would like to explore the latent variables for given images, in order to cluster images based on their latent space representations.
Which output do I have to take to get the variables?
Do I have to take z after applying normalizing flows, so around model.py#L366?

Thanks for your help!
Marc

PS: Great paper, and thanks for the helpful reference implementation!

Recontructed images visualization

Hello,
I'm especially interested in the reconstruction of the images given as input to the model, and I would like to get and visualize the reconstructed images. To do so, is it the right way to use output_img = output.sample() after logits, log_q, log_p, kl_all, _ = model(x) and output = model.decoder_output(logits) in the train and test functions ? Would you recommend to take several samples for the same image ?
Thank you for the work and the release,
Elsa

CelebA-HQ 256x256 Data Pre-processing

Thank you team for the sharing the project resources. I am trying to process the CelebA-HQ 256x256 dataset for the DDGAN model. The DDGAN repository recommends going over the dataset preparation methods in the NVAE repository (this repository).


The following commands will download tfrecord files from GLOW and convert them to store them in an LMDB dataset.

Use the link by openai/glow for downloading the CelebA-HQ 256x256 dataset (4 Gb).
To convert/store the CelebA-HQ 256x256 dataset to/as the lmdb dataset one needs to install module called "tfrecord".
The missing module error can be rectified by simply executing the command pip install tfrecord.



!mkdir -p $DATA_DIR/celeba
%cd $DATA_DIR/celeba
!wget https://openaipublic.azureedge.net/glow-demo/data/celeba-tfr.tar
!tar -xvf celeba-tfr.tar
%cd $CODE_DIR/scripts
!pip install tfrecord
!python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=train



The final command !python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=train gives the following output:

.
.
.
26300
26400
26500
26600
26700
26800
26900
27000
added 27000 items to the LMDB dataset.
Traceback (most recent call last):
  File "convert_tfrecord_to_lmdb.py", line 73, in <module>
    main(args.dataset, args.split, args.tfr_path, args.lmdb_path)
  File "convert_tfrecord_to_lmdb.py", line 58, in main
    print('added %d items to the LMDB dataset.' % count)
lmdb.Error: mdb_txn_commit: Disk quota exceeded


I am not sure I have made the LMDB dataset properly, I request you to guide me.

FFHQ Training

Thank you for sharing the implementation of the DDGAN model. I am trying to train the model on FFHQ 256x256 dataset. I used the NVLabs/NVAE repository for the dataset preparation. I have the file structure as follows:

image

To use another dataset similar to the CelebA-HQ 256x256, I modified the train function given in the line 190 of the train_ddgan.py file.

    elif args.dataset == 'ffhq_256':
        train_transform = transforms.Compose([
                transforms.Resize(args.image_size),
                transforms.RandomHorizontalFlip(),
                transforms.ToTensor(),
                transforms.Normalize((0.5,0.5,0.5), (0.5,0.5,0.5))
            ])
        dataset = LMDBDataset(root='/datasets/ffhq-lmdb/', name='ffhq', train=True, transform=train_transform)


My implementation for the DDGAN uses 4 NVIDIA GTX 1080ti GPUs with a total batch size of 32 for training the CelebA-HQ 256x256 dataset

(--batch_size 8 and --num_process_per_node 4)

I use the following command for training!python3 train_ddgan.py --dataset ffhq_256 --image_size 256 --exp ddgan_celebahq_exp1 --num_channels 3 --num_channels_dae 64 --ch_mult 1 1 2 2 4 4 --num_timesteps 2 --num_res_blocks 2 --batch_size 8 --num_epoch 800 --ngf 64 --embedding_type positional --use_ema --r1_gamma 2. --z_emb_dim 256 --lr_d 1e-4 --lr_g 2e-4 --lazy_reg 10 --num_process_per_node 4 --save_content



I am getting the following output message:

Node rank 0, local proc 0, global proc 0
Node rank 0, local proc 1, global proc 1
Node rank 0, local proc 2, global proc 2
Node rank 0, local proc 3, global proc 3
Process Process-4:
Traceback (most recent call last):
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "train_ddgan.py", line 482, in init_processes
    fn(rank, gpu, args)
  File "train_ddgan.py", line 248, in train
    dataset = LMDBDataset(root='/datasets/ffhq-lmdb/', name='ffhq', train=True, transform=train_transform)
  File "/home/manisha.padala/gan/denoising-diffusion-gan/datasets_prep/lmdb_datasets.py", line 33, in __init__
    self.data_lmdb = lmdb.open(lmdb_path, readonly=True, max_readers=1,
lmdb.Error: /datasets/ffhq-lmdb/train.lmdb: No such file or directory
Process Process-2:
Traceback (most recent call last):
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "train_ddgan.py", line 482, in init_processes
    fn(rank, gpu, args)
  File "train_ddgan.py", line 248, in train
    dataset = LMDBDataset(root='/datasets/ffhq-lmdb/', name='ffhq', train=True, transform=train_transform)
  File "/home/manisha.padala/gan/denoising-diffusion-gan/datasets_prep/lmdb_datasets.py", line 33, in __init__
    self.data_lmdb = lmdb.open(lmdb_path, readonly=True, max_readers=1,
lmdb.Error: /datasets/ffhq-lmdb/train.lmdb: No such file or directory
Process Process-1:
Traceback (most recent call last):
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "train_ddgan.py", line 482, in init_processes
    fn(rank, gpu, args)
  File "train_ddgan.py", line 248, in train
    dataset = LMDBDataset(root='/datasets/ffhq-lmdb/', name='ffhq', train=True, transform=train_transform)
  File "/home/manisha.padala/gan/denoising-diffusion-gan/datasets_prep/lmdb_datasets.py", line 33, in __init__
    self.data_lmdb = lmdb.open(lmdb_path, readonly=True, max_readers=1,
lmdb.Error: /datasets/ffhq-lmdb/train.lmdb: No such file or directory
Process Process-3:
Traceback (most recent call last):
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/apps/python-3.8.3/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "train_ddgan.py", line 482, in init_processes
    fn(rank, gpu, args)
  File "train_ddgan.py", line 248, in train
    dataset = LMDBDataset(root='/datasets/ffhq-lmdb/', name='ffhq', train=True, transform=train_transform)
  File "/home/manisha.padala/gan/denoising-diffusion-gan/datasets_prep/lmdb_datasets.py", line 33, in __init__
    self.data_lmdb = lmdb.open(lmdb_path, readonly=True, max_readers=1,
lmdb.Error: /datasets/ffhq-lmdb/train.lmdb: No such file or directory

Problems occurred in training CelebA64 data

Update:
--num_process_per_node 8 denotes the gpu number !!! So i need to change it to 1.

The first silly question asked under a great project, just laughed, I hope everyone succeeds


Hi, dear Arash Vahdat ,

NVAE is a great job! We are excited to meet this official implement.

After hesitating for two days, I still can’t help but ask, any friend else has successfully reproduce this implement in private machine?

During my running process, there still some errors, and I'm not sure whether this is purely resulted from my gpu.

Here is the traceback message:

(hsj-torch-gpu16) hsj@hsj:/data/hsj/NVAE$ python train.py --data ./scripts/data1/datasets/celeba_org/celeba64_lmdb --root ./CHECKPOINT_DIR --save ./EXPR_ID --dataset celeba_64 --num_channels_enc 32 --num_channels_dec 32 --epochs 90 --num_postprocess_cells 2 --num_preprocess_cells 2 --num_latent_scales 3 --num_latent_per_group 20 --num_cell_per_cond_enc 1 --num_cell_per_cond_dec 1 --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-1 --num_groups_per_scale 5 --batch_size 1 --num_nf 1 --ada_groups --num_process_per_node 8 --use_se --res_dist --fast_adamax Experiment dir : ./CHECKPOINT_DIR/eval-./EXPR_ID Node rank 0, local proc 0, global proc 0 Node rank 0, local proc 1, global proc 1 Node rank 0, local proc 2, global proc 2 Node rank 0, local proc 3, global proc 3 Node rank 0, local proc 4, global proc 4 Node rank 0, local proc 5, global proc 5 Node rank 0, local proc 6, global proc 6 Node rank 0, local proc 7, global proc 7 Process Process-1: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 280, in init_processes dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-3: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-4: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-5: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-6: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal THCudaCheck FAIL file=/pytorch/torch/csrc/cuda/Module.cpp line=59 error=101 : invalid device ordinal Process Process-8: Process Process-7: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 279, in init_processes torch.cuda.set_device(args.local_rank) File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/site-packages/torch/cuda/__init__.py", line 281, in set_device torch._C._cuda_setDevice(device) RuntimeError: cuda runtime error (101) : invalid device ordinal at /pytorch/torch/csrc/cuda/Module.cpp:59 Process Process-2: Traceback (most recent call last): File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/home/bjfu/anaconda3/envs/hsj-torch-gpu16/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, **self._kwargs) File "train.py", line 281, in init_processes fn(args) File "train.py", line 42, in main model = AutoEncoder(args, writer, arch_instance) File "/data/hsj/NVAE/model.py", line 163, in __init__ self.init_normal_sampler(mult) File "/data/hsj/NVAE/model.py", line 270, in init_normal_sampler nf_cells.append(PairedCellAR(self.num_latent_per_group, num_c1, num_c2, arch)) File "/data/hsj/NVAE/model.py", line 93, in __init__ self.cell1 = CellAR(num_z, num_ftr, num_c, arch, mirror=False) File "/data/hsj/NVAE/model.py", line 66, in __init__ self.conv = ARInvertedResidual(num_z, num_ftr, ex=ex, mirror=mirror) File "/data/hsj/NVAE/neural_ar_operations.py", line 147, in __init__ layers.extend([ARConv2d(inz, hidden_dim, kernel_size=3, padding=1, masked=True, mirror=mirror, zero_diag=True), File "/data/hsj/NVAE/neural_ar_operations.py", line 87, in __init__ self.mask = torch.from_numpy(create_conv_mask(kernel_size, C_in, groups, C_out, zero_diag, mirror)).cuda() RuntimeError: CUDA error: out of memory (hsj-torch-gpu16) bjfu@bjfu-15043:/data/hsj/NVAE$ lspci -vnn | grep -A6 "VGA" File "/data/hsj/NVAE/neural_ar_operations.py", line 147, in __init__ layers.extend([ARConv2d(inz, hidden_dim, kernel_size=3, padding=1, masked=True, mirror=mirror, zero_diag=True), File "/data/hsj/NVAE/neural_ar_operations.py", line 87, in __init__ self.mask = torch.from_numpy(create_conv_mask(kernel_size, C_in, groups, C_out, zero_diag, mirror)).cuda() RuntimeError: CUDA error: out of memory

Ps. Due to the GoogleDrive problem, I download the data specifically, and added into /data1 file ,then convert them into lmdb type data.

My first instinct is that the gpu memory is not enough, so I have reduced the batch parameters and model parameters as much as possible, as you can see in the first command line, however, it still don't work.

Ps. Devices Information:
`
NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2

2 * 1080ti ,while second one is running other job, only device 0 available

torch==1.6.0
torchvision==0.7.0
`

This NVAE is a great breakthrough job, expecting that we all can reproduce this job and get more inspirations from this.

Looking forward to get any useful suggestion,

Sincerely

Luke Huang

Understanding the relationship between the code and the paper

I think this work is likely to win the outstanding paper award. I am looking forward to the oral presentation.

First of all, the code works out-of-the-box. I am currently training on MNIST and the results are reasonable. I do have a couple of questions, which I think would benefit others as well. Some of the questions might be trivial, given that I am not an expert in VAEs.

  1. From the paper, Figure 2 (caption): "... and h is a trainable parameter":
    The description of h does not appear again in the paper (I might have missed it). What is it? Does it correspond to self.prior_ftr0 in line147?

  2. What are the magic numbers (multiplication and subtraction) in line 334? Is this basically transforming the intensity range [0.0, 1.0] to [-1.0, 1.0]?

  3. Do combiner_enc cells in line 344 correspond to the red ⊕ symbols along the encoder path in the in Figure 2?

  4. What does the enc0 function (or equivalently ftr) in line 355 represent? Is this the initial diamond residual layer that immediately follows x in Figure 2?

  5. What is the function of pre-processing layers and more precisely down_pre and normal_pre in the init_pre_process function in line 201? I can tell they have something to do with the bottom-up and top-down paths in Figure 2 but I am not sure what they do.

  6. Similar question to 5 regarding "post-processing" layers. What is their function? Why do you use them?

  7. Would it be fair to say that the latent representation of an image using the encoder network would be the mean of each residual normal distribution? If that is the case, would this essentially be mu_q in line 394 (corresponding to z1 in the decoder path) and z in line 396 for z2 and subsequent latents?

Thank you for sharing your code. I wish you the best.

Query: CelebA HQ 256

I am trying to implement the DDGAN paper. The authors ask the users to refer to this repository to for the dataset preparation.

The following command is suggested for downloading the dataset from openai's glow project:-
mkdir -p $DATA_DIR/celeba
cd $DATA_DIR/celeba

wget https://storage.googleapis.com/glow-demo/data/celeba-tfr.tar
tar -xvf celeba-tfr.tar
cd $CODE_DIR/scripts

python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=train

python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=validation

However when I implement the commands, I face the following errors:
`--2022-07-27 09:03:47-- https://storage.googleapis.com/glow-demo/data/celeba-tfr.tar
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.4.128, 172.217.194.128, 142.251.10.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.4.128|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2022-07-27 09:03:48 ERROR 404: Not Found.

tar: celeba-tfr.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
`

Prafulla Dhariwal has addressed the some related errors here openai/glow/issues/1 for reference.

Celeba64 training yielding nan during training

Hello Dr. Vahdat,

This is indeed impressive work !!! However I am struggling with the training process.

Using Pytorch 1.6.0 with cuda 10.1

Training using 4 (Not V-100) GPUs of size ~12GB each. Reduced batch size to 8 to fit memory. No other changes apart from this. Followed the instructions exactly as given in Readme. But the training logs show that I am get "nan" losses
image

Is there any other pre-processing step I need to do for the dataset? Perhaps any other minor detail which you felt was irrelevant to mention in the readme? Any help you can provide is greatly appreciated.

why is there self.prior_ftr0 in the decoder model?

Hello, great work! I am trying to understand NVAE, but I could not figure out the role of self.prior_ftr0 or h block in Fig2 in the decoder model. Correct me if im wrong, but i could not find any details regarding h in the paper; and the code defines it to be a random vector generated in the initialization of the model and then being used in the forward pass to kickstart the decoder model. my question is why did you chose this design rather than starting with just z1? Is there some part that Im missing here?

thank you so much for the great paper and repo.

Query: Dataset CelebA-HQ 256x256 issue

When i run the folllowing command to store the CelebA-HQ 256x256 dataset in an LMDB dataset,
python convert_tfrecord_to_lmdb.py --dataset=celeba --tfr_path=$DATA_DIR/celeba/celeba-tfr --lmdb_path=$DATA_DIR/celeba/celeba-lmdb --split=train

I am getting the following error:
Traceback (most recent call last): File "convert_tfrecord_to_lmdb.py", line 12, in <module> from tfrecord.torch.dataset import TFRecordDataset ModuleNotFoundError: No module named 'tfrecord'

I found only the following github issue relevant to the query: vahidk/tfrecord#1

I request you to please guide me.

Can you provide pretrained models?

Hi, I have no that powerful resources to train the models (2GPUs), could you provide pretrained models for evaluations purposes regarding the latent space? Thanks.

NaN values in gradients

Hi, in my experiment, I used Moving-MNIST dataset. But here are my problems during training that I couldn't find an answer:

I tried to play with a small network by using only num_latent_scale=1 and num_groups_per_scale=1. Then I realized there were no gradients generated for parameters including prior.ftr0 and an error was given to stop the training.

If I increase num_groups_per_scale from 1 to 2 or more, I still got Nan in some of the gradients in the first iteration, then they went away, but the training continues without errors.

I'm wondering if you could provide some hint or clue to why such behavior happens? Thank you in advance!

FID score?

Hi thanks for the great work!

I am just wondering do you have plans for releasing the FID scores on varying datasets for NVAE?

Thanks!

num_latent_scales on ImageNet 32x32 dataset

Thanks for sharing your code. I have found that the hyper-parameter 'num_latent_scales' on ImageNet 32x32 and CIFAR-10 has been set to be 1, which is quite smaller than that of 5 on FFHQ and CelebA HQ dataset. What will happen if we set this number as 5 on ImageNet 32x32 dataset ? Thanks~

Query: FFHQ Pre-Processing

Thank you sir for sharing the scripts for dataset preparation. I am trying to implement the DDGAN model on the FFHQ 256x256 dataset. I have used the FFHQ 256x256 resized dataset from the kaggle since the FFHQ 1024x1024 dataset has a size of 90 GB, which exceeds the limits of my resources.


The Kaggle dataset has the files in archive.zip file, which has a directory "resized" which contains the 70k .jpg files.


The file structure is as follows:
archive.zip
├  resized
├ (70k images)


I am using google drive and colab notebooks for the implementation. I am using the file setup with CODE_DIR = "/content/drive/MyDrive/Repositories/NVAE" and DATA_DIR = "/content/drive/MyDrive/Repositories/NVAE/dataset_nvae". When I try to run the command !python create_ffhq_lmdb.py --ffhq_img_path=$DATA_DIR/ffhq/resized/ --ffhq_lmdb_path=$DATA_DIR/ffhq/ffhq-lmdb --split=train, I get the following error message:

Traceback (most recent call last):
  File "create_ffhq_lmdb.py", line 70, in <module>
    main(args.split, args.ffhq_img_path, args.ffhq_lmdb_path)
  File "create_ffhq_lmdb.py", line 46, in main
    im = Image.open(img_path)
  File "/usr/local/lib/python3.7/dist-packages/PIL/Image.py", line 2843, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/Repositories/NVAE/dataset_nvae/ffhq/resized/55962.png'

NomalDecoder & num_bits

Hi and thanks for sharing the code.

I am doing some tests with medical images, 16-bit and one channel input, therefore employing the NormalDecoder.
The implementation takes as optional parameter num_bits but is not used as for the DiscMixLogistic. Is the implementation complete? or is just included for some kind of legacy?

Thanks in advance!

Question about KL computation

In distributions.py, the KL is computed as indicated in section 3.2 of the paper (residual normal distributions, Equation 2):

def kl(self, normal_dist):
    term1 = (self.mu - normal_dist.mu) / normal_dist.sigma
    term2 = self.sigma / normal_dist.sigma

    return 0.5 * (term1 * term1 + term2 * term2) - 0.5 - torch.log(term2) 

What I don't understand is, why you compute term2 = self.sigma / normal_dist.sigma. Shouldn't it be:
term2 = self.sigma - normal_dist.sigma?

FFHQ model checkpoint leads to out of memory even on single image inference

I am trying to play with the results of the model with the pretrained checkpoints, however, I continue to get out of memory errors when passing even a single image through the model for inference.

I load the state_dict and args of the FFHQ pretrained checkpoint into the model and pass in a single image of shape (1,3,256,256), dtype torch.float32, with values between [0, 1] (per the LMDB DataLoader) into the model's forward pass. I am using a Tesla V100 16GB, torch==1.6.0, CUDA 11.0, and Ubuntu 18.04.

I did a bit of debugging -- it looks like this Out of Memory error occurs when passing the image through the loop over the dec_tower.

Is there any reason, algorithmically, that the information for a single 256x256 image should consume more 16GB of GPU memory? Is there a memory leak somewhere?

Any insight and guidance on how to fix this would be appreciated.

No rendezvous handler for env://

As per the readme file, I have changed some values in the arguments to make the model easier to train. Since, I have only one gpu, I have change num_process_per_node to 1.

But I am getting this error.

python train.py --data D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\mnist --root D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT\ --save 01 --dataset mnist --batch_size 200 --epochs 100 --num_latent_scales 2 --num_groups_per_scale 10 --num_postprocess_cells 3 --num_preprocess_cells 3 --num_cell_per_cond_enc 1 --num_cell_per_cond_dec 1 --num_latent_per_group 20 --num_preprocess_blocks 2 --num_postprocess_blocks 2 --weight_decay_norm 1e-2 --num_channels_enc 16 --num_channels_dec 16 --num_nf 0 --ada_groups --num_process_per_node 1 --use_se --res_dist --fast_adamax

Experiment dir : D:\Project\TAMU\Hierarchical_Variational_Autoencoders\NVAE#start-of-content\CHECKPOINT/eval-01
starting in debug mode
Traceback (most recent call last):
File "train.py", line 415, in
init_processes(0, size, main, args)
File "train.py", line 280, in init_processes
dist.init_process_group(backend='nccl', init_method='env://', rank=rank, world_size=size)
File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\distributed_c10d.py", line 421, in init_process_group
init_method, rank, world_size, timeout=timeout
File "D:\ProgramData\Anaconda3\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for env://

I am new to PyTorch too, used Tensorflow till now.

Can you please tell me what type of error is this and how can I solve it?

Tips for latent space manipulation and interpolation?

Hi there, I was wondering if I can ask a question and get a bit of help about the representation power of NVAE. I would really appreciate it.

For one scenario, if I have two people, A, and B, and I get their representations, Z_a, and Z_b (all 25 Z's for 3 scales), I can somewhat change from person A to person B, by doing an interpolation such that new Z_a = (1-alpha) * Z_a + (alpha) * Z_b and decode, though the representation is not too smooth and alpha needs to be high (0.8) to see a change. I was wondering if the authors have tried something like this and see something similar?

Now, I only have Z_a, encoded from a person A. But unfortunately, I cannot modify Z_a by any means, to get a different person C, no matter what I add to the Z_a. If I add a large direction to it, I get a grainier/noisier image, but not a new person C.

I have also tried disabling flows but no luck. I feel that perhaps the mu and sigma coming from the encoder is too strong so that I cannot change the Z to get a new person (bc there's a part where NVAE concat the encoding and decoding mu,sigma). But then again I'm not sure.

Last point. in sampling, I see that we generate a new z_0 by sampling from N(0,1), and after going through the hierarchical Guassian, we get a nice new image. But let say I replace the z_0 by the z_0 of the encoded Z's of the person A, i get a junky/bad/trippy image, so I am at a loss bc the z_0 of a person A going through the hierarchical NVAE does not yield a person image.

CelebA-HQ 256 Dataset

Thanks for the interesting work.

I want to reproduce the NVAE model in CelebAHQ 256 dataset. Could you provide the main NVAE training scripts with detailed parameters in the Celeba-HQ dataset?

Thanks for your help.

Problem while converting tfrecord to lmdb data AttributeError: 'bytes' object has no attribute 'cpu'

Hey, I am trying to train your https://github.com/NVlabs/denoising-diffusion-gan Denoising Diffusion GAN model and my image size is 256. However, the problem is that repository recommends this repository to download and extract celeb as a lmdb file. However, when I try to run the script as it is described, I am getting the following error.
Screenshot 2023-11-17 at 17 16 12

Can you fix the problem. I can see that tfrecords repository has been archived but may be the problem is different. Thanks a lot.

Some questions about the inverse autoregressive flow

Hi there, I am just confused by the inverse autoregressive flow. Do you use another network model to fit the distribution q(z|x)? Can I understand in the following way?
As far as my know, in the flow based model, people want to model the data distribution p(x) , so from a random z, we can get x=f(z). Here in this paper , q(z|x) is your data distribution you will model, you train another network g, again, from a random e~N(0,I), you get a z=g(e), then you can sample a realistic image through the decoder of the NVAE model, Decoder(z).

RuntimeError: NCCL error

Hi,

I am trying to run NVAE on my machine with your command line for CIFAR10 (updating only the .. from 8 to 4 cause I own 4 GPUs):

export EXPR_ID=/home/dsi/eyalbetzalel/NVAE/logs  
export DATA_DIR=/home/dsi/eyalbetzalel/NVAE/data 
export CHECKPOINT_DIR=/home/dsi/eyalbetzalel/NVAE/cpt  
export CODE_DIR=/home/dsi/eyalbetzalel/NVAE  
cd $CODE_DIR

nohup python train.py --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset cifar10 \
        --num_channels_enc 128 --num_channels_dec 128 --epochs 400 --num_postprocess_cells 2 --num_preprocess_cells 2 \
        --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \
        --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 30 --batch_size 32 \
        --weight_decay_norm 1e-2 --num_nf 1 --num_process_per_node 4 --use_se --res_dist --fast_adamax &> NVAE_DSIGPU13_test_2_22102020.out &

and get this error:

File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 281, in init_processes
fn(args)
File "train.py", line 92, in main
train_nelbo, global_step = train(train_queue, model, cnn_optimizer, grad_scalar, global_step, warmup_iters, writer, logging)
File "train.py", line 160, in train
utils.average_params(model.parameters(), args.distributed)
File "/home/dsi/eyalbetzalel/NVAE/utils.py", line 274, in average_params
dist.all_reduce(param.data, op=dist.ReduceOp.SUM)
File "/home/dsi/eyalbetzalel/miniconda3/envs/NVAE_env/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 936, in all_reduce
work = _default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:518, unhandled system error, NCCL version 2.4.8

am I doing something wrong?

Thanks,
Eyal

@

"arch_instance" argument

Hi there, thank you for the great job!

I noticed that your implementation has an argument called arch_instance, set to default as res_mbconv. From the provided readme, it appears that this parameter is set as default in all your runs.

My question is: is this just an old argument used for testing different architectures? When training a new model, is it correct to leave it as default?
Can you also please provide some more info on what this argument does/what are the different tried architectures?

Thank you

Questions about Traversing the latent space

Thanks for your interesting works.

I wonder how to generate the interpolations by traversing in the latent space.
Could you provide the source code about the latent space traverse if convenient?

Thanks.

CelebaA HQ 256 out of memory and NaN

Arash Vahdat and Jan Kautz, thank you for a great paper and the code that you are providing!
My GPUs: 8 V100s 32GB.
Out of GPU Memory.
I would like to point out that provided command for training on CelebaA HQ 256 runs out of GPU memory whether it is default command with reduced amount of GPUs (24 to 8) batch_size 4 or suggested one for 8 GPUs with batch size 6. With reduced amount of GPUs maintaining 30 channels, batch_size of 3 works. Command:

python train.py --data /celeba/celeba-lmdb --root /NVAE/checkpoints --save 1 --dataset celeba_256 --num_channels_enc 30 --num_channels_dec 30 --epochs 300 --num_postprocess_cells 2 --num_preprocess_cells 2 --num_latent_scales 5 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-2 --num_groups_per_scale 16 --batch_size 3 --num_nf 2 --ada_groups --min_groups_per_scale 4 --weight_decay_norm_anneal --weight_decay_norm_init 1. --num_process_per_node 8 --use_se --res_dist --fast_adamax --num_x_bits 5 --cont_training

The command suggested for 8 GPUs with 24 channels works with batch_size of 5. Command:

python train.py --data /celeba/celeba-lmdb --root /NVAE/checkpoints --save 1 --dataset celeba_256 --num_channels_enc 24 --num_channels_dec 24 --epochs 300 --num_postprocess_cells 2 --num_preprocess_cells 2 --num_latent_scales 5 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-2 --num_groups_per_scale 16 --batch_size 5 --num_nf 2 --ada_groups --min_groups_per_scale 4 --weight_decay_norm_anneal --weight_decay_norm_init 1. --num_process_per_node 8 --use_se --res_dist --fast_adamax --num_x_bits 5

So -1 on batch_size in both cases, otherwise out of GPU memory.

NaN
I made 4 runs overall with 2 suggested commands, rerunning every command twice, but all of them didnt even reach epoch 100. "NaN or Inf found in input tensor" was encountered, sometimes training breaks at epoch 35, sometimes at 70. Starting from last checkpoint goes nowhere, same problem. Problem is listed in "Known Issues", you are mentioning: "commands above can be trained in a stable way", in my case given commands were unstable, the only difference is a reduced batch_size by one, I doubt reducing it by one can make such a big difference.
Did anyone else encounter these issues?
Ill play around with listed tricks to stabilize the training and report if something will remedy the NaN.
Thanks!

Why output for 3-rd channel is unused in Logistic mixture?

Hello! Can anyone explain me one thing: when counting mean3 for 3-rd channel (blue, I suppose), why don't we use samples[:, 2, :, :, :]:

NVAE/distributions.py

Lines 139 to 152 in 9fc1a28

samples = samples.unsqueeze(4) # B, 3, H , W
samples = samples.expand(-1, -1, -1, -1, self.num_mix).permute(0, 1, 4, 2, 3) # B, 3, M, H, W
mean1 = self.means[:, 0, :, :, :] # B, M, H, W
mean2 = self.means[:, 1, :, :, :] + \
self.coeffs[:, 0, :, :, :] * samples[:, 0, :, :, :] # B, M, H, W
mean3 = self.means[:, 2, :, :, :] + \
self.coeffs[:, 1, :, :, :] * samples[:, 0, :, :, :] + \
self.coeffs[:, 2, :, :, :] * samples[:, 1, :, :, :] # B, M, H, W
mean1 = mean1.unsqueeze(1) # B, 1, M, H, W
mean2 = mean2.unsqueeze(1) # B, 1, M, H, W
mean3 = mean3.unsqueeze(1) # B, 1, M, H, W
means = torch.cat([mean1, mean2, mean3], dim=1) # B, 3, M, H, W
centered = samples - means # B, 3, M, H, W

Also, why do we need to update means with samples when counting log prob? For example, in Tacotron-2 code the are no updates of means.

How to get Latent variables

Thank you for your sharing
I want to know that how to get Latent variables and save it
It's really important to me.

Minor error in execution script for cifar-10

Thanks for this implementation!

I think the execution script for cifar-10 provided in the readme has an extra --save in it and should read:

export EXPR_ID=UNIQUE_EXPR_ID export DATA_DIR=PATH_TO_DATA_DIR export CHECKPOINT_DIR=PATH_TO_CHECKPOINT_DIR export CODE_DIR=PATH_TO_CODE_DIR cd $CODE_DIR python train.py --data $DATA_DIR/cifar10 --root $CHECKPOINT_DIR --save $EXPR_ID --dataset cifar10 \ --num_channels_enc 128 --num_channels_dec 128 --epochs 400 --num_postprocess_cells 2 --num_preprocess_cells 2 \ --num_latent_scales 1 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \ --num_preprocess_blocks 1 --num_postprocess_blocks 1 --num_groups_per_scale 30 --batch_size 32 \ --weight_decay_norm 1e-2 --num_nf 1 --num_process_per_node 8 --use_se --res_dist --fast_adamax

It crashes as is right now.

Possible typo in the log_p() function

Hi there and thanks for an amazing work! I was exploring the code and in the module distributions.py found out that the logarithm of standard normal distribution (normalized_samples in the code) is:

$$ \ln(p) = -0.5 \cdot x^2 - 0.5 \cdot \ln(2 \cdot \pi) - \ln(\sigma). $$

But why do we need a $\ln(\sigma)$ part? Isn't the formula above should look like:

$$ \ln(p) = -0.5 \cdot x^2 - 0.5 \cdot \ln(2 \cdot \pi)? $$

Thanks in advance for the clarifications!

Could you explain kl_balancer in detail?

NVAE/utils.py

Line 213 in 38eb997

kl_coeff_i = kl_coeff_i / alpha_i * total_kl

* total_kl in this line is redundant because in the next line kl_coeff_i is divided by its mean.
Why mean(abs(kl_i)) is used as a weight factor for kl_i ?
Is / alpha_i equivalent to * num_group / feature_resolution?

Little question about lmdb_datasets.py implement

Nice to meet you again ! Dr.@arash-vahdat ,

  1. Here is a little question about line 43

What does target = [0] mean ? I can't figure out the place where this item used. And only find the returned img data is fed into the NVAE model.

  1. How to apply channel == 1 data to NVAE model ? I still can't find where the parameters denote the cin data channels. Or there don't need extra operations and Model can fit itself ? Still in confuse, beg your kindly help ! Sincerely !

Hoping to get your help, Thank you again!

All the best,

Luke Huang

How to run without using parallelization?

Can we run this model without using distributed nature of the code? I simply want to run the code for 1024x1024 images without doing any parallelization? Can't believe I am striggnling with this for a week now.....

tensorboard images not showing

Hi all,

I've been testing the default code for training mnist and custom rgb_64x64 dataset.

In both cases, I can see the resulting training graphs, but the images in the tensorboard/images section are not showing -- they look like a 1x1 black pixel/empty.

The shape of the visualized image for mnist reconsturction is (1, int(sqrt(batch)), 2*int(sqrt(batch))) as expected, but is not shown. Saving the image with torchvision.utils.save_image() saves the right size image.


What's the assumed tensorboard version?

tensorboard.version '1.15.0+nv'
torch.version '1.6.0a0+9907a3e'
torchvision.version '0.7.0a0'

Question regarding traversing the latent space

Hi, Thanks a lot for the nice-documented repo. I have a question regarding traversing the latent space.

For performing the traversing experiment, do I only need to modify mu_q in line 363 in the file model.py? For example, replacing one of the columns in mu_q with values between [-3,3]. Or do I also need to consider the values in self. enc_sampler?

I really appreciate any help you can provide.

FID score of CelebA-HQ 256x256

I'm quite confused about FID of CelebA-HQ.

In NCP-VAE and VAEBM paper, it is reported as 40.26, while recent paper LSGM reported as 29.76.

Were there further improvement of NVAE after publication of NCP-VAE and VAEBM?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.