Code Monkey home page Code Monkey logo

taming-transformers's Introduction

Taming Transformers for High-Resolution Image Synthesis

CVPR 2021 (Oral)

teaser

Taming Transformers for High-Resolution Image Synthesis
Patrick Esser*, Robin Rombach*, Björn Ommer
* equal contribution

tl;dr We combine the efficiancy of convolutional approaches with the expressivity of transformers by introducing a convolutional VQGAN, which learns a codebook of context-rich visual parts, whose composition is modeled with an autoregressive transformer.

teaser arXiv | BibTeX | Project Page

News

2022

2021

  • Thanks to rom1504 it is now easy to train a VQGAN on your own datasets.
  • Included a bugfix for the quantizer. For backward compatibility it is disabled by default (which corresponds to always training with beta=1.0). Use legacy=False in the quantizer config to enable it. Thanks richcmwang and wcshin-git!
  • Our paper received an update: See https://arxiv.org/abs/2012.09841v3 and the corresponding changelog.
  • Added a pretrained, 1.4B transformer model trained for class-conditional ImageNet synthesis, which obtains state-of-the-art FID scores among autoregressive approaches and outperforms BigGAN.
  • Added pretrained, unconditional models on FFHQ and CelebA-HQ.
  • Added accelerated sampling via caching of keys/values in the self-attention operation, used in scripts/sample_fast.py.
  • Added a checkpoint of a VQGAN trained with f8 compression and Gumbel-Quantization. See also our updated reconstruction notebook.
  • We added a colab notebook which compares two VQGANs and OpenAI's DALL-E. See also this section.
  • We now include an overview of pretrained models in Tab.1. We added models for COCO and ADE20k.
  • The streamlit demo now supports image completions.
  • We now include a couple of examples from the D-RIN dataset so you can run the D-RIN demo without preparing the dataset first.
  • You can now jump right into sampling with our Colab quickstart notebook.

Requirements

A suitable conda environment named taming can be created and activated with:

conda env create -f environment.yaml
conda activate taming

Overview of pretrained models

The following table provides an overview of all models that are currently available. FID scores were evaluated using torch-fidelity. For reference, we also include a link to the recently released autoencoder of the DALL-E model. See the corresponding colab notebook for a comparison and discussion of reconstruction capabilities.

Dataset FID vs train FID vs val Link Samples (256x256) Comments
FFHQ (f=16) 9.6 -- ffhq_transformer ffhq_samples
CelebA-HQ (f=16) 10.2 -- celebahq_transformer celebahq_samples
ADE20K (f=16) -- 35.5 ade20k_transformer ade20k_samples.zip [2k] evaluated on val split (2k images)
COCO-Stuff (f=16) -- 20.4 coco_transformer coco_samples.zip [5k] evaluated on val split (5k images)
ImageNet (cIN) (f=16) 15.98/15.78/6.59/5.88/5.20 -- cin_transformer cin_samples different decoding hyperparameters
FacesHQ (f=16) -- -- faceshq_transformer
S-FLCKR (f=16) -- -- sflckr
D-RIN (f=16) -- -- drin_transformer
VQGAN ImageNet (f=16), 1024 10.54 7.94 vqgan_imagenet_f16_1024 reconstructions Reconstruction-FIDs.
VQGAN ImageNet (f=16), 16384 7.41 4.98 vqgan_imagenet_f16_16384 reconstructions Reconstruction-FIDs.
VQGAN OpenImages (f=8), 256 -- 1.49 https://ommer-lab.com/files/latent-diffusion/vq-f8-n256.zip --- Reconstruction-FIDs. Available via latent diffusion.
VQGAN OpenImages (f=8), 16384 -- 1.14 https://ommer-lab.com/files/latent-diffusion/vq-f8.zip --- Reconstruction-FIDs. Available via latent diffusion
VQGAN OpenImages (f=8), 8192, GumbelQuantization 3.24 1.49 vqgan_gumbel_f8 --- Reconstruction-FIDs.
DALL-E dVAE (f=8), 8192, GumbelQuantization 33.88 32.01 https://github.com/openai/DALL-E reconstructions Reconstruction-FIDs.

Running pretrained models

The commands below will start a streamlit demo which supports sampling at different resolutions and image completions. To run a non-interactive version of the sampling process, replace streamlit run scripts/sample_conditional.py -- by python scripts/make_samples.py --outdir <path_to_write_samples_to> and keep the remaining command line arguments.

To sample from unconditional or class-conditional models, run python scripts/sample_fast.py -r <path/to/config_and_checkpoint>. We describe below how to use this script to sample from the ImageNet, FFHQ, and CelebA-HQ models, respectively.

S-FLCKR

teaser

You can also run this model in a Colab notebook, which includes all necessary steps to start sampling.

Download the 2020-11-09T13-31-51_sflckr folder and place it into logs. Then, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-09T13-31-51_sflckr/

ImageNet

teaser

Download the 2021-04-03T19-39-50_cin_transformer folder and place it into logs. Sampling from the class-conditional ImageNet model does not require any data preparation. To produce 50 samples for each of the 1000 classes of ImageNet, with k=600 for top-k sampling, p=0.92 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25   

To restrict the model to certain classes, provide them via the --classes argument, separated by commas. For example, to sample 50 ostriches, border collies and whiskey jugs, run

python scripts/sample_fast.py -r logs/2021-04-03T19-39-50_cin_transformer/ -n 50 -k 600 -t 1.0 -p 0.92 --batch_size 25 --classes 9,232,901   

We recommended to experiment with the autoregressive decoding parameters (top-k, top-p and temperature) for best results.

FFHQ/CelebA-HQ

Download the 2021-04-23T18-19-01_ffhq_transformer and 2021-04-23T18-11-19_celebahq_transformer folders and place them into logs. Again, sampling from these unconditional models does not require any data preparation. To produce 50000 samples, with k=250 for top-k sampling, p=1.0 for nucleus sampling and temperature t=1.0, run

python scripts/sample_fast.py -r logs/2021-04-23T18-19-01_ffhq_transformer/   

for FFHQ and

python scripts/sample_fast.py -r logs/2021-04-23T18-11-19_celebahq_transformer/   

to sample from the CelebA-HQ model. For both models it can be advantageous to vary the top-k/top-p parameters for sampling.

FacesHQ

teaser

Download 2020-11-13T21-41-45_faceshq_transformer and place it into logs. Follow the data preparation steps for CelebA-HQ and FFHQ. Run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-13T21-41-45_faceshq_transformer/

D-RIN

teaser

Download 2020-11-20T12-54-32_drin_transformer and place it into logs. To run the demo on a couple of example depth maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.imagenet.DRINExamples}}}"

To run the demo on the complete validation set, first follow the data preparation steps for ImageNet and then run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T12-54-32_drin_transformer/

COCO

Download 2021-01-20T16-04-20_coco_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2021-01-20T16-04-20_coco_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.coco.Examples}}}"

ADE20k

Download 2020-11-20T21-45-44_ade20k_transformer and place it into logs. To run the demo on a couple of example segmentation maps included in the repository, run

streamlit run scripts/sample_conditional.py -- -r logs/2020-11-20T21-45-44_ade20k_transformer/ --ignore_base_data data="{target: main.DataModuleFromConfig, params: {batch_size: 1, validation: {target: taming.data.ade20k.Examples}}}"

Scene Image Synthesis

teaser Scene image generation based on bounding box conditionals as done in our CVPR2021 AI4CC workshop paper High-Resolution Complex Scene Synthesis with Transformers (see talk on workshop page). Supporting the datasets COCO and Open Images.

Training

Download first-stage models COCO-8k-VQGAN for COCO or COCO/Open-Images-8k-VQGAN for Open Images. Change ckpt_path in data/coco_scene_images_transformer.yaml and data/open_images_scene_images_transformer.yaml to point to the downloaded first-stage models. Download the full COCO/OI datasets and adapt data_path in the same files, unless working with the 100 files provided for training and validation suits your needs already.

Code can be run with python main.py --base configs/coco_scene_images_transformer.yaml -t True --gpus 0, or python main.py --base configs/open_images_scene_images_transformer.yaml -t True --gpus 0,

Sampling

Train a model as described above or download a pre-trained model:

  • Open Images 1 billion parameter model available that trained 100 epochs. On 256x256 pixels, FID 41.48±0.21, SceneFID 14.60±0.15, Inception Score 18.47±0.27. The model was trained with 2d crops of images and is thus well-prepared for the task of generating high-resolution images, e.g. 512x512.
  • Open Images distilled version of the above model with 125 million parameters allows for sampling on smaller GPUs (4 GB is enough for sampling 256x256 px images). Model was trained for 60 epochs with 10% soft loss, 90% hard loss. On 256x256 pixels, FID 43.07±0.40, SceneFID 15.93±0.19, Inception Score 17.23±0.11.
  • COCO 30 epochs
  • COCO 60 epochs (find model statistics for both COCO versions in assets/coco_scene_images_training.svg)

When downloading a pre-trained model, remember to change ckpt_path in configs/*project.yaml to point to your downloaded first-stage model (see ->Training).

Scene image generation can be run with python scripts/make_scene_samples.py --outdir=/some/outdir -r /path/to/pretrained/model --resolution=512,512

Training on custom data

Training on your own dataset can be beneficial to get better tokens and hence better images for your domain. Those are the steps to follow to make this work:

  1. install the repo with conda env create -f environment.yaml, conda activate taming and pip install -e .
  2. put your .jpg files in a folder your_folder
  3. create 2 text files a xx_train.txt and xx_test.txt that point to the files in your training and test set respectively (for example find $(pwd)/your_folder -name "*.jpg" > train.txt)
  4. adapt configs/custom_vqgan.yaml to point to these 2 files
  5. run python main.py --base configs/custom_vqgan.yaml -t True --gpus 0,1 to train on two GPUs. Use --gpus 0, (with a trailing comma) to train on a single GPU.

Data Preparation

ImageNet

The code will try to download (through Academic Torrents) and prepare ImageNet the first time it is used. However, since ImageNet is quite large, this requires a lot of disk space and time. If you already have ImageNet on your disk, you can speed things up by putting the data into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ (which defaults to ~/.cache/autoencoders/data/ILSVRC2012_{split}/data/), where {split} is one of train/validation. It should have the following structure:

${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/
├── n01440764
│   ├── n01440764_10026.JPEG
│   ├── n01440764_10027.JPEG
│   ├── ...
├── n01443537
│   ├── n01443537_10007.JPEG
│   ├── n01443537_10014.JPEG
│   ├── ...
├── ...

If you haven't extracted the data, you can also place ILSVRC2012_img_train.tar/ILSVRC2012_img_val.tar (or symlinks to them) into ${XDG_CACHE}/autoencoders/data/ILSVRC2012_train/ / ${XDG_CACHE}/autoencoders/data/ILSVRC2012_validation/, which will then be extracted into above structure without downloading it again. Note that this will only happen if neither a folder ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/data/ nor a file ${XDG_CACHE}/autoencoders/data/ILSVRC2012_{split}/.ready exist. Remove them if you want to force running the dataset preparation again.

You will then need to prepare the depth data using MiDaS. Create a symlink data/imagenet_depth pointing to a folder with two subfolders train and val, each mirroring the structure of the corresponding ImageNet folder described above and containing a png file for each of ImageNet's JPEG files. The png encodes float32 depth values obtained from MiDaS as RGBA images. We provide the script scripts/extract_depth.py to generate this data. Please note that this script uses MiDaS via PyTorch Hub. When we prepared the data, the hub provided the MiDaS v2.0 version, but now it provides a v2.1 version. We haven't tested our models with depth maps obtained via v2.1 and if you want to make sure that things work as expected, you must adjust the script to make sure it explicitly uses v2.0!

CelebA-HQ

Create a symlink data/celebahq pointing to a folder containing the .npy files of CelebA-HQ (instructions to obtain them can be found in the PGGAN repository).

FFHQ

Create a symlink data/ffhq pointing to the images1024x1024 folder obtained from the FFHQ repository.

S-FLCKR

Unfortunately, we are not allowed to distribute the images we collected for the S-FLCKR dataset and can therefore only give a description how it was produced. There are many resources on collecting images from the web to get started. We collected sufficiently large images from flickr (see data/flickr_tags.txt for a full list of tags used to find images) and various subreddits (see data/subreddits.txt for all subreddits that were used). Overall, we collected 107625 images, and split them randomly into 96861 training images and 10764 validation images. We then obtained segmentation masks for each image using DeepLab v2 trained on COCO-Stuff. We used a PyTorch reimplementation and include an example script for this process in scripts/extract_segmentation.py.

COCO

Create a symlink data/coco containing the images from the 2017 split in train2017 and val2017, and their annotations in annotations. Files can be obtained from the COCO webpage. In addition, we use the Stuff+thing PNG-style annotations on COCO 2017 trainval annotations from COCO-Stuff, which should be placed under data/cocostuffthings.

ADE20k

Create a symlink data/ade20k_root containing the contents of ADEChallengeData2016.zip from the MIT Scene Parsing Benchmark.

Training models

FacesHQ

Train a VQGAN with

python main.py --base configs/faceshq_vqgan.yaml -t True --gpus 0,

Then, adjust the checkpoint path of the config key model.params.first_stage_config.params.ckpt_path in configs/faceshq_transformer.yaml (or download 2020-11-09T13-33-36_faceshq_vqgan and place into logs, which corresponds to the preconfigured checkpoint path), then run

python main.py --base configs/faceshq_transformer.yaml -t True --gpus 0,

D-RIN

Train a VQGAN on ImageNet with

python main.py --base configs/imagenet_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-09-23T17-56-33_imagenet_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.first_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

Train a VQGAN on Depth Maps of ImageNet with

python main.py --base configs/imagenetdepth_vqgan.yaml -t True --gpus 0,

or download a pretrained one from 2020-11-03T15-34-24_imagenetdepth_vqgan and place under logs. If you trained your own, adjust the path in the config key model.params.cond_stage_config.params.ckpt_path of configs/drin_transformer.yaml.

To train the transformer, run

python main.py --base configs/drin_transformer.yaml -t True --gpus 0,

More Resources

Comparing Different First Stage Models

The reconstruction and compression capabilities of different fist stage models can be analyzed in this colab notebook. In particular, the notebook compares two VQGANs with a downsampling factor of f=16 for each and codebook dimensionality of 1024 and 16384, a VQGAN with f=8 and 8192 codebook entries and the discrete autoencoder of OpenAI's DALL-E (which has f=8 and 8192 codebook entries). firststages1 firststages2

Other

Text-to-Image Optimization via CLIP

VQGAN has been successfully used as an image generator guided by the CLIP model, both for pure image generation from scratch and image-to-image translation. We recommend the following notebooks/videos/resources:

txt2img

Text prompt: 'A bird drawn by a child'

Shout-outs

Thanks to everyone who makes their code and models available. In particular,

BibTeX

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

taming-transformers's People

Contributors

carmocca avatar manolo-lolo avatar olaviinha avatar pesser avatar rom1504 avatar rromb avatar tgisaturday avatar wcshin-git avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

taming-transformers's Issues

Load an example segmentation and visualize

I get this issue when I use my own image in the Load an example segmentation and visualize section

How can I fix this?? Thanks.

IndexError                                Traceback (most recent call last)
<ipython-input-46-1334a87733d0> in <module>()
      4 segmentation = Image.open(segmentation_path)
      5 segmentation = np.array(segmentation)
----> 6 segmentation = np.eye(182)[segmentation]
      7 segmentation = torch.tensor(segmentation.transpose(2,0,1)[None]).to(dtype=torch.float32, device=model.device)

IndexError: index 255 is out of bounds for axis 0 with size 182

Starting point for training sflckr

Hey guys, great work!
I'm trying to run training on a dataset similar to your sflckr. However I'm hitting this error immediately after validation or training starts, right after "Summoning checkpoint.":
assert t <= self.block_size, "Cannot forward, model block size is exhausted." AssertionError: Cannot forward, model block size is exhausted.
Assuming this was GPU memory related I reduced the model size but the error persisted. So I started to think that perhaps this has something to do with the configuration. My starting point is your sflckr.yaml
python main.py --base configs/sflckr.yaml -t True --gpus 0,
Any hints are highly appreciated. Thanks!

Straight-through estimator

Thank you for sharing your amazing work.

Could you tell me where is the implementation of the "straight-through estimator" for training the encoder?

inference time

are there ways to reduce inference time? Currently takes about 13 minutes on a k80 for the norway example at 432x288

Request:Test the model without downloading the whole dataset.

Is there a easier way to test out D-RIN and FacesHQ models?The Imagenet is just too big for trying out the result to see how it works.Wish has a way to test 10 example like the S-FLCKR model.Thanks!

The S-FLCKR model is so mind blowing incredible! Great work!

config files for conditional training on segmentation maps

Great paper! I am trying to retrain this model on an image dataset where I'm able to generate the segmentation masks using DeepLab v2. However, I don't have a config yaml file for training transformer as for faceHQ or D-RIN. Could you please provide a sample yaml file training with segmentation masks? Many Thanks

colab

Hi, can you please add a google colab for inference thanks

error on train resume

I am getting the error below on train resume. Say,
python main.py --base configs/pic.yaml -t True --gpus 0, --max_epochs 2 --resume logs/2021-05-01T01-34-57_pic
Oddly enough, it seems to work fine when I retry.

Traceback (most recent call last):
File "main.py", line 562, in
trainer.fit(model, data)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/trainer.py", line 445, in fit
results = self.accelerator_backend.train()
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 148, in train
results = self.ddp_train(process_idx=self.task_idx, model=model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/accelerators/ddp_accelerator.py", line 279, in ddp_train
self.trainer.train_loop.setup_training(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/training_loop.py", line 174, in setup_training
self.trainer.checkpoint_connector.restore_weights(model)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 75, in restore_weights
self.restore(self.trainer.resume_from_checkpoint, on_gpu=self.trainer.on_gpu)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 107, in restore
self.restore_model_state(model, checkpoint)
File "/usr/local/lib/python3.6/dist-packages/pytorch_lightning/trainer/connectors/checkpoint_connector.py", line 128, in restore_model_state
model.load_state_dict(checkpoint['state_dict'])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Net2NetTransformer:
Unexpected key(s) in state_dict: "cond_stage_model.colorize".

When to use sliding attention window?

Hi, I find the code of sliding attention window in sample_conditional.py, but I cannot find where the sliding attention window is used in the training stage. Is this technique only used when sampling? Or both sampling and training? Thanks.

Rebuilding Pytorch Lighting Training Loop

Hello,
I'm currently trying to rebuild the model for a different use, or pose to image(which is covered in the website, but is not mentioned here). If I already have input images that are presegmented(eg by Openpose), how would I get this work?
The diagrams seems to indicate the input image is downsample, or encoded, passed through transformers in the patch fashion, then upsampled again, but I'm struggling to see how the code allows it(the model definition just seems to be instantiate from config), but I would appreciate any help in helping me hopefully pick apart this code, and reassemble it.

How much VRAM does it take to train COCO-Stuff/ADE20K transformer models?

Thank you for sharing this great work!

Could you give more information on training COCO-Stuff/ADE20K transformer models?
I got OOM even with bach-size of 1 when training these transformer models with the hyperparameters specified in the appendix on a GPU with 11GB VRAM. Is this expected? If so, what's the minimum amount of VRAM per GPU to train these conditional models?

Thank You!

I just wanted to personally thank everyone involved in this effort. Training is now far more accessible on DALLE-pytorch using the pretrained VAE you provided. Compute and memory costs are substantially lower and it's even possible for people to train a relatively large transformer under 16 GiB of VRAM.

It's early days, and no one has trained a "full DALL-E" yet, but this helps with that plenty and momentum is already picking up on the repo.

So thanks and great work everyone. You're awesome.

Generating higher resolution images for FacesHQ data

Hello authors,

Thank you for the amazing work! I am trying to generate a face image of higher resolution (512x512). My strategy is to initiate the z_q as a random vector of integers between (0, 1023) of dimension 1x1024, reshape this to 32x32 and then use model.decode_to_img to make an image of 512x512. To make a sensible image of faces I autoregressively generate the next codebook token using row-major sequence on the 32x32 matrix, similar mechanism as given in the notebook here . Unfortunately, the final image I get is something like this

download

It is something like the repetition of a pattern of faces. Could you please guide me on this?

Thanks!

Overfitting problem when training transformer

I train the transformer but find it overfits after 30-40 epochs, with the validation loss goes high and the training loss is very small.
If you meet this problem in model training. Now I try to use the pkeep=0.9 in the cond_transformer.py to avoid overfitting.

Segmentation map palette

Hi, thank you for sharing your amazing work. I want to play with it a bit and especially with my own segmentation maps, where I can find which color represents what material in the landscape model?

Here is an example of a segmentation map:
download (1)

Gradient flow

Hi guys, first of all, impressive work you have donehere.

Skimming through the repo I noticed that the critic/discriminator receives gradients through both losses on account of it not having its gradients frozen, when the autoencoder part is optimized. Do I see that correctly? and if so, why did you choose to do that?

Has anyone succeeded in reproducing the results?

I am still struggling with training VQ-GAN in the first stage, not even the conditional transformer which is a second stage.
The result looks fine before the discriminator loss is injected. BUT using the discriminator loss suddenly ruins the reconstructed images. disc_loss remains 1.0 during the training. Why??

Discriminator loss remains unchanged during training

Great work!

I am working on my own dataset recently. During training, I found two odd things about the loss. I really appreciate the guidance of you if you've had the same problems before.

a. I fit in my own dataset, the whole process runs well except that the D loss during training remains 1! I followed the same procedure where Discriminator started after several epochs. Seems that D losses its ability to distinguish the real and fake. I decrease the number of pre-running epochs but ended in the same result.

b. I tried to exclude the D loss and keep the perceptual loss. The reconstructed results seem fine, except that within the complex-pattern area there exists some blocking effect noise. I wonder whether you guys ran into the same odd before.

All in all, I really think this piece of work is a big step toward better text-image generation.

Pose to Image, Segmentation Mask Question

I was hoping to reimplement the pose to image portion of the paper with a couple modifications, does anyone have any information for the range of possible values allowed in segmentation, and also what color they correspond to?
Also, can you share the config and how you trained this model?

How to inpaint images?

Thanks. In the paper, top row of Figure 4 shows two image inpainting results. I wonder how to do that. Can I replace the image coordinates with masked images?

Setting for the reported experimental results

Hi @rromb , some quantitative results of FID on CelebA-HQ, ADE20k, e.g., are reported in this repo. But the model setting is not clear, such as if the model includes a conditional input (e.g., semantic map) or is unconditioned. Can you add the model setting in the table? Thanks.

New conditional model training

So the paper is fantastic and had lot of fun playing with the pretrained models :)
However I'm slightly confused about training a new model with COCO dataset . What I understood is the following:

  1. Create a new training config file which creates image, conditional image pairs as part of the training and validation dataloaders.
  2. Train the VQGAN with COCO dataset
  3. Train the VQGAN with conditional image (Is this needed? Because with DRIN there were two training steps but with coco segmentation only one VQGAN training step is required. )
  4. Train the Transformer with the above trained VQGAN checkpoints.

Can someone confirm if my understanding is correct?

Are the grid patterns normal?

I am trying to train VQ-GAN on the COCO dataset, but I got reconstructed images with grid patterns on them, and the reconstructed image doesn't look like the input. Is this normal at the beginning of the training? Am I doing anything wrong?

image

License

Hi, please add a license file thanks

pip installable package

Hi! Thank you for the great paper :)

I am the owner of https://github.com/lucidrains/DALLE-pytorch and was thinking of offering the users a way to train DALL-E using your pretrained VQ-GAN, specifically the one with a codebook of size 1024. lucidrains/DALLE-pytorch#75 I was wondering if you would be open to making your repository a pip installable package, with all the necessary dependencies (omegaconf and pytorch-lightning), so that it could be installed with

$ pip install taming-transformers

followed by

from taming.model.vqgan import VQModel

multi-gpu training hangs

Hmm, training with --gpus 0, works fine but training with --gpus 0,1 hangs right at initializing ddp ...

How to decide the training epochs or early stop condition?

I really like your paper, thanks for your open source!
It seems that you did not use early stop in the ModelCheckpoint. Could you please tell me how many epochs you trained the VQGAN and transformer? Or do you have suggestions about the training epochs on new datasets?

CelebA-HQ

Regarding CelebA-HQ - in the link you shared there are instructions on how to create tfrecords for CelebA-HQ but not the npy which are required in your code. Can you please provide some guidance on this?
Thank you

Skip steps in Inferencing to reduce runtime?

I've been looking at inferencing scripts, mostly the taming-transformers.ipynb file, and I can't figure about how to get the transformer to process every other step, every 3 steps, every n steps, etc. How would I modify the script to skip steps, at the expense of quality?

The file hosting server heibox.uni-heidelberg.de is down

https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt

remote: Enumerating objects: 13, done.
remote: Counting objects: 100% (13/13), done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 671 (delta 4), reused 7 (delta 3), pack-reused 658
Receiving objects: 100% (671/671), 116.29 MiB | 24.59 MiB/s, done.
Resolving deltas: 100% (139/139), done.
/content/taming-transformers
--2021-03-24 20:53:31--  https://heibox.uni-heidelberg.de/f/140747ba53464f49b476/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt [following]
--2021-03-24 20:53:31--  https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 200 OK
Length: 957954257 (914M) [application/octet-stream]
Saving to: ‘logs/vqgan_imagenet_f16_1024/checkpoints/last.ckpt’

logs/vqgan_imagenet   0%[                    ]       0  --.-KB/s    in 29s     

2021-03-24 20:54:01 (0.00 B/s) - Connection closed at byte 0. Retrying.

--2021-03-24 20:54:02--  (try: 2)  https://heibox.uni-heidelberg.de/seafhttp/files/0cc07b02-72f5-4615-a2ac-ace188cf0ed0/last.ckpt
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 502 Proxy Error
2021-03-24 20:54:32 ERROR 502: Proxy Error.

--2021-03-24 20:54:32--  https://heibox.uni-heidelberg.de/f/6ecf2af6c658432c8298/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/3dbcbfc9-5824-4909-8237-df3035a8d83b/model.yaml [following]
--2021-03-24 20:54:32--  https://heibox.uni-heidelberg.de/seafhttp/files/3dbcbfc9-5824-4909-8237-df3035a8d83b/model.yaml
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... 502 Proxy Error
2021-03-24 20:55:02 ERROR 502: Proxy Error.

--2021-03-24 20:55:03--  https://heibox.uni-heidelberg.de/f/867b05fc8c4841768640/?dl=1
Resolving heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)... 129.206.7.113
Connecting to heibox.uni-heidelberg.de (heibox.uni-heidelberg.de)|129.206.7.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://heibox.uni-heidelberg.de/seafhttp/files/5baa72fd-1411-420c-b711-69aeacccbf8d/last.ckpt [following]
--2021-03-24 20:55:03--  https://heibox.uni-heidelberg.de/seafhttp/files/5baa72fd-1411-420c-b711-69aeacccbf8d/last.ckpt
Reusing existing connection to heibox.uni-heidelberg.de:443.
HTTP request sent, awaiting response... ```

can't get or access the checkpoints and can no longer use them for our implementation. Please host them on a different server please! 

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.