ajbrock / biggan-pytorch Goto Github PK

View Code? Open in Web Editor NEW

2.8K 52.0 473.0 5.59 MB

The author's officially unofficial PyTorch BigGAN implementation.

License: MIT License

Python 95.46% Shell 2.37% MATLAB 2.17%

biggan pytorch deep-learning neural-networks gans dogball

biggan-pytorch's People

Contributors

Stargazers

Watchers

Forkers

rogalag rbramwell entn-at obake2ai qiansi jishuke wynmew hack121 ddeeppnneett paseam dwang68 parety tangxinofchina yibit quuhua911 guolong-zhang leo-xxx shaunstanislauslau nangeblog pandinosaurus samimideksa changdedu weiling103 hsuanshao mamengchen jwyang mikeshatch gragonvlad darthsuogles lxtgh fendaq kaiqiao1992 guoxpl chiukin lordwarlock michael-wzhu xhhong sadknight0001 trendingtechnology maveriq sangyy spencerx chenhch8 micklexqg stjordanis hezhang1994 shubhampachori12110095 serignecisse arnabgho gogumee wenbank ericodex trantorrepository batermj ml-lab hhy5277 lonelygo cyrus2333 wangkanger zhangzhehong olgaiv39 zekunzh orchestor raphaelcao cclauss cxz ssnl terarachang zivzone yobcmst knhuq osirisjs jiazewang chrisrodley andyeyeye janzd savourylie namisan jesse1029 lesnikow tianyunkeml alexandonian dukebw manik-hossain wang-chaoyue peterding kulasama monica-hr mgong2 mtlong blackcat84 xiaoanshi nguyenducnhaty rchavezj vishalbelsare fakhraddin amit2014 chenmoshushi chrisbender muxinghan

biggan-pytorch's Issues

Runtime Error when saving model

Hello, I'm having the following run-time error when saving my model at the first model save point. Any ideas or help would be excellent. Thank you.

RuntimeError: The size of tensor a (25) must match the size of tensor b (50) at non-singleton dimension 0

More context:

Saving weights to weights/BigGAN_C100_seed1_Gch64_Dch64_bs50_nDs4_Glr2.0e-04_Dlr2.0e-04_Gnlrelu_Dnlrelu_GinitN02_DinitN02_ema/copy0...
Traceback (most recent call last):
  File "train.py", line 227, in <module>
    main()
  File "train.py", line 224, in main
    run(config)
  File "train.py", line 206, in run
    state_dict, config, experiment_name)
  File "/workspace/BigGAN-PyTorch/train_fns.py", line 140, in save_and_sample
    z_=z_)
  File "/workspace/BigGAN-PyTorch/utils.py", line 895, in sample_sheet
    o = nn.parallel.data_parallel(G, (z_[:classes_per_sheet], G.shared(y)))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 207, in data_parallel
    outputs = parallel_apply(replicas, inputs, module_kwargs, used_device_ids)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply
    raise output
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/BigGAN.py", line 248, in forward
    h = block(h, ys[index])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/layers.py", line 399, in forward
    h = self.activation(self.bn1(x, y))
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/workspace/BigGAN-PyTorch/layers.py", line 325, in forward
    return out * gain + bias
r```

Performance on small datasets?

Hi,
does anyone know how BigGAN performs on small datasets? e.g. the birds dataset (Caltech-UCSD Birds-200-2011)? In such dataset, we only have about 50 images per class.
I have tried to run BigGAN on the birds dataset, however, the result doesn't look very nice (worse than baseline models)... since BigGAN is a large model, does it need a large dataset to perform well?

How should I conditional generate samples

I found that your categorical distribution label is sampled during the sample period. How should I conditional on a specific label to generate samples?

Thanks for the awesome code!

Error encountered in parallel training

Hello, I was training with my own dataset, which has 3 categories and 10K images in each. I use the launch_BigGAN_bs256x8.sh script.
The training exits with the following error message:

157/157 ( 99.36%) (TE/ET1k: 72:21 / 391:26) Traceback (most recent call last):_real : +0.914, D_loss_fake : +0.899
  File "train.py", line 227, in <module>
    main()
  File "train.py", line 224, in main
    run(config)
  File "train.py", line 184, in run
    metrics = train(x, y)
  File "/share/vision/BigGAN-PyTorch/train_fns.py", line 42, in train
    split_D=config['split_D'])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/opt/conda/lib/python3.6/site-packages/torch/_utils.py", line 369, in reraise
    raise self.exc_type(msg)
TypeError: Caught TypeError in replica 7 on device 7.
Original Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
TypeError: forward() missing 2 required positional arguments: 'z' and 'gy'

Looks like something wrong with data parallel mode. Tried with pytorch 1.0.1 and 1.2.0, the error stays the same.
Could you take a look at it, please?

Should not `import model`

BigGAN-PyTorch/sample.py

Line 21 in a555707

import model

This causes ModuleNotFound. I believe the importing of model module is covered below with __import__.

Why channel drop?

I was checking other repository (before checking this one), and then found a strange channel drop trick.
huggingface/pytorch-pretrained-BigGAN#9

I can see you also use it here:

BigGAN-PyTorch/BigGANdeep.py

Lines 54 to 56 in ba3d057

    
           # Drop channels in x if necessary 
        
           if self.in_channels != self.out_channels: 
        
             x = x[:, :self.out_channels]

Could you explain why do you do this? I think it's strange to train with lager channels more than necessary and drop at inference time. Does this trick somehow help for training?

Any point in trying out a 1024 x 1024 arch when folks are running out of memory?

Would this just make memory issues more pressing? Was hoping for a 1024 version of this BigGAN to be feasible over a 1024 ProGAN version such as https://github.com/tkarras/progressive_growing_of_gans

Spectral regularization for combating mode collapse

z_dim for 512x512 model

The paper mentions in Appendix B that 512x512 model should use 160 dimensional z. But both in this codebase and TFHub, a 128 dimensional z space is used (which is smaller than the 140 dimensional z space for 256x256 model). Which is correct? If the code is correct, should the paper be updated, and why does it use a smaller z space than 256x256 model?

Why drop_last of DataLoader is disabled

I cut the num_works to 0 due to lack of RAM and run BigGAN_bs256x8.sh, ended up with error bellow

I noticed that it was processing the last batch in the first epoch, so I dug into the dataloader part, and I found that drop_last is disabled when use_multiepoch_sampler is enabled, would that cause tuple error as I meet?

FID is nan

Hi guys, thank you for the amazing work!

I have a trouble with FID. During the training, many times the log shows FID is nan. It does not happen all the time but more than a half of them.

So I wonder in which cases FID can be nan? And what can I do to prevent it?

Thank you very much!

Question regarding the MultiEpochSampler

I've been reading the codebase, and find the MultiEpochSampler looks weird to me. Don't know if I understand it correctly, but it seems to me that in function utils.get_data_loaders
train_set is a vanilla data loader that read through all the data once, namely one epoch of data.
sampler is a MultiEpochSampler whose length is len(train_set) * num_epoch.
While in training, it's a nested loop that iterates through sampler for num_epoch times. In total, the model is trained on num_epoch^2 * len(train_set) images, and the "real" epoch is actually num_epoch^2.

I'm wondering if it's on purpose or a bug. Thank you very much.

Do you have any plan to release the Discriminator weights for 256 and 512 models?

Hi ajbrock,
Do you have any plan to release the discriminator weights for 256 or 512 models? I tried to use the discriminator for 128 models and I think it is a conditional discriminator. Is there any chance to convert it to an unconditional discriminator? Please correct me if I misunderstand anything about it.
Thank so much!

Any difference between Attention and SAGAN?

https://github.com/heykeetae/Self-Attention-GAN/blob/master/sagan_models.py
In this code

class Self_Attn(nn.Module):
    """ Self attention Layer"""
    def __init__(self,in_dim,activation):
        super(Self_Attn,self).__init__()
        self.chanel_in = in_dim
        self.activation = activation
        
        self.query_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.key_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim//8 , kernel_size= 1)
        self.value_conv = nn.Conv2d(in_channels = in_dim , out_channels = in_dim , kernel_size= 1)
        self.gamma = nn.Parameter(torch.zeros(1))

        self.softmax  = nn.Softmax(dim=-1) #
    def forward(self,x):
        """
            inputs :
                x : input feature maps( B X C X W X H)
            returns :
                out : self attention value + input feature 
                attention: B X N X N (N is Width*Height)
        """
        m_batchsize,C,width ,height = x.size()
        proj_query  = self.query_conv(x).view(m_batchsize,-1,width*height).permute(0,2,1) # B X CX(N)
        proj_key =  self.key_conv(x).view(m_batchsize,-1,width*height) # B X C x (*W*H)
        energy =  torch.bmm(proj_query,proj_key) # transpose check
        attention = self.softmax(energy) # BX (N) X (N) 
        proj_value = self.value_conv(x).view(m_batchsize,-1,width*height) # B X C X N

        out = torch.bmm(proj_value,attention.permute(0,2,1) )
        out = out.view(m_batchsize,C,width,height)
        
        out = self.gamma*out + x
        return out,attention

# A non-local block as used in SA-GAN
# Note that the implementation as described in the paper is largely incorrect;
# refer to the released code for the actual implementation.

which one do you mean is largely incorrect and actual implementation? Thanks

Undefined name 'self' in layers.py

flake8 testing of https://github.com/ajbrock/BigGAN-PyTorch on Python 3.7.1

$ flake8 . --count --select=E9,F63,F72,F82 --show-source --statistics

./layers.py:261:14: F821 undefined name 'self'
  if 'ch' in self.norm_style:
             ^
./layers.py:262:14: F821 undefined name 'self'
    ch = int(self.norm_style.split('_')[-1])
             ^
./layers.py:265:17: F821 undefined name 'self'
  elif 'grp' in self.norm_style:
                ^
./layers.py:266:18: F821 undefined name 'self'
    groups = int(self.norm_style.split('_')[-1])
                 ^
./utils.py:1005:35: F632 use ==/!= to compare str, bytes, and int literals
  'Gattn%s' % config['G_attn'] if config['G_attn'] is not '0' else None,
                                  ^
./utils.py:1006:35: F632 use ==/!= to compare str, bytes, and int literals
  'Dattn%s' % config['D_attn'] if config['D_attn'] is not '0' else None,
                                  ^
./train_fns.py:165:28: F821 undefined name 'z_'
                           z_, y_, config['n_classes'],
                           ^
./train_fns.py:165:32: F821 undefined name 'y_'
                           z_, y_, config['n_classes'],
                               ^
./sync_batchnorm/batchnorm_reimpl.py:15:1: F822 undefined name 'BatchNormReimpl' in __all__
__all__ = ['BatchNormReimpl']
^
2     F632 use ==/!= to compare str, bytes, and int literals
6     F821 undefined name 'self'
1     F822 undefined name 'BatchNormReimpl' in __all__
9

E901,E999,F821,F822,F823 are the "showstopper" flake8 issues that can halt the runtime with a SyntaxError, NameError, etc. These 5 are different from most other flake8 issues which are merely "style violations" -- useful for readability but they do not effect runtime safety.

F821: undefined name name
F822: undefined name name in __all__
F823: local variable name referenced before assignment
E901: SyntaxError or IndentationError
E999: SyntaxError -- failed to compile a file into an Abstract Syntax Tree

where is the pre-trained model of discriminator?

Can I find the pre-trained model of discriminator in BigaGAN?

Sync Batchnorm

Pytorch guys recently released an official SyncBatchnorm implementation. It requires a specific setup where we use torch.parallel.DistributedDataParallel(...) instead of nn.DataParallel(...) and launch a separate process for each GPU.

I wrote a small step-by-step here: https://github.com/dougsouza/pytorch-sync-batchnorm-example.

In my experiments SyncBatchnorm worked well. Also, using torch.parallel.DistributedDataParallel(...) with one process per GPU provides a huge speed up in training. The gain of adding more GPUs is almost linear, it performs a lot faster than nn.DataParallel(...). I believe you could reduce training time drastically by switching to torch.parallel.DistributedDataParallel(...).

BTW, thanks for this implementation!

how should i test?

May I ask how to resize ImageNet data to 128 by 128?

Output images have holes in them

I have been training my own models with 256x256 outputs, using my own dataset, with 1 class only. The training is working as expected, but I keep noticing the outputs from the GAN have holes in them. See example below.

Initially I thought it is a matter of training for longer, or rather that the training data is not very clean. But when I tried the same with a different cleaner dataset (for a different class), I still noticed a hole in all the outputs, even after training for longer.

I am using 4 GPUs, with a batch size of 40 and --num_G_accumulations 4 --num_D_accumulations 4.

I am wondering if anyone ran in the same issue? and what could be the problem?

I included an example below: you can see the hole in the center of the image.

Singular value clamping code?

I am looking for the code that does the clamping of the singular values from the weight matrices:

(i.e. page 6 of the arxiv paper)

but can't seem to find it in the training loop. Does anyone know where it is?

Thanks

groupnorm function in layers.py

Hi, when I read the code in layers.py, I found that in L322, the input is x and self.normstyle, but there is no self.normstyle in the previous definition. Should it be self.norm_style? In the definition of groupnorm function in L259, the norm_style could have the format of "ch_32" or "grp_16", but if this norm_style is the self.norm_style in L322, it could only be a choice of "bn", "ln", "in", or "gn". Therefore, I feel quite confused about the function here. Could you give some more explanations? Thanks!

Places365

Hi!
I'd like to ask about pretrained models for Places365. In Readme, it says that Places365 pretrained models are coming soon. Is there still a plan to make them available or not? I believe that I wouldn't be the only one who would find them useful so it would be great if you could release them.

A mismatch problem on fine-tune

Hi, I tried to fine-tune pre-trained model on my dataset, but the following error raised, which probably means a mismatch on the number of classes between imagenet(1000) and my dataset(2). So, Is there any suggestion on fine-tune for a different number of classes. Thanks.

RuntimeError: Error(s) in loading state_dict for Generator:
size mismatch for shared.weight: copying a param with shape torch.Size([1000, 128]) from checkpoint, the shape in current model is torch.Size([2, 128]).

A GPU VRAM is 6. Can it be trained?

When I set batch_size to 8, num_G_accumulations to 256, num_D_accumulations to 256, it still remains CUDA out of memory.

Looking forward to offical TPU-version BigGAN code~

Hi ajbrock,
Thanks for open-source BigGAN code, which benefits a lot on other explore of BigGAN. But considering the cost of time, we hope to train biggan on TPU, so we use a version of tensorflow code that looks close to yours(https://github.com/Octavian-ai/BigGAN-TPU-TensorFlow). But many of the attempts turned out to be bad. Can you continue to release the official biggan code of TPU version?

Sincerely

config:
Imagenet2012
tf-1.12.2
GCP v3-8 pod
training for up to 300k steps

Some failed results:
100k:

180k:

300k:

Plans to support converting -deep checkpoints?

Do you have plan to support converting tf -deep model weights?

Thanks for the repo!

Running out of memory

Hi,

I'm running ./scripts/launch_BigGAN_bs256x8.sh and getting the following error:

RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 7.44 GiB total capacity; 5.47 GiB already allocated; 487.56 MiB free; 1.07 GiB cached)
@ellismarte

Is there something I can configure so that that much memory doesn't get used and I don't run out of memory?

RuntimeError: The size of tensor a (1000) must match the size of tensor b (10) at non-singleton dimension 0

Hi, I want to fine tune BigGAN on my own dataset which has 10 classes. I have modified num_classess and made sure that the checkpoint can be loaded into the model, but when I run it, it gave me this Error. I cannot figure out what else config maybe left out. Do I need to modify the number of classes of inception_v3 model?

1/488 ( 0.00%) Traceback (most recent call last):
File "../train.py", line 228, in
main()
File "../train.py", line 224, in main
run(config)
File "../train.py", line 184, in run
metrics = train(x, y)
File "/cache/code/BigGAN-PyTorch/train_fns.py", line 58, in train
D.optim.step()
File "/root/miniconda3/lib/python3.6/site-packages/torch/optim/adam.py", line 93, in step
exp_avg.mul_(beta1).add_(1 - beta1, grad)
RuntimeError: The size of tensor a (1000) must match the size of tensor b (10) at non-singleton dimension 0

How to test on only one GPU?

Hi, we are a group of students and is reproducing this BigGAN model for our coursework, one question is that we only have one GPU on the Colab, and is wondering how to modify the model, BTW, we are trying to use another dataset, there are some problems too. Hope to get your reply, really appreciate it :).

Question about truncation trick in Big GAN.

I think the PyTorch implementation slightly differs from that from Tensorflow Hub w.r.t. truncation trick. Here is the situation: when I set truncation as 1, these two models (tf model v.s. pth model converted from tf model) produce same results. However, it no longer holds when truncation is less than 1.

According to the code on https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb, the truncation factor not only works on the input latent vector, but is also passed to the module by a placeholder. Differently, in this PyTorch version, truncation factor only affects the input noise z in sample.py.

I have also found that the following two settings give same outputs:
(1) With Tensorflow Hub,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))
truncation = 1.0

(2) With PyTorch,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))

But the following two settings give different syntheses:
(3) With Tensorflow Hub,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))
truncation = 0.02

(4) With PyTorch,

z = 0.02 * truncnorm.rvs(-2, 2, size=(1, dim_z), random_state=np.random.RandomState(0))

Would you please help me figure this thing out? I want to know how the official Big GAN play the truncation trick besides truncating the latent code. BTW, I experimented on the 512x512 BigGAN model (https://tfhub.dev/deepmind/biggan-512/2).

How to understand the result given by discriminator?

Hi, I want to use the discriminator alone.

Following is my code:

import torch
from BigGAN import Discriminator
import utils
from utils import Distribution

def load():
    path ='pre-trained/138k/'
    d_state_dict = torch.load(path + 'D.pth')
    D = Discriminator(D_ch=96, skip_init=True)
    D.load_state_dict(d_state_dict)
    return D

if __name__ == '__main__':
    D = load()
    D.eval()
    D.cuda()
    x = ... //x is an image
    x = x.to('cuda')
    y_ = Distribution(torch.zeros(1, requires_grad=False))
    y_.init_distribution('categorical', num_categories=1000)
    y_ = y_.to('cuda', torch.int64, non_blocking=False, copy=False)
    y_.sample_()
    print(D(x, y_[:1]))

I have three questions:

Is my code correct? Especially the preparation of y_.
How to preprocess x? I use imagenet val 2012.
How to understand the output of D? I find the value could be positive or negative, I am wondering to know which region means Discriminator think the input is fake/real?

Thanks for your time.

Using grayscale input images instead of RGB.

Hello @ajbrock! Thank you so much for making your model available for others to use. I'm trying to re-purpose it at the moment for a research project.

I have a two-fold issue: one piece is data-related, the other architecture-related.

I am trying to use a dataset of .png grayscale images produced by an analogue-to-digital converter. The image dimensions are 512x512 and there is only 1 class. I have made the following modifications in order to get the dataset loaded: (larcv is the dataset name)

In utils.py

# Convenience dicts
dset_dict = {'larcv_png': dset.ImageFolder, 'larcv_hdf5': dset.ILSVRC_HDF5,
             'I32': dset.ImageFolder, 'I64': dset.ImageFolder,
             'I128': dset.ImageFolder, 'I256': dset.ImageFolder,
             'I32_hdf5': dset.ILSVRC_HDF5, 'I64_hdf5': dset.ILSVRC_HDF5,
             'I128_hdf5': dset.ILSVRC_HDF5, 'I256_hdf5': dset.ILSVRC_HDF5,
             'C10': dset.CIFAR10, 'C100': dset.CIFAR100}
imsize_dict = {'larcv_png': 512, 'larcv_hdf5': 512,
               'I32': 32, 'I32_hdf5': 32,
               'I64': 64, 'I64_hdf5': 64,
               'I128': 128,
               'I128_hdf5': 128,
               'I256': 256, 'I256_hdf5': 256,
               'C10': 32, 'C100': 32}
root_dict = {'larcv_png': 'larcv_png', 'larcv_hdf5': 'ILSVRC512.hdf5',
             'I32': 'ImageNet', 'I32_hdf5': 'ILSVRC32.hdf5',
             'I64': 'ImageNet', 'I64_hdf5': 'ILSVRC64.hdf5',
             'I128': 'ImageNet', 'I128_hdf5': 'ILSVRC128.hdf5',
             'I256': 'ImageNet', 'I256_hdf5': 'ILSVRC256.hdf5',
             'C10': 'cifar', 'C100': 'cifar'}
nclass_dict = {'larcv_png': 1, 'larcv_hdf5': 1,
               'I32': 1000, 'I32_hdf5': 1000,
               'I64': 1000, 'I64_hdf5': 1000,
               'I128': 1000, 'I128_hdf5': 1000,
               'I256': 1000, 'I256_hdf5': 1000,
               'C10': 10, 'C100': 100}
# Number of classes to put per sample sheet
classes_per_sheet_dict = {'larcv_png': 1, 'larcv_hdf5': 1,
                          'I32': 50, 'I32_hdf5': 50,
                          'I64': 50, 'I64_hdf5': 50,
                          'I128': 20, 'I128_hdf5': 20,
                          'I256': 20, 'I256_hdf5': 20,
                          'C10': 10, 'C100': 100}

The dataset does serialize and load successfully, but when I check the dimensions of the images inside of the ILSVRC_HDF5class in datasets.py using img.shape, the dimensions show as [3, 512, 512].

This leads to a size-mismatch in the forward function of G_D at the line:
D_input = torch.cat([G_z, x], 0) if x is not None else G_z where G_z.shape = [4, 1, 512, 512] and x.shape = [4, 3, 512, 512]

I've made the following changes to the D_arch dictionary in order to accommodate the 512x512 images:

  arch[512] = {'in_channels' :  [1] + [ch*item for item in [1, 2, 4, 8, 8, 16, 16]],
               'out_channels' : [item * ch for item in [1, 2, 4, 4, 8, 8, 16, 16]],
               'downsample' : [True] * 7 + [False],
               'resolution' : [512, 256, 128, 64, 32, 16, 8, 4],
               'attention' : {2**i: 2**i in [int(item) for item in attention.split('_')]
                              for i in range(2,10)}}

I have also modified the last layer of the Generator to output 1-channel images:

    # output layer: batchnorm-relu-conv.
    # Consider using a non-spectral conv here
    self.output_layer = nn.Sequential(layers.bn(self.arch['out_channels'][-1],
                                                cross_replica=self.cross_replica,
                                                mybn=self.mybn),
                                    self.activation,
                                    self.which_conv(self.arch['out_channels'][-1], 1))

My questions are:

How can I get the images to load with only 1 channel?
Are the architecture modifications I've made appropriate?

Thank you so much.

MemoryError when calculate IS

I want to train the BigGAN on Places365 dataset and rewrite the dataloader. It is ok to make hdf5 file. However, when I run the calculate_inception_moments.py, It encounters the memory error. The log is below:
‘’‘
Traceback (most recent call last):
File "calculate_inception_moments.py", line 93, in
main()
File "calculate_inception_moments.py", line 89, in main
run(config)
File "calculate_inception_moments.py", line 68, in run
pool, logits, labels = [np.concatenate(item, 0) for item in [pool, logits, labels]]
File "calculate_inception_moments.py", line 68, in
pool, logits, labels = [np.concatenate(item, 0) for item in [pool, logits, labels]]
MemoryError
’‘’
Do you know how to solve the problem? Looking forward to your reply.
Thanks

Resuming training from given 100k checkpoint collapses

Training on a single V100 GPU, resuming from the checkpoint at 100,000 iterations linked to from the README, consistently collapses straight away.

Screenshot of script to resume training from checkpoint: https://drive.google.com/open?id=1ik16IVE9G8l7dCnpCjQStVslkE9Ltg19

The differences I can see besides using just one GPU and so having a smaller batch size and more accumulations are:

not using parallel
using 0 workers
not using multiepoch sampler
not using the EMA for evaluation (to show collapse)

I'm very confused as to why this is not working - none of those changes should make a difference, as far as I can tell - any suggestions greatly appreciated!

GIF showing collapse over about 130 iterations, sampling every 5: https://giphy.com/gifs/ifMnHURSReyjyE3IfC/html5

Logs: https://drive.google.com/drive/folders/11Far9osGQ2KslKY4Gyio8AHotBGXTKI6?usp=sharing

unsupervised GAN without class as conditional input?

Hi ajbrock,

Thanks a lot for your sharing this great repo! I'm trying to train the model on CIFAR-10 first, but I found that the setting of the training process on CIFAR-10 is similar to ImageNet, which requires a latent vector z and a class label y as inputs. But in most papers, CIFAR-10 is mainly used for unsupervised GAN, which only takes a latent vector z. Therefore, is there a way to train the model on CIFAR-10 in an unsupervised manner?

By the way, when I directly run the training script of training on CIFAR-10, the log shows that the FID score sometimes turns to NaN. What's the reason behind it?

Thanks you very much!

fine-tunning

Hi,
Do anyone fine-tuning by means of the provided model(Source: Imagenet, target: tench (n01440764)) ? I use the same parameters to train. Before training, the result is as felloing:

After one batch(note I set loss into 0 by multiplying 0. with loss):

Do Anyone have same problem?

How much data is enough for the model to work properly

I have trained biggan with over 21000 images but the results are not good, I want to know at least how much data does biggan need for the model to have good results ?

OSError

Hi,i met the question: unable to open file: name = 'data/ILSVRC128.hdf5',error message = 'No such file or directory', flags = 0, o_flags = 0).

No space left on device

I'm running sh scripts/utils/prepare_data.sh and getting the following error

ubuntu@ip-172-31-13-86:~/BigGAN-PyTorch$ sh scripts/utils/prepare_data.sh
{'dataset': 'I128', 'data_root': 'data', 'batch_size': 128, 'num_workers': 4, 'chunk_size': 100, 'compression': False}
Using dataset root location data/ImageNet
Data will not be augmented...
Generating Index file I128_imgs.npz...
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 7.39it/s]
Starting to load I128 into an HDF5 file with chunk size 100 and compression None...
0%| | 0/783 [00:00<?, ?it/s]Producing dataset of len 100163
Image chunks chosen as (100, 3, 128, 128)
Label chunks chosen as (100,)
83%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 653/783 [00:24<00:04, 26.39it/s]Traceback (most recent call last):
File "make_hdf5.py", line 110, in
main()
File "make_hdf5.py", line 107, in main
run(config)
File "make_hdf5.py", line 97, in run
f['imgs'][-x.shape[0]:] = x
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/h5py/_hl/dataset.py", line 632, in setitem
self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5d.pyx", line 221, in h5py.h5d.DatasetID.write
File "h5py/_proxy.pyx", line 132, in h5py._proxy.dset_rw
File "h5py/_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite
OSError: Can't prepare for writing data (file write failed: time = Sun Jul 21 20:37:20 2019
, filename = 'data/ILSVRC128.hdf5', file descriptor = 22, errno = 28, error message = 'No space left on device', buf = 0x55da1f79c868, total write size = 735016, bytes this sub-write = 735016, bytes actually written = 18446744073709551615, offset = 4123893760)

{'dataset': 'I128_hdf5', 'data_root': 'data', 'batch_size': 64, 'parallel': False, 'augment': False, 'num_workers': 8, 'shuffle': False, 'seed': 0}
Using dataset root location data/ILSVRC128.hdf5
Downloading: "https://download.pytorch.org/models/inception_v3_google-1a9a5a14.pth" to /home/ubuntu/.cache/torch/checkpoints/inception_v3_google-1a9a5a14.pth
0%| | 16384/108857766 [00:00<00:02, 49977801.26it/s]
Traceback (most recent call last):
File "calculate_inception_moments.py", line 91, in
main()
File "calculate_inception_moments.py", line 87, in main
run(config)
File "calculate_inception_moments.py", line 55, in run
net = inception_utils.load_inception_net(parallel=config['parallel'])
File "/home/ubuntu/BigGAN-PyTorch/inception_utils.py", line 262, in load_inception_net
inception_model = inception_v3(pretrained=True, transform_input=False)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torchvision/models/inception.py", line 45, in inception_v3
progress=progress)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/hub.py", line 433, in load_state_dict_from_url
_download_url_to_file(url, cached_file, hash_prefix, progress=progress)
File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/hub.py", line 367, in _download_url_to_file
f.write(buffer)
File "/home/ubuntu/anaconda3/lib/python3.6/tempfile.py", line 485, in func_wrapper
return func(*args, **kwargs)
OSError: [Errno 28] No space left on device

What do I need to change in order to solve this?

Training results(IS and FID) are not good as yours with same training process

Hi ajbrock,
I was running the training code on ImageNet by using default script launch_BigGAN_bs256x8.sh. It has finished 134k iterations and here is the log file.

Compare with the log file that you released, I got the worse results. I kept all the parameters as same as your default settings. The training is on 8xV100. Do you have any suggestion to make it better? Or what should I check to get a similar result as yours?

Thanks a lot!

Do you have any plan to release the 256 and 512 pre-trained models?

Hi ajbrock,
I believe the most attractive part of BigGAN is generating the high-resolution images. Do you have any plan to release the 256 and 512 pre-trained models?
Thanks.

Any reason you don't zero optimizer between Discriminator training steps?

WGAN-GP and most other gans I've seen will call D_optim.zero_grad() between Discriminator iterations. Is there any reason you don't? Does it provide better performance?

training at 512x512

Hello, I am trying to train a bigGan with a custom dataset, whose resolution is 512x512.
I edited one of the provided scripts to launch the training, but I got a key error when the code builds the discriminator. I noticed that there is not discriminator architecture for 512 resolution. Could you provide a discriminator architecture for such resolution?

Thanks!

Use my own dataset

If I use my own data set, do I need to use my own data set to train an inception_v3 when I get the initial measurement?

Minimal working example for sampling from pre-trained BigGAN?

Hi ajbrock,
I am so excited that you released the Pytorch version of BigGAN. I am trying to sample some results. Could you provide a minimal working example for sampling from pre-trained BigGAN? @airalcorn2 and I wrote a piece of code for sampling, but the results look bad.
Here is our sample code.

import functools
import numpy as np
import torch
import utils

from PIL import Image

parser = utils.prepare_parser()
parser = utils.add_sample_parser(parser)
config = vars(parser.parse_args())

# update config (see train.py for explanation)
config["resolution"] = utils.imsize_dict[config["dataset"]]
config["n_classes"] = utils.nclass_dict[config["dataset"]]
config["G_activation"] = utils.activation_dict[config["G_nl"]]
config["D_activation"] = utils.activation_dict[config["D_nl"]]
config = utils.update_config_roots(config)
config["skip_init"] = True
config["no_optim"] = True
device = "cuda:7"

# Seed RNG
utils.seed_rng(config["seed"])

# Setup cudnn.benchmark for free speed
torch.backends.cudnn.benchmark = True

# Import the model--this line allows us to dynamically select different files.
model = __import__(config["model"])
experiment_name = utils.name_from_config(config)
G = model.Generator(**config).to(device)
utils.count_parameters(G)

# Load weights
G.load_state_dict(torch.load("/mnt/raid/qi/biggan_weighs/G_optim.pth"), strict=False)

# Update batch size setting used for G
G_batch_size = max(config["G_batch_size"], config["batch_size"])
(z_, y_) = utils.prepare_z_y(
    G_batch_size,
    G.dim_z,
    config["n_classes"],
    device=device,
    fp16=config["G_fp16"],
    z_var=config["z_var"],
)

G.eval()

# Sample function
sample = functools.partial(utils.sample, G=G, z_=z_, y_=y_, config=config)

with torch.no_grad():
    z_.sample_()
    y_.sample_()
    image_tensors = G(z_, G.shared(y_))


for i in range(len(image_tensors)):
    image_array = image_tensors[i].permute(1, 2, 0).detach().cpu().numpy()
    image_array = np.uint8(255 * (1 + image_array) / 2)
    image = Image.fromarray(image_array).save("./test_images/{i}.png")

Here is one of our results.

Thanks a lot.

Why train_transform wasn't applied in utils.get_data_loaders()

Hi,
I am building my own model on top of this code and am trying to speed up i/o speed so I am working on using hdf5 file for my custom datasets input. But in get_data_loaders function in utils, it seems like 'train_transform' was not applied and I could not find appropriate image preprocessing/transformation applied anywhere upstream or downstream in the data-feeding pipeline if using hdf5. Would you care to explain why 'train_transform' was not applied if using dataset format hdf5? Or if it was applied, would you point out exactly where it was applied? Thanks very much!

Can this do conditional image generation?

Generate the corresponding picture according to the label

Some guidelines for better gpu perfomance.

Hi,

I am trying to train on a custom dataset using your algorithm.
I try to increase batch size up to the point that my script doesn't break.
I am running a script really similar to launch_BigGAN_ch64_bs256x8.sh.
I see that the memory being allocated is twice as much as it is used (~5GB is utilised and ~10GB is allocated).
Also after step 1000 the model broke as it tried to allocate more memory (I guess for predict, but why is that?).

I use 4 Titans with 11GB each and although you say the opposite, I would like you to suggest, if you have any idea of how I can use this already really powerful system to train your model.

Thanks a lot for your code and in advance for your time!

Are SAGAN and SNGAN scripts tested?

I have tried to train the SAGAN and SNGAN using the provided scripts, and find the performance grows very slow, e.g., at 10k iteration, they both only have an IS score of around 1.00 and FID around 300, sometimes the FID score will become nan.
I wonder if these two scripts are tested, as I see in the README.md that BigGAN-deep script is not tested but these two are not mentioned.

	# Drop channels in x if necessary
	if self.in_channels != self.out_channels:
	x = x[:, :self.out_channels]