My dataset is about 1000 128x128 images. How can I reduce GPU memory load?
Using TensorFlow backend.
2019-01-15 05:44:59.725488: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-01-15 05:44:59.725958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2019-01-15 05:44:59.725999: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-01-15 05:45:00.090022: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-01-15 05:45:00.090102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-01-15 05:45:00.090124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-01-15 05:45:00.090416: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding allow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2019-01-15 05:45:00.090487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10758 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
##### Information #####
# BigGAN 128
# gan type : hinge
# dataset : faces_resized
# dataset number : 1456
# batch_size : 2048
# epoch : 50
# iteration per epoch : 10000
##### Generator #####
# spectral normalization : True
# learning rate : 5e-05
##### Discriminator #####
# the number of critic : 2
# spectral normalization : True
# learning rate : 0.0002
WARNING:tensorflow:From /content/BigGAN-Tensorflow/BigGAN_128.py:215: shuffle_and_repeat (from tensorflow.contrib.data.python.ops.shuffle_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.shuffle_and_repeat(...)`.
WARNING:tensorflow:From /content/BigGAN-Tensorflow/BigGAN_128.py:216: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
WARNING:tensorflow:From /content/BigGAN-Tensorflow/BigGAN_128.py:217: prefetch_to_device (from tensorflow.contrib.data.python.ops.prefetching_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.prefetch_to_device(...)`.
---------
Variables: name (type shape) [size]
---------
discriminator/resblock_down_1/res1/batch_norm/gamma:0 (float32_ref 3) [3, bytes: 12]
discriminator/resblock_down_1/res1/batch_norm/beta:0 (float32_ref 3) [3, bytes: 12]
discriminator/resblock_down_1/res1/conv_0/kernel:0 (float32_ref 3x3x3x96) [2592, bytes: 10368]
...
generator/resblock_up_2/res2/batch_norm/gamma/dense/kernel:0 (float32_ref 20x192) [3840, bytes: 15360]
generator/resblock_up_2/res2/batch_norm/gamma/dense/bias:0 (float32_ref 192) [192, bytes: 768]
generator/resblock_up_2/res2/deconv_0/kernel:0 (float32_ref 3x3x192x192) [331776, bytes: 1327104]
...
generator/resblock_up_1/res2/deconv_0/kernel:0 (float32_ref 3x3x96x96) [82944, bytes: 331776]
generator/resblock_up_1/skip/deconv_0/kernel:0 (float32_ref 3x3x96x192) [165888, bytes: 663552]
generator/batch_norm/gamma:0 (float32_ref 96) [96, bytes: 384]
generator/batch_norm/beta:0 (float32_ref 96) [96, bytes: 384]
generator/G_logit/kernel:0 (float32_ref 3x3x96x3) [2592, bytes: 10368]
Total size of variables: 198818145
Total bytes of variables: 795272580
[*] Reading checkpoints...
[*] Failed to find a checkpoint
[!] Load failed...
2019-01-15 05:46:08.449964: W tensorflow/core/common_runtime/bfc_allocator.cc:211] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.69GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2019-01-15 05:46:18.455636: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 396.09MiB. Current allocation summary follows.
2019-01-15 05:46:18.455791: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 180, Chunks in use: 178. 45.0KiB allocated for chunks. 44.5KiB in use in bin. 3.9KiB client-requested in use in bin.
2019-01-15 05:46:18.455820: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512): Total Chunks: 150, Chunks in use: 150. 92.8KiB allocated for chunks. 92.8KiB in use in bin. 82.9KiB client-requested in use in bin.
...
2019-01-15 05:46:18.456213: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 396.09MiB was 256.00MiB, Chunk State:
2019-01-15 05:46:18.456242: I tensorflow/core/common_runtime/bfc_allocator.cc:619] Size: 371.91MiB | Requested Size: 81.00MiB | in_use: 0, prev: Size: 396.09MiB | Requested Size: 396.09MiB | in_use: 1, next: Size: 768.00MiB | Requested Size: 768.00MiB | in_use: 1
2019-01-15 05:46:18.456313: I tensorflow/core/common_runtime/bfc_allocator.cc:619] Size: 384.00MiB | Requested Size: 384.00MiB | in_use: 0, prev: Size: 768.00MiB | Requested Size: 768.00MiB | in_use: 1, next: Size: 384.00MiB | Requested Size: 384.00MiB | in_use: 1
2019-01-15 05:46:18.456333: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x703f20000 of size 256
2019-01-15 05:46:18.456366: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x703f20100 of size 256
2019-01-15 05:46:18.456381: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 0x703f20200 of size 256
2019-01-15 05:46:18.456395: I tensorflow/core/common_runtime/bfc_allocator.cc:632]
...
2019-01-15 05:46:18.484870: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1818230784 totalling 1.69GiB
2019-01-15 05:46:18.484885: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 9.09GiB
2019-01-15 05:46:18.484906: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 11281553818
InUse: 9764617728
MaxInUse: 10167270912
NumAllocs: 1328
MaxAllocSize: 1818230784
2019-01-15 05:46:18.485013: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***********************************__**************************************_****___********___****__
2019-01-15 05:46:18.485060: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at transpose_op.cc:199 : Resource exhausted: OOM when allocating tensor with shape[2048,3,130,130] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2019-01-15 05:46:28.485851: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.00GiB. Current allocation summary follows.
2019-01-15 05:46:28.485972: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256): Total Chunks: 180, Chunks in use: 178. 45.0KiB allocated for chunks. 44.5KiB in use in bin. 3.9KiB client-requested in use in bin.
2019-01-15 05:46:28.486002: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512): Total Chunks: 150, Chunks in use: 150. 92.8KiB allocated for chunks. 92.8KiB in use in bin. 82.9KiB client-requested in use in bin.
2019-01-15 05:46:28.486048: I tensorflow/core/common_runtime/bfc_allocator.cc:597]
...
...
2019-01-15 05:46:39.427827: I tensorflow/core/common_runtime/bfc_allocator.cc:641] 1 Chunks of size 1818230784 totalling 1.69GiB
2019-01-15 05:46:39.427855: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 9.41GiB
2019-01-15 05:46:39.427892: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit: 11281553818
InUse: 10099625472
MaxInUse: 10167270912
NumAllocs: 1342
MaxAllocSize: 1818230784
2019-01-15 05:46:39.428321: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ***********************************__**************************************_****___***************__
2019-01-15 05:46:39.429193: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at conv_grad_input_ops.cc:1050 : Resource exhausted: OOM when allocating tensor with shape[2048,768,16,16] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
2019-01-15 05:46:40.507376: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at iterator_ops.cc:1177 : Not found: Resource localhost/_0_OneShotIterator/N10tensorflow4data16IteratorResourceE does not exist.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,3,130,130] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/discriminator/resblock_down_1/skip/conv_0/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer}} = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](discriminator/resblock_down_1/skip/conv_0/Pad, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_2/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7346_add_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "main.py", line 132, in <module>
main()
File "main.py", line 120, in main
gan.train()
File "/content/BigGAN-Tensorflow/BigGAN_128.py", line 302, in train
_, summary_str, d_loss = self.sess.run([self.d_optim, self.d_sum, self.d_loss])
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[2048,3,130,130] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node gradients/discriminator/resblock_down_1/skip/conv_0/Conv2D_grad/Conv2DBackpropFilter-0-TransposeNHWCToNCHW-LayoutOptimizer}} = Transpose[T=DT_FLOAT, Tperm=DT_INT32, _device="/job:localhost/replica:0/task:0/device:GPU:0"](discriminator/resblock_down_1/skip/conv_0/Pad, PermConstNHWCToNCHW-LayoutOptimizer)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[[{{node add_2/_5}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7346_add_2", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.