I just spent a while getting ROCm set up to use tensorflow with my AMD GPU. Now that I

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Normal to Save Checkpoint for 10 minutes on Training? about deepdanbooru HOT 14 CLOSED

kichangkim commented on June 10, 2024

Normal to Save Checkpoint for 10 minutes on Training?

from deepdanbooru.

Comments (14)

KichangKim commented on June 10, 2024

It is strange. What is your dataset? How big is? In normal case, it should print loss, precision, recall, F1 score, training speed per every few iterations (depend on settings in project.json) If the total dataset size is smaller than minimum logging interval, it only prints "Saving checkpoint" like yours.

from deepdanbooru.

da3dsoul commented on June 10, 2024

The dataset is the entire Danbooru2020 set. I just filtered out some of the useless tags.

from deepdanbooru.

KichangKim commented on June 10, 2024

Can you show your project.json?

from deepdanbooru.

da3dsoul commented on June 10, 2024

Yeah, I'll post all of the contents, except the checkpoints folder.

ugh it only lets me do txt files, not json. Whatever, categories, tags_log, and project are jsons.
tags.txt
tags-character.txt
tags-general.txt
tags_log.txt
categories.txt
project.txt

from deepdanbooru.

da3dsoul commented on June 10, 2024

Hmm, I have it set to 200mb per checkpoint, and my IO can process several times that per second. Could that somehow be related? I expected to be bottlenecked on GPU power

from deepdanbooru.

KichangKim commented on June 10, 2024

From your settings, checkpoint is saved per every 3200 samples (or per epoch) and log per 320 samples. If your dataset has images fewer than 320, it only prints "saving checkpoint" but it is normal. Wait for completion.

If your dataset has many images, check your SQLite database and the count of actual images which has larger tag_count_general value than your minimum_tag_count in project.json.

from deepdanbooru.

da3dsoul commented on June 10, 2024

SELECT Count(1) FROM posts WHERE posts.tag_count_general >= 10

3956484

I don't think that's the issue.

Oh, I think I know the issue. I didn't have the database in the directory with the images folder, but one above it. Do I just delete the checkpoints folder to restart?

from deepdanbooru.

da3dsoul commented on June 10, 2024

da3dsoul@THE-THRONE:/media/da3dsoul/Golias/DeepDanbooru$ deepdanbooru train-project /media/da3dsoul/Golias/DeepDanbooru/unbooru_model/
Using Adam optimizer ... 
Loading tags ... 
Creating model (resnet_custom_v2) ... 
Model : (None, 299, 299, 3) -> (None, 14176)
Loading database ... 
No checkpoint. Starting new training ... (2021-08-31 00:47:17.383431)
Shuffling samples (epoch 0) ... 
Trying to change learning rate to 0.001 ...
Learning rate is changed to <tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.001> ...
2021-08-31 00:47:31.829958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on THE-THRONE. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
MIOpen(HIP): Warning [ParseAndLoadDb] File is unreadable: /opt/rocm-4.3.0/miopen/share/miopen/db/gfx803_32.HIP.fdb.txt
Traceback (most recent call last):
  File "/usr/local/bin/deepdanbooru", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/__main__.py", line 52, in train_project
    dd.commands.train_project(project_path, source_model)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py", line 196, in train_project
    step_result = model.train_on_batch(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1727, in train_on_batch
    logs = self.train_function(iterator)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Func/assert_greater_equal/Assert/AssertGuard/else/_1/input_control_node/_62/_83]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_52220]

Function call stack:
train_function -> train_function

That definitely did something different, but that looks like a me problem with my ROCm setup.

Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.

from deepdanbooru.

KichangKim commented on June 10, 2024

Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.

DeepDanbooru simply ignore image file I/O error because when you using extremely large dataset, it has number of incorrect/broken files and it should not disturb entire training process.

Anyway, I recommend that reducing minibatch size because your log shows OOM (out of memory) error.

from deepdanbooru.

da3dsoul commented on June 10, 2024

I can try, thanks. It's an 8G GPU, which is reasonable, but this isn't exactly a reasonable workload, so fair.

from deepdanbooru.

da3dsoul commented on June 10, 2024

I see, it eats RAM by the gig and asks for seconds. I had to turn the batch size down to 8 to make it not crash, but it's still over 10x faster with the RX570 vs the Ryzen 3700X. Thanks much for all of the help.

When I said run a check, I meant something as simple as

image_dir = *pull it from the database dir*
if not os.isdir(image_dir):
    throw new Exception("Missing images folder!")

It's been a while since I've done python. The syntax isn't important there...

from deepdanbooru.

da3dsoul commented on June 10, 2024

Out of curiosity, what's an acceptable sample rate? I'm getting about 12 samples/s

from deepdanbooru.

KichangKim commented on June 10, 2024

When I said run a check, I meant something as simple as

Ah, I'll add it.

what's an acceptable sample rate?

In my case, I got ~20 samples/s by using Ryzen 1700x + Geforce RTX 2080ti.

from deepdanbooru.

da3dsoul commented on June 10, 2024

Ok, a 2080 is way faster than a RX570, so that's fair. Thanks.

from deepdanbooru.

Normal to Save Checkpoint for 10 minutes on Training? about deepdanbooru HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent