Code Monkey home page Code Monkey logo

Comments (14)

KichangKim avatar KichangKim commented on June 10, 2024

It is strange. What is your dataset? How big is? In normal case, it should print loss, precision, recall, F1 score, training speed per every few iterations (depend on settings in project.json) If the total dataset size is smaller than minimum logging interval, it only prints "Saving checkpoint" like yours.

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

The dataset is the entire Danbooru2020 set. I just filtered out some of the useless tags.

from deepdanbooru.

KichangKim avatar KichangKim commented on June 10, 2024

Can you show your project.json?

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

Yeah, I'll post all of the contents, except the checkpoints folder.

ugh it only lets me do txt files, not json. Whatever, categories, tags_log, and project are jsons.
tags.txt
tags-character.txt
tags-general.txt
tags_log.txt
categories.txt
project.txt

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

Hmm, I have it set to 200mb per checkpoint, and my IO can process several times that per second. Could that somehow be related? I expected to be bottlenecked on GPU power

from deepdanbooru.

KichangKim avatar KichangKim commented on June 10, 2024

From your settings, checkpoint is saved per every 3200 samples (or per epoch) and log per 320 samples. If your dataset has images fewer than 320, it only prints "saving checkpoint" but it is normal. Wait for completion.

If your dataset has many images, check your SQLite database and the count of actual images which has larger tag_count_general value than your minimum_tag_count in project.json.

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024
SELECT Count(1) FROM posts WHERE posts.tag_count_general >= 10

3956484

I don't think that's the issue.

Oh, I think I know the issue. I didn't have the database in the directory with the images folder, but one above it. Do I just delete the checkpoints folder to restart?

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024
da3dsoul@THE-THRONE:/media/da3dsoul/Golias/DeepDanbooru$ deepdanbooru train-project /media/da3dsoul/Golias/DeepDanbooru/unbooru_model/
Using Adam optimizer ... 
Loading tags ... 
Creating model (resnet_custom_v2) ... 
Model : (None, 299, 299, 3) -> (None, 14176)
Loading database ... 
No checkpoint. Starting new training ... (2021-08-31 00:47:17.383431)
Shuffling samples (epoch 0) ... 
Trying to change learning rate to 0.001 ...
Learning rate is changed to <tf.Variable 'learning_rate:0' shape=() dtype=float32, numpy=0.001> ...
2021-08-31 00:47:31.829958: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:592] remapper failed: Not found: Op type not registered '_CopyFromGpuToHost' in binary running on THE-THRONE. Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
MIOpen(HIP): Warning [ParseAndLoadDb] File is unreadable: /opt/rocm-4.3.0/miopen/share/miopen/db/gfx803_32.HIP.fdb.txt
Traceback (most recent call last):
  File "/usr/local/bin/deepdanbooru", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/__main__.py", line 52, in train_project
    dd.commands.train_project(project_path, source_model)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py", line 196, in train_project
    step_result = model.train_on_batch(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 1727, in train_on_batch
    logs = self.train_function(iterator)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 555, in call
    outputs = execute.execute(
  File "/home/da3dsoul/.local/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted:  OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

         [[Func/assert_greater_equal/Assert/AssertGuard/else/_1/input_control_node/_62/_83]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted:  OOM when allocating tensor with shape[1024] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
         [[node resnet_custom_v2/batch_normalization_63/FusedBatchNormV3 (defined at home/da3dsoul/.local/lib/python3.8/site-packages/deepdanbooru/commands/train_project.py:196) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_52220]

Function call stack:
train_function -> train_function

That definitely did something different, but that looks like a me problem with my ROCm setup.

Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.

from deepdanbooru.

KichangKim avatar KichangKim commented on June 10, 2024

Can you add a simple check with an error and exit when the database isn't in the same folder as the images folder? It's a little thing that would help save a headache.

DeepDanbooru simply ignore image file I/O error because when you using extremely large dataset, it has number of incorrect/broken files and it should not disturb entire training process.

Anyway, I recommend that reducing minibatch size because your log shows OOM (out of memory) error.

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

I can try, thanks. It's an 8G GPU, which is reasonable, but this isn't exactly a reasonable workload, so fair.

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

I see, it eats RAM by the gig and asks for seconds. I had to turn the batch size down to 8 to make it not crash, but it's still over 10x faster with the RX570 vs the Ryzen 3700X. Thanks much for all of the help.

When I said run a check, I meant something as simple as

image_dir = *pull it from the database dir*
if not os.isdir(image_dir):
    throw new Exception("Missing images folder!")

It's been a while since I've done python. The syntax isn't important there...

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

Out of curiosity, what's an acceptable sample rate? I'm getting about 12 samples/s

from deepdanbooru.

KichangKim avatar KichangKim commented on June 10, 2024

When I said run a check, I meant something as simple as

Ah, I'll add it.

what's an acceptable sample rate?

In my case, I got ~20 samples/s by using Ryzen 1700x + Geforce RTX 2080ti.

from deepdanbooru.

da3dsoul avatar da3dsoul commented on June 10, 2024

Ok, a 2080 is way faster than a RX570, so that's fair. Thanks.

from deepdanbooru.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.