thund3rpat / kohya_ss-linux Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bmaltais/kohya_ss

28.0 28.0 3.0 2.96 MB

(WIP) A port of bmaltais/kohya_ss for Linux

License: Apache License 2.0

Python 99.94% CSS 0.05% Shell 0.01%

gradio stable-diffusion

kohya_ss-linux's People

Contributors

Stargazers

Watchers

Forkers

hustlion amandafanny jason9075

kohya_ss-linux's Issues

Error when training model

Hey there, first wanted to say thank you for porting this over. I am trying to train a model currently, and am getting this error.

Traceback (most recent call last):
File "/home/zono50/kohya_ss-linux/train_network.py", line 548, in
train(args)
File "/home/zono50/kohya_ss-linux/train_network.py", line 156, in train
text_encoder, vae, unet, _ = train_util.load_target_model(args, weight_dtype)
File "/home/zono50/kohya_ss-linux/library/train_util.py", line 1584, in load_target_model
text_encoder, vae, unet = model_util.load_models_from_stable_diffusion_checkpoint(args.v2, args.pretrained_model_name_or_path)
File "/home/zono50/kohya_ss-linux/library/model_util.py", line 880, in load_models_from_stable_diffusion_checkpoint
info = unet.load_state_dict(converted_unet_checkpoint)
File "/home/zono50/kohya_ss-linux/venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel:
size mismatch for conv_in.weight: copying a param with shape torch.Size([320, 9, 3, 3]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]).
Traceback (most recent call last):
File "/home/zono50/kohya_ss-linux/venv/bin/accelerate", line 8, in
sys.exit(main())
File "/home/zono50/kohya_ss-linux/venv/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/zono50/kohya_ss-linux/venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 1104, in launch_command
simple_launcher(args)
File "/home/zono50/kohya_ss-linux/venv/lib/python3.9/site-packages/accelerate/commands/launch.py", line 567, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/zono50/kohya_ss-linux/venv/bin/python3', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=/home/zono50/stable-diffusion-webui/models/Stable-diffusion/realisticVisionV13_v13-inpainting.ckpt', '--train_data_dir=/home/zono50/Lora_Training_Data/Test/image', '--resolution=512,512', '--output_dir=/home/zono50/Lora_Training_Data/Test/model', '--logging_dir=/home/zono50/Lora_Training_Data/Test/log', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=networks.lora', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=Brando', '--lr_scheduler_num_cycles=1', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=600', '--train_batch_size=1', '--max_train_steps=6000', '--save_every_n_epochs=1', '--mixed_precision=fp16', '--save_precision=fp16', '--seed=1234', '--cache_latents', '--bucket_reso_steps=64', '--xformers', '--use_8bit_adam', '--bucket_no_upscale']' returned non-zero exit status 1.

Seems no matter what settings I set to train Lora, I get this error, would love to know if there is a fix for this. Seems it works if I use the v1.5 pruned ema-only, but if I use RealisticVision 1.3, it gives an error.

Every model I train with this doesn't do anything. Was anyone able to actually get a working model out of it?

I'm running on runpod but getting this error: ModuleNotFoundError: No module named 'tkinter'

Getting this error when typing: python3 ./kohya_gui.py

ERROR: ModuleNotFoundError: No module named 'tkinter'

dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory

something went wrong when I try to execute "accelerate config"
I can't trian my lora because of that.

env:

os:Linux version 6.1.0-kali5-amd64 ([email protected]) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 6.1.12-1kali2 (2023-02-23)
python:3.10.9

The error log is:

2023-03-15 14:54:07.789455: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-15 14:54:07.978864: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-03-15 14:54:10.207545: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-15 14:54:10.208065: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-15 14:54:10.208079: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.

module 'bitsandbytes' has no attribute 'optim'

Traceback (most recent call last):
  File "/home/anon/Git/kohya_ss-linux/train_db.py", line 346, in <module>
    train(args)
  File "/home/anon/Git/kohya_ss-linux/train_db.py", line 126, in train
    optimizer_class = bnb.optim.AdamW8bit
AttributeError: module 'bitsandbytes' has no attribute 'optim'

Is preventing 8-bit adam from working on my fedora 37 installation. At least with this repo, it works fine with my automatic1111 installation.

I can also run lora without 8-bit adam, but then xformers isn't available to speed things up.

doesn t work in runpod

is there a way to make it work in runpod?

how to run on paperspace

training loss becomes NaN

Hello, thank you for your excellent port to linux.

I got the training working following your instructions. But after about 100 steps of training, the loss becomes NaN and the final result is not usable (won't be loaded by webui when generating images).

the loss display:

I found out that the newest code might be the cause that breaks the training. The issue is here: bmaltais#215

I checked the issue and git checkout 72c0cb7f632ab56a786b4c83e01de012ceaff96c to the corresponding commit in this repo but I still get NaN.

Any ideas on how to deal with this?

LoRA was trained, but does not work in Stable Diffusion

Hello, I do not understand what the problem, the learning process is fine, but when I try to generate images, nothing changes. LoRA has no effect at all on generation

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
use 8-bit AdamW optimizer | {}
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 2200
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 1100
  num epochs / epoch数: 1
  batch size per device / バッチサイズ: 2
  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 2
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 1100
steps:   0%|                                                                                                                                  | 0/1100 [00:00<?, ?it/s]epoch 1/1
steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1100/1100 [05:10<00:00,  3.54it/s, loss=0.152]save trained model to LoRA/output/Tqweqwfd.safetensors
model saved.
steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1100/1100 [05:10<00:00,  3.54it/s, loss=0.152]

The training schedule is as follows

Training parameters:

{
  "pretrained_model_name_or_path": "XpucT/Deliberate",
  "v2": false,
  "v_parameterization": false,
  "logging_dir": "LoRA/log",
  "train_data_dir": "LoRA/input",
  "reg_data_dir": "",
  "output_dir": "LoRA/output",
  "max_resolution": "512,512",
  "learning_rate": "0.0001",
  "lr_scheduler": "constant",
  "lr_warmup": "0",
  "train_batch_size": 2,
  "epoch": "1",
  "save_every_n_epochs": "1",
  "mixed_precision": "fp16",
  "save_precision": "fp16",
  "seed": "385405080",
  "num_cpu_threads_per_process": 2,
  "cache_latents": true,
  "caption_extension": ".txt",
  "enable_bucket": false,
  "gradient_checkpointing": false,
  "full_fp16": false,
  "no_token_padding": false,
  "stop_text_encoder_training": 0,
  "use_8bit_adam": false,
  "xformers": true,
  "save_model_as": "safetensors",
  "shuffle_caption": false,
  "save_state": false,
  "resume": "",
  "prior_loss_weight": 1.0,
  "text_encoder_lr": "5e-5",
  "unet_lr": "0.0001",
  "network_dim": 128,
  "lora_network_weights": "",
  "color_aug": false,
  "flip_aug": false,
  "clip_skip": 2,
  "gradient_accumulation_steps": 1.0,
  "mem_eff_attn": false,
  "output_name": "Tqweqwfd",
  "model_list": "runwayml/stable-diffusion-v1-5",
  "max_token_length": "75",
  "max_train_epochs": "",
  "max_data_loader_n_workers": "1",
  "network_alpha": 128,
  "training_comment": "Trigger word: Tqweqwfd",
  "keep_tokens": "0",
  "lr_scheduler_num_cycles": "",
  "lr_scheduler_power": "",
  "persistent_data_loader_workers": false,
  "bucket_no_upscale": true,
  "random_crop": false,
  "bucket_reso_steps": 64.0,
  "caption_dropout_every_n_epochs": 0.0,
  "caption_dropout_rate": 0,
  "optimizer": "AdamW8bit",
  "optimizer_args": "",
  "noise_offset": "",
  "LoRA_type": "Standard",
  "conv_dim": 1,
  "conv_alpha": 1
}

Generation without LoRA

Generation with LoRA

I am using runpod service for train model on linux. Can you please suggest what is the problem?

bitsandbytes was compiled without GPU support

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailableis:issue is:open

bitsandbytes is trying to find libcudart.so in LD_LIBRARY_PATH, but the filename in my environment is libcudart.so.11.0. Creating a softlink worked to resolve it.

sudo ln -s /home/USERNAME/.conda/envs/kohya/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/libcudart.so.11.0 /home/USERNAME/.conda/envs/kohya/lib/python3.10/site-packages/nvidia/cuda_runtime/lib/libcudart.so

and

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/anon/.conda/envs/kohya/lib/python3.10/site-packages/nvidia/cuda_runtime/lib

TypeError: init() got an unexpected keyword argument 'pretrained_model_name_or_path'

I don't understand why I get TypeError: __init__() got an unexpected keyword argument 'pretrained_model_name_or_path' after I click on train. The source model I chose was "AbyssOrangeMix2_hard.safetensors"

thund3rpat / kohya_ss-linux Goto Github PK

kohya_ss-linux's People

Contributors

Stargazers

Watchers

Forkers

kohya_ss-linux's Issues

Error when training model

Every model I train with this doesn't do anything. Was anyone able to actually get a working model out of it?

I'm running on runpod but getting this error: ModuleNotFoundError: No module named 'tkinter'

dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory

module 'bitsandbytes' has no attribute 'optim'

doesn t work in runpod

how to run on paperspace

training loss becomes NaN

LoRA was trained, but does not work in Stable Diffusion

bitsandbytes was compiled without GPU support

TypeError: init() got an unexpected keyword argument 'pretrained_model_name_or_path'

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent