nota-netspresso / bk-sdm Goto Github PK

View Code? Open in Web Editor NEW

196.0 196.0 12.0 105 KB

A Compressed Stable Diffusion for Efficient Text-to-Image Generation [ICCV'23 Demo] [ICML'23 Workshop]

License: Other

Shell 16.39% Python 83.61%

compression distillation huggingface lightweight pytorch stable-diffusion

bk-sdm's People

Contributors

Stargazers

Watchers

Forkers

ai-machine-vision-lab techthiyanes paperwave thibaultcastells treksis godofnothing aninda-leonardo wonkyoc camenduru mvandermeulen a7mad-magdy77 saranga7

bk-sdm's Issues

Is there someway to test Img2Img?

SDXL support?

Hi there!

I'd like to ask, do you have or plan to have support for the SDXL model? It's quite heavy and the process of making it more fast and lightweights would have insane benefits to the community.

Thanks for your work!

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline.

Could the author share the code for calculating the model parameters(Param.) and the model computational complexity(MACs) of the pipeline. very thank you!

Question about the lambda

Hi there,
It's me again, I am curious about whether you guys tried different combination of lambda for feat_loss and out_loss or maybe add a lambda for the task_loss?

From my training process, it seems that the feat_loss contributes most part of the total loss.

OSError: Error no file named scheduler_config.json found in directory CompVis/stable-diffusion-v1-4

i download the stable-diffusion-v1-4 ckpt in compvis，but still have this problem, i have triied to install transformers==4.25 4.27 and so on,but didn't work, this is the error details

bash scripts/kd_train_toy.sh
The following values were not passed to accelerate launch and had defaults used instead:
--num_processes was set to a value of 1
--num_machines was set to a value of 1
--mixed_precision was set to a value of 'no'
--dynamo_backend was set to a value of 'no'
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/accelerator.py:249: FutureWarning: logging_dir is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use project_dir instead.
warnings.warn(
./results/toy_bk_small/log_loss.csv
03/11/2024 21:34:33 - INFO - main - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

Traceback (most recent call last):
File "src/kd_train_text_to_image.py", line 914, in
main()
File "src/kd_train_text_to_image.py", line 429, in main
noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler")
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/diffusers/schedulers/scheduling_utils.py", line 139, in from_pretrained
config, kwargs, commit_hash = cls.load_config(
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/diffusers/configuration_utils.py", line 331, in load_config
raise EnvironmentError(
OSError: Error no file named scheduler_config.json found in directory CompVis/stable-diffusion-v1-4.
Traceback (most recent call last):
File "/home/lzj/miniconda3/envs/bk-sdm/bin/accelerate", line 8, in
sys.exit(main())
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 923, in launch_command
simple_launcher(args)
File "/home/lzj/miniconda3/envs/bk-sdm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/lzj/miniconda3/envs/bk-sdm/bin/python', 'src/kd_train_text_to_image.py', '--pretrained_model_name_or_path', 'CompVis/stable-diffusion-v1-4', '--train_data_dir', '/home/lzj/work/data/preprocessed_11k', '--use_ema', '--resolution', '512', '--center_crop', '--random_flip', '--train_batch_size', '2', '--gradient_checkpointing', '--mixed_precision=fp16', '--learning_rate', '5e-05', '--max_grad_norm', '1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--report_to=all', '--max_train_steps=20', '--seed', '1234', '--gradient_accumulation_steps', '4', '--checkpointing_steps', '5', '--valid_steps', '5', '--lambda_sd', '1.0', '--lambda_kd_output', '1.0', '--lambda_kd_feat', '1.0', '--use_copy_weight_from_teacher', '--unet_config_path', './src/unet_config', '--unet_config_name', 'bk_small', '--output_dir', './results/toy_bk_small']' returned non-zero exit status 1.

Discussion on preprocessing of LAION data

[Question]

I have another question.

I split the LAION-aesthetic V2 5+ dataset into several subsets, e.g., 5M, 10M, 89M, etc, and I made metadata.csv for each subset.

Then, when I tried to train with multi-gpus using the subset dataset, I faced the below error.

I guess that the problem was caused by the data itself.

FYI, I didn't pre-process the data except for resolution (512x512) when I downloaded data.

Did you also face this problem?

Or did you conduct any pre-processing of the LAION data??

Steps: 0%| | 283/400000 [35:52<813:24:06, 7.33s/it, kd_feat_loss=58.6, kd_output_loss=0.0447, lr=5e-5, sd_loss=0.185, step_loss=58.9]
Traceback (most recent call last):
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py, line 1171, in
main()
File "/home/user01/bk-sdm/src/kd_train_text_to_image.py", line 961, in main
for step, batch in enumerate(train_dataloader):
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/accelerate/data_loader.py", line 388, in iter
next_batch = next(dataloader_iter)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 671, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 56, in fetch
data = self.dataset.getitems(possibly_batched_index)
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in getitems
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
File "/home/user01/anaconda3/envs/kd-sdm/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2715, in
return [{col: array[i] for col, array in batch.items()} for i in range(n_examples)]
IndexError: index 63 is out of bounds for dimension 0 with size 63

Refine generation code

remove use_auth_token=True in StableDiffusionPipeline.from_pretrained [ref]
disable NSFW filter in recent diffusers versions [ref] [ref] for MS-COCO benchmark

Scale of KD-feature loss for SD inpainting 1.5

Hi there,

I am trying to distill the Unet in SD inpainting 1.5 to a smaller Unet by using your code. (I replaced the pipeline to inpainting and the input data)
I have trained for 130K steps with batch size 64.
Right now the kd_feat_loss is around 20.

I am wondering what kd_feat_loss you have when you finished distill the Unet in your experiment?

Thank you.

improved wandb logger

To incorporate the below feature

In addition, the base training script src/kd_train_text_to_image.py logs only the total loss to W&B and one may be interested in each particular contribution. I added image logging to W&B as well.

multi-gpu training error

Hi, I'm really impressed by your work and nice code.

When I ran the training code with multi-gpu setting, I encountered this error.

Traceback (most recent call last):
File "/home/user01/BK-SDM/src/kd_train_text_to_image.py", line 891, in
main()
File "/home/user01/BK-SDM/src/kd_train_text_to_image.py", line 766, in main
a_stu = acts_stu[m_stu]
KeyError: 'up_blocks.0'

Could you check this?

Thanks in advance :)

ValueError: Invalid pattern: '**' can only be an entire path component

pip install -U datasets

This solves the issue of loading the data.

About gpu memory

Thanks for your great work. May I ask a question about the GPU mermory? You write

A toy script can be used to verify the code executability and find the batch size that matches your GPU. With a batch size of 8 (=4×2), training BK-SDM-Base for 20 iterations takes about 5 minutes and 22GB GPU memory.

With a batch size of 256 (=4×64), training BK-SDM-Base for 50K iterations takes about 300 hours and 53GB GPU memory. With a batch size of 64 (=4×16), it takes 60 hours and 28GB GPU memory.

That is about batch size increase about 32x (from 2 to 64), but gpu memory only inscrease less than 3x (from 22G to 53G). Why the gpu memory is so saving? Does the diffusers more gpu efficient than pytorch-lightning (sd v1.5 used)?
Thanks very much

Loading preprocessed_212k laion dataset without any response in terminal

Hi @bokyeong1015 , thanks for your great work!

I modified diffusers/train_text_to_image.py and used your fine-tuning strategy: on 212k subset of laion. But when I run the training code, loading dataset will consume too much time and there is no response in the terminal after even 40 minutes.... Is it caused by the large number of images or some bugs in my code?

    # In distributed training, the load_dataset function guarantees that only one local process can concurrently
    if args.dataset_name is not None:
        # Downloading and loading a dataset from the hub.
        dataset = load_dataset(
            args.dataset_name,
            args.dataset_config_name,
            cache_dir=args.cache_dir,
            data_dir=args.train_data_dir,
        )
    else:
        data_files = {}
        if args.train_data_dir is not None:
            data_files["train"] = os.path.join(args.train_data_dir, "**")
        print("*** load dataset: start")
        t0 = time.time()
        dataset = load_dataset(
            "imagefolder",
            # data_files=data_files,
            cache_dir=args.cache_dir,
            split="train",
            data_dir=args.train_data_dir,
        )
        print(f"*** load dataset: end --- {time.time()-t0} sec")

        # See more about loading custom images at
        # https://huggingface.co/docs/datasets/v2.4.0/en/image_load#imagefolder

    # Preprocessing the datasets.
    # We need to tokenize inputs and targets.
        
    # column_names = dataset["train"].column_names
    
    ##############################################################################################
    column_names = dataset.column_names
    image_column = column_names[0]
    caption_column = column_names[1]
    ###################################################################################################

This is the loading dataset code. How much time will 'load_dataset' function cost?

Thanks for your great work, looking forward to your reply!

Best wishes,
Qianli

Add training code

issue about training iterations

We note the readme show training BK-SDM-Base need 50K interations， while we find in the "kd_train.py" show --max_train_steps=400K , so can we think the 50K is good enough?

Snapfusion seems to get better results?

Thanks for the generosity of open sourcing your work, but there was a previous work similar to yours, called Snapfusion, aimed at speeding up Stable diffusion.

From the results of their paper, they achieved better results through efficient-unet and step distillation, but unfortunately this work is not open source.

Do you have any opinion on this work? https://snap-research.github.io/SnapFusion/

how about kd trianing without ema?

thanks for your paper and code. my question is how about the model performance when i not use the eam option. it means i didn't pass the option "--use_ema"

Repo update

Code for SD-V2 applicability
Readme & model card for SD-V2 applicability
- Updated description & results
- Updated package info
Credit BK-SDXL from KOALA

batched image generation

To incorporate the below feature

The original src/generate.py generates images one-by-one which leads to under utilization of GPU and as consequence, generation of 30k images takes a while. I've added batched generation of images to speed-up it.

Discussion on experimental settings

[Inquiry]

hi, I tried this method, but found that the performance was very poor. My experimental configuration was to train on laion_11k data for 10k steps, and the unet is bk_tiny. And I also replaced the pipeline to inpainting and the input data. I would like to ask you for any good suggestions, thanks.

We find the 2.3M dataset can not download, the link is wrong?

1 we can download the 212K dataset by
https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_212k.tar.gz
but the 2.3M dataset cannot
https://netspresso-research-code-release.s3.us-east-2.amazonaws.com/data/improved_aesthetics_6.5plus/preprocessed_2256k.tar.gz

2 we also try "bash " method
bash scripts/get_laion_data.sh preprocessed_2256k

Add DreamBooth finetuning

Goal: Efficient personalized generation with lightweight SD backbones
Method: DreamBooth finetuning without and with LoRA

May I ask if the training time is not accurate

batch size is 64 (256=4x64), train BK-SDM-Base by single A100 for 50K iteractions takes about 300 hours
batch size is 16 (64=4x16), train BK-SDM-Base by single A100 for 50K iteractions takes about 60 hours ???
in fact ,it is 600 hours??

Unhandled exception while generating images that considered NSFW

Hi! I ran this line of code to generate samples to compute FID:

!python3 src/generate.py --model_id nota-ai/bk-sdm-base --save_dir ./results/bk-sdm-base

Then I got this error:

0/30000 | COCO_val2014_000000000042.jpg **A small dog is curled up on top of the shoes** | 25 steps
Total 751.9M (U-Net 579.4M; TextEnc 123.1M; ImageDec 49.5M)
100% 25/25 [00:03<00:00,  8.14it/s]
Traceback (most recent call last):
  File "/content/BK-SDM/src/generate.py", line 53, in <module>
    img = pipeline.generate(prompt = val_prompt,
  File "/content/BK-SDM/src/utils/inference_pipeline.py", line 34, in generate
    out = self.pipe(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 706, in __call__
    do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
TypeError: 'bool' object is not iterable

Add downloading 2.3M LAION training pairs

Related to #7
Data folder, preprocessed_2256k: 2,256,472 image-text pairs (182GB tar.gz; 204GB data folder)

About the training speed

I found that the total number of iterations for the training is 400,000. May I ask, how many days does it take for you to train a distilled model? I use 8*V100, I found that I can only complete around 3,800 iterations in one night (from 19:55 to 10:00 the next day).

Is it possible to fine-tune it for inpainting/outpainting task

Do you guys have any plan to fine-tune it for inpainting?

Thanks.

Queries

@bokyeong1015 hi thanks for sharing this wonderful work , i had few queries and request

Can you please share ur checkpoint-45000 on one drive or google drive , i wanted to test it on the system as donot have the resources to train it on gpu system
In ur paper u have mentioned deploying on Nvidia orin ? did u test it on any other platforms like Nvidia Agx / Nx/ nano if so whats the time taken on it
When deploying it on nvidia orin did u used docker or straight up with hugging face models
The Techniques used in this paper and the snap fusion can we bring in those in this code and can we expect to see some better improvements

Thanks n advanc

Could the authors share the code of producting heat map of Figure.8? I am very appreciate your nice work and kind help.

Generation with trained unet

response to #10 (comment)

I want to conduct zero-shot MSCOCO evaluation for my intermediate checkpoint trained with multi-GPU setting, I'm not sure how to denote my checkpoint.

Could you give me some hints for this?

In your instruction(2), you enter model_id.

Could I change the model_id to my checkpoint path?

However, I don't know which one should be denoted.

I guess the unet_ema/diffusion_pytorch_model.bin. Am I right?

Thanks in advance.

Wonderful work and hi from 🧨 diffusers

Hi folks!

Simply amazing work here 🔥

I am Sayak, one of the maintainers of 🧨 diffusers at HF. I see all the weights of BK-SDM are already diffusers-compatible. This is really amazing!

I wanted to know if there is any plan to also open-source the distillation pre-training code. I think that will be beneficial to the community.

Additionally, any plans on doing for SDXL as well?

Cc: @patrickvonplaten

any plans for more models?

Greetings!

these tiny models are amazing! i love fp16 versions,
could u please in the future make models that are based on 1.5 and mixed with uncensored models such as lyriel or deliberate for better face and anatomy?

kind regards

How to replicate this work offline

Hi,thanks for your great work!
I currently have an A100 GPU server that is not connected to the internet. I can configure the environment offline. **Can I replicate your work offline?**Could you please provide me with some guidance? Thank you.

data loading problem with 89M pairs

Hi, thanks to your excellent work, I have conducted many experiments.

When I trained on a subset of LAION-aesthetic-5+ (about 89M pairs), my training process was killed without specific error message:(

Maybe it occurred at the load_dataset.

I guess that the number of training sets is too big, but I'm not sure.

I think this problem may be caused by the huggingface's dataset library.

Have you ever faced this problem? and have you tried to train your model on much bigger training set?