Code Monkey home page Code Monkey logo

disco's Introduction

DisCo: Disentangled Control for Realistic Human Dance Generation

Colab YouTube

Tan Wang*, Linjie Li*, Kevin Lin*, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang

Nanyang Technological University   |   Microsoft Azure AI   |  University at Buffalo

DisCo: Disentangled Control for Referring Human Dance Generation in Real World



🔥 News

  • [2024.07.15] We present IDOL, an enhancement of DisCo that simultaneously generates video and depth, enabling realistic 2.5D video synthesis.
  • [2024.04.08] DisCo has been accepted by CVPR24. Please check the latest version of paper on ArXiv.
  • [2023.12.30] Update slides about introducing DisCo and summarizing recent works.
  • [2023.11.30] Update DisCo w/ temporal module.
  • [2023.10.12] Update the new ArXiv version of DisCo (Add temporal module; Synchronize FVD computation with MCVD; More baselines and visualizations, etc)
  • [2023.07.21] Update the construction guide of the TSV file.
  • [2023.07.08] Update the Colab Demo (make sure our code/demo can be run on any machine)!
  • [2023.07.03] Provide the local demo deployment example code. Now you can try our demo on you own dev machine!
  • [2023.07.03] We update the Pre-training tsv data.
  • [2023.06.28] We have released DisCo Human Attribute Pre-training Code.
  • [2023.06.21] DisCo Human Image Editing Demo is released! Have a try!
  • [2023.06.21] We release the human-specific fine-tuning code for reference. Come and build your own specific dance model!
  • [2023.06.21] Release the code for general fine-tuning.
  • [2023.06.21] We release the human attribute pre-trained checkpoint and the fine-tuning checkpoint.
Other following projects you may find interesting:
Comparison of recent works:



🎨 Gradio Demo

Launch Demo Locally (Video dance generation demo is on the way!)

  1. Download the fine-tuning checkpoint model (our demo uses this checkpoint, you can also use your own model); Download the sd-image-variation via git clone https://huggingface.co/lambdalabs/sd-image-variations-diffusers.

  2. Run the jupyter notebook file. All the required code/command are already set up. Remember to revise the pretrained model path --pretrained_model and --pretrained_model_path (sd-va) in manual_args = [xxx].

  3. After running, this jupyter will automatically launch the demo with your local dev GPU. You can visit the demo with the web link provided at the end of the notebook.

  4. Or you can refer to our deployment with Colab Colab. All the code are deployed from scratch!





📝 Introduction

In this project, we introduce DisCo as a generalized referring human dance generation toolkit, which supports both human image & video generation with multiple usage cases (pre-training, fine-tuning, and human-specific fine-tuning), especially good in real-world scenarios.

✨Compared to existing works, DisCo achieves:

  • Generalizability to a large-scale real-world human without human-specific fine-tuning (We also support human-specific fine-tuning). Previous methods only support generation for a specific domain of human, e.g., DreamPose only generate fashion model with easy catwalk pose.

  • Current SOTA results for referring human dance generation.

  • Extensive usage cases and applications (see project page for more details).

  • An easy-to-follow framework, supporting efficient training (x-formers, FP16 training, deepspeed, wandb) and a wide range of possible research directions (pre-training -> fine-tuning -> human-specific fine-tuning).

🌟With this project, you can get:

  • [User]: Just try our online demo! Or deploy the model inference locally.
  • [Researcher]: An easy-to-use codebase for re-implementation and development.
  • [Researcher]: A large amount of research directions for further improvement.



🚀 Getting Started

Installation

## after py3.8 env initialization
pip install --user torch==1.12.1+cu113 torchvision==0.13.1+cu113 -f https://download.pytorch.org/whl/torch_stable.html
pip install --user progressbar psutil pymongo simplejson yacs boto3 pyyaml ete3 easydict deprecated future django orderedset python-magic datasets h5py omegaconf einops ipdb
pip install --user --exists-action w -r requirements.txt
pip install git+https://github.com/microsoft/azfuse.git


## for acceleration
pip install --user deepspeed==0.6.3
pip install -v -U git+https://github.com/facebookresearch/xformers.git@main#egg=xformers

## you may need to downgrade prototbuf to 3.20.x
pip install protobuf==3.20.0

Data Preparation

1. Human Attribute Pre-training

We create a human image subset (700K Images) filtered from existing image corpus for human attribute pre-training:

Dataset COCO (Single Person) TikTok Style DeepFashion2 SHHQ-1.0 LAION-Human
Size 20K 124K 276K 40K 240K

The pre-processed pre-training data with the efficient TSV data format can be downloaded here (Google Drive).

Data Root
└── composite/
    ├── train_xxx.yaml  # The path need to be then specified in the training args
    └── val_xxx.yaml
    ...
└── TikTokDance/
    ├── xxx_images.tsv
    └── xxx_poses.tsv
    ...
└── coco/  
    ├── xxx_images.tsv
    └── xxx_poses.tsv
2. Fine-tuning with Disentangled Control

We use the TikTok dataset for the fine-tuning.

We have already pre-processed the tiktok data with the efficient TSV format which can be downloaded here (Google Drive). (Note that we only use the 1st frame of each TikTok video as the reference image.)

The data folder structure should be like:

Data Root
└── composite_offset/
    ├── train_xxx.yaml  # The path need to be then specified in the training args
    └── val_xxx.yaml
    ...
└── TikTokDance/
    ├── xxx_images.tsv
    └── xxx_poses.tsv
    ...

*PS: If you want to use your own data resource but with our TSV data structure, please follow PREPRO.MD for reference.



Human Attribute Pre-training

Training:

AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain \
--epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \
--learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \
--train_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \
--unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \
--conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0

Pre-trained Model Checkpoint: OneDrive



Fine-tuning with Disentangled Control

Image

Image



1. Modify the config file

Download the sd-image-variations-diffusers from official diffusers repo and put it according to the config file pretrained_model_path. Or you can also choose to modify the pretrained_model_path.



2. w/o Classifier-Free Guidance (CFG)

Training:

[*To enable WANDB, set up the wandb key in utils/lib.py]

[*To employ multiple GPU running, try to add mpirun -np {GPU NUM} before the python.]

WANDB_ENABLE=0 AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet/tiktok_S256L16_xformers_tsv.py \
--do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 32 \
--local_eval_batch_size 32 \
--log_dir exp/tiktok_ft \
--epochs 20 --deepspeed \
--eval_step 500 --save_step 500 \
--gradient_accumulate_steps 1 \
--learning_rate 2e-4 --fix_dist_seed --loss_target "noise" \
--train_yaml /home/wangtan/data/disco/yaml_file/train_TiktokDance-poses-masks.yaml \
--val_yaml /home/wangtan/data/disco/yaml_file/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--guidance_scale 3 \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local --combine_use_mask \
--conds "poses" "masks" \
--stage1_pretrain_path /path/to/pretrained_model_checkpoint/mp_rank_00_model_states.pt 

Visualization:

To run the visualization, just change --do_train to --eval_visu . You can also specify the visualization folder name with '--eval_save_filename' xxx.

Evaluation:

You first need to run the evaluation to get the results. Then we use gen_eval.sh to one-stop get the evaluation metrics for {exp_dir_path}/{prediction_folder_name}

bash gen_eval.sh {exp_dir_path} {exp_dir_path}/{prediction_folder_name}

For example,

bash gen_eval.sh /home/kevintw/code/disco/github2/DisCo/save_results/TikTok_cfg_check /home/kevintw/code/disco/github2/DisCo/save_results/TikTok_cfg_check/pred_gs1.5_scale-cond1.0-ref1.0/

You may need to download the pre-trained vision model and revise the path in gen_eval.sh for achieving fvd metric.



3. w/ Classifier-Free Guidance (CFG) [CFG can bring a slightly better results]

Training (add the following args into the training script of w/o CFG):

--drop_ref 0.05 # probability to dropout the reference image during training
--guidance_scale 1.5 # the scale of the CFG

Visualization:

To run the visualization, just change --do_train to --eval_visu . You can also specify the visualization folder name with '--eval_save_filename' xxx. (Remember to also specify the --guidance_scale)

You can also check our command bash file config/command_bash/tiktok_cfg.sh for reference.

Evaluation:

Same with above



4. Temporal module fine-tuning

Training: After training the image DisCo model, we further incorporate temporal convolutional layers and temporal attention layers to improve the temporal smoothness. Note that the content for argument --pretrained_model should be the image DisCo model checkpoint, instead of stage 1 pre-trained checkpoint.

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/disco_w_tm/yz_tiktok_S256L16_xformers_tsv_temdisco_temp_attn.py \
--do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 2 \
--local_eval_batch_size 2 \
--log_dir exp/tiktok_ft \
--epochs 20 --deepspeed \
--eval_step 500 --save_step 500 \
--gradient_accumulate_steps 1 \
--learning_rate 1e-4 --fix_dist_seed --loss_target "noise" \
--train_yaml /home/wangtan/data/disco/yaml_file/train_TiktokDance-poses-masks.yaml \
--val_yaml /home/wangtan/data/disco/yaml_file/new10val_TiktokDance-poses-masks.yaml \
--unet_unfreeze_type "all" \
--refer_sdvae \
--ref_null_caption False \
--combine_clip_local --combine_use_mask \
--train_sample_interval 4 \
--nframe 16 \
--frame_interval 1 \
--conds "poses" "masks" \
--pretrained_model /path/to/pretrained_model/mp_rank_00_model_states.pt 

Evaluation: Simply replace the previous gen_eval.sh script with the gen_eval_tm.sh script, as follows. GT folder path will be filled in automatically.

bash gen_eval_tm.sh {exp_dir_path} {exp_dir_path}/{prediction_folder_name}



5. Possible issue for FVD metric reproduction

Please first check the github issue and response here. We have validated the checkpoint results on A100 GPU. If you still cannot reproduce the results, please open an issue or send me the email.



Human-Specific Fine-tuning

Image

1. Prepare dataset that you want to use for training

  • Prepare a human-specific video or a set of human images

  • Use Grounded-SAM and OpenPose to obtain human mask and human skeleton for each training image (See PREPRO.MD for more details)

  • For human-specific fine-tuning, we recommend to directly use raw image/mask/pose for training rather than build TSV file. If you still want to use TSV file structure to prepare your data, please follow PREPRO.MD for reference.

2. Run the following script for human-specific fine-tuning:

For parameter tuning, recommend to first tune the learning-rate and unet_unfreeze_type.

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py \
--cf config/ref_attn_clip_combine_controlnet_imgspecific_ft/webtan_S256L16_xformers_upsquare.py --do_train --root_dir /path/of/saving/root \
--local_train_batch_size 32 --local_eval_batch_size 32 --log_dir exp/human_specific_ft/ \
--epochs 20 --deepspeed --eval_step 500 --save_step 500 --gradient_accumulate_steps 1 \
--learning_rate 1e-3  --fix_dist_seed  --loss_target "noise" \
--unet_unfreeze_type "crossattn" \
--refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "poses" "masks" \
--freeze_pose True --freeze_background False \
--pretrained_model /path/to/the/ft_model_checkpoint \
--ft_iters 500 --ft_one_ref_image False --ft_idx dataset/folder/name --strong_aug_stage1 True --strong_rand_stage2 True



Notes

1. Possible issue for FVD metric reproduction

Please first check the github issue and response here. We have validated the checkpoint results on A100 GPU. If you still cannot reproduce the results, please open an issue or send me the email.

2. PSNR metric

We thank @Delicious-Bitter-Melon for for highlighting a potential numerical overflow issue in the implementation of the PSNR metric. This issue has been resolved in codebase, and the updated score is reflected in our latest version. It's important to note that this correction does not alter the trend or affect the overall conclusions.

3. DisCo is not limited to upper-human.

Please check our latest paper for its generalizability. It can be easily achieved by incorporating extensive image scaling augmentations during the training phase.



Release Plan

  • Code for "Fine-tuning with Disentangled Control"
  • Code for "Human-Specific Fine-tuning"
  • Model Checkpoints for Pre-training and Fine-tuning
  • HuggingFace Demo
  • Code for "Human Attribute Pre-training"



Citation

If you use our work in your research, please cite:

@article{wang2023disco,
  title={Disco: Disentangled control for realistic human dance generation},
  author={Wang, Tan and Li, Linjie and Lin, Kevin and Zhai, Yuanhao and Lin, Chung-Ching and Yang, Zhengyuan and Zhang, Hanwang and Liu, Zicheng and Wang, Lijuan},
  journal={arXiv preprint arXiv:2307.00040},
  year={2023}
}

disco's People

Contributors

kevinlin311tw avatar wangt-cn avatar yhzhai avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

disco's Issues

No module named 'torch._six'

When I run the google colab DisCo_Demo.ipynb, the following error occurred.

ModuleNotFoundError                       Traceback (most recent call last)
[<ipython-input-25-7aa731641df4>](https://localhost:8080/#) in <cell line: 6>()
      4 
      5 from utils.wutils_ldm import *
----> 6 from agent import Agent_LDM, WarmupLinearLR, WarmupLinearConstantLR
      7 import torch
      8 from config import BasicArgs

3 frames
[/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py](https://localhost:8080/#) in <module>
     16 
     17 import torch
---> 18 from torch._six import inf
     19 import torch.distributed as dist
     20 

ModuleNotFoundError: No module named 'torch._six'

But, 'torch._six' was on or under torch==1.7.0.
Next, the following error occurred.

!pip install pip install torch==1.7.0
Requirement already satisfied: pip in /usr/local/lib/python3.10/dist-packages (23.1.2)
Collecting install
  Downloading install-1.3.5-py3-none-any.whl (3.2 kB)
ERROR: Could not find a version that satisfies the requirement torch==1.7.0 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0)
ERROR: No matching distribution found for torch==1.7.0

Please revise the gooble colab torch version on or over 2.0.0.

Multi-GPU failed to run

Hi, thanks for the great work.

My GPU: 2080ti * 10

AZFUSE_USE_FUSE=0 NCCL_ASYNC_ERROR_HANDLING=0 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8,9 mpirun -np 8 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir run_test \ --local_train_batch_size 8 --local_eval_batch_size 8 --log_dir exp/tiktok_pretrain \ --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \ --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \ --train_yaml /data/mfyan/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml /data/mfyan/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \ --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \ --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0

The first is error reporting RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:12475 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:12475 (errno: 98 - Address already in use). . Then I changed the port number in utils/dist.py to something else, and found that the same type of error was still reported, so I changed the port number to random.randint(10000, 20000), and it worked. But I found 8 processes running only on GPU 0, resulting in RuntimeError: CUDA error: out of memory .

'GIT/{:05d}/labels/{:04d}.txt' How to get this file?

Hello author, I would like to ask, how to get the “self.anno_path = 'GIT/{:05d}/labels/{:04d}.txt'” of images in the tiktok dataset

    if 'youtube' in anno_pose_path:
        img_key = self.anno_list[idx % self.num_images]
    else:
        anno = list(open(anno_path))
        img_key = json.loads(anno[0].strip())['image_key']
    """
    example:
    {"num_region": 6, "image_key": "TiktokDance_00001_0002.png", "image_split": "00001", "image_read_error": false}
    {"box_id": 0, "class_name": "aerosol_can", "norm_bbox": [0.5, 0.5, 1.0, 1.0], "conf": 0.0, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
    {"box_id": 1, "class_name": "person", "norm_bbox": [0.46692365407943726, 0.4977584183216095, 0.9338473081588745, 0.995516836643219], "conf": 0.912740170955658, "region_caption": "a woman with an orange dress with butterflies on her shirt.", "caption_conf": 0.9404542168542169}
    {"box_id": 2, "class_name": "butterfly", "norm_bbox": [0.2368704378604889, 0.5088028907775879, 0.1444256454706192, 0.04199704900383949], "conf": 0.8738771677017212, "region_caption": "a brown butterfly sitting on an orange background.", "caption_conf": 0.9297735554473283}
    {"box_id": 3, "class_name": "butterfly", "norm_bbox": [0.6688584089279175, 0.5137135982513428, 0.11311062425374985, 0.05455022677779198], "conf": 0.8287128806114197, "region_caption": "a brown butterfly sitting on an orange wall.", "caption_conf": 0.9264783379302365}
    {"box_id": 4, "class_name": "blouse", "norm_bbox": [0.4692786931991577, 0.6465241312980652, 0.9283269643783569, 0.6027728319168091], "conf": 0.6851752400398254, "region_caption": "a woman wearing an orange shirt with butterflies on it.", "caption_conf": 0.9978814544264754}
    {"box_id": 5, "class_name": "short_pants", "norm_bbox": [0.44008955359458923, 0.8769687414169312, 0.8799525499343872, 0.2431662678718567], "conf": 0.6741859316825867, "region_caption": "a person wearing an orange shirt and grey sweatpants.", "caption_conf": 0.9731313580907464}
    """

deepspeed.runtime.zero.utis.ZeR0RuntimeException

Hello, thanks for this great work! When I was trying to run through the code

AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /home1/wangtan/code/ms_internship2/github_repo/run_test \
--local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain \
--epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 \
--learning_rate 1e-3 --fix_dist_seed --loss_target "noise" \
--train_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./blob_dir/debug_output/video_sythesis/dataset/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml \
--unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask \
--conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0

I met the following raise exception:

Traceback (most recent call last):
  File "finetune_sdm_yaml.py", line 209, in <module>
    main_worker(parsed_args)
  File "finetune_sdm_yaml.py", line 135, in main_worker
    trainer.setup_model_for_training()
  File "/data1/tao.wu/DisCo/agent.py", line 978, in setup_model_for_training
    self.prepare_dist_model()
  File "/data1/tao.wu/DisCo/agent.py", line 205, in prepare_dist_model
    lr_scheduler=self.scheduler)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/__init__.py", line 181, in initialize
    config_class=config_class)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 310, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/data1/tao.wu/anaconda3/envs/disco/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1196, in _configure_optimizer
    raise ZeRORuntimeException(msg)
deepspeed.runtime.zero.utils.ZeRORuntimeException: You are using ZeRO-Offload with a client provided optimizer (<class 'torch.optim.adamw.AdamW'>) which in most cases will yield poor performance. Please either use deepspeed.ops.adam.DeepSpeedCPUAdam or set an optimizer in your ds-config (https://www.deepspeed.ai/docs/config-json/#optimizer-parameters). If you really want to use a custom optimizer w. ZeRO-Offload and understand the performance impacts you can also set <"zero_force_ds_cpu_optimizer": false> in your configuration file.

I wonder what may cause such exception, could anyone help me out? Thanks a lot!

Composite Caption LineList Creation ?

Hi ! Thanks for your Disco paper and explanation for the TSV file preparation.

In the composite yaml file, you have a 'caption linelist' file which is used.
caption_linelist: train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.caption.linelist.tsv
Could you explain how you make this file ?

Training/Validation Data Split

Hi, thanks for your great work. I check the TikTok tsv dataset and find that you've already split the dataset into trainig set and validation set. Since it's not easy to match the image with original sequence id of the dataset each by each, Then could you please just clarify that which sequences from original TikTok datset(from 000 to 340) are used for tranining and which are for validation? Thanks!

Question about training data structure

Hi, I would like to use a different dataset for the second step of fine-tuning. How should I structure the data as provided by you? For example, how can I obtain the files train_images.lineidx and train_images.lineidx.8b?
Can you provide a brief tutorial on how to use tsv_file_ops.py and tsv_file.py?

OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like /home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers is not the path to a directory containing a scheduler_config.json file. Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.

[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
[2023-07-04 16:13:34 <finetune_sdm_yaml.py:89> main_worker] Building models...
Traceback (most recent call last):
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 355, in load_config
config_file = hf_hub_download(
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 112, in _inner_fn
validate_repo_id(arg_value)
File "/root/anaconda3/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 160, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils._validators.HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers'. Use repo_type argument if needed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/data/DisCo/finetune_sdm_yaml.py", line 209, in
main_worker(parsed_args)
File "/data/DisCo/finetune_sdm_yaml.py", line 90, in main_worker
model = Net(args)
File "/data/DisCo/config/ref_attn_clip_combine_controlnet_attr_pretraining/net.py", line 38, in init
tr_noise_scheduler = DDPMScheduler.from_pretrained(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/schedulers/scheduling_utils.py", line 139, in from_pretrained
config, kwargs, commit_hash = cls.load_config(
File "/root/anaconda3/lib/python3.10/site-packages/diffusers/configuration_utils.py", line 391, in load_config
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this model, couldn't find it in the cached files and it looks like /home1/wangtan/code/ms_internship2/github_repo/run_test/diffusers/sd-image-variations-diffusers is not the path to a directory containing a scheduler_config.json file.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/diffusers/installation#offline-mode'.
how can I fix this?Thx

support multi-gpus

hi, I've tried running the code on multiple GPUs, but it seems that it doesn't utilize all available GPU resources. Could you please provide some guidance on how I can modify the code or which commands I should use to enable multi-GPU support? Thank you very much for your help.

Can we use any other controls instead of pose?

Great work @Wangt-CN. Currently the repo is using 2D keypoints to control the pose of output. Is it possible to replace pose with canny or depth based control? If yes, would it require only replacing controlnet model or retraining complete model? Thanks.

About the training data.

Thanks for your great work!
I am curious about the Human Attribute Pre-training stage, if you pre-trained the model with the full-body image in SHHQ or just use the cropped upper-body image (e.g. the showed tiktok video results)?

For hap training, it stopped after step 47999

my log is like this
1689042442307

as for metric.json, it is
{"Step 0": {"eval": {"FID": 290.6777284435151, "time": "0:03:31.655149"}}, "Epoch2": {"train": {"loss_total": 0.09805237877084273, "time": "2:00:58.214095"}}, "Step2000": {"eval": {"FID": 39.84386908484396, "time": "0:04:10.178784"}}, "Epoch3": {"train": {"loss_total": 0.08273238215732012, "time": "1:05:12.567647"}}, "Step4000": {"eval": {"FID": 43.90990567108605, "time": "0:04:07.709888"}}, "Epoch4": {"train": {"loss_total": 0.08007360270170316, "time": "0:12:27.958903"}}, "Epoch5": {"train": {"loss_total": 0.07951762963632482, "time": "2:10:26.386287"}}, "Step6000": {"eval": {"FID": 47.38045255087343, "time": "0:04:09.181442"}}, "Epoch6": {"train": {"loss_total": 0.0778031051322654, "time": "1:17:38.705795"}}, "Step8000": {"eval": {"FID": 44.090458160650826, "time": "0:04:09.042723"}}, "Epoch7": {"train": {"loss_total": 0.07623744776395902, "time": "0:24:55.471577"}}, "Epoch8": {"train": {"loss_total": 0.07622077868163159, "time": "2:22:49.398254"}}, "Step10000": {"eval": {"FID": 31.904819727152358, "time": "0:04:09.004639"}}, "Epoch9": {"train": {"loss_total": 0.07504117791188147, "time": "1:30:05.510775"}}, "Step12000": {"eval": {"FID": 27.483082697985367, "time": "0:04:09.374085"}}, "Epoch10": {"train": {"loss_total": 0.07409243798521284, "time": "0:37:20.796301"}}, "Epoch11": {"train": {"loss_total": 0.07395339482515068, "time": "2:35:10.650475"}}, "Step14000": {"eval": {"FID": 31.168737757947156, "time": "0:04:09.885283"}}, "Epoch12": {"train": {"loss_total": 0.07372492651599219, "time": "1:42:29.514715"}}, "Step16000": {"eval": {"FID": 27.21106589500107, "time": "0:04:08.266375"}}, "Epoch13": {"train": {"loss_total": 0.07312383905231748, "time": "0:49:47.800272"}}, "Epoch14": {"train": {"loss_total": 0.07289745142666333, "time": "2:47:37.351242"}}, "Step18000": {"eval": {"FID": 23.106254980103927, "time": "0:04:07.923059"}}, "Epoch15": {"train": {"loss_total": 0.07242165016734459, "time": "1:54:56.220633"}}, "Step20000": {"eval": {"FID": 27.248582831371834, "time": "0:04:08.544414"}}, "Epoch16": {"train": {"loss_total": 0.07194297805632631, "time": "1:02:14.190223"}}, "Step22000": {"eval": {"FID": 24.803106175247194, "time": "0:04:07.721933"}}, "Epoch17": {"train": {"loss_total": 0.07178588963246771, "time": "0:09:32.512436"}}, "Epoch18": {"train": {"loss_total": 0.07133925958989136, "time": "2:07:23.062100"}}, "Step24000": {"eval": {"FID": 24.043788111684535, "time": "0:04:08.888000"}}, "Epoch19": {"train": {"loss_total": 0.07101746586461861, "time": "1:14:44.514627"}}, "Step26000": {"eval": {"FID": 21.995370168790316, "time": "0:04:09.391666"}}, "Epoch20": {"train": {"loss_total": 0.0711026608135349, "time": "0:21:59.766503"}}, "Epoch21": {"train": {"loss_total": 0.07048133608498951, "time": "2:19:49.997902"}}, "Step28000": {"eval": {"FID": 24.72485237611329, "time": "0:04:08.996139"}}, "Epoch22": {"train": {"loss_total": 0.07035766966611788, "time": "1:27:07.072584"}}, "Step30000": {"eval": {"FID": 23.718398035524274, "time": "0:04:08.026119"}}, "Epoch23": {"train": {"loss_total": 0.07033483951472409, "time": "0:34:28.609709"}}, "Epoch24": {"train": {"loss_total": 0.0700808005451255, "time": "2:32:20.204781"}}, "Step32000": {"eval": {"FID": 22.28186004474327, "time": "0:04:09.402551"}}, "Epoch25": {"train": {"loss_total": 0.06941193997643072, "time": "1:39:37.250015"}}, "Step34000": {"eval": {"FID": 22.008737701972393, "time": "0:04:08.751428"}}, "Epoch26": {"train": {"loss_total": 0.06975648029961369, "time": "0:46:56.325861"}}, "Epoch27": {"train": {"loss_total": 0.06926694908420368, "time": "2:44:55.461831"}}, "Step36000": {"eval": {"FID": 20.646719371436802, "time": "0:04:09.315956"}}, "Epoch28": {"train": {"loss_total": 0.06902978069161715, "time": "1:51:58.898925"}}, "Step38000": {"eval": {"FID": 20.947136795766653, "time": "0:04:08.238444"}}, "Epoch29": {"train": {"loss_total": 0.06883154165577786, "time": "0:59:21.183679"}}, "Step40000": {"eval": {"FID": 21.91535396280665, "time": "0:04:08.135138"}}, "Epoch30": {"train": {"loss_total": 0.0685038694108908, "time": "0:06:39.512788"}}, "Epoch31": {"train": {"loss_total": 0.06833649117448559, "time": "2:04:39.496344"}}, "Step42000": {"eval": {"FID": 21.03997708430046, "time": "0:04:08.954742"}}, "Epoch32": {"train": {"loss_total": 0.06803931710874384, "time": "1:11:49.735317"}}, "Step44000": {"eval": {"FID": 20.869328025712207, "time": "0:04:08.273694"}}, "Epoch33": {"train": {"loss_total": 0.06827863852959126, "time": "0:19:08.113691"}}, "Epoch34": {"train": {"loss_total": 0.06779086924764614, "time": "2:16:59.691355"}}, "Step46000": {"eval": {"FID": 21.228485860542378, "time": "0:04:07.829478"}}, "Epoch35": {"train": {"loss_total": 0.06772437718459348, "time": "1:24:18.001555"}}, "Step48000": {"eval": {"FID": 21.43211396473822, "time": "0:04:08.974074"}}, "Epoch36": {"train": {"loss_total": 0.06746692015109836, "time": "0:31:33.721808"}}}
Does the code have the mechanism of early stop or my code have encounter sth error?
I ran this code using this
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0 >> log.txt 2>&1

change sd models

Hello, thank you for the great code. However, I have some slight concerns about the image quality,
so I wanted to ask if it's possible to replace the sd-image-variations-diffusers model with another model.
It seems difficult to make an immediate change due to the image_encoder file.

Thank you, and I hope you have a wonderful day.

About video frame consistency

Thanks for your great work!
I am curious about this model how to process the 'video frame consistency'. The paper seems to not consider this issue.

I try the video pose transfer and result as follows (far from paper shows. Am I missing some steps ? ):

out.mp4

human_img_edit_gradio.ipynb run error

cf = import_filename(args.cf)
Net, inner_collect_fn = cf.Net, cf.inner_collect_fn
cf is config/ref_attn_clip_combine_controlnet/app_demo_image_edit.py, and app_demo_image_edit.py does not have Net and inner_collect_fn

About BatchSize in Fine-tuning with Disentangled Control

Thank you for great work @Wangt-CN.
When Fine-tuning with Disentangled Control in TiktokDance, the paper states that "it is trained on 8 NVIDIA V100 GPUs for 70K steps with an image size of 256 × 256 and a learning rate of 2e−4". I would like to know the value of the local_batch_size in this case.
Thanks a lot.

如何从你处理好的数据中拿到skeleton中的key points信息?

您好,
我想从你处理好的数据中获取skeleton中key points的信息。
我从tsv文件中读取出来pose image的信息后,通过数值分析发现对应位置的像素值和create_custom_dataset_tsvs.py中设置的colors里面rgb对应不上,,看上去好像pose image在存储在tsv中的时候使用的是有损压缩?
期待您的回答,谢谢!

About the training data

Thanks for sharing this great work!

I find that you uploaded the training data to the google cloud in tsv format. It is inconvenient for me to download the data with google cloud. Could you please upload a copy of the data to other cloud storage, such as google drive, aliyun, or baiduyun?

Thanks

Video frame 'expand' when performing FVD

Hi, thank you for your great work! I have a question about the FVD evaluation. I intend to follow this work, but I have some problems when evaluating FVD. (The other quanti results are consistent with the paper)

When I check the configs of the videos generated from gif (in tool/metrics/utils.py 'DatasetFVDVideoResize'), I find that the video has the size of [128, 112, 112, 3], however, the gif has only 16 frames. So when I check out the ffmpeg function in tool/metrics/utils.py line 358

out, _ = (ffmpeg.input(path).output('pipe:', format='rawvideo', pix_fmt='rgb24').run(capture_stdout=True, quiet=False))

it outputs something like below, which means it transfers the 16-frame gif to a 128 frame video (and segment it into 8 pieces for the num_seg parameter):

Input #0, gif, from '/root/autodl-tmp/DisCo/run_test/exp/tiktok_ft/outputs//pred_gs1.5_scale-cond1.0-ref1.0_gif/TiktokDance_00337_0010png.gif': Duration: 00:00:05.28, start: 0.000000, bitrate: 866 kb/s Stream #0:0: Video: gif, bgra, 256x256, 3.03 fps, 24.25 tbr, 100 tbn, 100 tbc

Output #0, rawvideo, to 'pipe:': Metadata: encoder : Lavf58.29.100 Stream #0:0: Video: rawvideo (RGB[24] / 0x18424752), rgb24, 256x256, q=2-31, 38141 kb/s, 24.25 fps, 24.25 tbn, 24.25 tbc Metadata: encoder : Lavc58.54.100 rawvideo

And if I set the fps in gen_eval.sh as 25 (and the video will be 16 frames), the FVD-3DRN50 will become 96.15 (from More TikTok-Style Training Data (FID-FVD: 15.7))
even if I don't change the fps (remain as 3), the FVD-3DRN50 is 20.34, different from the paper.

So I have 3 questions on this evaluation:

  1. Should we change fps in gen_eval.sh?
  2. Like #25 , I evaluate the fvd using: FID-VID:resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth" with MD5 a044310dff79e2688c342d55a0b202d2, FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit" with MD5 c275f5caff95bea0b712515feedad130. Are these two correct for evalulation?
  3. In #27 , the authors say the evaluation uses 335-340 and 5 OL video as evaluation, but the provided new10val_TiktokDance-poses-masks.yaml outputs 337/338/201/202/203. Maybe the correct yaml will lead to the paper FVD results?

Thank you!

LPIPS evaluation should add `normalize=True`

Hi, thanks for the great work.
I noticed that the LPIPS evaluation does not include normalize=True given that the inputs are in the [0,1] range. Adding this would change the results from 0.292 to 0.339. Despite this increase, the result still remains significantly better than the baseline.

def compute_lpips(gen_inst_name_full, gt_inst_name_full):
gen_inst_name_full = sorted(gen_inst_name_full)
gt_inst_name_full =sorted(gt_inst_name_full)
convert_tensor = transforms.ToTensor()
loss_fn_vgg = lpips.LPIPS(net='vgg')
scores = []
for gen_path, gt_path in tqdm(zip(gen_inst_name_full, gt_inst_name_full)):
gen_filename = os.path.splitext(
os.path.basename(gen_path))[0]
gt_filename = os.path.splitext(
os.path.basename(gt_path))[0]
assert gen_filename == gt_filename, 'file mismatch'
image1 = convert_tensor(load_image(gen_path)).unsqueeze(0)
image2 = convert_tensor(load_image(gt_path)).unsqueeze(0)
score = loss_fn_vgg(image1, image2).item()
scores.append(score)
score_ave = np.mean(scores)
return score_ave

LPIPS:
https://github.com/richzhang/PerceptualSimilarity/blob/31bc1271ae6f13b7e281b9959ac24a5e8f2ed522/lpips/lpips.py#L112-L115

for human-specific fine-tuning,I can't execute

I set my dataset like the toy_dataset you provided. However, it can't execute. I encountered this problem
Original Traceback (most recent call last):
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/.local/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 569, in getitem
raw_data = self.get_img_txt_pair(idx)
File "/data/DisCo/dataset/tiktok_controlnet_t2i_imagevar_combine_specifcimg_web_upsquare.py", line 512, in get_img_txt_pair
anno = list(open(anno_path))
FileNotFoundError: [Errno 2] No such file or directory: './719__242.png'

image

Incorrect FID-VID and FVD

Thanks for great work. @Wangt-CN

I tried to reproduce the results using "gen_eval.sh," but I noticed that the FID-VID and FVD do not match the results reported in the paper. Can you help me with this issue? Is it possible that I am using the incorrect checkpoints?

截屏2023-07-28 14 11 42

download checkpoints:
pth : TikTok Training Data (FID-FVD: 18.8)

FID-VID:resnet-50-kinetics.pth : "https://github.com/yjh0410/YOWOF/releases/download/yowof-weight/resnet-50-kinetics.pth"

FVD: i3d_pretrained_400.pt : "https://drive.google.com/file/d/1mQK8KD8G6UWRa5t87SRMm5PVXtlpneJT/edit"

Human specific finetune

Hi, Thanks a lot for this great work!

I am trying to run the finetuning code with the provided instructions but the code references a lot of data that I don't have (tiktok data etc.). Is all this data needed for the finetuning? could you perhaps clarify the structure of the finetune data and how to reference it?

Thanks!

How to try demo with my own dataset?

Hi, i'd like to try the demo with another dataset, and i follow the PREPRO.md, successfully run the GroundSAM and Openpose script.

The question is Openpose does not return the images show as demo_data/pose_img/*.png, Openpose output the keypoints with json style , and draw the skeleton directly on original rgb images. So is there any script to generate this style images? The pose image is resized to 256x256, if my dataset image is larger, should i crop the pose area and resize to 256 at first?

e.g. (openpose result i got)
00001.jpg.json.txt
00001 jpg

hope your response, thanks!

There is sth wrong with ./annotator/grounded-sam/run.py

I modify the args and used
python ./annotator/grounded-sam/run.py --dataset_root ./single/ --partition 1。
Under groundsam_vis folder,I got 001.png.mask.jpg and 001.png.mask.png. The original size of pic is 540960, 001.png.mask.png is 540960, but it is all black. 001.png.mask.jpg's foreground color is yellow and its background is purple,
but its size is 1299 x 2310.

Some issues about deepspeed

Hi, thank you very much for your amazing work. I have successfully run the Gradio Demo using the model you provided.

However, I encountered the following error during the fine-tuning training phase using fp16 deepspeed:
[INFO] [stage_1_and_2.py:1651:step] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0. How should I configure deepspeed.py to solve this problem?

Pre-training dataset

Thank you very much for such an outstanding work, will the pre-training dataset be open sourced?

How can it be successfully run inference on multiple GPUs?

Whichever port I use for multiple-GPU inference, I always get the error that is the address already in use.

For example, I set "export MASTER_PORT=65530" before inference on multiple GPUs and then I will get an error as follows:

RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:65530 (errno: 98 - Address already in use). The server socket has failed to bind to?UNKNOWN? (errno: 98 - Address already in use).

How should I set the learning rate on 8*v100 32g gpus?

This is the parameters I deployed
AZFUSE_USE_FUSE=0 QD_USE_LINEIDX_8B=0 NCCL_ASYNC_ERROR_HANDLING=0 mpirun -np 8 --allow-run-as-root python finetune_sdm_yaml.py --cf config/ref_attn_clip_combine_controlnet_attr_pretraining/coco_S256_xformers_tsv_strongrand.py --do_train --root_dir /data/DisCo --local_train_batch_size 64 --local_eval_batch_size 64 --log_dir exp/tiktok_pretrain --epochs 40 --deepspeed --eval_step 2000 --save_step 2000 --gradient_accumulate_steps 1 --learning_rate 1e-3 --fix_dist_seed --loss_target "noise" --train_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/train_TiktokDance-coco-single_person-Lindsey_0411_youtube-SHHQ-1.0-deepfashion2-laion_human-masks-single_cap.yaml --val_yaml ./TSV_dataset/Human_Attribute_Pretrain/composite/val_TiktokDance-coco-single_person-SHHQ-1.0-masks-single_cap.yaml --unet_unfreeze_type "transblocks" --refer_sdvae --ref_null_caption False --combine_clip_local --combine_use_mask --conds "masks" --max_eval_samples 2000 --strong_aug_stage1 --node_split_sampler 0
and the loss I got after Human attribute pretraining.
Metering:{'loss_total': '0.0667'}: 100%|██████████| 55280/55280 [81:48:04<00:00, 5.33s/it]
1689906765826

I noticed that on your article, you mentioned All pre-training experiments are conducted on 4x8 NVIDIA V100 GPUs for 25K steps with image size 256×256 and learning rate 1e−3.
Because I only got one-fourth of the number of gpus you have,should I reduce the learning rate to one-fourth of 1e-3?
what is your loss after training on this stage?Thank you!

About the extra Tiktok-style data

Hi @Wangt-CN, thanks for the great work!
I have noticed that you've collected an additional 250 TikTok-style short videos from the internet. Will you consider uploading them? This would enable us to make a comparison with the the released model trained on it.

Difficulty finding the inference function

Hi, @Wangt-CN, first off, great work!!

I want to run inference through code, not gradio. I tried searching for the function to do that, the closest I found is the Agent_LDM. But this takes a reference fg, bg and skeleton. Is there a function which just takes in an image (with a character in it) and a skeleton, and returns the output?

Additionally, any function for end to end video gen?

Thanks

Just another: Will the code be compatible with PyTorch 2?

unexpected_keys when Loaded data from mp_rank_00_model_states.pt

Hi, thanks for your great work.
I only change the '--root_dir', '--pretrained_model', '--pretrained_model_path' according to my local settings. But when I run the cell in human_img_edit_gradio.ipynb
`## prepare the eval

logger.warning("Do eval_visu...")
if getattr(args, 'refer_clip_preprocess', None):
eval_dataset = BaseDataset(args, args.val_yaml, split='val', preprocesser=model.feature_extractor)
else:
eval_dataset = BaseDataset(args, args.val_yaml, split='val')
eval_dataloader, eval_info = make_data_loader(
args, args.local_eval_batch_size,
eval_dataset)

trainer = Agent_LDM(args=args, model=model)
trainer.eval_demo_pre()`
seems lots of weights of controlnet failed to load from mp_rank_00_model_states.pt.
And after the gradio launched, the background and pose not work.
Please help me out, thx.

Human Specific Finetuning

Hi,

Could you please provide some more information about the human specific finetuning model?

I tried running it and have generated checkpoint files however their dictionary keys are wildly different to the provided checkpoint, 'mp_rank_00_model_states.pt':

My checkpoint: dict_keys(['models', 'optimizer', 'epoch', 'global_step', 'scheduler'])

Checkpoint provided: dict_keys(['module', 'buffer_names', 'optimizer', 'param_shapes', 'lr_scheduler', 'sparse_tensor_module_names', 'skipped_steps', 'global_steps', 'global_samples', 'dp_world_size', 'mp_world_size', 'ds_config', 'ds_version'])

Therefore, when I try to generate new images using my checkpoint it fails at the load_checkpoint_for_deepspeed_diff_gpu function with this message:

Traceback (most recent call last): File "/home/emily/DisCo/VideoGenerationModel/run.py", line 645, in <module> trainer.eval_demo_pre() File "/home/emily/DisCo/agent.py", line 422, in eval_demo_pre self.prepare_dist_model() File "/home/emily/DisCo/agent.py", line 199, in prepare_dist_model self.load_checkpoint_for_deepspeed_diff_gpu(self.pretrained_model) # load pt model with default pytorch File "/home/emily/DisCo/agent.py", line 813, in load_checkpoint_for_deepspeed_diff_gpu adaptively_load_state_dict(self.model, checkpoint['module']) KeyError: 'module'

I'm not really sure what to do about this issue as it seems the new checkpoints are supposed to be made this way, please advise. Many thanks :)

Incorrect parameter name in config scripts

When running the Gradio Demo, this error kept generating when it was loading the pre-trained unet: TypeError: get_down_block() got an unexpected keyword argument 'attn_num_head_channels'

Looking at how the other parameters were named, I tried changing it to 'attention_head_dim', however that then created this error: TypeError: unsupported operand type(s) for //: 'int' and 'NoneType'

Once I expanded the error and viewed it in full, I noticed num_attention_heads was mentioned yet this was not present in any of the scripts. Therefore, I tried changing the parameter name to this and the code ran successfully.

Hence, all instances of attn_num_head_channels in the following scripts need to be changed to num_attention_heads:

  • controlnet_main.py
  • controlnet.py
  • unet_2d_condition.py

About the reference image

what are your criteria about choosing the reference image in 1) Pre-training, 2) General fine-tuning and 3) Human-specific fine-tuning respectively? are they all the first images of a dataset?

why this effect?

hi,so amazing work
i try this work today,but the effect is so bad,the face is severely deformed,and clothes are also changed
20230805-131420

can you tell me how to solve this problem?

Problem when training with 4090

Hi, thanks for the great work. I'm trying to run the training code, but when I do pre_train with multiple 4090 gpus, it always gets stuck and no report is shown. But when training with multiple 3090 and single 4090 everything is fine. I strongly suspect that a deadlock occurs in 4090. I narrow the problem down to deepspeed.initialize in agent.py. But I don't know how to solve this problem.

Any response will be greatly appreciated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.