guoqincode / animateanyone-unofficial Goto Github PK

Unofficial Implementation of Animate Anyone

Python 99.75% Shell 0.25%

animateanyone-unofficial's Introduction

Unofficial Implementation of Animate Anyone

If you find this repository helpful, please consider giving us a star⭐!

We only train on small-scale datasets (such as TikTok, UBC), and it is difficult to achieve official results under the condition of insufficient data scale and quality. Because of the consideration of time and cost, we do not intend to collect and filter a large number of high-quality data. If someone has a robust model trained on a large amount of high-quality data and is willing to share it, make a pull request.

Overview

This repository contains an simple and unofficial implementation of Animate Anyone. This project is built upon magic-animate and AnimateDiff. This implementation is first developed by Qin Guo and then assisted by Zhenzhi Wang.

Training Guidance

Although we cannot use large-scale data to train the model, we can provide several training suggestions:

In our experiments, the poseguider in the original paper of AnimateAnyone is very difficult to control pose, no matter what activation function we use (such as ReLU, SiLU), but the output channel is enlarged to 320 and added after conv_in (such as model.hack_poseguider ) is very effective, and at the same time, compared to controlnet, this solution is more lightweight (<1M para vs 400M para). But we still think that Controlnet is a good choice. Poseguider relies on unet that is fine-tuned at the same time and cannot be used immediately. Plug and play.
In small-scale data sets (less than 2000 videos), stage1 can work very well (including generalization), but stage2 is data hungry. When the amount of data is low, artifacts and flickers can easily occur. Because we retrained unet in the first stage, the checkpoint of the original animatediff lost its effect, so a large number of high-quality data sets are needed to retrain the motion module of animatediff at this stage.
Freezing unet is not a good choice as it will lose the texture information of the reference image.
This is a data hungry task. We believe that scale up data quality and scale are often more valuable than modifying the tiny structure of the model. Data quantity and quality are very important!
High-resolution training is very important, which affects the learning and reconstruction of details. The training resolution should not be greater than the inference resolution.

Sample of Result on UBC-fashion dataset

Stage 1

The current version of the face still has some artifacts. This model is trained on the UBC dataset rather than a large-scale dataset.

Stage 2

The training of stage2 is challenging due to artifacts in the background. We select one of our best results here, and are still working on it. An important point is to ensure that training and inference resolution is consistent.

ToDo

Release Training Code.
Release Inference Code.
Release Unofficial Pre-trained Weights.
Release Gradio Demo.

Requirements

bash fast_env.sh

🎬Gradio Demo

python3 -m demo.gradio_animate

For a 13-second pose video, processing at 256 resolution requires 11G VRAM, and at 512 resolution, it requires 23.5G VRAM.

Training

Original AnimateAnyone Architecture (It is difficult to control pose when training on a small dataset.)

First Stage

torchrun --nnodes=8 --nproc_per_node=8 train.py --config configs/training/train_stage_1.yaml

Second Stage

torchrun --nnodes=8 --nproc_per_node=8 train.py --config configs/training/train_stage_2.yaml

Our Method (A more dense pose control scheme, the number of parameters is still small.) (Highly recommended)

torchrun --nnodes=8 --nproc_per_node=8 train_hack.py --config configs/training/train_stage_1.yaml

Second Stage

torchrun --nnodes=8 --nproc_per_node=8 train_hack.py --config configs/training/train_stage_2.yaml

Acknowledgements

Special thanks to the original authors of the Animate Anyone project and the contributors to the magic-animate and AnimateDiff repository for their open research and foundational work that inspired this unofficial implementation.

Email

For academic or business cooperation only: [email protected]

animateanyone-unofficial's People

Contributors

Stargazers

Watchers

Forkers

taichuai jiananzhao0224 i486magzog wweevv-johndpope tufo830 exusial vcertion chuanmew camenduru vamoko arkboy1224 zhangxujinsh awekling paperwave cian0 syunar kuoenterprises rimbaborne happyxy eki-indradi asdlei99 jbluv lycokie tuhinmallick staccats troph-team reynroselle kimwoonggon sexyeah paramedick eltociear d-mad hay-man gieniuw hailangwu-aval wysstartgo hankleaf hhy5277 bismuth209 adambear yanxg threeneedone hassantsyed shungjhon peace-zy lygraychina mathewferon hydrupsw91 elinaibra bloodsolz 18notesys istorywar mrskiffer supermario-ai lyuyork ehsanqa mymusise xavierrabbit chnxindong irenemsm2020 chi2nagisa zhenzhiwang oreml ablenine wyspywoods princetrunks anthonyyuan redlegenddev delihiros xingpeima 55587jijing hadesnull123 a279780399 hs991023 molierflower tutumomo jingx8885 e-kiss-me slacklife mtkshu s8xy coder-drinker dineshkumares xupercoin spicyguml tori-ham fskeo jessehao123 nicbair barracudnn bashirhassan-appness linzai1992 treksis fangyouqing nekonabe-tokumori chibanem hustwyk imloama xing531120070 zhaopufeng

animateanyone-unofficial's Issues

No trainable param for unet in stage 1

unet = DDP(unet, device_ids=[local_rank], output_device=local_rank)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 551, in init
self._log_and_throw(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 686, in _log_and_throw
raise err_type(err_msg)
RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

where is spatial attention?

Great work, and i have some question about the attention modules (spatial attention&cross-attention&temporal attention), but the spatial-attention for calculating reference-net latent feature and denoising-unet latent feature is ignored? (cite:we replace the self-attention layer with spatial-attention layer. Given a feature map x1∈Rt×h×w×c from denoising UNet and x2∈Rh×w×c from ReferenceNet, we first copy x2 by t times and concatenate it with x1 along w dimension)

SparseCausalAttention2D

While reading the code I saw that the standard BasicTransformerBlock from diffusers has been replaced with a modified version that utilizes a new class called SparseCausalAttention2D for the attn1 layer. Could you specify where this class is defined? Or maybe, were you able to successfully train the model without using this class (replacing it with a different one)?

Running Inference step 2

I got train results for both stages 1 and 2. Inference stage one works but creates a video with the same frame for one second; the inference stage 2 module is not working. I tried python -m pipelines.animation_stage_2 --config configs/prompts/animation_stage_2.yaml. I set the config values correctly. It throws an import error, than I fixed it. I have this error:

  from diffusers.pipeline_utils import DiffusionPipeline
loaded temporal unet's pretrained weights from outputs/train_stage_2-2023-12-22T08-59-53
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 244, in <module>
    run(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 233, in run
    main(args)
  File "/workspace/AnimateAnyone-unofficial/pipelines/animation_stage_2.py", line 70, in main
    unet = UNet3DConditionModel.from_pretrained_2d(config.pretrained_motion_unet_path, subfolder=None, unet_additional_kwargs=OmegaConf.to_container(inference_config.unet_additional_kwargs), specific_model=config.specific_motion_unet_model)
  File "/workspace/AnimateAnyone-unofficial/models/unet.py", line 457, in from_pretrained_2d
    raise RuntimeError(f"{config_file} does not exist")
RuntimeError: outputs/train_stage_2-2023-12-22T08-59-53/config.json does not exist

for classifier free guidance would it be better to use a blank condition image for negatives?

        latents_pose = poseguider(pose_condition)
        # latents_pose = rearrange(latents_pose, "(b f) c h w -> b c f h w", f=video_length)
        if do_classifier_free_guidance: latents_pose = latents_pose.repeat(2,1,1,1) # b c h w

here instead of repeating, would passing zeros through the poseguider and then catting be more appropriate?

loss won't decrease

Is it normal when training the stage 1, the loss doesn't decrease

How many steps are needed in stage 1?

I trained in stage_1, but the loss does not decrease?，is is correct?

Which clip encoder is this?

Magicanimate doesn't seem to have it in their pretrained directory. Is it the same as "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" ?

Have you noticed any issue during training related to the denoising timesteps?

Per the title I've been a little perplexed to see that what was denoised well at 30 inference timesteps @ 60k training steps, requires 70 steps @ 100k training steps.

My implementation is slightly different than yours so there could be quite a few things going on. Just curious if you noticed any similar behaviors since you're in the middle of training these days.

Thank you

what is the poseguider_checkpoint_path value?

Hello, first of all thanks for your work.

I have some questions. During the second stage of training, in the train_stage_2.yaml file,
poseguider_checkpoint_path: ""
referencenet_checkpoint_path: ""
What should these two contents be? Should the model trained in the first stage be written in referencenet_checkpoint_path?
Or something else, I hope to get your reply.

about the result of the first stage

my config:
train_data:
csv_path: ../TikTok_info.csv
video_folder:../TikTok_dataset/TikTok_dataset
sample_size: 512
sample_stride: 4
sample_n_frames: 16
clip_model_path: openai/clip-vit-base-patch32

gradient_accumulation_steps: 128
batch_size: 1
use 1 V100, optimizer = torch.optim.SGD(trainable_params, lr=learning_rate / gradient_accumulation_steps, momentum=0.9)
result: show the result of 20000 steps

Could it be because the 20,000 steps I have here are actually only equivalent to more than 300 steps when the batchsize is 64? or other reasons?

No module named 'models.hack_poseguider'

I tried to run demo.gradio_animate, but the following error was reported. Under the models folder, I did not find hack_poseguider

Traceback (most recent call last):
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/work/diffuser-env/python/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/work/AnimateAnyone-unofficial/demo/gradio_animate.py", line 8, in
from demo.animate import AnimateAnyone
File "/home/work/AnimateAnyone-unofficial/demo/animate.py", line 21, in
from models.hack_poseguider import Hack_PoseGuider as PoseGuider
ModuleNotFoundError: No module named 'models.hack_poseguider'

Where is the file "configs/prompts/animation_stage_1.yaml"?

Could you please provide the file "configs/prompts/animation_stage_1.yaml" for the animation test for the stage 1?

RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

I modified the paths in the configuration file to point to my local directories (UBC Fashion Video dataset) and started the training process. However, an error occurred during the process.

/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
### Train Info: train stage 1: image pretrain ###
Some weights of the model checkpoint were not used when initializing ReferenceNet: 
 ['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']
12/20/2023 01:40:44 - INFO - root - ***** Running training *****
12/20/2023 01:40:44 - INFO - root -   Num examples = 500
12/20/2023 01:40:44 - INFO - root -   Num Epochs = 480
12/20/2023 01:40:44 - INFO - root -   Instantaneous batch size per device = 4
12/20/2023 01:40:44 - INFO - root -   Total train batch size (w. parallel, distributed & accumulation) = 4
12/20/2023 01:40:44 - INFO - root -   Gradient Accumulation steps = 1
12/20/2023 01:40:44 - INFO - root -   Total optimization steps = 60000

  0%|          | 0/60000 [00:00<?, ?it/s]
Steps:   0%|          | 0/60000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 629, in <module>
    main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
  File "train.py", line 492, in main
    referencenet(latents_ref_img, ref_timesteps, encoder_hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
    else self._run_ddp_forward(*inputs, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
    return self.module(*inputs, **kwargs)  # type: ignore[index]
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet.py", line 1005, in forward
    sample, res_samples = downsample_block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py", line 1086, in forward
    hidden_states = attn(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/transformer_2d.py", line 315, in forward
    hidden_states = block(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/AnimateAnyone-unofficial/models/ReferenceNet_attention.py", line 199, in hacked_basic_transformer_inner_forward
    attn_output = self.attn2(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 417, in forward
    return self.processor(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/attention_processor.py", line 1023, in __call__
    key = attn.to_k(encoder_hidden_states, scale=scale)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/diffusers/models/lora.py", line 224, in forward
    out = super().forward(hidden_states)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x1024 and 768x320)

Steps:   0%|          | 0/60000 [00:05<?, ?it/s]
[2023-12-20 01:40:55,416] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 412807) of binary: /home/user/miniconda3/envs/animateanyone-unofficial/bin/python
Traceback (most recent call last):
  File "/home/user/miniconda3/envs/animateanyone-unofficial/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/user/miniconda3/envs/animateanyone-unofficial/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-12-20_01:40:55
  host      : gpuserver
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 412807)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

UBC-512*512的8卡A100，训练耗时多少？几天？

我跑了可能需要十几天呢，，

Good job

Really nice sharing!

Is there any way to avoid using torchrun?

About time embedding in ReferenceNet

In the official paper, the authors say

While ReferenceNet introduces a comparable number of parameters to the denoising UNet, in diffusion-based video generation, all video frames undergo denoising multiple times, whereas ReferenceNet only needs to extract features once throughout the entire process

But in your implementation of inference, the forward of ReferenceNet is performed multiple times.

Consider fixing the timestep of ReferenceUnet？

stage2 training error

Thank you for your work.

When I was in the second stage of training, I kept reporting out-of-memory errors. I have 80G of memory. No matter on a single card or multiple cards, the same error was reported. Even if --train_batch_size is set to 1, what went wrong?

error message:
Traceback (most recent call last):
File "/home/work/animate-anyone/train_2nd_stage.py", line 919, in
main(args)
File "/home/work/animate-anyone/train_2nd_stage.py", line 823, in main
model_pred = unet(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 632, in forward
return model_forward(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/accelerate/utils/operations.py", line 620, in call
return convert_to_fp32(self.model_forward(*args, **kwargs))
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast
return func(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_condition.py", line 1011, in forward
sample = upsample_block(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/unet_3d_blocks.py", line 901, in forward
hidden_states = resnet(hidden_states, temb, scale=lora_scale)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/animate-anyone/animate_anyone/models/resnet.py", line 340, in forward
hidden_states = self.norm1(hidden_states)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/modules/normalization.py", line 273, in forward
return F.group_norm(
File "/home/work/AnimateAnyone-unofficial/animateanyone_env/lib/python3.10/site-packages/torch/nn/functional.py", line 2530, in group_norm
return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 810.00 MiB (GPU 0; 79.35 GiB total capacity; 76.87 GiB already allocated; 64.19 MiB free; 77.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

model saving

Hi, it seems that you train the 2D unet, referencenet, and poseguider during the first stage,
but you don't save parameters of 2D unet.

Question: Why is CFG not set during training but set during inference?

Operations such as cfg_random_null_text or cfg_random_null_ref were not used during the training phase, but guidance_scale: 7.5 and do_classifier_free_guidance=True was set during inference. Is this as expected?

How big does the video dataset need to be for good results?

About multi-GPU training

Thank you for your contributions! There are two questions below:

I have observed that the training duration using two RTX 6000 ada GPUs exceeds the time it takes with a single GPU. Is this an expected phenomenon?
I encountered the phenomenon of gradient explosion during the training process.

Results

Hi, @guoqincode, thanks for your effort in reimplementing this! Could you show some video results as demonstration?

About animation_stage_1.yaml

Can you provide the file "configs/prompts/animation_stage_1.yaml"

about training memory optimization

In the README, you mentioned that you would optimize the training code using DeepSpeed and Accelerate. However, as far as I know, the DeepSpeed functionality integrated into the Accelerate library does not support multi-model training. Do you have any suggestions?

About masks?

In Tiktok dataset, there is a masks file。 Maybe the foreground is trained separately, Have you taken this into account?

Why DWpose instead of densepose?

Two papers report better results with the 'thicker' style of conditioning images:

https://arxiv.org/pdf/2308.03610.pdf (avatarverse)
https://arxiv.org/pdf/2311.16498.pdf (magicanimate)

Any reason you chose DWpose?

I would like to know how to handle the dataset. What is the overall structure of the TikTok dataset, and do I need to preprocess it beforehand using DWPose or OpenPose?

What about adding time embedding to the pose guider?

As the title mentions, has this crossed your mind?

save unet to state_dict during stage 1 training

about training memory optimization

about loss

Why my loss is quite strange
today, I try the new code, and my loss gets NaN:

where dependecy list (requirements.txt) ?

about training optimization

i try the 8bit adam optimizer, i can train stage one on 40g a100. i think it can help reduce the vram usage, but i don't know if it will decrease the model performance. what dou you think? did you try the 8bit adam?

Is classifier-free guidance embedded in attention 1 correctly？

https://github.com/guoqincode/AnimateAnyone-unofficial/blob/main/models/ReferenceNet_attention.py#L153

In AnimateAnyone paper, attention1 is responsible for spatial-attention operation，encoder_hidden_states is embedded in attention2, and Is it correct to apply classifier-free guidance to attention2?
Or for image, is it only necessary to set uncodition image_embeddings to 0 before input unet?

how do you preprocess the data like ubc, do you first split the frames, what's structure of the output dataset?

AttributeError: ‘PoseGuider’ object has no attribute ‘module’.

Thanks for sharing this repo @guoqincode.

While trying to run stage 2 training, getting this error: AttributeError: ‘PoseGuider’ object has no attribute ‘module’. Did you mean: ‘modules’? on line 550 in file train.py.

Do you happen to know why I might be getting this error?

About multi-gpu training

It seems that DDP isn't used in train.py.

one of the variables needed for gradient computation has been modified by an inplace operation [torch.cuda.FloatTensor [128]] is at version 3; expected version 2 instead

File "train_th.py", line 637, in
main(name=name, launcher=args.launcher, use_wandb=args.wandb, **config)
File "train_th.py", line 460, in main
latents_pose = poseguider(mask_image)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1040, in forward
output = self._run_ddp_forward(*inputs, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1000, in _run_ddp_forward
return module_to_run(*inputs[0], **kwargs[0])
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/modules/batchnorm.py", line 171, in forward
return F.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/nn/functional.py", line 2450, in batch_norm
return torch.batch_norm(
File "/home/hongfating/miniconda3/envs/animate/lib/python3.8/site-packages/torch/fx/traceback.py", line 57, in format_stack
return traceback.format_stack()

It should be here: File "/cto_labs/hongfating/workspace/src/AnimateAnyone-unofficial/models/PoseGuider.py", line 78, in forward
x = self.conv_layers(x)

I have no idea about that.

Tensor size mismatch in using clip-vit-large-patch14

Hi,

Thanks for sharing your implementation. It really helps the community a lot to reproduce animate-anyone. When I try to training the network with your code, I find that in the referencenet_attention, the hidden state size of stable diffusion unet is 768 while the clip image feature extracted from clip-vit-large-patch14 is 1024, which causes size mismatch in network forward (however, the hidden size of clip-vit-base-patch32 is 768). As your config yaml file was clip-vit-base-patch32 and recently change to clip-vit-large-patch14, and you mentioned that you use clip-vit-large-patch14 in another issue. Could you elaborate more details how your code works with clip-vit-large-patch14? I encountered errors when I directly run your training code with clip-vit-large-patch14.

Looking forward to your reply! Thanks again for your effort.

The parameters of unet training are not used

In pipelines/animation_stage_1.py, the parameters of unet are load from config.pretrained_model_path, does not load from config.pretrained_unet_path

Is 40 GB VRAM enough for training?

啥时候Release Inference Code and unofficial pre-trained weights呀，很急

急急急急

Any results?

I saw you added an inference cmd to the readme.
Do you have any preliminary results?

Loss not decreasing in stage 1

After training stage-1 for 30000 steps on TikTok dataset I'm getting the following loss curve and images from validation_pipeline is this correct?

Has anyone implemented it in Comfy?

How can I implement this in Comfy UI?

referencenet initializing warning ?

Some weights of the model checkpoint were not used when initializing ReferenceNet:
['conv_norm_out.bias, conv_norm_out.weight, conv_out.bias, conv_out.weight, up_blocks.3.attentions.2.proj_out.bias, up_blocks.3.attentions.2.proj_out.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn1.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_k.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_q.weight, up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_v.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.0.proj.weight, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.bias, up_blocks.3.attentions.2.transformer_blocks.0.ff.net.2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm2.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm2.weight, up_blocks.3.attentions.2.transformer_blocks.0.norm3.bias, up_blocks.3.attentions.2.transformer_blocks.0.norm3.weight']

is this correct？the training loss is not decreasing，result：

the pose condition is invalid..

batch size in training stage1.

I training the first stage with 8*A800 80G. However, the max batch size can only be set to 1 on each single GPU. Is that normal?

About "beta_schedule"

I noticed that you changed beta_schedule from linear to scaled_linear. Is it because the training results are better when using the latter?

guoqincode / animateanyone-unofficial Goto Github PK

animateanyone-unofficial's Introduction

Unofficial Implementation of Animate Anyone

Overview

Training Guidance

Sample of Result on UBC-fashion dataset

Stage 1

Stage 2

ToDo

Requirements

🎬Gradio Demo

Training

Original AnimateAnyone Architecture (It is difficult to control pose when training on a small dataset.)

First Stage

Second Stage

Our Method (A more dense pose control scheme, the number of parameters is still small.) (Highly recommended)

Second Stage

Acknowledgements

Email

animateanyone-unofficial's People

Contributors

Stargazers

Watchers

Forkers

animateanyone-unofficial's Issues

Recommend Projects

Recommend Topics

Recommend Org