Code Monkey home page Code Monkey logo

followyourpose's Introduction

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Follow-Your-Pose ๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ
Pose-Guided Text-to-Video Generation using Pose-Free Videos (AAAI 2024)

Yue Ma*, Yingqing He*, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan, Xiu Li, and Qifeng Chen

Open In Colab Hugging Face Spaces Open in OpenXLab visitors GitHub

"The man is sitting on chair, on the park" "The Iron man, on the street "
"The stormtrooper, in the gym " "The astronaut, earth background, Cartoon Style "

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Demo Video

demo3.mp4

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Abstract

TL;DR: We tune the text-to-image model (e.g., stable diffusion) to generate the character videos from pose and text description.

CLICK for full abstract

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e., image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable textto-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Changelog

  • [2024.03.15] ๐Ÿ”ฅ ๐Ÿ”ฅ ๐Ÿ”ฅ We release the Second Follower Follow-Your-Click, the first framework to achieve regional image animation. Try it now! Please give us a star! โญ๏ธโญ๏ธโญ๏ธ ๐Ÿ˜„
  • [2023.12.09] ๐Ÿ”ฅ The paper is accepted by AAAI 2024!
  • [2023.08.30] ๐Ÿ”ฅ Release some new results!
  • [2023.07.06] ๐Ÿ”ฅ Release A new version of ๆตฆๆบๅ†…ๅฎนๅนณๅฐ demo ๆตฆๆบๅ†…ๅฎนๅนณๅฐ Spaces! Thanks for the support of Shanghai AI Lab!
  • [2023.04.12] ๐Ÿ”ฅ Release local gradio demo and you could run it locally, only need a A100/3090.
  • [2023.04.11] ๐Ÿ”ฅ Release some cases in huggingface demo.
  • [2023.04.10] ๐Ÿ”ฅ Release A new version of huggingface demo Hugging Face Spaces, which support both raw video and skeleton video as input. Enjoy it!
  • [2023.04.07] Release the first version of huggingface demo. Enjoy the fun of following your pose! You need to download the skeleton video or make your own skeleton video by mmpose. Additionaly, the second version which regard the video format as input is comming.
  • [2023.04.07] Release a colab notebook Open In Colab and updata the requirements for installation!
  • [2023.04.06] Release code, config and checkpoints!
  • [2023.04.03] Release Paper and Project page!

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ HuggingFace Demo

๐ŸŽค๐ŸŽค๐ŸŽค Todo

  • Release the code, config and checkpoints for teaser
  • Colab
  • Hugging face gradio demo
  • Release more applications

๐Ÿป๐Ÿป๐Ÿป Setup Environment

Our method is trained using cuda11, accelerator and xformers on 8 A100.

conda create -n fupose python=3.8
conda activate fupose

pip install -r requirements.txt

xformers is recommended for A100 GPU to save memory and running time.

Click for xformers installation

We find its installation not stable. You may try the following wheel:

wget https://github.com/ShivamShrirao/xformers-wheels/releases/download/4c06c79/xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl
pip install xformers-0.0.15.dev0+4c06c79.d20221201-cp38-cp38-linux_x86_64.whl

Our environment is similar to Tune-A-video (official, unofficial). You may check them for more details.

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Training

We fix the bug in Tune-a-video and finetune stable diffusion-1.4 on 8 A100. To fine-tune the text-to-image diffusion models for text-to-video generation, run this command:

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --multi_gpu --num_processes=8 --gpu_ids '0,1,2,3,4,5,6,7' \
    train_followyourpose.py \
    --config="configs/pose_train.yaml" 

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Inference

Once the training is done, run inference:

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --gpu_ids '0' \
    txt2video.py \
    --config="configs/pose_sample.yaml" \
    --skeleton_path="./pose_example/vis_ikun_pose2.mov"

You could make the pose video by mmpose , we detect the skeleton by HRNet. You just need to run the video demo to obtain the pose video. Remember to replace the background with black.

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Local Gradio Demo

You could run the gradio demo locally, only need a A100/3090.

python app.py

then the demo is running on local URL: http://0.0.0.0:Port

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Weight

[Stable Diffusion] Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. The pre-trained Stable Diffusion models can be downloaded from Hugging Face (e.g., Stable Diffusion v1-4)

[FollowYourPose] We also provide our pretrained checkpoints in Huggingface. you could download them and put them into checkpoints folder to inference our models.

FollowYourPose
โ”œโ”€โ”€ checkpoints
โ”‚   โ”œโ”€โ”€ followyourpose_checkpoint-1000
โ”‚   โ”‚   โ”œโ”€โ”€...
โ”‚   โ”œโ”€โ”€ stable-diffusion-v1-4
โ”‚   โ”‚   โ”œโ”€โ”€...
โ”‚   โ””โ”€โ”€ pose_encoder.pth

๐Ÿ’ƒ๐Ÿ’ƒ๐Ÿ’ƒ Results

We show our results regarding various pose sequences and text prompts.

Note mp4 and gif files in this github page are compressed. Please check our Project Page for mp4 files of original video results.

"Trump, on the mountain " "man, on the mountain " "astronaut, on mountain"
"girl, simple background" "A Iron man, on the beach" "A Hulk, on the mountain"
"A policeman, on the street" "A girl, in the forest" "A Iron man, on the street"
"A Robot, in Sahara desert" "A Iron man, on the beach" "A panda, son the sea"
"A man in the park, Van Gogh style" "The fireman in the beach" "Batman, brown background"
"A Hulk, on the sea" "A superman, in the forest" "A Iron man, in the snow"
"A man in the forest, Minecraft." "A man in the sea, at sunset" "James Bond, grey simple background"
"A Panda on the sea." "A Stormtrooper on the sea" "A astronaut on the moon"
"A astronaut on the moon." "A Robot in Antarctica." "A Iron man on the beach."
"The Obama in the desert" "Astronaut on the beach." "Iron man on the snow"
"A Stormtrooper on the sea" "A Iron man on the beach." "A astronaut on the moon."
"Astronaut on the beach" "Superman on the forest" "Iron man on the beach"
"Astronaut on the beach" "Robot in Antarctica" "The Stormtroopers, on the beach"

๐ŸŽผ๐ŸŽผ๐ŸŽผ Citation

If you think this project is helpful, please feel free to leave a starโญ๏ธโญ๏ธโญ๏ธ and cite our paper:

@article{ma2023follow,
  title={Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos},
  author={Ma, Yue and He, Yingqing and Cun, Xiaodong and Wang, Xintao and Shan, Ying and Li, Xiu and Chen, Qifeng},
  journal={arXiv preprint arXiv:2304.01186},
  year={2023}
}

๐Ÿ‘ฏ๐Ÿ‘ฏ๐Ÿ‘ฏ Acknowledgements

This repository borrows heavily from Tune-A-Video and FateZero. thanks the authors for sharing their code and models.

๐Ÿ•บ๐Ÿ•บ๐Ÿ•บ Maintenance

This is the codebase for our research work. We are still working hard to update this repo and more details are coming in days. If you have any questions or ideas to discuss, feel free to contact Yue Ma or Yingqing He or Xiaodong Cun.

โญ๏ธโญ๏ธโญ๏ธ Star History

Star History Chart

followyourpose's People

Contributors

eltociear avatar mayuelala avatar pleonard212 avatar vinthony avatar yingqinghe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

followyourpose's Issues

SD Auto's 1111 Ui Extension ?

The project looks really promising so far !
But sadly I'm not able to make it work on the hugginface demo, are you planning on releasing an extension for Auto's UI ?

No such file or directory: 'Your dataset path/caption_rm2048_train.csv'

In the train command to below, It seems it did not find a file name "caption_rm2048_train.csv" from hdvila.py called. would you pls provide the file or guid me to get thru it. Thank you.

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch train_followyourpose.py --config="configs/pose_train.yaml"

Portion error log to beloa:

File "/home/cc/FollowYourPose/followyourpose/data/hdvila.py", line 109, in _load_metadata
with open(caption_path, 'r',encoding="utf-8") as csvfile: #41s
FileNotFoundError: [Errno 2] No such file or directory: 'Your dataset path/caption_rm2048_train.csv'

What parts of Cross-Frame Attention have been reformed in your project relative to Tune-A-Video๏ผŸ

As the author mentioned in Abstract: In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.

Can I understand the cross-frame attn mentioned in your paper is the SparseCausalAttention Class in your opened-source codes, which is the same as the SparseCausalAttention Class writen in Tune-A-Video? In this case, how does the Cross-Frame Attn reformed in your project? Which part of the code is embodied?

Hand + face Pose Guide to generate

Hi,
Is it possible to generate a single character from the Pose for about 5 seconds?

I have a video of Pose ( openpose + hands + face) and i was wondering if it is possible to generate an output video withe the length of 5 seconds that has a consistent character/Avatar which plays Dance, .... from the controled (pose) input?

i want to generate human like animation (No matter what, but just a consistent Character/Avatar)
Sample Video

Thanks
Best regards

Cannot connect to the server

[W socket.cpp:697] [c10d] The client socket has failed to connect to [DESKTOP-SB3DEO9]:29500 (system error: 10049 - The requested address is not valid in its context.).

The web paper shows examples of poses, but they all seemsm to be about "dancing"?

Hello,
I wanted to know how many types of poses were there please? And how much control do we have?

I actually tried to read the prompts, and the prompt were never used to describe the pose animations am I wrong?
I did not install it but wanted to learn more before starting to use it, can we actually have some control over the poses realisticlly please? thanks

Dataset

Hi,
great work! Are there any plans to release the dataset LAION-Pose?

Is Tesla T4 usable in Colab?

I noticed in the quick_demo.ipynb that the GPU used is Tesla T4 and the whole process seems to be fine. But when I run it myself, there is CUDA out of memory. Your homepage says that it needs A100/3090, so I want to know whether Tesla T4 is usable in Colab or not, and how to fix CUDA out of memory in Colab without changing the GPU. Thanks a lot!

dataset

excuse me, can you introduce how to get the dataset needed for training? i did not get it . thanks!

error colab

/content/FollowYourPose
/content/FollowYourPose
The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_processes` was set to a value of `1`
	`--num_machines` was set to a value of `1`
	`--mixed_precision` was set to a value of `'no'`
	`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: module 'triton.language' has no attribute 'constexpr'
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/_functional_video.py:6: UserWarning: The 'torchvision.transforms._functional_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms.functional' module instead.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/torchvision/transforms/_transforms_video.py:22: UserWarning: The 'torchvision.transforms._transforms_video' module is deprecated since 0.12 and will be removed in the future. Please use the 'torchvision.transforms' module instead.
  warnings.warn(
Traceback (most recent call last):
  File "txt2video.py", line 28, in <module>
    from followyourpose.pipelines.pipeline_followyourpose import FollowYourPosePipeline
  File "/content/FollowYourPose/followyourpose/pipelines/pipeline_followyourpose.py", line 43, in <module>
    class FollowYourPosePipeline(DiffusionPipeline):
  File "/content/FollowYourPose/followyourpose/pipelines/pipeline_followyourpose.py", line 333, in FollowYourPosePipeline
    **kwargs,
NameError: name 'kwargs' is not defined
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 552, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3.8', 'txt2video.py', '--config=configs/pose_sample.yaml', '--skeleton_path=./pose_example/vis_ikun_pose2.mov']' returned non-zero exit status 1.

pose video

Hi, can you elaborate on the production process of pose video, Why do I always make videos with mmpose with key points๏ผŒ thank you

Could not load the Space: fffiloni/mmpose-estimation

Fetching Space from: https://huggingface.co/spaces/fffiloni/mmpose-estimation
Traceback (most recent call last):
File "/root/miniconda3/envs/env39/lib/python3.9/site-packages/gradio/external.py", line 436, in from_spaces
config = json.loads(result.group(1)) # type: ignore
AttributeError: 'NoneType' object has no attribute 'group'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/root/autodl-tmp/FollowYourPose/app.py", line 28, in
pipe = merge_config_then_run()
File "/root/autodl-tmp/FollowYourPose/inference_followyourpose.py", line 25, in init
self.mmpose = gr.load(name="spaces/fffiloni/mmpose-estimation")
File "/root/miniconda3/envs/env39/lib/python3.9/site-packages/gradio/external.py", line 68, in load
return load_blocks_from_repo(
File "/root/miniconda3/envs/env39/lib/python3.9/site-packages/gradio/external.py", line 107, in load_blocks_from_repo
blocks: gradio.Blocks = factory_methods[src](name, api_key, alias, **kwargs)
File "/root/miniconda3/envs/env39/lib/python3.9/site-packages/gradio/external.py", line 438, in from_spaces
raise ValueError("Could not load the Space: {}".format(space_name))
ValueError: Could not load the Space: fffiloni/mmpose-estimation

I used to be successful, but today suddenly it's not working. Has anyone had the same problem๏ผŸ

Training Code

is the output config file from running the Training code :

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --multi_gpu --num_processes=8 --gpu_ids '0,1,2,3,4,5,6,7' \
    train_followyourpose.py \
    --config="configs/pose_train.yaml" 

supposed to be used in the Inference code?:

TORCH_DISTRIBUTED_DEBUG=DETAIL accelerate launch \
    --gpu_ids '0' \
    txt2video.py \
    --config="configs/pose_sample.yaml" \
    --skeleton_path="./pose_example/vis_ikun_pose2.mov"

the output config.yaml file doesnt seem to look like the pose_sample.yaml file that is supposed to be used in the Inference code.

Encoder path hardcoded.

Hey, thank you for the repo! Really cool.

While trying to get the sample up, I realize that the encoder path is hardcoded followyourpose/models/unet.py Line 215:

adapter_weight = torch.load('./checkpoints/pose_encoder.pth')

I think I can add a PR to make it so that it be accepted through the OmegaConfig, but before I do that, just wanna ask if in case this is some kinda workaround that I'm not aware about.

Cheers!

Code for 1 stage training

Hello,

Can you share the code for 1st stage of training your model (Pose-Controllable Text-to-Image Generation)?

Thank you in advance.

mmpose TypeError: wrapper() got an unexpected keyword argument 'fn_index'

Traceback (most recent call last):
File "/FollowYourPose/app.py", line 177, in
gr.Examples(examples=examples,
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/helpers.py", line 75, in create_examples
examples_obj.create()
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/helpers.py", line 301, in create
client_utils.synchronize_async(self.cache)
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio_client/utils.py", line 808, in synchronize_async
return fsspec.asyn.sync(fsspec.asyn.get_loop(), func, *args, **kwargs) # type: ignore
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/fsspec/asyn.py", line 103, in sync
raise return_result
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/fsspec/asyn.py", line 56, in _runner
result[0] = await coro
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/helpers.py", line 362, in cache
prediction = await Context.root_block.process_api(
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/blocks.py", line 1561, in process_api
result = await self.call_function(
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/blocks.py", line 1179, in call_function
prediction = await anyio.to_thread.run_sync(
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/anyio/to_thread.py", line 56, in run_sync
return await get_async_backend().run_sync_in_worker_thread(
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 2134, in run_sync_in_worker_thread
return await future
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 851, in run
result = context.run(func, *args)
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/utils.py", line 695, in wrapper
response = f(*args, **kwargs)
File "/FollowYourPose/inference_followyourpose.py", line 50, in run
infer_skeleton(self.mmpose, data_path)
File "/FollowYourPose/inference_mmpose.py", line 97, in infer_skeleton
mmpose_frame = get_mmpose_filter(mmpose, i)
File "/FollowYourPose/inference_mmpose.py", line 63, in get_mmpose_filter
image = mmpose(i, fn_index=0)[1]
File "/root/miniconda3/envs/python-app/lib/python3.9/site-packages/gradio/events.py", line 74, in call
return self.fn(*args, **kwargs)
TypeError: wrapper() got an unexpected keyword argument 'fn_index'

how to enhance the generation quality?

Hello, I've been using your project for action video generation. I conducted experiments on a V100-32G GPU, inputting a skeleton video of around seven seconds and trying various prompts. However, the generated results didn't quite meet the showcased effects. I'm wondering, without shortening the video length, which parameters can be modified to enhance the generation quality?

Plan to support new version of diffusers?

Hello, I would like to personalize the origin model SD1.4 with dreambooth and integrate with your pipeline for inference. However, I use the latest version of diffusers to train Dreambooth. Therefore, when loading the model, I encounter this error:

ValueError: unknown mid_block_type : UNetMidBlock2DCrossAttn

Would you please help me with this error?

Stage 1 code

Hi

I am looking for the training code for stage 1 (pose encoder) in this repo, but didn't find it. Will this code be released or any suggestions for training on my own pose/other conditions datasets?

Thanks!

LAION-Pose Link

Hi,

I was looking for the links the LAION-Pose dataset from the paper. I was looking to help a team for the Huggingface JAX sprint train a MediaPipe hand tracking annotator for ControlNet.

Is the dataset publicly available?

What will be needed if we wanna generate a higher fps video sequence?

Current demo from this repo seems generating 4-8 fps videos pretty decently. However, to make the video really useful, I would imagine we would like to generate some smoother video and 30fps could be a better start.
Is that easy to configure current code to generate a 30fps video? The calculation will increase for sure, but are there anything else we should be mindful of? For example, background flickering might be more obvious as we increase fps, what is the best way to achieve the best quality when expanding the fps based on current code?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.