vchitect / latte Goto Github PK

View Code? Open in Web Editor NEW

1.5K 1.5K 153.0 36.28 MB

Latte: Latent Diffusion Transformer for Video Generation.

License: Apache License 2.0

Python 99.21% Shell 0.79%

latte's People

Contributors

Stargazers

Watchers

Forkers

skbl5694 cv-synthesis qinhao0519 spacewalkingninja liubo0902 jags111 maxin-cn navezjt eltociear orefaleoluwayinka princerumi camenduru vinthony wangbo-zhao aapostoliadis mr-harry sfidea zhaoxin151 xiechengmude codeaudit blackmatrix2007 theonetrueguy geronimi73 wjd6910502 golegen coderlsf jonnyquan sanying123 xiaoyangsharon uloveqian2021 dstarepoch rong9797 kellhuang oldbirdgo susmall2 amanda-barbara kioco onlyfish79 techthiyanes faruba valencebond ingeniousfrog 1530426574 kenvindev chenin-wang gussailraat shiyukonghui zhaopufeng jackzhousz songfang senwang98 eaglecoders haorand alexyu0814 nashihikari yuxuan2015 otherbackup jangkyung etherfurnace harleywang qiufeng1981 kackbob yumianhuli1 xwyangjshb yangwang92 ccczone heatingma mowenyii zdlant ineedgrowup adambear zuonet1988 droshown acproject zzhao-98 nelsonjiao zippynetworks guanghuisong frankleeeee xszheng2020 leecheedoo kabachuha ego tqlwodege zhuxindong devbox10 binzhu-ece call-center-together uniqzheng fangwudi cnfjb333 mbrukman svorwerk-flextg ftgreat xggnet tanjingme qingshui will001 vincezh2000 stupidzz

latte's Issues

FVD values of PVDM are strange

Hi, I am the first author of PVDM, and I just checked the FVD values of PVDM are much worse than the values that I reported in the paper. Could you tell me why such differences exist?

Many people tried (and succeeded) to reproduce the values, so it is weird to me.

如何复现主页展示的t2v效果？

非常棒的工作！请问复现主页展示的视频需要设置seed为多少或者还有别的设置项吗？每次跑出来的结果都不一样

Pipelines loaded with `dtype=torch.float16` cannot run with `cpu` device

HI I have successfully loaded t2v model using
bash sample/t2v.sh
but it shows the model is running on cpu, how to set it to run on GPU, thanks

Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.74it/s]
Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference.
Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference.
Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference.
Processing the (Yellow and black tropical fish dart through the sea.) prompt

Does Latte support multiple GPUs

Does Latte support multiple GPUs, and if so, how should it be set up?

Issue about "LayerNormKernelImpl" not implemented for 'Half'

Hi, I would like to run bash sample/ucf101.sh for sampling but I met this "LayerNormKernelImpl" not implemented for 'Half'. Could you give some insights about how to resolve it? Many thanks.

Latte的实时微信讨论组

我刚试跑了一下t2v demo，效果还是有一定的差距的
在想几个事情，
1、是不是加大训练量能够很大改善，极大缩小跟sora 的差距
2、素材怎么搞得更好质量，比如类似dalle e 视频生成文字。

有没有实时的讨论组啊，比如微信群啥的，要不我建个

About evaluation

HI, could you please share the script for evaluation like FVD?

t2i代码疑问

你们的t2i的代码是不是没有加位置编码

Preprocess of UCF101

thanks for your great work!
i want to know how to generate the /path/to/datasets/UCF101/train_256_list.txt for the UCF101 training。
After downloading the UCF101 videos, according to the paper "We extract 16-frame video clips from these datasets", are there any process scripts we can follow?

Inference code

Can you provide me with the inference code for text to video?

Is there any bug in text2video generation mode?

When using 'args.extras=78', that is, text2video generation mode, I noticed this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/train.py#L221 using pooled-text-embeddings([batch, 768]) instead of text-embeddings([batch, 77, 768]) , which is not compatible with this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/models/latte.py#L241

As a result, I got this error RuntimeError: mat1 and mat2 shapes cannot be multiplied (5x768 and 59136x1152)

Some errors when running the LatteT2v

Hi, I tried to sample from the pre-trained LatteT2V model by running on CPU. But I have several errors during running the code.

Steps to reproduce the error

modifying enviroment.yml to follow the requirement
download the t2v.pt and whole folder from https://huggingface.co/maxin-cn/Latte/tree/main/t2v_required_models and keep the same structure, name this folder as t2v, so we have t2v/scheduler .... t2v/model_index.json and t2v/t2v.pt
In t2v.sample.yaml, let ckpt = "t2v/t2v.pt" and pretrained_model_path = "t2v"
change the name of file transformer_config.json in t2v/transformer to config.json. Because I got RuntimeError: t2v\transformer\config.json does not exist in line 982 in latte_t2v.py, in from_pretrained_2d.
I got RuntimeError: None does not exist in line 1000 in latte_t2v.py, in from_pretrained_2d. Since we store our diffusion_pytorch_model.safetensors in t2v/vae and there is no .bin file in t2v, there are no .safetensors and .bin files in t2v/transformer folder.

Should I move .safetensors file to t2v/transformer? Could you please review this part?

Difference between the training result of train.py and train_with_img.py?

What are the difference between the model of those two？

Excellent work, will there be an official support of images to vedio (like sora) ?

And the t2v model cannot recognize Chinese prompts

run bash sample/t2v.sh error

运行run bash sample/t2v.sh 出现报错Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference. 我应该在哪里设置让它运行的时候使用GPU

Translate the following sentence into English: When running run bash sample/t2v.sh, an error occurred saying "Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for float16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference." Where should I set it to use GPU when it runs?

anyone meet zero grad?零梯度？

when I train latte on UCF101, the grad of linear layer are all zero. I think it is strange, 零梯度？

Question

I have a question, after making the setting for t2v, run t2v.sh, the generation is all noise,

Can you provide the code for DDIM sampler

I try to change the ‘sample_method’ hyperparamter to 'DDIM' in Latte/configs/t2v/t2v_sample.yaml, which make the bad performance of the output image. can you provide some scripts for DDIM sampler, or is that model can not work well when using DDIM sampler?

Sicnerely

diffusion noise modify

Hi, it's a great job, hope you have time to answer my simple question, where can I modify the Gaussian noise parameter in the inference stage or sampling stage and if I change input to a image or video, dose the model have ability to generate a video through this image or generate a awesome video though the existing poor quality video? Thanks

Bug

Where is the transformer/config.json when sh t2v.sh

T2V with >16 vedio_length output random noises

As the title.

Does it mean the current t2v model is not trained on other frame lengths and cannot generalize to other frame length?

sh sample/t2v.sh error,

sh sample/t2v.sh
Using model!
Traceback (most recent call last):
File "/data/zhangmaolin/code/Latte/sample/sample_t2v.py", line 160, in
main(OmegaConf.load(args.config))
File "/data/zhangmaolin/code/Latte/sample/sample_t2v.py", line 34, in main
vae = AutoencoderKL.from_pretrained(args.pretrained_model_path, subfolder="vae", torch_dtype=torch.float16).to(device)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/diffusers/models/modeling_utils.py", line 812, in from_pretrained
unexpected_keys = load_model_dict_into_meta(
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/diffusers/models/modeling_utils.py", line 155, in load_model_dict_into_meta
raise ValueError(
ValueError: Cannot load /data/zhangmaolin/code/Lattle_file/Latte/t2v_required_models because decoder.conv_in.bias expected shape tensor(..., device='meta', size=(64,)), but got torch.Size([512]). If you want to instead overwrite randomly initialized weights, please make sure to pass both low_cpu_mem_usage=False and ignore_mismatched_sizes=True. For more information, see also: huggingface/diffusers#1619 (comment) as an example.

您好，能帮忙看下这个问题怎么解决吗，我跑t2v.sh的时候报错，期待您的回复

how to get 2048 videos for computing FVD?

Hi author,
Thanks for your great work.
I would like to know how to get 2048 videos for computing FVD? Thanks

torchrun --nnodes=1 --nproc_per_node=2 train_with_img.py --config ./configs/sky/sky_img_train.yaml error

A very impressive job. There are several issues when using SkyTimelapse data for image and video pre training

From utils import (clip_gradnorm, create'logger, update_ema,

Requires_grad, cleanup, create_tensorboard,

Write_tensorboard, setup_distributed, fetch files by numbers,

Fetch files by numbers and separation content motion were not found in utils in get_experiment_dir, separation content motion,)

"ImportError: Unable to import the name 'fetch files by num bers' from' utils' (Last/utils. py). After commenting out the corresponding file, it is sufficient. Can you ask what this mainly does? Does it directly use the original video?" Commenting out directly is not a problem, right? "”

If args. dataset=='webvideo2mlaion ':

Traceback (most recent call last):

File "/data/zqzx/latte/latte_main/latte/train_with_img. py", line 361, in

Main (OmegaConf. load (args. config))

File "/data/zqzx/latte/latte_main/latte/train_with_img. py", line 221, in main

Logger. info (f "Dataset contains {len (dataset):,} videos ({args. webvideo_data_path})")

File "/data/miniconde3/envs/yxl/lib/python3.9/site packages/omegaconf/docconfiguration. py", line 355, in getattr_

Self_ Formad_and_raise()

This can be directly solved by adding the corresponding solution to the sky_img_train.yaml corresponding to the actual video, which is. mp4? Or can we think of our own video dataset through this path?

If args. test_run: After commenting it out directly, it can be run now

Thank you.

Non-consecutive added token '<extra_id_99>' found.

When the code was executed to

text_encoder = T5EncoderModel.from_pretrained(
    args.pretrained_model_path, subfolder="text_encoder", 
    torch_dtype=torch.float16
).to(device)

the following error occurred

ValueError: 
Non-consecutive added token '<extra_id_99>' found.
Should have index 32100 but has index 32000 in saved vocabulary.

What is the reason for this? Is it because the t2v_required_models/tokenizer/spiece.model file on the hugging face is outdated?

为什么会这样报错呢在运行sample.py模块的时候Traceback (most recent call last): File "C:\Users\Dell\Desktop\Project\Latte-main\sample\sample.py", line 29, in <module> from models import get_models File "C:\Users\Dell\Desktop\Project\Latte-main\models\init.py", line 7, in <module> from .latte_t2v import LatteT2V File "C:\Users\Dell\Desktop\Project\Latte-main\models\latte_t2v.py", line 11, in <module> from diffusers.models.embeddings import get_1d_sincos_pos_embed_from_grid, ImagePositionalEmbeddings, CaptionProjection, PatchEmbed, CombinedTimestepSizeEmbeddings ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'

Traceback (most recent call last):
File "C:\Users\Dell\Desktop\Project\Latte-main\sample\sample.py", line 29, in
from models import get_models
File "C:\Users\Dell\Desktop\Project\Latte-main\models_init_.py", line 7, in
from .latte_t2v import LatteT2V
File "C:\Users\Dell\Desktop\Project\Latte-main\models\latte_t2v.py", line 11, in
from diffusers.models.embeddings import get_1d_sincos_pos_embed_from_grid, ImagePositionalEmbeddings, CaptionProjection, PatchEmbed, CombinedTimestepSizeEmbeddings
ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'

Where is the paper?

Sorry to bother you again, but where can I find the paper . The paper link in the project seems invalid.

Import error

ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'
can anyone please help with this errror?

Does sora copy from this idea?

Train code of t2v？

Do you have any plans to make the training t2v part of the code public? And the best model of T2V.

run bash sample/t2v.sh,but why?

try generating videos from text，why?

t2vA_dog_in_astronaut_suit_and_sunglasses_floating_in_space._0000webv-imageio.mp4

Cannot find model：LatteT2V.from_pretrained_2d

~/Latte# bash sample/t2v.sh
/root/miniconda3/envs/latte/lib/python3.12/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
/root/miniconda3/envs/latte/lib/python3.12/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Traceback (most recent call last):
  File "/root/Latte/sample/sample_t2v.py", line 160, in <module>
    main(OmegaConf.load(args.config))
  File "/root/Latte/sample/sample_t2v.py", line 30, in main
    transformer_model = get_models(args).to(device, dtype=torch.float16)
                        ^^^^^^^^^^^^^^^^
  File "/root/Latte/models/__init__.py", line 42, in get_models
    return LatteT2V.from_pretrained_2d(pretrained_model_path, subfolder="transformer")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Latte/models/latte_t2v.py", line 992, in from_pretrained_2d
    raise RuntimeError(f"{model_file} does not exist")
RuntimeError: None does not exist

Code:

        model = cls.from_config(config)
        
        model_files = [
            os.path.join(pretrained_model_path, 'diffusion_pytorch_model.bin'),
            os.path.join(pretrained_model_path, 'diffusion_pytorch_model.safetensors')
        ]

        model_file = None

        for fp in model_files:
            if os.path.exists(fp):
                model_file = fp

        if not model_file:
            raise RuntimeError(f"{model_file} does not exist")

the transformer does not have these two models：

The provided pre-trained model is invalid

Bug

ModuleNotFoundError: No module named 'petrel_client'，how can i solve this problem, pip seems to do not work

some error to save output text to video

i have bug in line save video i have bug
PyAVPlugin.write() got an unexpected keyword argument 'quality'
I cant save output

Implementation of compression frame patch embedding (Fig. 3b)

Hi,
Thanks for the great work. I have a few questions:

By default which "patch embedding" is used? Fig.3(a) or (b)?
Is there a parameter to switch between (a) and (b) in a config file?
I'd like to take a look at the implementation of (b) -- compression frame patch embedding. I see PatchEmbed several places and they are from different libs: sometimes from diffuser sometimes from timm. Do you have a pointer to the code where Fig.3(b) is implemented?

Training BatchSize

Hi :)! First thank you for your excellent work!

I am trying to reproduce the results of Latte, and I wonder the total batch size for each dataset (local_batch_size * num_gpus), can you share more information on the experiment setups?

I tried the 1e-r lr with 32 total batchsize with the small version Latte-S, but can't generate good results. And so I wonder is the batch size / model size highly relevant to the final results? Thank you!

cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'

I create conda environment using conda env create -f environment.yml && conda activate latte, but also
encounter this error when run
bash sample/ffs.sh

diffuser lib is 0.26.3. AND, I have tried to install diffusers from source, but I found no information about CaptionProjection in diffusers lib. Which version you have used?

preprocess dataset & t2v training

I saw in the README that it can be used to train two models, class-conditional and unconditional Latte, using the FaceForensics dataset. Do I need to do any additional preprocessing on the FaceForensics dataset? In what form should I organize the data in the FaceForensics dataset?
In addition, how to train the t2v model?

What is

I'm asking for the lowest amount of GPU video memory (VRAM) necessary to run latte video generation effectively? for both training and inference.

Asking for training code for t2v

I was trying to train for the text to video generation can you please provide any code base as in the train.py it is said that T2V training is not supported at this moment so how can I do that ?

Discriminative tasks

Have you tested your arch on discriminative tasks like Video/Panoptic segmentation?
There was some promising effort recently but on images:
https://github.com/cp3wan/DFormer

Re-implementation err on ffs experiment

Good job, but I have some questions on ffs ckpt inference experiment.
1）I set "ckpt" in ffs.sh to the folder related to https://huggingface.co/maxin-cn/Latte/blob/main/ffs.pt", set "pretrained_model_path" to the folder related to https://huggingface.co/maxin-cn/Latte/tree/main/vae.
But the performance of video generation is bad. Is there anything wrong with my process?

sample.mp4

2）Besides, I edit the code in sample.py. If I keep the code "samples = vae.decode(samples / 0.18215).sample", I will get "Segmentation fault". Therefore, I replace the code with the following. Is there anything wrong with my process?

TypeError: PatchEmbed.init() got an unexpected keyword argument 'bias'

When I use the
bash sample/ffs.sh
meets this error.

Traceback (most recent call last):
File "/app/alpaca-lora/voice/clip_proj/Latte/sample/sample.py", line 138, in
main(omega_conf)
File "/app/alpaca-lora/voice/clip_proj/Latte/sample/sample.py", line 56, in main
model = get_models(args).to(device)
File "/app/alpaca-lora/voice/clip_proj/Latte/models/init.py", line 44, in get_models
return Latte_models[args.model](
File "/app/alpaca-lora/voice/clip_proj/Latte/models/latte.py", line 465, in Latte_XL_2
return Latte(depth=28, hidden_size=1152, patch_size=2, num_heads=16, **kwargs)
File "/app/alpaca-lora/voice/clip_proj/Latte/models/latte.py", line 233, in init
self.x_embedder = PatchEmbed(input_size, patch_size, in_channels, hidden_size, bias=True)
TypeError: PatchEmbed.init() got an unexpected keyword argument 'bias'

How can I turn on the autoregressive mode to generate >16 frame videos?

          > As the title.

Does it mean the current t2v model is not trained on other frame lengths and cannot generalize to other frame length?

Hi, producing videos directly with more than 16 frames can lead to low-quality output. To generate videos longer than 16 frames, you might consider using the autoregressive mode for better results.

Originally posted by @maxin-cn in #25 (comment)

please: one step take all 大神们，一步到位啊。

one step take all:

way 1: offline/explicit
1. 4d (time+stereo) strong physical stereo-consistent, any camera pose, offline render, interactive semantic highly-controllable，world.

way 2: online/implicit
binocular stereo-consistent generate, observe pose in-place change, online realtime, interactive semantic highly-controllable，world.
and: the technical path is: train by binocular-video in Unreal Engine or Physical World.

你们组很强，有能力做到这种效果。
视觉游戏的终局。