vchitect / latte Goto Github PK
View Code? Open in Web Editor NEWLatte: Latent Diffusion Transformer for Video Generation.
License: Apache License 2.0
Latte: Latent Diffusion Transformer for Video Generation.
License: Apache License 2.0
Hi, I am the first author of PVDM, and I just checked the FVD values of PVDM are much worse than the values that I reported in the paper. Could you tell me why such differences exist?
Many people tried (and succeeded) to reproduce the values, so it is weird to me.
非常棒的工作!请问复现主页展示的视频需要设置seed为多少或者还有别的设置项吗?每次跑出来的结果都不一样
HI I have successfully loaded t2v model using
bash sample/t2v.sh
but it shows the model is running on cpu, how to set it to run on GPU, thanks
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00, 2.74it/s]
Pipelines loaded with dtype=torch.float16
cannot run with cpu
device. It is not recommended to move them to cpu
as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16
operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16
argument, or use another device for inference.
Pipelines loaded with dtype=torch.float16
cannot run with cpu
device. It is not recommended to move them to cpu
as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16
operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16
argument, or use another device for inference.
Pipelines loaded with dtype=torch.float16
cannot run with cpu
device. It is not recommended to move them to cpu
as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16
operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16
argument, or use another device for inference.
Processing the (Yellow and black tropical fish dart through the sea.) prompt
Does Latte support multiple GPUs, and if so, how should it be set up?
Hi, I would like to run bash sample/ucf101.sh for sampling but I met this "LayerNormKernelImpl" not implemented for 'Half'. Could you give some insights about how to resolve it? Many thanks.
HI, could you please share the script for evaluation like FVD?
你们的t2i的代码是不是没有加位置编码
thanks for your great work!
i want to know how to generate the /path/to/datasets/UCF101/train_256_list.txt for the UCF101 training。
After downloading the UCF101 videos, according to the paper "We extract 16-frame video clips from these datasets", are there any process scripts we can follow?
Can you provide me with the inference code for text to video?
When using 'args.extras=78', that is, text2video generation mode, I noticed this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/train.py#L221 using pooled-text-embeddings([batch, 768]) instead of text-embeddings([batch, 77, 768]) , which is not compatible with this line https://github.com/maxin-cn/Latte/blob/c4df091565fa6675f39d2fd1f8292295e202a43a/models/latte.py#L241
As a result, I got this error RuntimeError: mat1 and mat2 shapes cannot be multiplied (5x768 and 59136x1152)
Hi, I tried to sample from the pre-trained LatteT2V model by running on CPU. But I have several errors during running the code.
Steps to reproduce the error
Should I move .safetensors file to t2v/transformer? Could you please review this part?
What are the difference between the model of those two?
And the t2v model cannot recognize Chinese prompts
运行run bash sample/t2v.sh 出现报错Pipelines loaded with dtype=torch.float16
cannot run with cpu
device. It is not recommended to move them to cpu
as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support forfloat16
operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16
argument, or use another device for inference. 我应该在哪里设置让它运行的时候使用GPU
Translate the following sentence into English: When running run bash sample/t2v.sh, an error occurred saying "Pipelines loaded with dtype=torch.float16 cannot run with cpu device. It is not recommended to move them to cpu as running them will fail. Please make sure to use an accelerator to run the pipeline in inference, due to the lack of support for float16 operations on this device in PyTorch. Please, remove the torch_dtype=torch.float16 argument, or use another device for inference." Where should I set it to use GPU when it runs?
when I train latte on UCF101, the grad of linear layer are all zero. I think it is strange, 零梯度?
I have a question, after making the setting for t2v, run t2v.sh, the generation is all noise,
I try to change the ‘sample_method’ hyperparamter to 'DDIM' in Latte/configs/t2v/t2v_sample.yaml, which make the bad performance of the output image. can you provide some scripts for DDIM sampler, or is that model can not work well when using DDIM sampler?
Sicnerely
Hi, it's a great job, hope you have time to answer my simple question, where can I modify the Gaussian noise parameter in the inference stage or sampling stage and if I change input to a image or video, dose the model have ability to generate a video through this image or generate a awesome video though the existing poor quality video? Thanks
Where is the transformer/config.json when sh t2v.sh
As the title.
Does it mean the current t2v model is not trained on other frame lengths and cannot generalize to other frame length?
sh sample/t2v.sh
Using model!
Traceback (most recent call last):
File "/data/zhangmaolin/code/Latte/sample/sample_t2v.py", line 160, in
main(OmegaConf.load(args.config))
File "/data/zhangmaolin/code/Latte/sample/sample_t2v.py", line 34, in main
vae = AutoencoderKL.from_pretrained(args.pretrained_model_path, subfolder="vae", torch_dtype=torch.float16).to(device)
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/diffusers/models/modeling_utils.py", line 812, in from_pretrained
unexpected_keys = load_model_dict_into_meta(
File "/home/user/anaconda3/envs/py39/lib/python3.9/site-packages/diffusers/models/modeling_utils.py", line 155, in load_model_dict_into_meta
raise ValueError(
ValueError: Cannot load /data/zhangmaolin/code/Lattle_file/Latte/t2v_required_models because decoder.conv_in.bias expected shape tensor(..., device='meta', size=(64,)), but got torch.Size([512]). If you want to instead overwrite randomly initialized weights, please make sure to pass both low_cpu_mem_usage=False
and ignore_mismatched_sizes=True
. For more information, see also: huggingface/diffusers#1619 (comment) as an example.
您好,能帮忙看下这个问题怎么解决吗,我跑t2v.sh的时候报错,期待您的回复
Hi author,
Thanks for your great work.
I would like to know how to get 2048 videos for computing FVD? Thanks
A very impressive job. There are several issues when using SkyTimelapse data for image and video pre training
Requires_grad, cleanup, create_tensorboard,
Write_tensorboard, setup_distributed, fetch files by numbers,
Fetch files by numbers and separation content motion were not found in utils in get_experiment_dir, separation content motion,)
"ImportError: Unable to import the name 'fetch files by num bers' from' utils' (Last/utils. py). After commenting out the corresponding file, it is sufficient. Can you ask what this mainly does? Does it directly use the original video?" Commenting out directly is not a problem, right? "”
Traceback (most recent call last):
File "/data/zqzx/latte/latte_main/latte/train_with_img. py", line 361, in
Main (OmegaConf. load (args. config))
File "/data/zqzx/latte/latte_main/latte/train_with_img. py", line 221, in main
Logger. info (f "Dataset contains {len (dataset):,} videos ({args. webvideo_data_path})")
File "/data/miniconde3/envs/yxl/lib/python3.9/site packages/omegaconf/docconfiguration. py", line 355, in getattr_
Self_ Formad_and_raise()
This can be directly solved by adding the corresponding solution to the sky_img_train.yaml corresponding to the actual video, which is. mp4? Or can we think of our own video dataset through this path?
Thank you.
When the code was executed to
text_encoder = T5EncoderModel.from_pretrained(
args.pretrained_model_path, subfolder="text_encoder",
torch_dtype=torch.float16
).to(device)
the following error occurred
ValueError:
Non-consecutive added token '<extra_id_99>' found.
Should have index 32100 but has index 32000 in saved vocabulary.
What is the reason for this? Is it because the t2v_required_models/tokenizer/spiece.model
file on the hugging face is outdated?
Traceback (most recent call last):
File "C:\Users\Dell\Desktop\Project\Latte-main\sample\sample.py", line 29, in
from models import get_models
File "C:\Users\Dell\Desktop\Project\Latte-main\models_init_.py", line 7, in
from .latte_t2v import LatteT2V
File "C:\Users\Dell\Desktop\Project\Latte-main\models\latte_t2v.py", line 11, in
from diffusers.models.embeddings import get_1d_sincos_pos_embed_from_grid, ImagePositionalEmbeddings, CaptionProjection, PatchEmbed, CombinedTimestepSizeEmbeddings
ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'
Sorry to bother you again, but where can I find the paper . The paper link in the project seems invalid.
ImportError: cannot import name 'CaptionProjection' from 'diffusers.models.embeddings'
can anyone please help with this errror?
Do you have any plans to make the training t2v part of the code public? And the best model of T2V.
~/Latte# bash sample/t2v.sh
/root/miniconda3/envs/latte/lib/python3.12/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
/root/miniconda3/envs/latte/lib/python3.12/site-packages/diffusers/utils/outputs.py:63: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Traceback (most recent call last):
File "/root/Latte/sample/sample_t2v.py", line 160, in <module>
main(OmegaConf.load(args.config))
File "/root/Latte/sample/sample_t2v.py", line 30, in main
transformer_model = get_models(args).to(device, dtype=torch.float16)
^^^^^^^^^^^^^^^^
File "/root/Latte/models/__init__.py", line 42, in get_models
return LatteT2V.from_pretrained_2d(pretrained_model_path, subfolder="transformer")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Latte/models/latte_t2v.py", line 992, in from_pretrained_2d
raise RuntimeError(f"{model_file} does not exist")
RuntimeError: None does not exist
Code:
model = cls.from_config(config)
model_files = [
os.path.join(pretrained_model_path, 'diffusion_pytorch_model.bin'),
os.path.join(pretrained_model_path, 'diffusion_pytorch_model.safetensors')
]
model_file = None
for fp in model_files:
if os.path.exists(fp):
model_file = fp
if not model_file:
raise RuntimeError(f"{model_file} does not exist")
ModuleNotFoundError: No module named 'petrel_client',how can i solve this problem, pip seems to do not work
i have bug in line save video i have bug
PyAVPlugin.write() got an unexpected keyword argument 'quality'
I cant save output
Hi,
Thanks for the great work. I have a few questions:
PatchEmbed
several places and they are from different libs: sometimes from diffuser sometimes from timm. Do you have a pointer to the code where Fig.3(b) is implemented?Hi :)! First thank you for your excellent work!
I am trying to reproduce the results of Latte, and I wonder the total batch size for each dataset (local_batch_size * num_gpus), can you share more information on the experiment setups?
I tried the 1e-r lr with 32 total batchsize with the small version Latte-S, but can't generate good results. And so I wonder is the batch size / model size highly relevant to the final results? Thank you!
I create conda environment using conda env create -f environment.yml && conda activate latte
, but also
encounter this error when run
bash sample/ffs.sh
diffuser lib is 0.26.3. AND, I have tried to install diffusers from source, but I found no information about CaptionProjection
in diffusers lib. Which version you have used?
I saw in the README that it can be used to train two models, class-conditional and unconditional Latte, using the FaceForensics dataset. Do I need to do any additional preprocessing on the FaceForensics dataset? In what form should I organize the data in the FaceForensics dataset?
In addition, how to train the t2v model?
I'm asking for the lowest amount of GPU video memory (VRAM) necessary to run latte video generation effectively? for both training and inference.
I was trying to train for the text to video generation can you please provide any code base as in the train.py it is said that T2V training is not supported at this moment so how can I do that ?
Have you tested your arch on discriminative tasks like Video/Panoptic segmentation?
There was some promising effort recently but on images:
https://github.com/cp3wan/DFormer
Good job, but I have some questions on ffs ckpt inference experiment.
1)I set "ckpt" in ffs.sh to the folder related to https://huggingface.co/maxin-cn/Latte/blob/main/ffs.pt", set "pretrained_model_path" to the folder related to https://huggingface.co/maxin-cn/Latte/tree/main/vae.
But the performance of video generation is bad. Is there anything wrong with my process?
2)Besides, I edit the code in sample.py. If I keep the code "samples = vae.decode(samples / 0.18215).sample", I will get "Segmentation fault". Therefore, I replace the code with the following. Is there anything wrong with my process?
When I use the
bash sample/ffs.sh
meets this error.
Traceback (most recent call last):
File "/app/alpaca-lora/voice/clip_proj/Latte/sample/sample.py", line 138, in
main(omega_conf)
File "/app/alpaca-lora/voice/clip_proj/Latte/sample/sample.py", line 56, in main
model = get_models(args).to(device)
File "/app/alpaca-lora/voice/clip_proj/Latte/models/init.py", line 44, in get_models
return Latte_models[args.model](
File "/app/alpaca-lora/voice/clip_proj/Latte/models/latte.py", line 465, in Latte_XL_2
return Latte(depth=28, hidden_size=1152, patch_size=2, num_heads=16, **kwargs)
File "/app/alpaca-lora/voice/clip_proj/Latte/models/latte.py", line 233, in init
self.x_embedder = PatchEmbed(input_size, patch_size, in_channels, hidden_size, bias=True)
TypeError: PatchEmbed.init() got an unexpected keyword argument 'bias'
> As the title.
Does it mean the current t2v model is not trained on other frame lengths and cannot generalize to other frame length?
Hi, producing videos directly with more than 16 frames can lead to low-quality output. To generate videos longer than 16 frames, you might consider using the autoregressive mode for better results.
Originally posted by @maxin-cn in #25 (comment)
one step take all:
way 1: offline/explicit
1. 4d (time+stereo) strong physical stereo-consistent, any camera pose, offline render, interactive semantic highly-controllable,world.
OR
way 2: online/implicit
binocular stereo-consistent generate, observe pose in-place change, online realtime, interactive semantic highly-controllable,world.
and: the technical path is: train by binocular-video in Unreal Engine or Physical World.
你们组很强,有能力做到这种效果。
视觉游戏的终局。
Hi,in training stage, what does 'args.extra' paremeter(78/2/1) mean separately ?
What are the minimum requirements for GPU memory for training and inference?
是模型限制吗?还是?有没有办法让他生成更长时间的视频?
Hi, thanks for your great work, your code and ablation experiments have inspired us a lot. Is it possible for me to make modifications based on your code to adapt it to Open-Sora Plan? Thank you.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.