microsoft / nuwa Goto Github PK

A unified 3D Transformer Pipeline for visual synthesis

nuwa's Introduction

This is the official repo for the following papers:

NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion. (ECCV 2022)
NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis. (NeurIPS 2022)
NUWA-LIP: Language Guided Image Inpainting with Defect-free VQGAN. (CVPR 2023)
Learning 3D Photography Videos via Self-supervised Diffusion on Single Images. (IJCAI 2023)
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation. (ACL 2023)

Update 2022/7/13: NUWA-Infinity

NUWA-Infinity is a generative model for infinite visual synthesis, which is defined as the task of generating arbitrarily-sized high-resolution images or long-duration videos.

Update 2021/11/26: NÜWA

NÜWA is a unified multimodal pre-trained model that can generate new or manipulate existing visual data (i.e., images and videos) for 8 visual synthesis tasks (as shown above).

nuwa's People

Contributors

Stargazers

Watchers

Forkers

vn09 repo-collection lee-b saulocatharino davincibj yexiaoya dumpmemory wulalano1 lee0008 unitycoder chenhuayou mathpopo wizard1987 stjordanis platypussdivva gaoloveai liannice c1a1o1 liuqinglong110 hirajanwin linecode kaelo-brandon-mokalake zizoo73 jaedukseo vipul1306 weisk thecooltechguy smarth265 919289947 vixeruntr lyrl ziqi-song pzl1744 gshan4056 xiaohunqupo atlains jointiger kmyface kevinlaw95 listenerhugh h412989333 good-man-1998 chiusky nicklgyuan landys13 auther daisukiyou runningfatty aiiotlabs vivounicorn cicyer wzb1005 hackdou hsintien-ng cleverjie chaolei-peng whxn520 sailfish009 mulistik lymandos carlziess hititan llango xiaowo53908 bensonlp startzhuzhu jumbokh mingo-wu1 lzhangsjtu meet-ai charminglittedeveloper ai-machine-vision-lab danielpwang davidkennedys raojun06 dvampire hankx syd-q watson1101 learnpythontheew zhuzhuxiong angasl zyw359569198 zcyf spring1024 hongjinfeng leon-zhou liylin6688 q977734161 inux silmelyy jumpthenfall stellahsr manyshapes shehao ck032 mrcodechef benjamesbabala murad-1999 ltss1988

nuwa's Issues

PeppaPig Dataset

Is there any plan to release the PeppaPig dataset? Hope to reproduce some works on this dataset.

Thanks for this project, it's simply amazing.
Any plans to share the pre-trained model(s)? That would be super helpful to compare it against CLIP, DALL-E, VQGAN, and these models ensemble combinations.

Thanks a lot

You can find the codes here

As mentioned in #2, #4, #5, #6, #7, #9, #14, #15, #16, #19, they aren't going to give you a fuck, as if they care. Especially they are in China, a place disconnected from the world.

You can find the following codes to achieve the corresponding tasks (many of them also achieves superior quality).
And remember to cite them (instead of NUWA of course) to support of their efforts in making their codes publicly available.

Text to image:
- https://github.com/CompVis/stable-diffusion
Text to video:
- https://github.com/THUDM/CogVideo
- https://github.com/lucidrains/imagen-pytorch
Infinite-visual synthesis:

code, please

code

When will the code of [NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation] be released?

Missing performance numbers in the paper

First off congratulation to this amazing work. I think you managed to find the closing gap to make generative Deep learning relevant for real-world application, besides being just a nice toy as previous work in this area.

However to truly judge the performance of your approach I have to say I was a bit disappointed after reading your paper there was not a single note on execution time for either training or more crucial actually sampling of a single final image.

Would you be able to provide some numbers on how long a sample generation takes for a 4kx1k images with 256^2 patch size and on which setup?

Also if possible could you also shed some light on training times and which setup was used.

Thank you!

For T2V, is the 10 frames evenly sampled from the video or the first 10 frames in the video?

Thank you for your excellent work! From the paper, I know that you sample 10 frames from a 2.5 FPS video. I want to know how many frames per video in the dataset you use? Is the 10 frames evenly sampled from the video or the first 10 frames in the video?

When this project will be open free available?

I need to use the software! Is it possible?

I would like to use the software for image development for my university architecture thesis (argentina)

The idea is to animate several images.

Code

The code of this paper will be extremely helpful in evaluating it against other techniques and fine tuning as well.

This work is simply too great!

A very polite question, When will the code and model be released?
I would appreciate it if I could expose the code and the model.

Code

The source code can be public? I want to reproduce your work because the task sound interesting :)

This repo is missing a license file

This repository is currently missing a LICENSE.MD file outlining its license. A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license text at the Microsoft repo templates LICENSE file: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.
If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license.

Question about paper

As the paper says in appendix：
"For example, for long videos or high-resolution frames with large h, w, s, usually (e^h)(e^w)(e^s)< (h + w + s)"
Is there any situation that (e^h)(e^w)(e*s)< (h + w + s)?

Paper - Possible (minor) error

In this paper, we show that simply using 2D VQ-GAN to encode each frame of a video can also generate temporal consistency videos and at the same time benefit from both image and video data.

In the paper, I believe you mean "temporally consistent" here. Subtle change in wording.

[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

Judging by the results, the transformer is taking in a single frame, and would be considered an Image to Video process.
Something like video inpainting or camera FOV extrapolation(like in FGVC) would be input video -> output video.
Am I missing something in the documentation that maybe shows it as some sort of sparse video interpolation where it can input more than a (D1, D2, single frame); or was it called V2V in order to match the I2I label on the inpainting/image completion counterparts?

Additionally, there isn't a direct link to the paper, which documents that the V2V model only takes in a single image.
https://arxiv.org/abs/2111.12417

Merge this pull request