bytedance / mvdream Goto Github PK

View Code? Open in Web Editor NEW

732.0 732.0 52.0 55 KB

Multi-view Diffusion for 3D Generation

License: MIT License

Python 100.00%

mvdream's Introduction

MVDream

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, Xiao Yang

3D Generation

This repository only includes the diffusion model and 2D image generation code of MVDream paper.
For 3D Generation, please check MVDream-threestudio.

Installation

You can use the same environment as in Stable-Diffusion for this repo. Or you can set up the environment by installing the given requirements

pip install -r requirements.txt

To use MVDream as a python module, you can install it by pip install -e . or:

pip install git+https://github.com/bytedance/MVDream

Model Card

Our models are provided on the Huggingface Model Page with the OpenRAIL license.

Model	Base Model	Resolution
sd-v2.1-base-4view	Stable Diffusion 2.1 Base	4x256x256
sd-v1.5-4view	Stable Diffusion 1.5	4x256x256

By default, we use the SD-2.1-base model in our experiments.

Note that you don't have to manually download the checkpoints for the following scripts.

Text-to-Image

You can simply generate multi-view images by running the following command:

python scripts/t2i.py --text "an astronaut riding a horse"

We also provide a gradio script to try out with GUI:

python scripts/gradio_app.py

Usage

Load the Model

We provide two ways to load the models of MVDream:

Automatic: load the model config with model name and weights from huggingface.

from mvdream.model_zoo import build_model
model = build_model("sd-v2.1-base-4view")

Manual: load the model with a config file and a checkpoint file.

from omegaconf import OmegaConf
from mvdream.ldm.util import instantiate_from_config
config = OmegaConf.load("mvdream/configs/sd-v2-base.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("path/to/sd-v2.1-base-4view.th", map_location='cpu'))

Inference

Here is a simple example for model inference:

import torch
from mvdream.camera_utils import get_camera
model.eval()
model.cuda()
with torch.no_grad():
    noise = torch.randn(4,4,32,32, device="cuda") # batch of 4x for 4 views, latent size 32=256/8
    t = torch.tensor([999]*4, dtype=torch.long, device="cuda") # same timestep for 4 views
    cond = {
        "context": model.get_learned_conditioning([""]*4).cuda(), # text embeddings
        "camera": get_camera(4).cuda(),
        "num_frames": 4,
    }
    eps = model.apply_model(noise, t, cond=cond)

Acknowledgement

This repository is heavily based on Stable Diffusion. We would like to thank the authors of these work for publicly releasing their code.

Citation

@article{shi2023MVDream,
  author = {Shi, Yichun and Wang, Peng and Ye, Jianglong and Mai, Long and Li, Kejie and Yang, Xiao},
  title = {MVDream: Multi-view Diffusion for 3D Generation},
  journal = {arXiv:2308.16512},
  year = {2023},
}

mvdream's People

Contributors

Stargazers

Watchers

Forkers

paperwave camenduru kustomzone x-ck-x xymfei 5l1v3r1 cat-state ildar-idrisov devangaggarwal zebrajack keyology assassindesign torment123 rakyat-game zheleska cristy-the-one qianqian121 inkedyogi jclarkk hammad93 murlock1000 libingzeng arkboy1224 cv-synthesis emailandxu zhangchi233 zhipengcai standardgalactic dsaurus laeljh dstice 27yw itingtsai avert nevermorecy therealmukul jags111 abecid dqj5182 vinijaiswal mnpham0417 gpu-net briancho1 brian-lou xxjimin shimomurakei dimensify bruinxiong conglesolutionx rithwikn05 gaga1313 municef1

mvdream's Issues

Question about the camera matrix

Thanks for sharing the exciting work : )
I have some question about the camera matrix.
In Zero-1-to-3, the camera matrix‘s dimension is 3 * 4, while in your case, it is 16 (maybe 4 * 4), why?

training code

Thx for your excellent work. Would you mind release the training code?

I noticed that the camera distance is 1 in your image generation script. And in the 3D generation code, the camera distances are also normalized to 1. So, are the camera distances in training all set to 1, or they are between [0.9, 1.1] and only use 1 for generation?

Is it hard for MVDream to generate objects with low volume value in z axis?

Just the same as in the title, in my trials on some objects with low volume values, like 'an empty plate', MVDream will not provide plausible results, but generate four views with the same face of the plate. Is it a universal issue of MVDream? I think might because of the training dataset used? Since images with low z-axis value is seldomly rendered and trained on the Multi-view diffusion models.

Generating from images

Hi,
Thanks for open sourcing your interesting hard working repo.

If I want to generate, let's say a chair, not from text prompt but with an input image,
(the motive is that I want to generate the same chair from different angles),
is there any technique to do it ? maybe changing little bit the diffuser model ?
maybe inputing the noise as my image and in text prompt to put "chair" ?

Why is the background color is filled with random scale color instead of using white?

Does using white cause performance drop?

MVDream easy to generate dotted, overlighting and strange textures? #21 #429

MVDream easy to generate dotted and strange textures?

Could you explain the reason and give some tips？

Very Thanks.

AssertionError: Torch not compiled with CUDA enabled

Tried to install fresh in a conda venv

Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Loading model from cache file: C:\Users\Duckers.cache\huggingface\hub\models--MVDream--MVDream\snapshots\d14ac9d78c48c266005729f2d5633f6c265da467\sd-v2.1-base-4view.pt
Traceback (most recent call last):
File "H:\Stable3D\MVDream-main\scripts\t2i.py", line 79, in
model.to(device)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 989, in to
return self._apply(convert)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 664, in apply
param_applied = fn(param)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\cuda_init.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

(NVDream) H:\Stable3D\MVDream-main>

A question about Section A.5.1 in your paper (data preparation).

In section A.5.1 DATA PREPARATION AND DIFFUSION MODEL TRAINING:
The authors mentioned, "32 uniformly distributed azimuth angles are used for rendering, starting from the front view."
Does it mean that this view must be frontal if the azimuth angle is 0?
I am curious about how to implement that, or, how to determine which view is the front view.
Is this constraint manually ensured by humans?

Besides, will the used training data be released?
I sincerely hope these data could be released for research.

loading error when using SD1.5

RuntimeError: Error(s) in loading state_dict for LatentDiffusionInterface:
Unexpected key(s) in state_dict: "cond_stage_model.transformer.text_model.embeddings.position_ids".

Would you mind sharing the object id of training set?

LoRA compatibility question

Is there currently LoRA compatibility built into the codebase; and if it isn't currently, is it possible to add it to work with the goals in mind of this repo?

Curious about the guidance scale used in your practice

Hi I have a question about the guidance scale.

I have tested the 2.1 model in my machine, and I find that the generation quality is very bad when guidance scale is relatively small (<20), and the quality was getting better and better as I added it along to 60. Below shows some of my testing results.

guidance scale = 10:

guidance scale = 15:

guidance scale = 20:

guidance scale = 40:

guidance scale = 60:

Since I notice that in your T2I script you set the guidance scale defaultly as 10, I'm wondering if I made anything wrong.

And the backgrounds in the images above are all same brown color, is it correct?

question about apply

Can I apply this model to other text-to-3d models instead stable-diffusion, and is there anything that needs to be changed and how?

Using a different sized image mangles the output

Hi, I believe this may be a problem with concatenation or positional embedding. Anywho, when using a large size of image for generation the results become strange. Please see the example:

More camera views?

Hi and thanks for this amazing project.
Question: is it possible to generate more camera views? let's say 6 or 8 or 12 instead of 4?

can you share the script of render mesh？

we want to fine tune your model,but we dont know the way you render mesh.

Hi, can you release the dataset processing file used during training？

Thank you very much~

Do you plan to release training code ?

Converting a base checkpoint to an MVDream checkpoint

Is it possible to convert other checkpoints to MVDream checkpoint, for example basic SDXL checkpoint or custom SDXL 2.1?

sd webui or diffusers

It is really a good job, i like it very much.
I want to know if it could be used in sd-webui or diffusers as a basemodel?
祝生活愉快，工作顺利。

ImageDream MVDream?

Hi,

I came across this project https://image-dream.github.io/test_1.html which released by part of the team who created MVDream.

But its paper link point to MVDream. They are not the same project right, since ImageDream is a diffusion model that use image input as prompt

Is it possible for you to release the data processing code?

Thanks for your nice work!

Does camera conditioning affect style of generated images?

I was doing a small experiment on MVDream to evaluate consistency across generations with this piece of code:

from PIL import Image
import numpy as np
import torch 

from mvdream.camera_utils import get_camera
from mvdream.ldm.models.diffusion.ddim import DDIMSampler
from mvdream.model_zoo import build_model


def t2i(model, image_size, prompt, uc, sampler, step=20, scale=7.5, batch_size=8, ddim_eta=0., dtype=torch.float32, device="cuda", camera=None, num_frames=1, x_T=None):
    if type(prompt)!=list:
        prompt = [prompt]
    with torch.no_grad(), torch.autocast(device_type=device, dtype=dtype):
        c = model.get_learned_conditioning(prompt).to(device)
        c_ = {"context": c.repeat(batch_size,1,1)}
        uc_ = {"context": uc.repeat(batch_size,1,1)}
        if camera is not None:
            c_["camera"] = uc_["camera"] = camera
            c_["num_frames"] = uc_["num_frames"] = num_frames

        shape = [4, image_size // 8, image_size // 8]
        samples_ddim, _ = sampler.sample(S=step, conditioning=c_,
                                        batch_size=batch_size, shape=shape,
                                        verbose=False, 
                                        unconditional_guidance_scale=scale,
                                        unconditional_conditioning=uc_,
                                        eta=ddim_eta, x_T=x_T)
        x_sample = model.decode_first_stage(samples_ddim)
        x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
        x_sample = 255. * x_sample.permute(0,2,3,1).cpu().numpy()

    return list(x_sample.astype(np.uint8))

device = "cuda"
model = build_model("sd-v2.1-base-4view", ckpt_path=None)
model.device = device
model.to(device)
model.eval()

sampler = DDIMSampler(model)
uc = model.get_learned_conditioning( [""] ).to(device)

torch.manual_seed(12345)
torch.cuda.manual_seed_all(12345)

fixed_noise = torch.randn([8,4,32,32], device=device)

cameras = []

for azimuth_start in [90, 60]:
    camera = get_camera(4, elevation=15, azimuth_start=azimuth_start, azimuth_span=360)
    camera = camera.repeat(2,1).to(device)
    cameras.append(camera)

images = []

prompt = "gandalf smiling, 3D asset"

for camera in cameras:
     img = t2i(model, 256, prompt, uc, sampler, step=50, scale=10., batch_size=8, ddim_eta=0.0, 
                dtype=torch.float16, device=device, camera=camera, num_frames=4, x_T=fixed_noise)
     img = np.concatenate(img, 1)
     images.append(img)

images = np.concatenate(images, 0)

Image.fromarray(images).save("gandalf.png")

TL;DR: the code freezes the noise for two generations with different sets of camera angles.

The output looks like this:

Although the styles for the two different sets of camera angles are similar, they are not the same. So I would not be able to create different views for one pair of (prompt, start seed) in a separate, independent generation unless I have the exact same set of camera positions.

Is this expected? Does the MVDream training regime introduce camera dependent styles?

FOV used for training data

What was the FOV used for the renders of the objects in the training data?

Where can I get the final mesh model?

Thanks for this amazing work. I would like to ask that where can I find the final 3d model after training?

Generation speed quite different

From my observation, the gereration speed are quite different between different prompt.

Some quick just need 1 hour maybe(GPU 3090), but some need cost 20+ hours.

Is that right that some prompt will cost far more time than other? Or is most of the generation time the same, no matter what prompt it is?

MVDreambooth

Hi,

Congrats for the great work!
Do you plan to release the code for training MVDreambooth? Or perhaps some pretrained lora checkpoints that I could load?
Thank you in advance for your reply!

if the fov dynamics change，these OBJs will go beyond the screen boundaries，How did you solve the problem

          it is random between 15-60 degrees

Originally posted by @seasonSH in #12 (comment)

evaluation code

Thank you for your excellent work. Can you provide evaluation code, such as FID, IS, and CLIP score?

Does the code support image based prompting?

Can I supply a one (or more) images to the model to learn a concept which can subsequently be used to generate a 3D model? Basically MV DreamBooth as mentioned in the paper.
Thanks!

That's a so great work,But i have a question . When using mvdream for 3D reconstruction, an error will be reported when batchsize is set to 4. Can batchsize only be set to 8?