Code Monkey home page Code Monkey logo

mvdream's Introduction

MVDream

Yichun Shi, Peng Wang, Jianglong Ye, Long Mai, Kejie Li, Xiao Yang

| Project Page | 3D Generation | Paper | HuggingFace Demo (Coming) |

multiview diffusion

3D Generation

  • This repository only includes the diffusion model and 2D image generation code of MVDream paper.
  • For 3D Generation, please check MVDream-threestudio.

Installation

You can use the same environment as in Stable-Diffusion for this repo. Or you can set up the environment by installing the given requirements

pip install -r requirements.txt

To use MVDream as a python module, you can install it by pip install -e . or:

pip install git+https://github.com/bytedance/MVDream

Model Card

Our models are provided on the Huggingface Model Page with the OpenRAIL license.

Model Base Model Resolution
sd-v2.1-base-4view Stable Diffusion 2.1 Base 4x256x256
sd-v1.5-4view Stable Diffusion 1.5 4x256x256

By default, we use the SD-2.1-base model in our experiments.

Note that you don't have to manually download the checkpoints for the following scripts.

Text-to-Image

You can simply generate multi-view images by running the following command:

python scripts/t2i.py --text "an astronaut riding a horse"

We also provide a gradio script to try out with GUI:

python scripts/gradio_app.py

Usage

Load the Model

We provide two ways to load the models of MVDream:

  • Automatic: load the model config with model name and weights from huggingface.
from mvdream.model_zoo import build_model
model = build_model("sd-v2.1-base-4view")
  • Manual: load the model with a config file and a checkpoint file.
from omegaconf import OmegaConf
from mvdream.ldm.util import instantiate_from_config
config = OmegaConf.load("mvdream/configs/sd-v2-base.yaml")
model = instantiate_from_config(config.model)
model.load_state_dict(torch.load("path/to/sd-v2.1-base-4view.th", map_location='cpu'))

Inference

Here is a simple example for model inference:

import torch
from mvdream.camera_utils import get_camera
model.eval()
model.cuda()
with torch.no_grad():
    noise = torch.randn(4,4,32,32, device="cuda") # batch of 4x for 4 views, latent size 32=256/8
    t = torch.tensor([999]*4, dtype=torch.long, device="cuda") # same timestep for 4 views
    cond = {
        "context": model.get_learned_conditioning([""]*4).cuda(), # text embeddings
        "camera": get_camera(4).cuda(),
        "num_frames": 4,
    }
    eps = model.apply_model(noise, t, cond=cond)

Acknowledgement

This repository is heavily based on Stable Diffusion. We would like to thank the authors of these work for publicly releasing their code.

Citation

@article{shi2023MVDream,
  author = {Shi, Yichun and Wang, Peng and Ye, Jianglong and Mai, Long and Li, Kejie and Yang, Xiao},
  title = {MVDream: Multi-view Diffusion for 3D Generation},
  journal = {arXiv:2308.16512},
  year = {2023},
}

mvdream's People

Contributors

ildar-idrisov avatar seasonsh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mvdream's Issues

Question about the camera matrix

Thanks for sharing the exciting work : )
I have some question about the camera matrix.
In Zero-1-to-3, the camera matrix‘s dimension is 3 * 4, while in your case, it is 16 (maybe 4 * 4), why?

training code

Thx for your excellent work. Would you mind release the training code?

camera distance in training

I noticed that the camera distance is 1 in your image generation script. And in the 3D generation code, the camera distances are also normalized to 1. So, are the camera distances in training all set to 1, or they are between [0.9, 1.1] and only use 1 for generation?

Is it hard for MVDream to generate objects with low volume value in z axis?

Just the same as in the title, in my trials on some objects with low volume values, like 'an empty plate', MVDream will not provide plausible results, but generate four views with the same face of the plate. Is it a universal issue of MVDream? I think might because of the training dataset used? Since images with low z-axis value is seldomly rendered and trained on the Multi-view diffusion models.

Generating from images

Hi,
Thanks for open sourcing your interesting hard working repo.

If I want to generate, let's say a chair, not from text prompt but with an input image,
(the motive is that I want to generate the same chair from different angles),
is there any technique to do it ? maybe changing little bit the diffuser model ?
maybe inputing the noise as my image and in text prompt to put "chair" ?

AssertionError: Torch not compiled with CUDA enabled

Tried to install fresh in a conda venv

Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is None and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 640, context_dim is 1024 and using 10 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is None and using 5 heads.
Setting up MemoryEfficientCrossAttention. Query dim is 320, context_dim is 1024 and using 5 heads.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla-xformers' with 512 in_channels
building MemoryEfficientAttnBlock with 512 in_channels...
Loading model from cache file: C:\Users\Duckers.cache\huggingface\hub\models--MVDream--MVDream\snapshots\d14ac9d78c48c266005729f2d5633f6c265da467\sd-v2.1-base-4view.pt
Traceback (most recent call last):
File "H:\Stable3D\MVDream-main\scripts\t2i.py", line 79, in
model.to(device)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 989, in to
return self._apply(convert)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 641, in _apply
module._apply(fn)
[Previous line repeated 1 more time]
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 664, in apply
param_applied = fn(param)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\nn\modules\module.py", line 987, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
File "C:\Users\Duckers\anaconda3\envs\NVDream\lib\site-packages\torch\cuda_init
.py", line 221, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

(NVDream) H:\Stable3D\MVDream-main>

A question about Section A.5.1 in your paper (data preparation).

In section A.5.1 DATA PREPARATION AND DIFFUSION MODEL TRAINING:
The authors mentioned, "32 uniformly distributed azimuth angles are used for rendering, starting from the front view."
Does it mean that this view must be frontal if the azimuth angle is 0?
I am curious about how to implement that, or, how to determine which view is the front view.
Is this constraint manually ensured by humans?

Besides, will the used training data be released?
I sincerely hope these data could be released for research.

loading error when using SD1.5

RuntimeError: Error(s) in loading state_dict for LatentDiffusionInterface:
Unexpected key(s) in state_dict: "cond_stage_model.transformer.text_model.embeddings.position_ids".

LoRA compatibility question

Is there currently LoRA compatibility built into the codebase; and if it isn't currently, is it possible to add it to work with the goals in mind of this repo?

Curious about the guidance scale used in your practice

Hi I have a question about the guidance scale.

I have tested the 2.1 model in my machine, and I find that the generation quality is very bad when guidance scale is relatively small (<20), and the quality was getting better and better as I added it along to 60. Below shows some of my testing results.

guidance scale = 10:
image

guidance scale = 15:
image

guidance scale = 20:
image

guidance scale = 40:
image

guidance scale = 60:
image

Since I notice that in your T2I script you set the guidance scale defaultly as 10, I'm wondering if I made anything wrong.

And the backgrounds in the images above are all same brown color, is it correct?

question about apply

Can I apply this model to other text-to-3d models instead stable-diffusion, and is there anything that needs to be changed and how?

More camera views?

Hi and thanks for this amazing project.
Question: is it possible to generate more camera views? let's say 6 or 8 or 12 instead of 4?

sd webui or diffusers

It is really a good job, i like it very much.
I want to know if it could be used in sd-webui or diffusers as a basemodel?
祝生活愉快,工作顺利。

Does camera conditioning affect style of generated images?

I was doing a small experiment on MVDream to evaluate consistency across generations with this piece of code:

from PIL import Image
import numpy as np
import torch 

from mvdream.camera_utils import get_camera
from mvdream.ldm.models.diffusion.ddim import DDIMSampler
from mvdream.model_zoo import build_model


def t2i(model, image_size, prompt, uc, sampler, step=20, scale=7.5, batch_size=8, ddim_eta=0., dtype=torch.float32, device="cuda", camera=None, num_frames=1, x_T=None):
    if type(prompt)!=list:
        prompt = [prompt]
    with torch.no_grad(), torch.autocast(device_type=device, dtype=dtype):
        c = model.get_learned_conditioning(prompt).to(device)
        c_ = {"context": c.repeat(batch_size,1,1)}
        uc_ = {"context": uc.repeat(batch_size,1,1)}
        if camera is not None:
            c_["camera"] = uc_["camera"] = camera
            c_["num_frames"] = uc_["num_frames"] = num_frames

        shape = [4, image_size // 8, image_size // 8]
        samples_ddim, _ = sampler.sample(S=step, conditioning=c_,
                                        batch_size=batch_size, shape=shape,
                                        verbose=False, 
                                        unconditional_guidance_scale=scale,
                                        unconditional_conditioning=uc_,
                                        eta=ddim_eta, x_T=x_T)
        x_sample = model.decode_first_stage(samples_ddim)
        x_sample = torch.clamp((x_sample + 1.0) / 2.0, min=0.0, max=1.0)
        x_sample = 255. * x_sample.permute(0,2,3,1).cpu().numpy()

    return list(x_sample.astype(np.uint8))

device = "cuda"
model = build_model("sd-v2.1-base-4view", ckpt_path=None)
model.device = device
model.to(device)
model.eval()

sampler = DDIMSampler(model)
uc = model.get_learned_conditioning( [""] ).to(device)

torch.manual_seed(12345)
torch.cuda.manual_seed_all(12345)

fixed_noise = torch.randn([8,4,32,32], device=device)

cameras = []

for azimuth_start in [90, 60]:
    camera = get_camera(4, elevation=15, azimuth_start=azimuth_start, azimuth_span=360)
    camera = camera.repeat(2,1).to(device)
    cameras.append(camera)

images = []

prompt = "gandalf smiling, 3D asset"

for camera in cameras:
     img = t2i(model, 256, prompt, uc, sampler, step=50, scale=10., batch_size=8, ddim_eta=0.0, 
                dtype=torch.float16, device=device, camera=camera, num_frames=4, x_T=fixed_noise)
     img = np.concatenate(img, 1)
     images.append(img)

images = np.concatenate(images, 0)

Image.fromarray(images).save("gandalf.png")

TL;DR: the code freezes the noise for two generations with different sets of camera angles.

The output looks like this:

gandalf

Although the styles for the two different sets of camera angles are similar, they are not the same. So I would not be able to create different views for one pair of (prompt, start seed) in a separate, independent generation unless I have the exact same set of camera positions.

Is this expected? Does the MVDream training regime introduce camera dependent styles?

Generation speed quite different

From my observation, the gereration speed are quite different between different prompt.

Some quick just need 1 hour maybe(GPU 3090), but some need cost 20+ hours.

Is that right that some prompt will cost far more time than other? Or is most of the generation time the same, no matter what prompt it is?

MVDreambooth

Hi,

Congrats for the great work!
Do you plan to release the code for training MVDreambooth? Or perhaps some pretrained lora checkpoints that I could load?
Thank you in advance for your reply!

evaluation code

Thank you for your excellent work. Can you provide evaluation code, such as FID, IS, and CLIP score?

Does the code support image based prompting?

Can I supply a one (or more) images to the model to learn a concept which can subsequently be used to generate a 3D model? Basically MV DreamBooth as mentioned in the paper.
Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.