cure-lab / magicdrive Goto Github PK

View Code? Open in Web Editor NEW

469.0 469.0 28.0 24.14 MB

[ICLR24] Official implementation of the paper “MagicDrive: Street View Generation with Diverse 3D Geometry Control”

Home Page: https://gaoruiyuan.com/magicdrive/

License: GNU Affero General Public License v3.0

Python 100.00%

autonomous-vehicles deep-learning diffusion-models image-generation pytorch video-generation

magicdrive's People

Contributors

Stargazers

Watchers

magicdrive's Issues

[unet-controlnet-question] Is the `ControlnetUnetWrapper` necessary?

Good afternoon!

Is the ControlnetUnetWrapper object necessary here?

The self.unet weights are supposed to be already copied into the self.controlnet, here.

Therefore, why initialize a ControlnetUnetWrapper object, since you already have everything under self.controlnet?

Thank you, in advance

How long will the training take?

Thanks for your impressive work!

I trained the base resolution model with 8 A100 GPU, but it shows the training time is over 1 week.
Is this the normal training time?

Steps:  26%|██▌       | 106314/410550 [58:15:26<149:16:23,  1.77s/it, loss=0.0122, lr0=8e-5]
Steps:  26%|██▌       | 106372/410550 [58:16:59<145:08:15,  1.72s/it, loss=0.16, lr0=8e-5]  
Steps:  26%|██▌       | 106430/410550 [58:19:27<166:34:50,  1.97s/it, loss=0.151, lr0=8e-5]
Steps:  26%|██▌       | 106488/410550 [58:20:48<151:37:45,  1.80s/it, loss=0.0792, lr0=8e-5]
Steps:  26%|██▌       | 106546/410550 [58:22:49<158:58:54,  1.88s/it, loss=0.121, lr0=8e-5] 
Steps:  26%|██▌       | 106604/410550 [58:24:31<156:07:56,  1.85s/it, loss=0.0271, lr0=8e-5]
Steps:  26%|██▌       | 106662/410550 [58:26:46<167:58:01,  1.99s/it, loss=0.0966, lr0=8e-5]
Steps:  26%|██▌       | 106720/410550 [58:28:07<152:58:54,  1.81s/it, loss=0.0423, lr0=8e-5]
Steps:  26%|██▌       | 106778/410550 [58:30:36<171:55:15,  2.04s/it, loss=0.0431, lr0=8e-5]
Steps:  26%|██▌       | 106836/410550 [58:32:36<172:49:27,  2.05s/it, loss=0.0184, lr0=8e-5]
Steps:  26%|██▌       | 106894/410550 [58:34:31<171:09:10,  2.03s/it, loss=0.189, lr0=8e-5] 
Steps:  26%|██▌       | 106952/410550 [58:36:03<159:48:38,  1.90s/it, loss=0.148, lr0=8e-5]
Steps:  26%|██▌       | 107010/410550 [58:38:01<163:17:47,  1.94s/it, loss=0.126, lr0=8e-5]
Steps:  26%|██▌       | 107068/410550 [58:39:42<158:40:36,  1.88s/it, loss=0.423, lr0=8e-5]
Steps:  26%|██▌       | 107126/410550 [58:41:24<155:23:04,  1.84s/it, loss=0.0905, lr0=8e-5]
Steps:  26%|██▌       | 107184/410550 [58:43:09<154:28:57,  1.83s/it, loss=0.144, lr0=8e-5] 
Steps:  26%|██▌       | 107242/410550 [58:45:27<168:08:17,  2.00s/it, loss=0.0132, lr0=8e-5]
Steps:  26%|██▌       | 107300/410550 [58:47:20<166:58:52,  1.98s/it, loss=0.082, lr0=8e-5] 
Steps:  26%|██▌       | 107358/410550 [58:49:02<161:19:43,  1.92s/it, loss=0.0878, lr0=8e-5]
Steps:  26%|██▌       | 107416/410550 [58:50:58<163:25:30,  1.94s/it, loss=0.0937, lr0=8e-5]
Steps:  26%|██▌       | 107474/410550 [58:53:03<168:44:02,  2.00s/it, loss=0.213, lr0=8e-5]

How to calculate the FID?

Hello author, I saw inception.py under magicdrive/misc, but I did not find the specific calculation of the FID for generated images and real images. How did you calculate the FID in the paper?

When will open the code? Thanks

Training Time

Dear Authors,

How long does this method need to train?

About downloading .pkl file of dataset

Excuse me, in the section of 'Prepare Data', you say that
"You can download the .pkl files from OneDrive. They should be enough for training and testing."
However, the OneDrive link does not work. Could you please provide a valid dataset link of the NuScenes dataset similar to bevfusion's instruction?

xformers running error: NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs

I am sorry to bother you, but I am getting the following error when I try to train the video generation model after successfully installing xformers from the repo. I followed the installation instructions of A series strictly. Thus I can run the program successfully with xformers disabled.

NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
flshattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
Operator wasn't built - see python -m xformers.info for more info
tritonflashattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
Operator wasn't built - see python -m xformers.info for more info
triton is not available
cutlassF is not supported because:
xFormers wasn't build with CUDA support
smallkF is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40

Thank you for your consistently helpful responses.

When will open the training code? Thanks

torch version and xformers install

Thanks for sharing this work! I would like to ask if the torch version needs to be 1.12.0+ when training, since xformer relies on 1.12.0+. However, torch version in README is 1.10.2.

Can use nuscenes mini to train？

Thank you for your good job!!
I tried to use nusenes mini to run a training demo, but raise error "No such file or directory: 'data/nuscenes/maps/expansion/singapore-onenorth.json', so i wonder if this version can only train on full nuscenes dataset?
"

Weights of generate video

Hi,
Thank you so much for you wonderful work, I got a simple question, does the weights you post used for image generation same with the video generation's ? I saw you post the training process of video gen, however, nusence dataset is kind of huge, it take times, so I am wondering if I could just use your weights to test the effect of video generation. I really appreciate your help !

Best,

The inputs of view-conditioned generation.

Hi, congrats for your excellent work! I have a question regarding view-conditioned generation in Fig. 6:

I am wondering how the condition view image is provided to the denoising process. Is it generated by DDIM inversion?

Without map information, can magicdrive be trained?

If my dataset doesn't include map, can magicdrive work well during training and infrnece?

the generated images is abnormal when I used nuScenes dataset

follow your readme, I can run demo successfully. But when I train and test by nuScenes mini dataset, but the generated images is abnormal. Can you help us see where the problem lies？

use mini nuScenes dataset
python tools/create_data.py nuscenes --root-path ./data/nuscenes \ --out-dir ./data/nuscenes_mmdet3d_2 --extra-tag nuscenes --version v1.0-mini
train model in debug config with 1xV100
accelerate launch --mixed_precision fp16 --gpu_ids all --num_processes 1 tools/train.py +exp=224x400 runner=debug runner.validation_before_run=true --version
test
python tools/test.py resume_from_checkpoint=magicdrive-log/debug/SDv1.5mv-rawbox_2024-02-27_09-57_224x400

Bug in nuscenes_t_dataset

There is a syntax error in the build_clips function on line 59 of the nuscenes_t_dataset.py file in the video branch of the current repository. The line currently reads:
len(scene[start] >= 33)
It should probably be:
len(scene[start]) >= 33
Please investigate and correct this issue. Thank you!

Question about demo/run.py show_box = true

When show_box = true in test_config.yaml

Error executing job with overrides: ['resume_from_checkpoint=/home/magicdrive-files/SDv1.5mv-rawbox_2023-09-07_18-39_224x400', '++runner.enable_xformers_memory_efficient_attention=false']
Traceback (most recent call last):
  File "demo/run.py", line 69, in main
    map_size=target_map_size)
  File "./magicdrive/misc/test_utils.py", line 301, in run_one_batch
    for bi in range(bs)
  File "./magicdrive/misc/test_utils.py", line 301, in <listcomp>
    for bi in range(bs)
  File "./magicdrive/misc/test_utils.py", line 56, in draw_box_on_imgs
    val_input['meta_data']['img_aug_matrix'][idx].data.numpy(),
  File "./magicdrive/runner/box_visualizer.py", line 111, in show_box_on_views
    img_out = np.asarray(Image.open(temp_path))  # ensure image is loaded
  File "/opt/conda/envs/magicdrivepy37/lib/python3.7/site-packages/PIL/Image.py", line 3236, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '.tmp/tmprxi40ph_.png'

When show_box = false in test_config.yaml
There is a problem with the loop iteration over gen_imgs_wb_list. This loop does not execute because there are no elements in gen_imgs_wb_list.

for map_img, ori_imgs, ori_imgs_wb, gen_imgs_list, gen_imgs_wb_list in zip(*return_tuples):

I deleted gen_imgs_wb_list in that line and commented code related to show_box=true, and the script("demo/run.py") can run correctly. So I think there is no problem with enviroment.
I'm using the nuscenes v1.0 mini dataset, and I suspect this happened cuz not reading the files related to the 3D BOX labels in the groundtruth data.

About Todo list updates

I have observed two pending tasks on your to-do list that align with my interests. May I inquire if it would be convenient for you to share the associated configurations, weightings, and the code at this time?

config and pretrained weight for high resolution

train and test code for CVT and BEVFusion

camera pose coding and change camera intrinsic/extrinsic to generate images

Thanks for sharing this work. As the section 4.1 of your paper shows, the camera parameters are encoded by Fourier embedding and a MLP. By this way, could we slightly change camera intrinsic/extrinsic to generate images? I didn't see any results in your paper about camera control.

The released weight fails to reproduce performance as reported

Hi! Images generated via weights released led to worse reported numbers compared to those in the paper, may I ask how many epochs of training required for reproduction, and the epochs you have trained for your released weights, thanks.

FID指标

Thank you very much for your work, but I would like to ask a question about FID. I used the weights you published to generate a validation set data of 224 * 400, and processed the real validation set the same way, saving it as a PNG format image. However, after calculating the FID, it was much higher than the data in the paper. How did you calculate it? Thank you.

'BasicTransformerBlock' object has no attribute '_args'

I can't run the full demo or the gradio demo because of the following error:

AttributeError: 'BasicTransformerBlock' object has no attribute '_args'

if crossview_attn_type == "basic":
                    _set_module(self, name, BasicMultiviewTransformerBlock(
                        **mod._args, # Problem line
                        neighboring_view_pair=neighboring_view_pair,
                        neighboring_attn_type=neighboring_attn_type,
                        zero_module_type=zero_module_type,
                    ))

I understand that the code is attempting to retrieve the original arguments from the BasicTransformer block, but it seems to be unsuccessful syntactically.

I downloaded an unzipped the pretrained model to /pretrained and have an otherwise functional env.

it occurs error when I install xformers in third_party

thanks for your work!

when I install xformers in third_party by using "pip install .", find error:

if I update code in /opt/conda/lib/python3.7/subprocess.py, xformers is not work:
packaging.version.InvalidVersion: Invalid version: '0.0.19xxx'

How to generate some specific frames？

I noticed that you gave the index of the generated frame in configs/runner/default.yaml. Now I want to generate a certain frame through the token of the given frame. How do I determine its index? Thanks

How could i change the prompt(text) via interactive_gui.py

ex, i config the prompt with ['sunny', 'rainy']

video-branch occurs error when followed your steps [No such file or directory: './out/lidar_20Hz/results_nusc.json']

hello, When I followed your steps, it reported an error, may I ask you how did you successfully execute it?

How to replace Unet with DIT architecture

Dear author, thank you very much for presenting such a valuable project and open-sourcing it. However, in the project introduction page, you mentioned the possibility of replacing the Unet architecture with the DIT architecture. Could you kindly guide me on where this switch is performed?

Some questions about model training

I followed the training script and the format of the training model obtained is as follows.

However, the format of the officially provided pre-training weights is

Save the generated image to a size of 1600 * 900

          > Yes, we kept the original data processing pipeline. Actually, we upscale and pad each generated images to 1600x900 and save them to disk, so that we can run the original code without any change.

Thanks for your reply!

May I ask if you could explain the specific operation of pads and upsamples here?

How to generate videos in MagicDrive-t using pretrained models.

I would like to directly obtain visualizations in MagicDrive-t using pretrained weights for rawbox_mv2.0t_0.4.3.yaml on OneDrive. Could you please provide specific execution scripts, preferably for generating frames at different time intervals? Thank you very much.

How to run the codebase with higher cuda version

Thank you for your excellent work. I have noticed that the codebase relies on multiple specific versions of the mm series code repository, and the highest graphics card computing power supported by pytorch-cuda10.2 is sm_70. Although V100 can be used, many new versions of graphics cards such as A series or L40S series cannot support sm_70. Is it possible for us to make the code run on these graphics cards?

ValueError: xformers is not available. Make sure it is installed correctly

训练时可以不安装xformers吗

About video quality

Dear authors,

Thanks for your great work!

I have a question regarding the quality of the generated video. I loaded the pretrained video unet and got videos like this:

It seems to have some artifacts in the video and some views are more blurry than others.

I am wondering if this is a normal phenomenon or if I did something wrong.

Thanks!

question about temp_attn_type

"ts_first_ff", "s_first_t_ff", "s_ff_t_last", "t_first_ff", "t_ff", "_ts_first_ff", "_s_first_t_ff", "_s_ff_t_last"
what dose these mean? any documentations?

Can this method be transferred to monocular cameras?

Trying to generate without any conditions

I am having a hard time installing and running some of the libraries required for the demo and I want to perform inference with no conditioning (generation from pretrained weights with no map or boxes whatsoever). So I wrote this script:

import torch.nn as nn
from magicdrive.pipeline.pipeline_bev_controlnet import (
    StableDiffusionBEVControlNetPipeline,
    BEVStableDiffusionPipelineOutput,
)
from magicdrive.networks.unet_addon_rawbox import BEVControlNetModel #controlnet class
from magicdrive.networks.unet_2d_condition_multiview import UNet2DConditionModelMultiview
from magicdrive.pipeline.pipeline_bev_controlnet import StableDiffusionBEVControlNetPipeline
from diffusers import UniPCMultistepScheduler
from typing import Tuple, Union, List
from PIL import Image

pipe_param = {}
controlnet = BEVControlNetModel.from_pretrained('magicdrive_weights/SDv1.5mv-rawbox_2023-09-07_18-39_224x400/controlnet', bbox_embedder_cls= "magicdrive.networks.bbox_embedder.ContinuousBBoxWithTextEmbedding")
controlnet.eval()  # from_pretrained will set to eval mode by default
pipe_param["controlnet"] = controlnet

unet = UNet2DConditionModelMultiview.from_pretrained('magicdrive_weights/SDv1.5mv-rawbox_2023-09-07_18-39_224x400/unet')
unet.eval()
pipe_param["unet"] = unet

pipe = StableDiffusionBEVControlNetPipeline.from_pretrained('magicdrive_weights/runwayml:stable-diffusion-v1-5', **pipe_param, safety_checker=None, feature_extractor=None)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

pipe.enable_xformers_memory_efficient_attention()

generator = None

pipeline_param = {
    "guidance_scale": 2,  # if > 1, enable classifier-free guidance
    "num_inference_steps": 20,
    "eta": 0.0,
    "controlnet_conditioning_scale": 1.0,
    "guess_mode": False,
    "use_zero_map_as_unconditional": True,
    "bbox_max_length": None,
}

image: BEVStableDiffusionPipelineOutput = pipe(
    prompt='a vehicle driving in the rain in bright daylight',
    image=torch.zeros(1,8,200,200),
    camera_param=torch.zeros(1,6,3,7),
    height=224,
    width=400,
    generator=generator,
    bev_controlnet_kwargs={"bboxes_3d_data":None},
    **pipeline_param,
)
image: List[List[Image.Image]] = image.images

Are these the right steps to reproduce inference? I am getting pretty much the same image for all views.

The approximate training time for the video generation model

Hello,

Thank you for your excellent work! I would like to ask about the approximate training time for the video generation model. I followed the instructions from this link and used the command scripts/dist_train.sh 8 runner=8gpus_t +exp=rawbox_mv2.0t_0.4.3 for training. However, it took me over 30 hours to train for 5000 steps. I would like to know if this is normal because it was mentioned that approximately 80,000 steps are needed for training, which would take a considerable amount of time.

Thank you very much for your help!

About the Cross-View Attention

The released training code does not see the part about cross-view attention, is this not open source?

any idea about improve consistency and continuity for front-rear frames on video branch?

hello, I'm here to ask for your advice again.
I found that the effect is relatively poor in consistency and continuity of front-rear frames when I generate a video. Do you have any suggestions regarding this problem？

No module named 'xformers._flash_attn'

Hello, thanks for your excellent work, I am getting an error ""No module named 'xformers._flash_attn'"" when I run the demo command you gave.
Environment:
CUDA10.2
PyTorch 1.10.2
gcc 7.4.0
NVIDIA A100

Can the generated video improve MOT proformence

I saw in your paper, you said the generated image could improve 3D object detection task KPI, I checked your generated video, the generated car or other object, the consistency is not very well in image sequence, So I want to know, does the un-consistency issue decrease some MOT KPI if i use these generated video to enhance training dataset.

change the resolution of picture

How to change the resolution of pictures? For example:224400 to 9001600?

[video-generation-ego-movement] Question about video generation

Good afternoon,

For video generation, the MagicDrive paper states that only the first and the last frames have bounding boxes (section 5.4). I have the following question:

How do you encode the movement of the car in the last frame, relative to its initial position in the first frame?

The ego-pose of the car changes in these 7 frames (around 4 seconds duration), but if I understand correctly both the bounding boxes and the camera poses have the ego-point of the car as reference. Therefore, they do not inject any information regarding the change in ego pose of the car from its starting position to the finish. In my mind, this is important information to inject to the video generation model but I may be missing something.

Thank you in advance for your feedback.

[MagicDrive3d codes]

Thank you for your great works! When do you plan to release codes for MagicDrive3d? I found MagicDrive series are indeed driving world models that can possibly been put down to the earth.

Conditioning on Segmentation Maps

Hello, great work!

I was wondering whether you thougth about conditioning on the image segmentation maps (generated with some algorithm on NuScenes images).
Because I am not quite sure, I would expect segmentation maps have more information about the shapes of the scene/objects compared to 3d bboxes + class, but maybe it is not the case if you already considering and discraded it.

Thank you!!

question about view consistentence

Does the term "hidden state of neighbor" refer to the latent representation of the neighboring image or the scene-level encoding embedding?
Additionally, when addressing view consistency, does the model output multiple-view results simultaneously or only one view?

Did the neg_prompt really work? i typed in "tree" to neg_prompt, try to remove the tree in scene, but failed!

Detection Results using BEVFusion of Table I?

Hi, nice work!

Do the results of Table I all use the checkpoint officially released by bevfusion? Or the checkpoint is trained by urself?

Looking forward to your reply.

Your Loss is NaN

Steps: 0%| | 4680/1172500 [3:15:04<781:36:43, 2.41s/it, loss=0.0968, lr0=4e-5]Error executing job with overrides: ['+exp=224x400', 'runner=4gpus']
Traceback (most recent call last):
File "tools/train.py", line 110, in main
runner.run()
File "/data/******************./magicdrive/runner/base_runner.py", line 352, in run
raise RuntimeError('Your loss is NaN.')
RuntimeError: Your loss is NaN.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Steps: 0%| | 4734/1172500 [3:18:08<814:35:08, 2.51s/it, loss=0.294, lr0=4e-5]

hey,I train the model for 4 gpus V100,with batch_size=3,and learning_rate is 4e-5,but I got the error.Does any one have ever encounted the error?Thanks a lot.

Some confusion about BEVFusion detection scores.

Hi, I would like to express my appreciation for your outstanding work. I have some further questions regarding the BEVFusion detection scores.

I notice that in the first and third rows of table 2 in your paper, the bevfusion detection scores you show without data augmentation are much lower than those officially reported by bevfusion (https://github.com/mit-han-lab/bevfusion/). This is very confusing.

Thank you, and I look forward to your response!
Best regards.

where can I find forked version of mmdet3d required by the video branch?

Hi,

the README.md file in video branch contains this:

Datasets

Please prepare the nuScenes dataset as [bevfusion's instructions](https://github.com/mit-han-lab/bevfusion#data-preparation). 

Note:
Run with our forked version of mmdet3d.

however I don't know where I can find that forked version of mmdet3d. Could the URL be added in the file?

I'm asking this because I'm getting weird exceptions when using the official version of mmdetction3d to generate the .json files required by scripts/ann_generator.sh in ASAP. Not sure if not using the forked version is the problem.

Thanks in advance.

cure-lab / magicdrive Goto Github PK

magicdrive's People

Contributors

Stargazers

Watchers

Forkers

magicdrive's Issues

Recommend Projects

Recommend Topics

Recommend Org