Code Monkey home page Code Monkey logo

magicdrive's People

Contributors

flymin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

magicdrive's Issues

How long will the training take?

Thanks for your impressive work!

I trained the base resolution model with 8 A100 GPU, but it shows the training time is over 1 week.
Is this the normal training time?

Steps:  26%|██▌       | 106314/410550 [58:15:26<149:16:23,  1.77s/it, loss=0.0122, lr0=8e-5]
Steps:  26%|██▌       | 106372/410550 [58:16:59<145:08:15,  1.72s/it, loss=0.16, lr0=8e-5]  
Steps:  26%|██▌       | 106430/410550 [58:19:27<166:34:50,  1.97s/it, loss=0.151, lr0=8e-5]
Steps:  26%|██▌       | 106488/410550 [58:20:48<151:37:45,  1.80s/it, loss=0.0792, lr0=8e-5]
Steps:  26%|██▌       | 106546/410550 [58:22:49<158:58:54,  1.88s/it, loss=0.121, lr0=8e-5] 
Steps:  26%|██▌       | 106604/410550 [58:24:31<156:07:56,  1.85s/it, loss=0.0271, lr0=8e-5]
Steps:  26%|██▌       | 106662/410550 [58:26:46<167:58:01,  1.99s/it, loss=0.0966, lr0=8e-5]
Steps:  26%|██▌       | 106720/410550 [58:28:07<152:58:54,  1.81s/it, loss=0.0423, lr0=8e-5]
Steps:  26%|██▌       | 106778/410550 [58:30:36<171:55:15,  2.04s/it, loss=0.0431, lr0=8e-5]
Steps:  26%|██▌       | 106836/410550 [58:32:36<172:49:27,  2.05s/it, loss=0.0184, lr0=8e-5]
Steps:  26%|██▌       | 106894/410550 [58:34:31<171:09:10,  2.03s/it, loss=0.189, lr0=8e-5] 
Steps:  26%|██▌       | 106952/410550 [58:36:03<159:48:38,  1.90s/it, loss=0.148, lr0=8e-5]
Steps:  26%|██▌       | 107010/410550 [58:38:01<163:17:47,  1.94s/it, loss=0.126, lr0=8e-5]
Steps:  26%|██▌       | 107068/410550 [58:39:42<158:40:36,  1.88s/it, loss=0.423, lr0=8e-5]
Steps:  26%|██▌       | 107126/410550 [58:41:24<155:23:04,  1.84s/it, loss=0.0905, lr0=8e-5]
Steps:  26%|██▌       | 107184/410550 [58:43:09<154:28:57,  1.83s/it, loss=0.144, lr0=8e-5] 
Steps:  26%|██▌       | 107242/410550 [58:45:27<168:08:17,  2.00s/it, loss=0.0132, lr0=8e-5]
Steps:  26%|██▌       | 107300/410550 [58:47:20<166:58:52,  1.98s/it, loss=0.082, lr0=8e-5] 
Steps:  26%|██▌       | 107358/410550 [58:49:02<161:19:43,  1.92s/it, loss=0.0878, lr0=8e-5]
Steps:  26%|██▌       | 107416/410550 [58:50:58<163:25:30,  1.94s/it, loss=0.0937, lr0=8e-5]
Steps:  26%|██▌       | 107474/410550 [58:53:03<168:44:02,  2.00s/it, loss=0.213, lr0=8e-5]

How to calculate the FID?

Hello author, I saw inception.py under magicdrive/misc, but I did not find the specific calculation of the FID for generated images and real images. How did you calculate the FID in the paper?

Training Time

Dear Authors,

How long does this method need to train?

About downloading .pkl file of dataset

Excuse me, in the section of 'Prepare Data', you say that
"You can download the .pkl files from OneDrive. They should be enough for training and testing."
However, the OneDrive link does not work. Could you please provide a valid dataset link of the NuScenes dataset similar to bevfusion's instruction?

xformers running error: NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs

I am sorry to bother you, but I am getting the following error when I try to train the video generation model after successfully installing xformers from the repo. I followed the installation instructions of A series strictly. Thus I can run the program successfully with xformers disabled.

NotImplementedError: No operator found for memory_efficient_attention_forward with inputs:
query : shape=(1, 2, 1, 40) (torch.float32)
key : shape=(1, 2, 1, 40) (torch.float32)
value : shape=(1, 2, 1, 40) (torch.float32)
attn_bias : <class 'NoneType'>
p : 0.0
flshattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
Operator wasn't built - see python -m xformers.info for more info
tritonflashattF is not supported because:
xFormers wasn't build with CUDA support
dtype=torch.float32 (supported: {torch.float16, torch.bfloat16})
Operator wasn't built - see python -m xformers.info for more info
triton is not available
cutlassF is not supported because:
xFormers wasn't build with CUDA support
smallkF is not supported because:
xFormers wasn't build with CUDA support
max(query.shape[-1] != value.shape[-1]) > 32
unsupported embed per head: 40

Thank you for your consistently helpful responses.

torch version and xformers install

Thanks for sharing this work! I would like to ask if the torch version needs to be 1.12.0+ when training, since xformer relies on 1.12.0+. However, torch version in README is 1.10.2.
2

Can use nuscenes mini to train?

Thank you for your good job!!
I tried to use nusenes mini to run a training demo, but raise error "No such file or directory: 'data/nuscenes/maps/expansion/singapore-onenorth.json', so i wonder if this version can only train on full nuscenes dataset?
"

Weights of generate video

Hi,
Thank you so much for you wonderful work, I got a simple question, does the weights you post used for image generation same with the video generation's ? I saw you post the training process of video gen, however, nusence dataset is kind of huge, it take times, so I am wondering if I could just use your weights to test the effect of video generation. I really appreciate your help !

Best,

The inputs of view-conditioned generation.

Hi, congrats for your excellent work! I have a question regarding view-conditioned generation in Fig. 6:
fig6
I am wondering how the condition view image is provided to the denoising process. Is it generated by DDIM inversion?

the generated images is abnormal when I used nuScenes dataset

follow your readme, I can run demo successfully. But when I train and test by nuScenes mini dataset, but the generated images is abnormal. Can you help us see where the problem lies?

  1. use mini nuScenes dataset
    python tools/create_data.py nuscenes --root-path ./data/nuscenes \ --out-dir ./data/nuscenes_mmdet3d_2 --extra-tag nuscenes --version v1.0-mini

  2. train model in debug config with 1xV100
    accelerate launch --mixed_precision fp16 --gpu_ids all --num_processes 1 tools/train.py +exp=224x400 runner=debug runner.validation_before_run=true --version

  3. test
    python tools/test.py resume_from_checkpoint=magicdrive-log/debug/SDv1.5mv-rawbox_2024-02-27_09-57_224x400

0_gen0

截屏2024-02-27 20 04 43

Bug in nuscenes_t_dataset

There is a syntax error in the build_clips function on line 59 of the nuscenes_t_dataset.py file in the video branch of the current repository. The line currently reads:
len(scene[start] >= 33)
It should probably be:
len(scene[start]) >= 33
Please investigate and correct this issue. Thank you!

Question about demo/run.py show_box = true

When show_box = true in test_config.yaml

Error executing job with overrides: ['resume_from_checkpoint=/home/magicdrive-files/SDv1.5mv-rawbox_2023-09-07_18-39_224x400', '++runner.enable_xformers_memory_efficient_attention=false']
Traceback (most recent call last):
  File "demo/run.py", line 69, in main
    map_size=target_map_size)
  File "./magicdrive/misc/test_utils.py", line 301, in run_one_batch
    for bi in range(bs)
  File "./magicdrive/misc/test_utils.py", line 301, in <listcomp>
    for bi in range(bs)
  File "./magicdrive/misc/test_utils.py", line 56, in draw_box_on_imgs
    val_input['meta_data']['img_aug_matrix'][idx].data.numpy(),
  File "./magicdrive/runner/box_visualizer.py", line 111, in show_box_on_views
    img_out = np.asarray(Image.open(temp_path))  # ensure image is loaded
  File "/opt/conda/envs/magicdrivepy37/lib/python3.7/site-packages/PIL/Image.py", line 3236, in open
    fp = builtins.open(filename, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '.tmp/tmprxi40ph_.png'

When show_box = false in test_config.yaml
There is a problem with the loop iteration over gen_imgs_wb_list. This loop does not execute because there are no elements in gen_imgs_wb_list.

for map_img, ori_imgs, ori_imgs_wb, gen_imgs_list, gen_imgs_wb_list in zip(*return_tuples):

I deleted gen_imgs_wb_list in that line and commented code related to show_box=true, and the script("demo/run.py") can run correctly. So I think there is no problem with enviroment.
I'm using the nuscenes v1.0 mini dataset, and I suspect this happened cuz not reading the files related to the 3D BOX labels in the groundtruth data.

About Todo list updates

I have observed two pending tasks on your to-do list that align with my interests. May I inquire if it would be convenient for you to share the associated configurations, weightings, and the code at this time?

  • config and pretrained weight for high resolution
  • train and test code for CVT and BEVFusion

FID指标

Thank you very much for your work, but I would like to ask a question about FID. I used the weights you published to generate a validation set data of 224 * 400, and processed the real validation set the same way, saving it as a PNG format image. However, after calculating the FID, it was much higher than the data in the paper. How did you calculate it? Thank you.

'BasicTransformerBlock' object has no attribute '_args'

I can't run the full demo or the gradio demo because of the following error:

AttributeError: 'BasicTransformerBlock' object has no attribute '_args'

if crossview_attn_type == "basic":
                    _set_module(self, name, BasicMultiviewTransformerBlock(
                        **mod._args, # Problem line
                        neighboring_view_pair=neighboring_view_pair,
                        neighboring_attn_type=neighboring_attn_type,
                        zero_module_type=zero_module_type,
                    ))

I understand that the code is attempting to retrieve the original arguments from the BasicTransformer block, but it seems to be unsuccessful syntactically.

I downloaded an unzipped the pretrained model to /pretrained and have an otherwise functional env.

it occurs error when I install xformers in third_party

thanks for your work!

when I install xformers in third_party by using "pip install .", find error:

image

if I update code in /opt/conda/lib/python3.7/subprocess.py, xformers is not work:
packaging.version.InvalidVersion: Invalid version: '0.0.19xxx'

31ff61ed4d89c31f936f15a1ac88562e

How to generate some specific frames?

I noticed that you gave the index of the generated frame in configs/runner/default.yaml. Now I want to generate a certain frame through the token of the given frame. How do I determine its index? Thanks

How to replace Unet with DIT architecture

Dear author, thank you very much for presenting such a valuable project and open-sourcing it. However, in the project introduction page, you mentioned the possibility of replacing the Unet architecture with the DIT architecture. Could you kindly guide me on where this switch is performed?

Some questions about model training

I followed the training script and the format of the training model obtained is as follows.
屏幕快照 2024-04-01 下午4 37 11

However, the format of the officially provided pre-training weights is
屏幕快照 2024-04-01 下午4 37 37

Save the generated image to a size of 1600 * 900

          > Yes, we kept the original data processing pipeline. Actually, we upscale and pad each generated images to 1600x900 and save them to disk, so that we can run the original code without any change.

Thanks for your reply!

May I ask if you could explain the specific operation of pads and upsamples here?

How to generate videos in MagicDrive-t using pretrained models.

I would like to directly obtain visualizations in MagicDrive-t using pretrained weights for rawbox_mv2.0t_0.4.3.yaml on OneDrive. Could you please provide specific execution scripts, preferably for generating frames at different time intervals? Thank you very much.

How to run the codebase with higher cuda version

Thank you for your excellent work. I have noticed that the codebase relies on multiple specific versions of the mm series code repository, and the highest graphics card computing power supported by pytorch-cuda10.2 is sm_70. Although V100 can be used, many new versions of graphics cards such as A series or L40S series cannot support sm_70. Is it possible for us to make the code run on these graphics cards?

About video quality

Dear authors,

Thanks for your great work!

I have a question regarding the quality of the generated video. I loaded the pretrained video unet and got videos like this:
1_16_gen0-ezgif com-video-to-gif-converter
It seems to have some artifacts in the video and some views are more blurry than others.

I am wondering if this is a normal phenomenon or if I did something wrong.

Thanks!

question about temp_attn_type

"ts_first_ff", "s_first_t_ff", "s_ff_t_last", "t_first_ff", "t_ff", "_ts_first_ff", "_s_first_t_ff", "_s_ff_t_last"
what dose these mean? any documentations?

Trying to generate without any conditions

I am having a hard time installing and running some of the libraries required for the demo and I want to perform inference with no conditioning (generation from pretrained weights with no map or boxes whatsoever). So I wrote this script:

import torch.nn as nn
from magicdrive.pipeline.pipeline_bev_controlnet import (
    StableDiffusionBEVControlNetPipeline,
    BEVStableDiffusionPipelineOutput,
)
from magicdrive.networks.unet_addon_rawbox import BEVControlNetModel #controlnet class
from magicdrive.networks.unet_2d_condition_multiview import UNet2DConditionModelMultiview
from magicdrive.pipeline.pipeline_bev_controlnet import StableDiffusionBEVControlNetPipeline
from diffusers import UniPCMultistepScheduler
from typing import Tuple, Union, List
from PIL import Image

pipe_param = {}
controlnet = BEVControlNetModel.from_pretrained('magicdrive_weights/SDv1.5mv-rawbox_2023-09-07_18-39_224x400/controlnet', bbox_embedder_cls= "magicdrive.networks.bbox_embedder.ContinuousBBoxWithTextEmbedding")
controlnet.eval()  # from_pretrained will set to eval mode by default
pipe_param["controlnet"] = controlnet

unet = UNet2DConditionModelMultiview.from_pretrained('magicdrive_weights/SDv1.5mv-rawbox_2023-09-07_18-39_224x400/unet')
unet.eval()
pipe_param["unet"] = unet

pipe = StableDiffusionBEVControlNetPipeline.from_pretrained('magicdrive_weights/runwayml:stable-diffusion-v1-5', **pipe_param, safety_checker=None, feature_extractor=None)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)

pipe.enable_xformers_memory_efficient_attention()

generator = None

pipeline_param = {
    "guidance_scale": 2,  # if > 1, enable classifier-free guidance
    "num_inference_steps": 20,
    "eta": 0.0,
    "controlnet_conditioning_scale": 1.0,
    "guess_mode": False,
    "use_zero_map_as_unconditional": True,
    "bbox_max_length": None,
}

image: BEVStableDiffusionPipelineOutput = pipe(
    prompt='a vehicle driving in the rain in bright daylight',
    image=torch.zeros(1,8,200,200),
    camera_param=torch.zeros(1,6,3,7),
    height=224,
    width=400,
    generator=generator,
    bev_controlnet_kwargs={"bboxes_3d_data":None},
    **pipeline_param,
)
image: List[List[Image.Image]] = image.images

Are these the right steps to reproduce inference? I am getting pretty much the same image for all views.
image

The approximate training time for the video generation model

Hello,

Thank you for your excellent work! I would like to ask about the approximate training time for the video generation model. I followed the instructions from this link and used the command scripts/dist_train.sh 8 runner=8gpus_t +exp=rawbox_mv2.0t_0.4.3 for training. However, it took me over 30 hours to train for 5000 steps. I would like to know if this is normal because it was mentioned that approximately 80,000 steps are needed for training, which would take a considerable amount of time.

Thank you very much for your help!

No module named 'xformers._flash_attn'

Hello, thanks for your excellent work, I am getting an error ""No module named 'xformers._flash_attn'"" when I run the demo command you gave.
Environment:
CUDA10.2
PyTorch 1.10.2
gcc 7.4.0
NVIDIA A100

Can the generated video improve MOT proformence

I saw in your paper, you said the generated image could improve 3D object detection task KPI, I checked your generated video, the generated car or other object, the consistency is not very well in image sequence, So I want to know, does the un-consistency issue decrease some MOT KPI if i use these generated video to enhance training dataset.

[video-generation-ego-movement] Question about video generation

Good afternoon,

For video generation, the MagicDrive paper states that only the first and the last frames have bounding boxes (section 5.4). I have the following question:

  • How do you encode the movement of the car in the last frame, relative to its initial position in the first frame?

The ego-pose of the car changes in these 7 frames (around 4 seconds duration), but if I understand correctly both the bounding boxes and the camera poses have the ego-point of the car as reference. Therefore, they do not inject any information regarding the change in ego pose of the car from its starting position to the finish. In my mind, this is important information to inject to the video generation model but I may be missing something.

Thank you in advance for your feedback.

[MagicDrive3d codes]

Thank you for your great works! When do you plan to release codes for MagicDrive3d? I found MagicDrive series are indeed driving world models that can possibly been put down to the earth.

Conditioning on Segmentation Maps

Hello, great work!

I was wondering whether you thougth about conditioning on the image segmentation maps (generated with some algorithm on NuScenes images).
Because I am not quite sure, I would expect segmentation maps have more information about the shapes of the scene/objects compared to 3d bboxes + class, but maybe it is not the case if you already considering and discraded it.

Thank you!!

question about view consistentence

Does the term "hidden state of neighbor" refer to the latent representation of the neighboring image or the scene-level encoding embedding?
Additionally, when addressing view consistency, does the model output multiple-view results simultaneously or only one view?

Your Loss is NaN

Steps: 0%| | 4680/1172500 [3:15:04<781:36:43, 2.41s/it, loss=0.0968, lr0=4e-5]Error executing job with overrides: ['+exp=224x400', 'runner=4gpus']
Traceback (most recent call last):
File "tools/train.py", line 110, in main
runner.run()
File "/data/******************./magicdrive/runner/base_runner.py", line 352, in run
raise RuntimeError('Your loss is NaN.')
RuntimeError: Your loss is NaN.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

Steps: 0%| | 4734/1172500 [3:18:08<814:35:08, 2.51s/it, loss=0.294, lr0=4e-5]

hey,I train the model for 4 gpus V100,with batch_size=3,and learning_rate is 4e-5,but I got the error.Does any one have ever encounted the error?Thanks a lot.

Some confusion about BEVFusion detection scores.

Hi, I would like to express my appreciation for your outstanding work. I have some further questions regarding the BEVFusion detection scores.

I notice that in the first and third rows of table 2 in your paper, the bevfusion detection scores you show without data augmentation are much lower than those officially reported by bevfusion (https://github.com/mit-han-lab/bevfusion/). This is very confusing.

Thank you, and I look forward to your response!
Best regards.

where can I find forked version of mmdet3d required by the video branch?

Hi,

the README.md file in video branch contains this:

Datasets

Please prepare the nuScenes dataset as [bevfusion's instructions](https://github.com/mit-han-lab/bevfusion#data-preparation). 

Note:
Run with our forked version of mmdet3d.

however I don't know where I can find that forked version of mmdet3d. Could the URL be added in the file?

I'm asking this because I'm getting weird exceptions when using the official version of mmdetction3d to generate the .json files required by scripts/ann_generator.sh in ASAP. Not sure if not using the forked version is the problem.

Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.