umass-foundation-model / 3d-llm Goto Github PK

View Code? Open in Web Editor NEW

801.0 801.0 51.0 35.74 MB

Code for 3D-LLM: Injecting the 3D World into Large Language Models

License: MIT License

Python 96.94% Shell 0.03% C++ 0.30% Cuda 2.73%

3d-llm's People

Contributors

Stargazers

Watchers

Forkers

anyezhy ardabck kelier paperwave zsh2000 evelinehong abdulrehman1232 lisaterumi balakreshnan aurora779 jaedukseo 3a1b2c3 f901107 afromankenobi ajinkyapuar 2132660698 pterameta jackzhousz budukratok kyegomez ilileun yinfi peihaochen azure-arc-0 linhaojia13 homee-bruce pakeypay yuqunw mym181 jacky68147527 andycao1125 hongyonggi tomoyoshihirata alexandor91 xianzhengma zhwl2117 laiyuzhi turna1 crazyboystop mitkor2 021gink kevinwck simbag04 svorwerk-flextg prateekrana17 sdarkhovsky owenonline caoyongshengcys whuhxb hiyyg

3d-llm's Issues

How to evaluate the model?

Hi, are there any scripts that can be used to calculate the BLEU, METEOR, ROUHE-L, CIDER, and EM of the model?

Train 3D-LLM from scratch

Hi,
Thanks for the significant work.

I see that 3D LLM discarding the pre-trained vision encoder, replacing with offline generated features and trains QFormer with frozen LLM.

I don't know that QFormer is trained from scratch or the author loaded pretrained QFormer and finetune it cuz before forwarding into the QFormer, pcl features are processed by positional encoding.

Looking forward to the response.
Thank you

How to map the 'val_1_vqa_result.json' with 'all_questions.json' for val set in pretrain.yaml?

'val_1_vqa_result.json' is the file produced after evaluation during training, and there are only 'question_id' and 'answer' for each item in it. How to map each item to those in 'all_questions.json' for val set in pretrain.yaml, so that I could check the val performance?

I also find that there are 21092 question-answer pairs in 'val_1_vqa_result.json', while 67578 question-answer pairs in 'all_questions.json' for val set in pretrain.yaml, the number does not match!

How to process scannet dataset?

I have downloaded the scannet dataset from https://kaldir.vc.in.tum.de/scannet_benchmark/documentation. (scannet_frames_25k). When I run python direct_3d.py in three steps for scene, it looks like a lot of json files are missing, such as,

How to render the original scannet dataset? Could you have your own script ?

Thanks very much.

What if the point cloud features are not 1408 dimensional

Hi, thank you so much for your wonderful work and for giving me a better understanding of VLM.

I would like to ask, when running the code 3DLLM_BLIP2-base/inference.py I found that if I input my own point cloud into it (the number of features per point is not 1408), it will not work. Upon inspection, the code 3DLLM_BLIP2-base/lavis/models/blip2_models/blip2_t5.py makes heavy use of the number 1407 for parameter determination.

I'm wondering what I should do if I want to pass my own point cloud in and run it. (I'm worried that the rest of the code will also have a lot of 1407 that I won't be able to change all the way around)

For the json files used in pretraining, did you split them to train, val and test?

For the json files used in pretraining, like 'data_part2_scene_v2.json', 'data_part1_all_objaverse.json', did you use them all for training? Or used some part of them for val and test? If so, which parts are for val and test?

ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

Dear author:

Thanks for your interesting work.

When I run finetune on finetune_scanqa.yaml with scannet features, an error occured:

2023-10-12 21:41:15,670 [INFO] load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xl.pth

url_or_filename: /data/xxx/code/3d-llm/ckpts/pretrain_blip2_sam_flant5xl_v2.pth

2023-10-12 21:41:22,086 [INFO] number of trainable parameters: 372436480
Traceback (most recent call last):
  File "train.py", line 112, in <module>
    main()
  File "train.py", line 108, in main
    runner.train()
  File "/data/renruilong/code/3d-llm/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 354, in train
    self._load_checkpoint(self.resume_ckpt_path)
  File "/data/renruilong/code/3d-llm/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 587, in _load_checkpoint
    self.optimizer.load_state_dict(checkpoint["optimizer"])
  File "/data/renruilong/miniconda3/envs/3dllm/lib/python3.8/site-packages/torch/optim/optimizer.py", line 390, in load_state_dict
    raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

It seems like the pretrain_blip2_sam_flant5xl_v2.pth do not match the model defined in code, and I wonder how to solve this error?

Best!
Xiaolong

license

Very nice work! Could you add a license file to the repo? Thanks!

Do I need to provide some external datasets like COCO dataset in order to run the models?

How to generate the data for 3D Grounding?

The insights gained from 3D-LLM, ranging from model design to data collection, have been truly enlightening for me. I'm incredibly grateful to the authors for open-sourcing this remarkable work. I'm particularly intrigued by how the authors managed to generate the 3D Grounding data. Would you be able to share some details with me?

In the paper, it's mentioned that there are three types of prompt methods. Could you please specify which prompt method was used for generating the 3D Grounding data?
Figure 6 showcases the prompts used for generating data for tasks like task decomposition and 3D-assisted dialog. Could you kindly provide the prompts used for creating 3D Grounding data, along with some in-context examples?

How many obj files in all in your Object Dataset?

I see 69 objects in your objaverse_feat_subset, and wonder how many obj files in all in your Object Dataset from Objaverse? Did you use all files from Objaverse?

Implementation of location tokens

I'm wondering if the code for location tokens implementation could be shared?

This is important to understand how ScanRefer task is handled by the proposed method.

Thanks,
Chao

Code for multiple tasks in the project page

Thanks for the insightful work！Will the training code for the tasks mentioned on the project page be released? I am interested in the research regarding EQA and navigation～😊.

About point cloud data

Hi，
Can this model handle LIDAR point cloud data? Thanks!

Difference between scannet features for finetuning and for pretraining?

I downloaded two different pre-computed features for scannet:

One is inside voxelized_features_sam_nonzero_preprocess, according to the documents, is for ScanQA finetuning
The other part is inside 3dllm_final_scene_data_v2/features, according to the documents, is for pretraining

The two folders have overlapping features for scannet scenes, and they are different inside. I wonder what is the difference and is there a better one to use?

`final_scene_map_dict_scan_v2` and `v3` have conflicting key: value pairs

The newly uploaded final_scene_map_dict_scan_v3 has some confliction with the old v2 mapping file. For example:

In v2, "1013": "scene0411_00"
In v3, "1013": "scene0261_01"

Is this intended? Are we supported to use the v3 mapping file together with the v2 language data data_part2_scene_v2.json?

The position of the object

Hi, in the inference mode of room, how could I get the object position(x,y,z) with prompt.``

The training goes error when you updated the 'blip2_t5.py' for optimizer parameters loading

When I run the training command 'python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/pretrain.yaml', using the latest version, it goes error. The error information is listed below. I think the reason lies in 'blip2_t5.py' line 54-61, the newly added codes.
''
Traceback (most recent call last):
File "train.py", line 111, in
main()
File "train.py", line 107, in main
runner.train()
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 360, in train
train_stats = self.train_epoch(cur_epoch)
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 396, in train_epoch
return self.task.train_epoch(
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 110, in train_epoch
return self._train_inner_loop(
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 211, in _train_inner_loop
loss = self.train_step(model=model, samples=samples)
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 64, in train_step
loss = model(samples)["loss"]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by
making sure all forward function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 1 2
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
''

Some questions

Hi, when I run the code

python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml

there are some questions:

I downloaded the weights locally from "https://huggingface.co/facebook/opt-2.7b" and replaced 'opt_model' in the code with the local weight file, but it shows that the weight and model sizes don't match.
What directory should I place the downloaded dataset in?
I found that the three annotations files in 3dvqa_ft.yaml do not exist. How can I obtain them?

train:
  storage: ./examples/all_refer_questions_train.json
test:
  storage: ./examples/all_refer_questions_val.json
val:
  storage: ./examples/all_refer_questions_val.json

Can you share the script of prompting ChatGPT to generate the description of 3D data?

Nice work!

I want to see how accurate the ChatGPT can generate based on different 3D scenes.

Could you push the related scripts to this github, which may be very helpful.

Xianzheng

Where is the files such as './examples/all_refer_questions_train.json'

Dear author:

Thanks for your interesting work.

When I run the following command:

python-m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml

I found that the three annotations files in 3dvqa_ft.yaml do not exist. How can I obtain them? Or do I need to generate them locally?

build_info:
  annotations:
    train:
      storage: ./examples/all_refer_questions_train.json
    test:
      storage: ./examples/all_refer_questions_val.json
    val:
      storage: ./examples/all_refer_questions_val.json

Thanks!!

Get "" answer when run evaluate.py with scanqa.pth as resume_ckpt_path on scannet dataset

Dear author:

Thanks for your interesting work.

When I run the following command:

cd 3DLLM_BLIP2-base
python evaluate.py --cfg-path lavis/projects/blip2/train/finetune_scanqa.yaml

The test_best_vqa_result.json I obtained is as follows:

{"question_id": 0, "answer": ""}, 
{"question_id": 1, "answer": ""}, 
{"question_id": 2, "answer": ""}, 
{"question_id": 3, "answer": ""}, 
{"question_id": 4, "answer": "the lower kitchen cabinets the same color"}, 
{"question_id": 5, "answer": ""}, 
...
{"question_id": 11, "answer": ""}, 
{"question_id": 12, "answer": "a white t"}, 
{"question_id": 13, "answer": ""}, 
{"question_id": 14, "answer": ""}, 
{"question_id": 15, "answer": ""}, 
{"question_id": 16, "answer": "a counter"}, 
{"question_id": 17, "answer": ""}, 
{"question_id": 18, "answer": ""}
...

There are many empty answers here, I wonder if this is a normal result? And if not, how to solve it?

Best!
Xiaolong

HW spec for Inference

Thanks for your awesome work!
Can you tell me the minimum hw spec for your model inference? I'd like to try your model for inference only!

Are there any model checkpoints available to directly load the models? Or do I need to train them.

About hardware source

I would like to train this model of yours or train a model using your data. I'm interested in knowing the approximate duration for training this model and some hardware information. It seems that the paper doesn't appear to have this information.

Thanks！

The google cloud links are the same for language annotations on the object data and the scene data.

How to inference the model given a 3D scene or object data?

Before inference, do I need to pre-process the 3D scene data following the Three-step 3D Feature Extraction (Scene)?
For object data, follow the Three-step 3D Feature Extraction (Objaverse)?

Voxels to point clouds

From the downloaded scannet data, I see the point clouds are actually voxels, eg. arrays of [25, 221, 209]...
Could you explain the transformation from the continuous point clouds to this voxels?

Basically, im interested in finetuning the pretrained model for ScanRefer, I need to get the mapping from voxel to point coordinates. Right?

Thanks for your attention.

I can't find the data for 3D Grounding and 3D Dense Captioning in file 'data/data_part2_scene.json'

The paper mentions that the authors utilized ChatGPT to generate data for 3D Grounding and 3D Dense Caption. When I opened the 'data_part2_scene.json' file to examine the data corresponding to these two tasks, I noticed that none of the Questions or Answers in the dataset contained coordinates, despite the expectation that these tasks should involve coordinates related to referent objects.
As depicted in Figure 7, the data distribution for the 3D Grounding and 3D Dense Caption tasks is ~25% each. However, theoretically, these tasks could share data since the main distinction lies in one deriving a bounding box from a description, while the other involves generating a description from a given box. I'm curious why the authors use different data for the two tasks.

Visual grounding data format and training process

Hello, and first of all, thank you for sharing your impressive work. I am currently working on fine-tuning a visual grounding process using real-world data from our laboratory. I understand that both the visual grounding and navigation processes are ongoing. Could you please provide any updates regarding the release of the visual grounding data format and the training process?

Question about 3D Localization Mechanism

Dear author:

Thanks for your interesting work.

I have two questions about 3D Localization Mechanism:

Are position embeddings already embedded in the features of 3dllm_final_scene_data_v2
Where in the code is the embedding of location tokens implemented? I can not find it.

By the way, I wander when will you release detail data & code about grounding? I am really looking forward to it~

Best！
Xiaolong

The scene data is not accesed in google drive now because to many peple have downloaded it

When I try to download 3dllm_final_scene_data_blip_v1.zip, it returns:

(base) lhj@mac:/data/pointcloud/3dllm$ gdown https://drive.google.com/uc?id=118JSjS1nl-1v87wC2oTxEmQzSSyCIBXM
Access denied with the following error:

        Too many users have viewed or downloaded this file recently. Please
        try accessing the file again later. If the file you are trying to
        access is particularly large or is shared with many people, it may
        take up to 24 hours to be able to view or download the file. If you
        still can't access a file after 24 hours, contact your domain
        administrator. 

You may still be able to access the file from the browser:

         https://drive.google.com/uc?id=118JSjS1nl-1v87wC2oTxEmQzSSyCIBXM

Could you please upload this big file to Baiduyun or OneDrive?

Validation loss vs. metrics?

Dear authors, @evelinehong,

Thanks for the interesting paper and for releasing the code and models :) I was able to reproduce the ScanQA results on the validation set. I also added computation of the validation loss, similar to the training step by calling forward in addition to predict_answers, in VQATask.valid_step. However I noticed while the validation loss goes up, the validation metrics also go up.

Did you notice something similar to this while training, or do you have any suggestions as to why this could happen? I would expect validation loss and metrics to be correlated, even if the val loss is not from autoregressive generation.

Best,
Chandan

dataset

When will the dataset be released? Do you have any plan to release the code that constructs the dataset?

Categories of Questions

Could you please provide a document that specifies which questions belong to which tasks?

How to ensure the accuracy of the data

Hello, thank you for doing such an inspiring job.
I would like to ask a question, how do you ensure the accuracy of this data when performing 3D-Language Data Generation? Or is there a detection mechanism to detect it?
Hope to receive an answer, thank you!
You guys are really great, I’m so envious!
In Chinese, niubi!

Verification of 3D feature extraction

Hi, I tried to extract 3D features following the three-step pipeline using blip_sam.py file
There are a few questions about the details:

In line 53-54 of second_step/blip_sam.py, you have:
raw_image = cv2.imread(INPUT_IMAGE_PATH)
raw_image = cv2.resize(raw_image, (512, 512))
is it correct to go without using: img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)?
Once I generated my own point cloud features, how could I verify if the features are consistent to yours? Could you provide a subset of the features (eg. scannet)? I also tried to compute the similarity between point cloud features with text description. However, since the 1408 dim features are the hidden layer output, it is not feasible to compute the similarity with text features which are 256 dim. Do you have any suggestions on this?

Sample dataset for 3DMV-VQA

For the three step feature extraction, we need the 3DMV-VQA dataset format as mentioned at
https://github.com/evelinehong/3D-CLR-Official

-data   # multi-view images of single-room scenes
   - 00009-vLpv2VX547B_0    # most rooms contain 1000 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means it's the first room of the house
      - 0.png
      - 0_depth.npy
      - 0.json
      - 1.png
      - 1_depth.npy
      - 1.json
      ...
   - 00009-vLpv2VX547B_1
      - 0.png
      - 0_depth.npy
      - 0.json
      - 1.png
      - 1_depth.npy
      - 1.json
      ...
   ... 
   - 00891-cvZr5TUy5C5_9
      - 0.png
      - 0_depth.npy
      - 0.json
      - 1.png
      - 1_depth.npy
      - 1.json
      ...
 data_2  #multi-view images of two-room scenes
   - 00009-vLpv2VX547B_0_1    # most rooms contain 1000 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means the first room of the house, _1 means the second rooms of the house. Meaning that this scene consists of two rooms of house 00009-vLpv2VX547B .
      - 0.png
      - 0_depth.npy
      - 0.json
      - 1.png
      - 1_depth.npy
      - 1.json
 data_3   #multi-view images of two-room scenes
   - 00009-vLpv2VX547B_0_1_2    # most rooms contain 1500 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means the first room of the house, _1 means the second room of the house, _3 means the third room of the house. Meaning that this scene consists of three rooms of house 00009-vLpv2VX547B .
      - 0.png
      - 0_depth.npy
      - 0.json
      - 1.png
      - 1_depth.npy
      - 1.json
 questions_train.json #questions and answers of training dataset
 questions_val.json
 questions_test.json
 all_concepts.json #all concepts of the dataset
 objects_bboxes_per_room.zip  #object bounding boxes of each room
 room_bboxes_with_wallsrevised_axis.zip  #room bounding boxes of the houses
 single_room_concepts3_after_bboxes_after_replace.zip #Useful concepts of each room

can you please let me know a way to download a sample dataset to understand the dataset format, where original dataset is 250GB? . A sample dataset for one room should be okay for understanding. please help

Render objaverse, the operating system will restart.

Dear authors, when I try to use the blender 3.3 to render objaverse into images, after running the "{path/to/blender} -b -P render.py -noaudio --disable-crash-handler -- --uid", the operation system will restart.
Could you please know the detail reason?
Is the nvidia-driver or other problems?

How is scene data rendered from point clouds into images before three steps?

How is scene data rendered from point clouds into images before three steps ? The project only explains the rendering method of objaverse.

The held-in datasets

Hi, could you please provide the held-in datasets? Thanks!

Training batch size

Hi, thanks for sharing your fantastic work! As indicated in the paper, when using BLIP-2 as backbones, it requires 64 V100s to train the model with batch size 128 on 8 nodes (16 per node), while using Flamingo, it requires 8 A100, with batch size 16. I wonder if it's possible to train with a smaller batch size? Many thanks.

What is `objaverse_frame_cap3d`?

I was following gen_features in Step3: 3D feature construction from rendered images, but ran into a blocker.

Apparently sam_mask.py expects objaverse_frame_cap3d to contain different image views of a "room". Where/how are those images generated?

Your help would be greatly appreciated!

How to evaluate?

I could not find released version of validation set for 3D Captioning, 3D-assisted Dialog, Task Decomposition, how could we evaluate the model's performance for these tasks in figure 3?

How to align scene_ids in `data_part2_scene.json` from those in ScanNet?

I notice that the scene_ids in data_part2_scene.json have a dtype of int, while the scene_ids in scannet are something like scene0000_00. So, how do I align these two datasets? Or how can I split the scannet annotations from that file by myself?

Some questions about the code

Hi, I have some questions:

As there are no descriptions about "task_type" in the annotations, how to split the 'data_part1_all_objaverse.json' and 'data_part2_scene.json' into train/val/test sets in proportion of (8:1:1)?
I found you have uploaded files "voxelized_features_sam_nonzero_preprocess.zip" and "voxelized_voxels_sam_nonzero_preprocess.zip", what's that for?
During the evaluation, the code needs the "coco_fmt_qust_file" annotations for computing accuracy. However there are no descriptions about "coco_fmt_qust_file" in the annotations, so how to calculate the test accuracy?

            if hasattr(dataset[split], "coco_fmt_qust_file") and dataset[split].coco_fmt_qust_file is not None:
                self.ques_files[split] = dataset[split].coco_fmt_qust_file
                self.anno_files[split] = dataset[split].coco_fmt_anno_file

About the annotation of Scannet

Thanks for the great work. I am interested in the data generation. I notice that there are attributes of the objects in the QA or dialogue in the released data. How you get object attribute annotations on scannet?

m@kIoU Metric Results for 3D Dense Captioning

It seems that the appendix provides certain metrics including B-4, M, and R for 3D Dense Captioning.

Since I am doing some comparisons on the Scan2Cap benchmark, could you kindly provide me the m@kIoU evaluation as proposed in https://arxiv.org/pdf/2012.02206.pdf ?

the processing of objaverse feature

When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?

Another one, It seems the number of rendered images should be 8, not 4

umass-foundation-model / 3d-llm Goto Github PK

3d-llm's People

Contributors

Stargazers

Watchers

Forkers

3d-llm's Issues

Recommend Projects

Recommend Topics

Recommend Org