umass-foundation-model / 3d-llm Goto Github PK
View Code? Open in Web Editor NEWCode for 3D-LLM: Injecting the 3D World into Large Language Models
License: MIT License
Code for 3D-LLM: Injecting the 3D World into Large Language Models
License: MIT License
Hi, are there any scripts that can be used to calculate the BLEU, METEOR, ROUHE-L, CIDER, and EM of the model?
Hi,
Thanks for the significant work.
I see that 3D LLM discarding the pre-trained vision encoder, replacing with offline generated features and trains QFormer with frozen LLM.
I don't know that QFormer is trained from scratch or the author loaded pretrained QFormer and finetune it cuz before forwarding into the QFormer, pcl features are processed by positional encoding.
Looking forward to the response.
Thank you
'val_1_vqa_result.json' is the file produced after evaluation during training, and there are only 'question_id' and 'answer' for each item in it. How to map each item to those in 'all_questions.json' for val set in pretrain.yaml, so that I could check the val performance?
I also find that there are 21092 question-answer pairs in 'val_1_vqa_result.json', while 67578 question-answer pairs in 'all_questions.json' for val set in pretrain.yaml, the number does not match!
I have downloaded the scannet dataset from https://kaldir.vc.in.tum.de/scannet_benchmark/documentation. (scannet_frames_25k). When I run python direct_3d.py in three steps for scene, it looks like a lot of json files are missing, such as,
How to render the original scannet dataset? Could you have your own script ?
Thanks very much.
Hi, thank you so much for your wonderful work and for giving me a better understanding of VLM.
I would like to ask, when running the code 3DLLM_BLIP2-base/inference.py
I found that if I input my own point cloud into it (the number of features per point is not 1408), it will not work. Upon inspection, the code 3DLLM_BLIP2-base/lavis/models/blip2_models/blip2_t5.py
makes heavy use of the number 1407 for parameter determination.
I'm wondering what I should do if I want to pass my own point cloud in and run it. (I'm worried that the rest of the code will also have a lot of 1407 that I won't be able to change all the way around)
For the json files used in pretraining, like 'data_part2_scene_v2.json', 'data_part1_all_objaverse.json', did you use them all for training? Or used some part of them for val and test? If so, which parts are for val and test?
Dear author:
Thanks for your interesting work.
When I run finetune on finetune_scanqa.yaml
with scannet features, an error occured:
2023-10-12 21:41:15,670 [INFO] load checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/BLIP2/blip2_pretrained_flant5xl.pth
url_or_filename: /data/xxx/code/3d-llm/ckpts/pretrain_blip2_sam_flant5xl_v2.pth
2023-10-12 21:41:22,086 [INFO] number of trainable parameters: 372436480
Traceback (most recent call last):
File "train.py", line 112, in <module>
main()
File "train.py", line 108, in main
runner.train()
File "/data/renruilong/code/3d-llm/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 354, in train
self._load_checkpoint(self.resume_ckpt_path)
File "/data/renruilong/code/3d-llm/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 587, in _load_checkpoint
self.optimizer.load_state_dict(checkpoint["optimizer"])
File "/data/renruilong/miniconda3/envs/3dllm/lib/python3.8/site-packages/torch/optim/optimizer.py", line 390, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
It seems like the pretrain_blip2_sam_flant5xl_v2.pth
do not match the model defined in code, and I wonder how to solve this error?
Best!
Xiaolong
Very nice work! Could you add a license file to the repo? Thanks!
The insights gained from 3D-LLM, ranging from model design to data collection, have been truly enlightening for me. I'm incredibly grateful to the authors for open-sourcing this remarkable work. I'm particularly intrigued by how the authors managed to generate the 3D Grounding data. Would you be able to share some details with me?
In the paper, it's mentioned that there are three types of prompt methods. Could you please specify which prompt method was used for generating the 3D Grounding data?
Figure 6 showcases the prompts used for generating data for tasks like task decomposition and 3D-assisted dialog. Could you kindly provide the prompts used for creating 3D Grounding data, along with some in-context examples?
I see 69 objects in your objaverse_feat_subset, and wonder how many obj files in all in your Object Dataset from Objaverse? Did you use all files from Objaverse?
I'm wondering if the code for location tokens implementation could be shared?
This is important to understand how ScanRefer task is handled by the proposed method.
Thanks,
Chao
Thanks for the insightful work!Will the training code for the tasks mentioned on the project page be released? I am interested in the research regarding EQA and navigation~😊.
Hi,
Can this model handle LIDAR point cloud data? Thanks!
I downloaded two different pre-computed features for scannet:
voxelized_features_sam_nonzero_preprocess
, according to the documents, is for ScanQA finetuning3dllm_final_scene_data_v2/features
, according to the documents, is for pretrainingThe two folders have overlapping features for scannet scenes, and they are different inside. I wonder what is the difference and is there a better one to use?
The newly uploaded final_scene_map_dict_scan_v3 has some confliction with the old v2 mapping file. For example:
"1013": "scene0411_00"
"1013": "scene0261_01"
Is this intended? Are we supported to use the v3 mapping file together with the v2 language data data_part2_scene_v2.json
?
Hi, in the inference mode of room, how could I get the object position(x,y,z) with prompt.``
When I run the training command 'python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/pretrain.yaml', using the latest version, it goes error. The error information is listed below. I think the reason lies in 'blip2_t5.py' line 54-61, the newly added codes.
''
Traceback (most recent call last):
File "train.py", line 111, in
main()
File "train.py", line 107, in main
runner.train()
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 360, in train
train_stats = self.train_epoch(cur_epoch)
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/runners/runner_base.py", line 396, in train_epoch
return self.task.train_epoch(
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 110, in train_epoch
return self._train_inner_loop(
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 211, in _train_inner_loop
loss = self.train_step(model=model, samples=samples)
File "/root/3DLLM/3D-LLM-main-10.16/3DLLM_BLIP2-base/lavis/tasks/base_task.py", line 64, in train_step
loss = model(samples)["loss"]
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1026, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True
to torch.nn.parallel.DistributedDataParallel
, and by
making sure all forward
function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward
function. Please include the loss function and the structure of the return value of forward
of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 1 2
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
''
Hi, when I run the code
python -m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml
there are some questions:
I downloaded the weights locally from "https://huggingface.co/facebook/opt-2.7b" and replaced 'opt_model' in the code with the local weight file, but it shows that the weight and model sizes don't match.
What directory should I place the downloaded dataset in?
I found that the three annotations files in 3dvqa_ft.yaml do not exist. How can I obtain them?
train:
storage: ./examples/all_refer_questions_train.json
test:
storage: ./examples/all_refer_questions_val.json
val:
storage: ./examples/all_refer_questions_val.json
Nice work!
I want to see how accurate the ChatGPT can generate based on different 3D scenes.
Could you push the related scripts to this github, which may be very helpful.
Xianzheng
Dear author:
Thanks for your interesting work.
When I run the following command:
python-m torch.distributed.run --nproc_per_node=8 train.py --cfg-path lavis/projects/blip2/train/3dvqa_ft.yaml
I found that the three annotations files in 3dvqa_ft.yaml do not exist. How can I obtain them? Or do I need to generate them locally?
build_info:
annotations:
train:
storage: ./examples/all_refer_questions_train.json
test:
storage: ./examples/all_refer_questions_val.json
val:
storage: ./examples/all_refer_questions_val.json
Thanks!!
Dear author:
Thanks for your interesting work.
When I run the following command:
cd 3DLLM_BLIP2-base
python evaluate.py --cfg-path lavis/projects/blip2/train/finetune_scanqa.yaml
The test_best_vqa_result.json
I obtained is as follows:
{"question_id": 0, "answer": ""},
{"question_id": 1, "answer": ""},
{"question_id": 2, "answer": ""},
{"question_id": 3, "answer": ""},
{"question_id": 4, "answer": "the lower kitchen cabinets the same color"},
{"question_id": 5, "answer": ""},
...
{"question_id": 11, "answer": ""},
{"question_id": 12, "answer": "a white t"},
{"question_id": 13, "answer": ""},
{"question_id": 14, "answer": ""},
{"question_id": 15, "answer": ""},
{"question_id": 16, "answer": "a counter"},
{"question_id": 17, "answer": ""},
{"question_id": 18, "answer": ""}
...
There are many empty answers here, I wonder if this is a normal result? And if not, how to solve it?
Best!
Xiaolong
Thanks for your awesome work!
Can you tell me the minimum hw spec for your model inference? I'd like to try your model for inference only!
I would like to train this model of yours or train a model using your data. I'm interested in knowing the approximate duration for training this model and some hardware information. It seems that the paper doesn't appear to have this information.
Thanks!
Before inference, do I need to pre-process the 3D scene data following the Three-step 3D Feature Extraction (Scene)?
For object data, follow the Three-step 3D Feature Extraction (Objaverse)?
From the downloaded scannet data, I see the point clouds are actually voxels, eg. arrays of [25, 221, 209]...
Could you explain the transformation from the continuous point clouds to this voxels?
Basically, im interested in finetuning the pretrained model for ScanRefer, I need to get the mapping from voxel to point coordinates. Right?
Thanks for your attention.
The paper mentions that the authors utilized ChatGPT to generate data for 3D Grounding and 3D Dense Caption. When I opened the 'data_part2_scene.json' file to examine the data corresponding to these two tasks, I noticed that none of the Questions or Answers in the dataset contained coordinates, despite the expectation that these tasks should involve coordinates related to referent objects.
As depicted in Figure 7, the data distribution for the 3D Grounding and 3D Dense Caption tasks is ~25% each. However, theoretically, these tasks could share data since the main distinction lies in one deriving a bounding box from a description, while the other involves generating a description from a given box. I'm curious why the authors use different data for the two tasks.
Hello, and first of all, thank you for sharing your impressive work. I am currently working on fine-tuning a visual grounding process using real-world data from our laboratory. I understand that both the visual grounding and navigation processes are ongoing. Could you please provide any updates regarding the release of the visual grounding data format and the training process?
Dear author:
Thanks for your interesting work.
I have two questions about 3D Localization Mechanism:
By the way, I wander when will you release detail data & code about grounding? I am really looking forward to it~
Best!
Xiaolong
When I try to download 3dllm_final_scene_data_blip_v1.zip
, it returns:
(base) lhj@mac:/data/pointcloud/3dllm$ gdown https://drive.google.com/uc?id=118JSjS1nl-1v87wC2oTxEmQzSSyCIBXM
Access denied with the following error:
Too many users have viewed or downloaded this file recently. Please
try accessing the file again later. If the file you are trying to
access is particularly large or is shared with many people, it may
take up to 24 hours to be able to view or download the file. If you
still can't access a file after 24 hours, contact your domain
administrator.
You may still be able to access the file from the browser:
https://drive.google.com/uc?id=118JSjS1nl-1v87wC2oTxEmQzSSyCIBXM
Could you please upload this big file to Baiduyun or OneDrive?
Dear authors, @evelinehong,
Thanks for the interesting paper and for releasing the code and models :) I was able to reproduce the ScanQA results on the validation set. I also added computation of the validation loss, similar to the training step by calling forward
in addition to predict_answers
, in VQATask.valid_step
. However I noticed while the validation loss goes up, the validation metrics also go up.
Did you notice something similar to this while training, or do you have any suggestions as to why this could happen? I would expect validation loss and metrics to be correlated, even if the val loss is not from autoregressive generation.
Best,
Chandan
When will the dataset be released? Do you have any plan to release the code that constructs the dataset?
Could you please provide a document that specifies which questions belong to which tasks?
Hello, thank you for doing such an inspiring job.
I would like to ask a question, how do you ensure the accuracy of this data when performing 3D-Language Data Generation? Or is there a detection mechanism to detect it?
Hope to receive an answer, thank you!
You guys are really great, I’m so envious!
In Chinese, niubi!
Hi, I tried to extract 3D features following the three-step pipeline using blip_sam.py file
There are a few questions about the details:
In line 53-54 of second_step/blip_sam.py, you have:
raw_image = cv2.imread(INPUT_IMAGE_PATH)
raw_image = cv2.resize(raw_image, (512, 512))
is it correct to go without using: img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)?
Once I generated my own point cloud features, how could I verify if the features are consistent to yours? Could you provide a subset of the features (eg. scannet)? I also tried to compute the similarity between point cloud features with text description. However, since the 1408 dim features are the hidden layer output, it is not feasible to compute the similarity with text features which are 256 dim. Do you have any suggestions on this?
For the three step feature extraction, we need the 3DMV-VQA dataset format as mentioned at
https://github.com/evelinehong/3D-CLR-Official
-data # multi-view images of single-room scenes
- 00009-vLpv2VX547B_0 # most rooms contain 1000 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means it's the first room of the house
- 0.png
- 0_depth.npy
- 0.json
- 1.png
- 1_depth.npy
- 1.json
...
- 00009-vLpv2VX547B_1
- 0.png
- 0_depth.npy
- 0.json
- 1.png
- 1_depth.npy
- 1.json
...
...
- 00891-cvZr5TUy5C5_9
- 0.png
- 0_depth.npy
- 0.json
- 1.png
- 1_depth.npy
- 1.json
...
data_2 #multi-view images of two-room scenes
- 00009-vLpv2VX547B_0_1 # most rooms contain 1000 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means the first room of the house, _1 means the second rooms of the house. Meaning that this scene consists of two rooms of house 00009-vLpv2VX547B .
- 0.png
- 0_depth.npy
- 0.json
- 1.png
- 1_depth.npy
- 1.json
data_3 #multi-view images of two-room scenes
- 00009-vLpv2VX547B_0_1_2 # most rooms contain 1500 views while some contain less. 00009-vLpv2VX547B means house 00009-vLpv2VX547B which is the same as HM3D dataset. _0 means the first room of the house, _1 means the second room of the house, _3 means the third room of the house. Meaning that this scene consists of three rooms of house 00009-vLpv2VX547B .
- 0.png
- 0_depth.npy
- 0.json
- 1.png
- 1_depth.npy
- 1.json
questions_train.json #questions and answers of training dataset
questions_val.json
questions_test.json
all_concepts.json #all concepts of the dataset
objects_bboxes_per_room.zip #object bounding boxes of each room
room_bboxes_with_wallsrevised_axis.zip #room bounding boxes of the houses
single_room_concepts3_after_bboxes_after_replace.zip #Useful concepts of each room
can you please let me know a way to download a sample dataset to understand the dataset format, where original dataset is 250GB? . A sample dataset for one room should be okay for understanding. please help
Dear authors, when I try to use the blender 3.3 to render objaverse into images, after running the "{path/to/blender} -b -P render.py -noaudio --disable-crash-handler -- --uid", the operation system will restart.
Could you please know the detail reason?
Is the nvidia-driver or other problems?
How is scene data rendered from point clouds into images before three steps ? The project only explains the rendering method of objaverse.
Hi, could you please provide the held-in datasets? Thanks!
Hi, thanks for sharing your fantastic work! As indicated in the paper, when using BLIP-2 as backbones, it requires 64 V100s to train the model with batch size 128 on 8 nodes (16 per node), while using Flamingo, it requires 8 A100, with batch size 16. I wonder if it's possible to train with a smaller batch size? Many thanks.
I was following gen_features in Step3: 3D feature construction from rendered images, but ran into a blocker.
Apparently sam_mask.py
expects objaverse_frame_cap3d
to contain different image views of a "room". Where/how are those images generated?
Your help would be greatly appreciated!
I could not find released version of validation set for 3D Captioning, 3D-assisted Dialog, Task Decomposition, how could we evaluate the model's performance for these tasks in figure 3?
I notice that the scene_ids in data_part2_scene.json
have a dtype of int
, while the scene_ids in scannet are something like scene0000_00
. So, how do I align these two datasets? Or how can I split the scannet annotations from that file by myself?
Hi, I have some questions:
if hasattr(dataset[split], "coco_fmt_qust_file") and dataset[split].coco_fmt_qust_file is not None: self.ques_files[split] = dataset[split].coco_fmt_qust_file self.anno_files[split] = dataset[split].coco_fmt_anno_file
Thanks for the great work. I am interested in the data generation. I notice that there are attributes of the objects in the QA or dialogue in the released data. How you get object attribute annotations on scannet?
It seems that the appendix provides certain metrics including B-4, M, and R for 3D Dense Captioning.
Since I am doing some comparisons on the Scan2Cap benchmark, could you kindly provide me the m@kIoU evaluation as proposed in https://arxiv.org/pdf/2012.02206.pdf ?
Hi
When I run the blip_oa.py in '3DLanguage_data/ChatCaptioner_based/gen_features/'
The Error: RuntimeError: Input type (unsigned char) and bias type (c10::Half) should be the same.
I try to revise the code: 'output = visual_encoder(image.float())' in 167. Is it OK?
Another one, It seems the number of rendered images should be 8, not 4
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.