baaivision / uni3d Goto Github PK

View Code? Open in Web Editor NEW

451.0 14.0 26.0 6.19 MB

[ICLR'24 Spotlight] Uni3D: 3D Visual Representation from BAAI

License: MIT License

Python 98.24% Shell 1.76%

3d-representation-learning foundation-models vision-transformers

uni3d's Introduction

Uni3D: Exploring Unified 3D Representation at Scale

Junsheng Zhou^1,2*, Jinsheng Wang^1*, Baorui Ma^1*, Yu-Shen Liu², Tiejun Huang^1,3, Xinlong Wang¹

¹BAAI, ²THU, ³PKU
^* Equal Contribution

ICLR 2024 (Spotlight)

We present Uni3D, a unified and scalable 3D pretraining framework for large-scale 3D representation learning, and explore its limits at the scale of one billion parameters. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D models and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks.

Schedule

We are committed to open-sourcing Uni3D related materials, including:

Extended Uni3D to a 3D metric (Uni3D-score) for enhanced semantic coherence in text-to-3D tasks. For details, see GeoDream.
The weights of models range from 6M to 1B parameters.
Evaluation code
Evaluation data
Pretraining code
Pretraining data

We hope to foster the growth of our community through open-sourcing and promoting collaboration👬. Let's step towards multimodal intelligence together🍻.

Installation

Clone this repository and install the required packages:

git clone https://github.com/baaivision/Uni3D.git
cd Uni3D

conda create -n uni3d python=3.8
conda activate uni3d
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

# install pointnet2 extensions from https://github.com/erikwijmans/Pointnet2_PyTorch
pip install "git+git://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"

Core packages:

Pytorch version 2.0.1
open-clip-torch version 2.20.0
timm version 0.9.7
DeepSpeed version 0.10.3
Open3D version 0.17.0

Model Zoo

Model	Training Data	Objaverse-LVIS Top1 (Top5)	ModelNet40 Top1 (Top5)	ScanObjectNN Top1 (Top5)
Uni3d-B	Ensembled w/o LVIS	45.9 (74.8)	86.1 (98.7)	61.7 (89.5)
Uni3d-B	Ensembled	51.7 (80.8)	86.3 (97.9)	63.8 (90.2)
Uni3d-L	Ensembled w/o LVIS	46.2 (74.7)	86.6 (97.8)	58.4 (90.1)
Uni3d-L	Ensembled	53.1 (81.5)	86.3 (98.3)	58.2 (89.4)
Uni3d-g	Ensembled w/o LVIS	47.2 (76.1)	86.8 (98.4)	66.5 (90.1)
Uni3d-g	Ensembled	53.5 (82.0)	87.3 (99.2)	63.9 (91.7)
Uni3d-g 🔥	Ensembled	55.3 (82.9)	88.2 (99.3)	65.3 (92.7)

Evaluation of Zero-shot 3D classification

We evaluate the zero-shot 3D classification performance on three datasets: Objaverse-LVIS, ModelNet40 and ScanObjectNN.

Please refer to DATASETS.md for evaluation dataset preparation.
[Recommended 🤗] Download the clip model and put it in /path/to/clip_model folder.
Download model zoo weights and put them in /path/to/checkpoints folder.
Run bash scripts/inference.sh [scale] to evaluate the model on the above datasets, e.g., bash scripts/inference.sh giant.

Pre-training

Please refer to DATASETS.md for pre-train dataset preparation.
[Recommended 🤗] Download the clip model and put it in /path/to/clip_model folder.
[Recommended 🤗] Download the initialization model and put it in /path/to/init_model folder.
Run bash scripts/pretrain.sh to pre-train the model on ensemble datasets.

Visualization

Open-world Understanding

One-shot Part Segmentation

Point Cloud Painting

Cross-modal Retrieval

Acknowledgement

Uni3D is built using the awesome EVA, OpenCLIP, timm, DeepSpeed, ULIP and OpenShape.

Citation

@inproceedings{zhou2023uni3d,
  title={Uni3d: Exploring unified 3d representation at scale},
  author={Zhou, Junsheng and Wang, Jinsheng and Ma, Baorui and Liu, Yu-Shen and Huang, Tiejun and Wang, Xinlong},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2024}
}

uni3d's People

Contributors

Stargazers

Watchers

uni3d's Issues

That's a so great work,but i have a some questions?

1.How to perform cross-modal retrieval?
2.Do you need to download the complete data set?
3.How long does it take?
I am a beginner and asked some basic questions. I would be very grateful if you could answer me。

open_clip.create_model_and_transforms can't create the model.

thanks for your work, and I have a question about open_clip.create_model_and_transforms.
pretorianed = path/to/clip_model.bin
model, _, preprocess = open_clip.create_model_and_transforms('EVA02-E-14-plus',pretrained=pretrained)
but there are something wrong with that. it takes a few minutes to create model, and I found the memory of this model(I don's create other model) is 20GB. and I change pretrained = None, it can also create the model for a long time , and the memory is also 20GB.
this bug makes me fuses a long time, if it's ok, I need your help. thanks.

Question about using colors

Hello to the authors,

First off, thanks so much for sharing your project with the community! It’s been really interesting to dive into. I've noticed that in the zero-shot classification, RGB data is used as input for the Objarverse dataset, but not for Modelnet40 and ScanObjectNN.

Just wondering, is there a particular reason why RGB data isn't part of the setup for Modelnet40 and ScanObjectNN? Any insights you could provide would be super helpful for understanding the approach better and might even impact how I tackle similar issues in my work.

Thanks in advance for any thoughts you might share, and looking forward to hearing from you.

Question about the downstream tasks

This is an Interesting work!
I'm new to this type of representation learning work, and I have a few inquiries regarding the downstream tasks. I apologize for any inconvenience my questions may cause.

How are the downstream tasks conducted? I comprehend that the 3D model is encoded into embeddings, which exist in the same representation space as text or images. Is this a presupposed prior knowledge that you expect all readers to possess?
Do you have any quantified baseline for the text-3D retrieval task?

Model release

Thanks for your nice work!
Your contribution of designing billion-scale 3D representation models can be seen in Table5 in the experimental section of the paper. But I only found the Uni3d-g model with the best performance in the repo. I'm here to ask for your releasing of other scales of representation model checkpoints, e.g. Uni3d-L and Uni3d-B, so that we could ultilize all these models for further research!
I appreciate your quick response.

Point Cloud Painting

Thank you for your very interesting work! How should One-shot Part Segmentation and Point Cloud Painting be used?

Evaluation datasets objaverse-lvis

It is a good job. I have a question about the Evaluation datasets objaverse-lvis. I have download the test_datasets.zip. However, it seems that its dim is 1280, not 1024. So I can not use it for zero-shot classification directly. Could you provide the right Evaluation datasets objaverse-lvis

Emdedding Dimension of Point Encoder

Dear Author,

Many thanks for the great work and the nice open-source.

During trying your code, I noticed that the embeding diomension of the point cloud encoder is set to be 1024, which is to match the dimension of the EVA-CLIP. However, I would like to try for some smaller versions of CLIP(as also shown in your paper) which only supports 512 dimention, is it possible to provide the pretrained Uni3D with 512 embedding dimension?

Thanks for the reply in advance.

How is the text used in your training generated?

Is it the same text generation method as openshape?

Code release for point cloud segmentation and painting?

Dear authors,

Thank you for your interesting work and your extensive experiments. 🎉
When will the code be released for the segmentation and the painting task? 🎨

Kind regards
Jarne

Question about the color input.

This is a very meaningful job!

I would like to ask if all training samples are input with color during the pre-training process?
In the zero-shot classification experiment, for ModelNet40, color input was not used. Should these dimensions be deleted or replaced with a fixed value (0.4)?
If color is not used as input in pre-training, will there be any impact on downstream tasks?

Deterministic problem

When I use this pointnet2_ops_lib for pointnet2, I found out the atomicAdd() function in the .cu file is not fixed in order when called, resulting in a numerical error in the sum of floating points as a result of gradient calculations. And set_seed() in python is useless.
I found that your amazing network is based on this package, and how do you deal with this non-determinism?
If determinism is not guaranteed, ablation experiments can be cumbersome.
Thank you!

License

Thank you for open-sourcing this work!
Could you please add licensing?

Do you have a plan to pre-train Uni3D on Objaverser-xl, a 10M-scale 3D dataset?

Thanks for sharing the paper and code. It's a great work.

Since Uni3D has scaled the point encoder to 1B parameters, that are rather big. But the scale of Objaverse 1.0 is only 800K 3D objects, I think it is still relatively small to support the 1B-scale point encoder pre-training and the generalization is still far behind the counterparts in image and text field.

Now Objaverse-xl is released, which contains 10M+ 3D objects, does your team have such a plan to pre-train on larger Objaverse-xl? I think BAAI has the computing source to finish such a task. How do you think about it?

Templates for zero-shot classification

Dear Authors,

Thanks for the great work!

I noticed that the templates for the zeroshot classification to be a list of prompts(in the "templates.jason" file), may I ask if these templates come from some previous works? Or why do you choose such templates?

Thank you for your reply in advance.

Best regards

what's the difference between uni3d_g and uni3d_g with `fire` symbol?

Thanks for sharing the paper, code and weights. It's a great work!

One point confuses me is that in the model zoo, you have two uni3d_g models pre-trained with the Ensemble datasets, as the red box highlights below.

But their zero-shot accuracy is quite different, uni3d_g with the fire symbol is much better than uni3d_g, so what's the difference between these two models?

Looking forward to your reply. Thanks.

installing pointnet2_ops_lib

I am trying to use the command line- pip install "git+git://github.com/erikwijmans/Pointnet2_PyTorch.git#egg=pointnet2_ops&subdirectory=pointnet2_ops_lib"

But it is showing error like-

Collecting pointnet2_ops
Cloning git://github.com/erikwijmans/Pointnet2_PyTorch.git to /tmp/pip-install-yeknjq64/pointnet2-ops_0f5c19c3381948d0b75fd75f60b29bcb
Running command git clone --filter=blob:none --quiet git://github.com/erikwijmans/Pointnet2_PyTorch.git /tmp/pip-install-yeknjq64/pointnet2-ops_0f5c19c3381948d0b75fd75f60b29bcb
fatal: unable to connect to github.com:
github.com[0: 20.205.243.166]: errno=Connection timed out

error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet git://github.com/erikwijmans/Pointnet2_PyTorch.git /tmp/pip-install-yeknjq64/pointnet2-ops_0f5c19c3381948d0b75fd75f60b29bcb did not run successfully.
│ exit code: 128
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

note: This error originates from a subprocess, and is likely not a problem with pip.

Question about Customized Dataset

Hello, I'm using Uni3D to try to extract embeddings from 3D models from Objaverse. The format of 3d models are .glb, and I don't know how to transform it into .npy that you use in Objaverse-LVIS. Do you know how to make it or where can I find it?

BTW, the code snippet below is how I manage to extract the embedding of 3D models. Am I missing something?

@torch.no_grad()
def encode(dataloader):
    model = create_uni3d()
    model.eval()
    model.to(device)
    embeddings = []
    for (pc, _, _, rgb) in tqdm(dataloader, desc="Extracting embeddings"):
        pc = pc.to(device=device, non_blocking=True)
        rgb = rgb.to(device=device, non_blocking=True)
        feature = torch.cat((pc, rgb), dim=-1)
        pc_features = model.encode_pc(feature)
        pc_features = pc_features / pc_features.norm(dim=-1, keepdim=True)
        embeddings.append(pc_features)
    embeddings = torch.cat(embeddings, dim=0)
    return embeddings

open-world understanding

Thank you for your excellent work. I have a question about applying a point cloud classification model trained on object datasets to 3D scene segmentation tasks like ScanNet. Does the model require fine-tuning to adapt to the complexities of full scene segmentation, or can the original model be directly applied to this task? Your insights would be very helpful. Thank you!

Question about vis on scannet

Thank you for your interesting work! I just wonder how do you perform open-world understanding on scannet dataset?

About the gpu devices required for Evaluation of Zero-shot 3D classification

Thank you for your great work. I am a newbie in this area and I would like to know what kind of GPU device is required when evaluating the model? I have two 3090 NVIDIA graphics cards, but this doesn't seem to be enough for evaluating the model. Is there any suitable method? Thanks! !

baaivision / uni3d Goto Github PK

uni3d's Introduction

Schedule

Installation

Model Zoo

Evaluation of Zero-shot 3D classification

Pre-training

Visualization

Open-world Understanding

One-shot Part Segmentation

Point Cloud Painting

Cross-modal Retrieval

Acknowledgement

Citation

uni3d's People

Contributors

Stargazers

Watchers

Forkers

uni3d's Issues

Recommend Projects

Recommend Topics

Recommend Org