Code Monkey home page Code Monkey logo

metatransformer's Introduction

1 Multimedia Lab, The Chinese University of Hong Kong
2 OpenGVLab,Shanghai AI Laboratory
* Equal Contribution  Corresponding Author  Project Lead 

arXiv website blog-cn Hugging Face Spaces OpenXLab

Meta-Transformer with Large Language Models ✨✨✨

We're thrilled to present OneLLM, ensembling Meta-Transformer framework with Multimodal Large Language Models, which performs multimodal joint training🚀, supports more modalities including fMRI, Depth and Normal Maps 🚀, and demonstrates very impressive performances on 25 benchmarks🚀🚀🚀.

🔥🔥 The code, pretrained models, and datasets are publicly available at OneLLM.

🔥🔥 Project Website is at OneLLM.

🌟 Single Foundation Model Supports A Wide Range of Applications

As a foundation model, Meta-Transformer can handle data from 12 modalities, which determines that it can support a wide range of applications. As shown in this figure, Meta-Transformer can provide services for downstream tasks including stock analysis 📈, weather forecasting ☀️ ☔ ☁️ ❄️ ⛄ ⚡, remote sensing 📡, autonomous driving 🚗, social network 🌍, speech recognition 🔉, etc.

Table 1: Meta-Transformer is capable of handling up to 12 modalities, including natural language , RGB images , point clouds , audios , videos , tabular data , graph , time series data , hyper-spectral images , IMU , medical images , and infrared images .

🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities

This repository is built to explore the potential and extensibility of transformers for multimodal learning. We utilize the advantages of Transformers to deal with length-variant sequences. Then we propose the Data-to-Sequence tokenization following a meta-scheme, then we apply it to 12 modalities including text, image, point cloud, audio, video, infrared, hyper-spectral, X-Ray, tabular, graph, time-series, and Inertial Measurement Unit (IMU) data.

After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can handle various tasks on the different modalities, such as: classification, detection, and segmentation.

🌟 News

  • 2023.8.17: Release code to directly get embeddings from multiple modalities. We will further release code on utilizing Meta-Transformer for Human-Centric vision tasks.
  • 2023.8.2: 🎉🎉🎉 The implementation of Meta-Transformer for image, point cloud, graph, tabular, time-series, X-Ray, hyper-spectrum, LiDAR data has been released. We also release a very powerful foundation model for Autonomous Driving 🚀🚀🚀.
  • 2023.7.22: Pretrained weights and a usage demo for our Meta-Transformer have been released. Comprehensive documentation and implementation of the image modality are underway and will be released soon. Stay tuned for more exciting updates!⌛⌛⌛
  • 2023.7.21: Paper is released at arxiv, and code will be gradually released.
  • 2023.7.8: Github Repository Initialization.

🔓 Model Zoo

Open-source Modality-Agnostic Models
Model Pretraining Scale #Param Download 国内下载源
Meta-Transformer-B16 LAION-2B Base 85M ckpt ckpt
Meta-Transformer-L14 LAION-2B Large 302M ckpt ckpt
  • Demo of Use for Pretrained Encoder
import torch 
import torch.nn as nn
from timm.models.vision_transformer import Block
from Data2Seq import Data2Seq
video_tokenier = Data2Seq(modality='video',dim=768)
audio_tokenier = Data2Seq(modality='audio',dim=768)
time_series_tokenier = Data2Seq(modality='time-series',dim=768)

features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)
# For base-scale encoder:
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
encoder = nn.Sequential(*[
            Block(
                dim=768,
                num_heads=12,
                mlp_ratio=4.,
                qkv_bias=True,
                norm_layer=nn.LayerNorm,
                act_layer=nn.GELU
            )
            for i in range(12)])
encoder.load_state_dict(ckpt,strict=True)
# For large-scale encoder:
ckpt = torch.load("Meta-Transformer_large_patch14_encoder.pth")
encoder = nn.Sequential(*[
            Block(
                dim=1024,
                num_heads=16,
                mlp_ratio=4.,
                qkv_bias=True,
                norm_layer=nn.LayerNorm,
                act_layer=nn.GELU
            )
            for i in range(24)])
encoder.load_state_dict(ckpt,strict=True)
encoded_features = encoder(features)

🕙 ToDo

  • [ x ] Meta-Transformer with Large Language Models.
  • [ x ] Multimodal Joint Training with Meta-Transformer.
  • [ x ] Support More Modalities and More Tasks.

Contact

🚀🚀🚀 We aspire to shape this repository into a formidable foundation for mainstream AI perception tasks across diverse modalities. Your contributions can play a significant role in this endeavor, and we warmly welcome your participation in our project!

To contact us, never hestitate to send an email to [email protected] ,[email protected], [email protected], or [email protected]!

Citation

If the code and paper help your research, please kindly cite:

@article{zhang2023meta,
  title={Meta-transformer: A unified framework for multimodal learning},
  author={Zhang, Yiyuan and Gong, Kaixiong and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Ouyang, Wanli and Yue, Xiangyu},
  journal={arXiv preprint arXiv:2307.10802},
  year={2023}
}

License

This project is released under the Apache 2.0 license.

Acknowledgement

This code is developed based on excellent open-sourced projects including MMClassification, MMDetection, MMsegmentation, OpenPoints, Time-Series-Library, Graphomer, SpectralFormer, and ViT-Adapter.

metatransformer's People

Contributors

bobrown avatar eltociear avatar invictus717 avatar kxgong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

metatransformer's Issues

Data2Seq > Hyper_Spectrum.py update from self.cls_tokens to self.cls_token

https://github.com/invictus717/MetaTransformer/blob/d30327826f4c2f158df137568e9557cb715026ec/Data2Seq/Hyper_Spectrum.py#L21C9-L21C69

From
cls_tokens = repeat(self.cls_tokens, '() n d -> b n d', b=b)

To
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)

Which would be equivalent to

cls_tokens = repeat(self.cls_token, '() n d -> b n d', b = b) #[b,1,dim]
?

data2seq

Can you give a demo about how to use data2seq code?

Questions about experiments

Hi, thanks for your great contributions.

I have some questions after reading your paper:

  1. In Table 3 (GLUE benchmark), you present Size for pre-training but do not give the corresponding unit. I know that LAION-2B contains 2B image-text pairs. But what do 0.8B, 3.3B, 4,5000B mean for language models? The number of tokens or the disk space of text files?
  2. In Table 3 (GLUE benchmark), the model with frozen LAION-2B-pretrained backbone (i.e., Meta-Transformer-B16_F) lags behind the SOTA performance by a large margin (similarly in Table 9 - video understanding, and Table 12 - graph data understanding). In terms of this, I mentioned in issue#49 that the released Meta-Transformer weights may give inferior representations, especially for modalities like text, video and graph (because the backbone is pre-trained on LAION-2B, not trained jointly on data across 12 modalities). Do I misunderstand or overlook sth? It seems that one would have to fine-tune the backbone to get more discriminative representations (e.g., Table 3, Meta-Transformer-B16_F vs. Meta-Transformer-B16_T).

预训练模型内地如何下载?

[赞],非常棒的开源项目,可是我们国内没法访问墙外的谷歌,是否有计划将预训练模型国内也上传一份呢?

audio模块代码报错

Audio模块中的run_sc.sh代码15行,报错提示venvast文件不存在,请问怎么解决?

Finetuning the model

First of all, great work by your team, this will be the new breakthrough in AI.
How can we Finetune the Meta Transformer model with our task-specific data?

Whether to support BBOX data?

Can you tell me if BBOX data (either 2D or 3D) is currently supported? If not, can you give some guidance on using richer input data types?

video

你们论文中提到了video recognition,请问在哪一部分代码中体现?
image

Questions about inference

hello, how to determine which modality is input during reasoning? Is a classification network used before the unimodal expert transfomer?

Video

您好,请问可以提供下
image
中针对k400数据集的data preparation的代码吗

video

您好!我在运行video的run.sh中遇到了如下的问题:
82E0EC63745B1379058FBC680703B9C7
代码在执行
ckpt = torch.load("Meta-Transformer_base_patch16_encoder.pth")
model.blocks.load_state_dict(ckpt,strict=True)
时遇到了如上报错。
我猜测是由于模型结构不一致造成的,想请教下您是如何解决的,感谢!

Data2Seq Usage/Embedding Dim

Thanks for sharing the code for embedding modalities!

I'd like to use Meta Transformer in my research (I use images and text) and have multiple short questions:

  1. When embedding an Image with data2seq code, I get an embedding of (batch_size, num_patches, 768). Is this the correct embedding shape for images?

2 a) When passing a text, data2seq produces a dict with input_ids (tokens) and attention_masks
2 b) When using the get_text_embeddings() to embed text, I get an embedding of (batch_size x 768). The encoder as loaded in the demo section does not accept this shape (I need to add unsqueeze() to add another dimension).
Whats the correct way to embed text, and what input shape is expected by the encoder?

  1. Input shapes after embedding should be the same across all modalities, correct?

  2. Are weights for embeddings available to download or would I need to learn them seperately?

Thanks in advance for your time!

How to pretrain Unified Multimodal Model?

I would like to express my appreciation for your exceptional work. I attended your live presentation yesterday and gained valuable insights. I am interested in exploring the Unified Multimodal Model that you proposed within my research domain. As my multimodal data is of fine granularity, I am considering fine-tuning or retraining your model to suit my needs.

I kindly request if it would be possible for you to open-source some of the pretraining procedures for the Unified Multimodal Model. This would greatly assist me in adapting the model to my specific requirements.

Thank you very much for your outstanding contributions.

audio

audio部分的训练可以中断后继续训练吗

Code for Tokenization?

Thank you for sharing this most exciting work!

I would like to know: Is the code for tokenizing different modalities not released yet or am I failing to read where in the code the tokenization happens?

I would like to use Meta Transformer on a custom Data Set, with image and text inputs.

As far as I understood the workflow would be:

token_text, token_image = tokenize(text), tokenize(image)

embedding_text = pretrained_encoder(token_text)  # as described in demo
embedding_image = pretrained_encoder(token_image)  # as described in demo

downstream_task(embedding_text, embedding_image) 

Is this correct on a very high level?

Thanks in advance!

demo use

After reading your paper, I still don't know how to use your model. Could you please provide a complete example? Please include a full 'Demo of Use for Pretrained Encoder' here. Thank you very much.

How to compute similarity score between different modalities?

Hello! The project looks very promising, but I'm having some issues starting up with it. Given that there is a common embedding space for all these modalities, all I would like to do with it is to encode and compare different data types.

Very rudimentary example would be that I have some pictures of animals and some point clouds of animals and I'd like to calculate the similarity matrix to find out which picture matches which point cloud.

As far as I understand it,

  1. this README.md / Model Zoo snippet loads the foundation model
  2. then I'd need to encode all my different file types with some functions that I can't find - I'm stuck here
  3. multiply, maybe softmax etc. the two feature arrays to get a similarity matrix/heatmap

Replicating training?

How to replicating the training process? Would you release a detailed guide for replicating? What was the training rig that you used?

Explain please

Hello! I've been trying to figure out Meta-Transformer for two weeks now and I can't get the embeddings I need. Please share the code for the following example: how to get text and image embeddings from the words "dog", "car", "bird" and their pictures. Thanks!

为什么训练图像会出现这种错误

我使用的Image中的htc++模型,数据集按照readme.md中的要求下载了COCO数据集,但是在训练过程中老是提示如下错误FileNotFoundError: [Errno 2] No such file or directory: 'data/coco/stuffthingmaps/train2017/000000248242.png',但是数据集中所有文件都是jpg,改称png后,有一些数据又要用到png。请问我该如何解决这个问题

Question about the ``pretrain-finetune'' pipeline

Hi, thanks for your great contributions.

I am curious about your ``pretrain-finetune'' pipeline.

According to the paper and your code, it seems that the pipeline is:

  1. you first carry out pre-training on LAION-2B with a CLIP-style objective to obtain a modality-agnostic encoder
  2. then you integrate a data-to-sequence tokenizer (whose implementation depends on the modality of downstream task) with the pre-trained encoder and fine-tune the model.

Am I understand right?

Here are my concerns:

  1. The core idea of MetaTransformer is one shared backbone + different tokenizers + different heads. However, I don't see any joint training process on data across 12 modalities. Instead, it seems that you carry out fine-tuning 12 times, where in some cases the so-called ``shared'' backbone needs to be trained to fit in a specific modality to obtain superior performance.
  2. According to the first concern, the demo you give in README may give inferior representations for modalities except for images, right? This is because the released pre-trained weights are obtained from the above step 1) pre-training, not the joint training on 12 modalities.

The tokenizer for time-series data

The paper mentioned that metaTransformer uses the tokenizer of autoformer for time-series forecasting task. However, autoformer does not have a "tokenizer", and its encoder directly takes the raw time-series data as input. I wonder if you mistake it for PatchTST or something else?

Enhance Codebase with Comprehensive Docstrings

adding detailed docstrings throughout the code.

  • Each function, class, and module in the codebase is accompanied by a comprehensive docstring.
  • Docstrings follow the established documentation format and conventions.
  • README is updated to include guidelines for writing effective docstrings.

Multiple modals

How to use multiple modals at the same time for a task, such as text+image, text+audio, or text+pointcloud?

For More Setup Instructions

Hello

Thank you for your amzing work which is quite promising. However, I'm encountering difficulties due to the lack of clear environment setup instructions.Could you please provide more detailed environment and guidence in documentation? This would greatly help users like me to set up the project smoothly and contribute effectively.Additionally, I would like to inquire about the current status of the project's code upload.

Thank you for your consideration.

Best regards,

audio

请问audio部分中run_sc.sh中代码第56行,CUDA_VISIBLE_DEVICES是什么参数

how to use it?

Hello! Your transformer is amazing! But i m beginner in data science. I have to do research for my university task: we want to predict how negotiations will finish. We have various modalities including video, audio, time-series EEG. Maybe you have demo version how to use transformer for such tasks? If you do, please share it.
Thanks!

paper

论文的Introduction的第二段少一个"."

There is a typo in README file, LOL

After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can hanle various tasks on the different modalities, such as: classification, detection, and segmentation.

The hanle should be handle.

LOL, Nice work, forks, and thanks for sharing.

Looking forward to the code.

audio

我在改写dataloader时遇到一个问题。
原先工作用的tau中音频是双声道的wav文件,而您的工作中用的数据集中用的是单声道的wav文件。
您的代码中有一块将音频文件转换为滤波器组特征的代码,在这地方报了错。原因是两种文件在load之后shape不同。我对音频方面不太懂,我猜测是不是将原本对单声道处理的函数改为对双声道的两个声道都处理一遍就可以解决问题了?

audio问题

请问audio部分按教程运行
! bash run_sc.sh
之后为什么会出现log.txt训练记录为空的情况
屏幕截图 2023-09-07 140227
屏幕截图 2023-09-07 140113

question about training

I wonder that could the modality A interact with modality B in training?
I guess that each tokenizer process their modality sperately, each modality was transformed by the freezed encoder(concat all the data and set the attention mask if modality A can not interact with modality B, or just forward 12 times?), and each modality was sent to its head to calculate loss.
Am I right?

关于text模态的使用

非常感谢您做了如此出色地工作,给了我非常大的启发!
在利用meta-transformer处理文本模态时,我有这样几个问题有点困惑:

  1. 在paper中,对于text的tokenizer,先使用clip将其分割为一系列子词,再用一个embedding层将其向量化。这个embedding层我该如何以代码的方式体现呢?
  2. data2seq模块中,是直接使用clip对得到的子词进行了embedding,但这与文章中描述的tokenization的过程有点出入,在这个过程中我们应该只是用一个很轻量级的投影层而非clip吧?
  3. 最后,您是否能够提供一个简易的文本模态tokenization的小demo。
    感谢再次您如此富有创造力的工作!

What's the difference between the patch 14 and patch 16 models?

Hi,
Thanks for your great job!
I'm a beginner in LLM, could you please tell me the difference between the patch 14 and patch 16 models?
Besides, how to use the pre-trained models? For example, if I want to do a text generation task, should I find a model like llama or vacuna and replace their encoder with this pre-trained encoder models?

Data2Seq Weights

In the Data2Seq code for getting embeddings, the Image and Video embedders have a Conv2d and Conv3d, respectively. Do you plan to release the pre-trained weights for these layers?

如何使用较大版本的模型。

您好,非常感谢您做出如此出色的工作!
我想使用meta-transformer来做一些分类工作,我首先使用了base版本的模型,并使用了您的部分代码,它成功的运行了。
但当我想要换成large版本的模型时,出现了一些问题,相比如base版本,他似乎有24个block,同时需要的embedding变成了1024.
我尝试了修改模型定义来适配large版本,但一直无法work,您是否可以提供一下large版本的模型定义以及如何使用large版本的模型呢?
感谢!

audio preprocess

Nice work! Could you please provide code about the audio preprocess (Data-to-Sequence Tokenization) in your paper section 3.2 Audio Spectrogram?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.