Yiyuan Zhang^1,2* Kaixiong Gong^1,2* Kaipeng Zhang^2,✉
Hongsheng Li ^1,2 Yu Qiao ² Wanli Ouyang² Xiangyu Yue^1,✉

¹Multimedia Lab, The Chinese University of Hong Kong
²OpenGVLab，Shanghai AI Laboratory
^* Equal Contribution ^✉ Corresponding Author

🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities

This repository is built to explore the potential and extensiability of transformers for multimodal learning. We utilize the advantages of Transformers to deal with length-variant sequence. Then we proposes the Data-to-Sequence tokenization following a meta-scheme, then we apply it to 12 modalities including text, image, point cloud, audio, video, infrared, hyper-spectral, X-Ray, tabular, graph, time-series, and Inertial Measurement Unit (IMU) data.

After obtaining the token sequence, we employ a modality-shared encoder to extract representation across different modalities. With task-specific heads, Meta-Transformer can handle various tasks on the different modalities, such as: classification, detection, and segmentation.

🌟 News

2023.7.22: 🌟🌟🌟 Pretrained weights and a usage demo for our Meta-Transformer have been released. Comprehensive documentation and implementation of the image modality are underway and will be released soon. Stay tuned for more exciting updates!⌛⌛⌛
2023.7.21: Paper is released at arxiv, and code will be gradually released.
2023.7.8: Github Repository Initialization.

🔓 Model Zoo

Open-source Pretrained Models

Model	Pretraining	Scale	#Param	Download
Meta-Transformer-B16	LAION-2B	Base	85M	ckpt
Meta-Transformer-L14	LAION-2B	Large	302M	ckpt

Demo of Use for Pretrained Encoder

img_model = timm.create_model("vit_base_patch16_224", pretrained = False )
ckpt = torch.load("Meta-Transformer_large_patch14_encoder.pth")
img_model.blocks.load_state_dict(ckpt,strict=True)

🕙 ToDo

Meta-Transformer with Large Language Models.
Multimodal Joint Training with Meta-Transformer.
Support More Modalities and More Tasks.

Contact

Welcome to contribute to our project!

To contact us, never hestitate to send an email to [email protected] ,[email protected], [email protected], or [email protected]!

Citation

If the code and paper help your research, please kindly cite:

@article{zhang2023metatransformer,
        title={Meta-Transformer: A Unified Framework for Multimodal Learning}, 
        author={Zhang, Yiyuan and Gong, Kaixiong and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Ouyang, Wanli and Yue, Xiangyu},
        year={2023},
        journal={arXiv preprint arXiv:2307.10802},
  }

License

This project is released under the Apache 2.0 license.

Acknowledgement

This code is developed based on excellent open-sourced projects including MMClassification, MMDetection, MMsegmentation, OpenPoints, Time-Series-Library, Graphomer, SpectralFormer, and ViT-Adapter.

syc-hjy / metatransformer Goto Github PK

metatransformer's Introduction

🚩🚩🚩 Shared-Encoder, Unpaired Data, More Modalities

🌟 News

🔓 Model Zoo

🕙 ToDo

Contact

Citation

License

Acknowledgement

metatransformer's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent