๐ท๏ธ Recognize Anything: A Strong Image Tagging Model & Tag2Text: Guiding Vision-Language Model via Image Tagging
Official PyTorch Implementation of the Recognize Anything Model (RAM) and the Tag2Text Model.
- RAM is a strong image tagging model, which can recognize any common category with high accuracy.
- Tag2Text is an efficient and controllable vision-language model with tagging guidance.
When combined with localization models (Grounded-SAM), Tag2Text and RAM form a strong and general pipeline for visual semantic analysis.
- ๐ [Try our RAM & Tag2Text web Demo! ๐ค]
- ๐ [Access RAM Homepage]
- ๐ [Access Tag2Text Homepage]
- ๐ป [Read RAM arXiv Paper]
- ๐น [Read Tag2Text arXiv Paper]
Recognition and localization are two foundation computer vision tasks.
- The Segment Anything Model (SAM) excels in localization capabilities, while it falls short when it comes to recognition tasks.
- The Recognize Anything Model (RAM) and Tag2Text exhibits exceptional recognition abilities, in terms of both accuracy and scope.
Tag2Text for Vision-Language Tasks.
- Tagging. Without manual annotations, Tag2Text achieves superior image tag recognition ability of 3,429 commonly human-used categories.
- Efficient. Tagging guidance effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks.
- Controllable. Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.
Advancements of RAM on Tag2Text.
- Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, resulting higher accuracy compared to Tag2Text.
- Scope. Tag2Text recognizes 3,400+ fixed tags. RAM upgrades the number to 6,400+, covering more valuable categories. With open-set capability, RAM is feasible to recognize any common category.
- Tag2Text/RAM with Grounded-SAM is trong and general pipeline for visual semantic analysis, which can automatically recognize, detect, and segment for an image!
- Ask-Anything is a multifunctional video question answering tool. Tag2Text provides powerful tagging and captioning capabilities as a fundamental component.
- Prompt-can-anything is a gradio web library that integrates SOTA multimodal large models, including Tag2text as the core model for graphic understanding
2023/06/08
: We release the Recognize Anything Model (RAM) Tag2Text web demo ๐ค, checkpoints and inference code!2023/06/07
: We release the Recognize Anything Model (RAM), a strong image tagging model!2023/06/05
: Tag2Text is combined with Prompt-can-anything.2023/05/20
: Tag2Text is combined with VideoChat.2023/04/20
: We marry Tag2Text with with Grounded-SAM.2023/04/10
: Code and checkpoint is available Now!2023/03/14
: Tag2Text web demo ๐ค is available on Hugging Face Space!
- Release Tag2Text demo.
- Release checkpoints.
- Release inference code.
- Release RAM demo and checkpoints.
- Release training codes (until July 8st at the latest).
- Release training datasets (until July 15st at the latest).
name | backbone | Data | Illustration | Checkpoint | |
---|---|---|---|---|---|
1 | RAM-Swin | Swin-Large | COCO, VG, SBU, CC-3M, CC-12M | Demo version can recognize any common category with high accuracy. | Download link |
2 | Tag2Text-Swin | Swin-Base | COCO, VG, SBU, CC-3M, CC-12M | Demo version with comprehensive captions. | Download link |
- Install the dependencies, run:
pip install -r requirements.txt
-
Download RAM pretrained checkpoints.
-
Get the English and Chinese outputs of the images:
python inference_ram.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/ram_swin_large_14m.pth
RAM Zero-Shot Inference is Comming!
- Install the dependencies, run:
pip install -r requirements.txt
-
Download Tag2Text pretrained checkpoints.
-
Get the tagging and captioning results:
python inference_tag2text.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/tag2text_swin_14m.pthOr get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py --image images/1641173_2291260800.jpg \ --pretrained pretrained/tag2text_swin_14m.pth \ --specified-tags "cloud,sky"
If you find our work to be useful for your research, please consider citing.
@article{zhang2023recognize,
title={Recognize Anything: A Strong Image Tagging Model},
author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
journal={arXiv preprint arXiv:2306.03514},
year={2023}
}
@article{huang2023tag2text,
title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
journal={arXiv preprint arXiv:2303.05657},
year={2023}
}
This work is done with the help of the amazing code base of BLIP, thanks very much!
We also want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying Tag2Text with Grounded-SAM.