Code Monkey home page Code Monkey logo

recognize_anything-tag2text's Introduction

๐Ÿท๏ธ Recognize Anything: A Strong Image Tagging Model & Tag2Text: Guiding Vision-Language Model via Image Tagging

Official PyTorch Implementation of the Recognize Anything Model (RAM) and the Tag2Text Model.

  • RAM is a strong image tagging model, which can recognize any common category with high accuracy.
  • Tag2Text is an efficient and controllable vision-language model with tagging guidance.

When combined with localization models (Grounded-SAM), Tag2Text and RAM form a strong and general pipeline for visual semantic analysis.

๐ŸŒž Helpful Tutorial

๐Ÿ’ก Highlight

Recognition and localization are two foundation computer vision tasks.

  • The Segment Anything Model (SAM) excels in localization capabilities, while it falls short when it comes to recognition tasks.
  • The Recognize Anything Model (RAM) and Tag2Text exhibits exceptional recognition abilities, in terms of both accuracy and scope.

Tag2Text for Vision-Language Tasks.
  • Tagging. Without manual annotations, Tag2Text achieves superior image tag recognition ability of 3,429 commonly human-used categories.
  • Efficient. Tagging guidance effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks.
  • Controllable. Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.

Advancements of RAM on Tag2Text.
  • Accuracy. RAM utilizes a data engine to generate additional annotations and clean incorrect ones, resulting higher accuracy compared to Tag2Text.
  • Scope. Tag2Text recognizes 3,400+ fixed tags. RAM upgrades the number to 6,400+, covering more valuable categories. With open-set capability, RAM is feasible to recognize any common category.

โœจ Highlight Projects with other Models

  • Tag2Text/RAM with Grounded-SAM is trong and general pipeline for visual semantic analysis, which can automatically recognize, detect, and segment for an image!
  • Ask-Anything is a multifunctional video question answering tool. Tag2Text provides powerful tagging and captioning capabilities as a fundamental component.
  • Prompt-can-anything is a gradio web library that integrates SOTA multimodal large models, including Tag2text as the core model for graphic understanding

๐Ÿ”ฅ News

โœ๏ธ TODO

  • Release Tag2Text demo.
  • Release checkpoints.
  • Release inference code.
  • Release RAM demo and checkpoints.
  • Release training codes (until July 8st at the latest).
  • Release training datasets (until July 15st at the latest).

๐Ÿงฐ Checkpoints

name backbone Data Illustration Checkpoint
1 RAM-Swin Swin-Large COCO, VG, SBU, CC-3M, CC-12M Demo version can recognize any common category with high accuracy. Download link
2 Tag2Text-Swin Swin-Base COCO, VG, SBU, CC-3M, CC-12M Demo version with comprehensive captions. Download link

๐Ÿƒ Model Inference

RAM Inference

  1. Install the dependencies, run:

pip install -r requirements.txt

  1. Download RAM pretrained checkpoints.

  2. Get the English and Chinese outputs of the images:

python inference_ram.py  --image images/1641173_2291260800.jpg \
--pretrained pretrained/ram_swin_large_14m.pth

RAM Zero-Shot Inference is Comming!

Tag2Text Inference

  1. Install the dependencies, run:

pip install -r requirements.txt

  1. Download Tag2Text pretrained checkpoints.

  2. Get the tagging and captioning results:

python inference_tag2text.py  --image images/1641173_2291260800.jpg \
--pretrained pretrained/tag2text_swin_14m.pth
Or get the tagging and sepcifed captioning results (optional):
python inference_tag2text.py  --image images/1641173_2291260800.jpg \
--pretrained pretrained/tag2text_swin_14m.pth \
--specified-tags "cloud,sky"

โœ’๏ธ Citation

If you find our work to be useful for your research, please consider citing.

@article{zhang2023recognize,
  title={Recognize Anything: A Strong Image Tagging Model},
  author={Zhang, Youcai and Huang, Xinyu and Ma, Jinyu and Li, Zhaoyang and Luo, Zhaochuan and Xie, Yanchun and Qin, Yuzhuo and Luo, Tong and Li, Yaqian and Liu, Shilong and others},
  journal={arXiv preprint arXiv:2306.03514},
  year={2023}
}

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}

โ™ฅ๏ธ Acknowledgements

This work is done with the help of the amazing code base of BLIP, thanks very much!

We also want to thank @Cheng Rui @Shilong Liu @Ren Tianhe for their help in marrying Tag2Text with Grounded-SAM.

recognize_anything-tag2text's People

Contributors

xinyu1205 avatar positive666 avatar tuofeilunhifi avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.