Code Monkey home page Code Monkey logo

lenna's Introduction

Lenna: Language Enhanced Reasoning Detection Assistant

With the fast-paced development of multimodal large language models (MLLMs), we can now converse with AI systems in natural languages to understand images. However, the reasoning power and world knowledge embedded in the large language models have been much less investigated and exploited for image perception tasks. In this work, we propose Lenna a Language enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional <DET> token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection. For more details, please refer to the paper.

teaser

Lenna Architecture

Getting Started

1. Installation

  • We utilize A100 GPU for training and inference.

  • Git clone our repository and creating conda environment:

    git clone https://github.com/Meituan-AutoML/Lenna.git
    conda create -n lenna python=3.10
    conda activate lenna
    pip install -r requirements.txt
  • Follow mmdet/get_started to install mmdetection series.

2. Prepare Lenna checkpoint

  • Download the Lenna-7B checkpoint from HuggingFace.

4. Inference

  • After preparing the checkpoint and conda environment, please execute the following script to implement single image inference:

    python chat.py \
    --ckpt-path path/to/Lenna-7B \
    --vis_save_path ./vis_output \
    --threshold 0.3 
    
  • When the model has finished loading, you will see the following prompt:

    [Lenna] Please input your caption: {input your caption}
    [Lenna] Input prompt:  Please detect the {your caption} in this image.
    [Lenna] Please input the image path: {input your image path}
    
  • Fill in the {input your caption} with the description of the object you want to detect, and the {input your image path} with your image path.

Updates

  • 2023-12-28 Inference code and the Lenna-7B model are released.
  • 2023-12-05 Note: Paper is released on arxiv.

Cite

@article{wei2023lenna,
  title={Lenna: Language enhanced reasoning detection assistant},
  author={Wei, Fei and Zhang, Xinyu and Zhang, Ailing and Zhang, Bo and Chu, Xiangxiang},
  journal={arXiv preprint arXiv:2312.02433},
  year={2023}
}

Acknowledgement

This repo benefits from LISA, GroundingDINO, LLaVA and Vicuna.

License

This repository is released under the Apache 2.0 license as found in the LICENSE file.

lenna's People

Contributors

weifei7 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.