Code Monkey home page Code Monkey logo

Comments (6)

Mountchicken avatar Mountchicken commented on June 10, 2024

Hi @fuweifu-vtoo
Sorry for the late reply. Essentially the Text Encoder in CLIP is also a BERT, so it's not much different from Grounding DINO. The difference lies in how these two models model the text prompt. Grounding DINO feeds the whole sentence into BERT, and then takes the embedding of the corresponding text as the representation. In T-Rex2, we use Phrase as the input to BERT, and only take the output of CLS TOKEN as the text representation. As for the reason, since T-Rex2 is a late fusion structure, i.e., the text embedding will not interact with the image features, and will only compute the similarity with the query at the final output layer, we want to make this text representation as simple as possible, i.e., no matter how long the input phrases are, we just want to represent them as one embedding.

from t-rex.

fuweifu-vtoo avatar fuweifu-vtoo commented on June 10, 2024

Thank you for your detailed explanation~
Another question I also hope to get your answer is, does T-Rex2 freeze the CLIP text encoder during training?

from t-rex.

fuweifu-vtoo avatar fuweifu-vtoo commented on June 10, 2024

Also, how long does it take to train a T-Rex2 with swin Transformer tiny model on 16 NVIDIA A100 GPUs with a total batch size of 128?

from t-rex.

Mountchicken avatar Mountchicken commented on June 10, 2024

The CLIP text encoder is not frozen during the training process. It takes around 3 days to train a Swin-T model

from t-rex.

Baboom-l avatar Baboom-l commented on June 10, 2024

@Mountchicken How many iters did Trex v2 T train? Three days is much shorter than my estimated time.

from t-rex.

Mountchicken avatar Mountchicken commented on June 10, 2024

T-Rex2 has gone through multiple rounds of training. We will first train the text prompt and then load this weights before training with visual prompt. The last training phase took about 3 days and 100000 iterations were trained.

from t-rex.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.