Code Monkey home page Code Monkey logo

Comments (7)

ziyuleoliu avatar ziyuleoliu commented on August 17, 2024

as my understanding, the attention weights matrix is 17*17, CLS token + frame 1-16.

if we want to visualize the CLS token, we need to take out the first row and columns 1-16 of the matrix. it represents the attention scores between CLS token and frames 1-16.

We align them to raw images 1-16, each frame will have only one attention score.

In your paper, Fig.7, a single frame is, however, corresponding to 16 different values and has 16 different highlights.

I am wondering if we only visualize the CLS token, how could we get 16 different values on a single frame? Shouldn't one frame have only one value that represents the similarity of the frame to CLS token?

It would be great if you could give me some help.

Thanks
Best

from slowfast.

bomri avatar bomri commented on August 17, 2024

Hi @ziyuleoliu,
The visualization we suggest shows which frames contributed more to the classification and not which part in the image (pixel-wise) was most relevant.

See issue #17 on how to implement this visualization.

from slowfast.

ziyuleoliu avatar ziyuleoliu commented on August 17, 2024

hey

thanks for your reply :)

i'm trying to use the same framework, but vanilla transformer instead of Longformer. The cls token in my case is a [17,17] matrix. (1 CLS token + 16 frames). If I try to visualize the CLS token and take the first row of the matrix. I can only get an array of 17. It can show the contribution of each frame to CLS token. But each frame has only 1 value. But in your paper figure 7, each frame has 16 different values.
I'm wondering what makes this difference. Is it because of Longformer?

Your framework inspires me a lot, I would love to ask if it's possible to visualize the grad-CAM of the temporal encoder?
Normally if we want to visualize the VIT we have to resize the token [bs, 196, dim] to [bs, 14,14,dim]. but in this framework, temporal encoder takes directly the feature vectores as tokens eg. res50 (bs, 16, 768). Can we still visualize the gradcam of temporal encoder and put the heatmap on org pics?

Thanks again for your time.
Best wishes from noob Leo

from slowfast.

ziyuleoliu avatar ziyuleoliu commented on August 17, 2024

hey @bomri

i would be appreciate if you could give me a bit more help. i stuck here since weeks and get really confused....

Best

from slowfast.

bomri avatar bomri commented on August 17, 2024

Hi @ziyuleoliu ,

This figure's visualization uses the weights correlated to the CLS token. If you are not using the Longformer, how do you consider the temporal information? we use the "temporal" weight per frame in the clip to illustrate this.

We haven't tried grad-CAM , for temporal heatmaps, we use what is shown in the paper, for spatial it can probably work similarly to other usages of grad-Cam.

from slowfast.

ziyuleoliu avatar ziyuleoliu commented on August 17, 2024

hey @bomri

thanks for your reply.

I use a vanilla transformer (VIT) instead of Longformer as Temporal Encoder to model the temporal information.

The temporal attention weights is a 17*17 matrix:
attention matrix

If I have the right understanding, I should only take the first row and column [1:17](frame1 to 16). It represents CLS as Query and frame 1 to 16 as Key, so that i can get which frame has the biggest contribution for CLS token e.g., the classification result, right? Or should i take the first column and row [1:17]?

But what confused me is one frame has only one weight value correlated to CLS token, as it shown in the attachment.
cls token visulization_first row

But in your figure, a single frame will have different values.
VTN CLS visulisation

I'm worried that i have an incorrect understanding and took the wrong weights.

I appreciate your time and it would be great if you can give me a bit more help.
Thanks again.
LEO

from slowfast.

bomri avatar bomri commented on August 17, 2024

If I have the right understanding, I should only take the first row and column [1:17](frame1 to 16). It represents CLS as Query and frame 1 to 16 as Key, so that i can get which frame has the biggest contribution for CLS token e.g., the classification result, right?

Right

Or should i take the first column and row [1:17]?

It depends on your implementation and I don't remember if we used row or col in our setup.

But what confused me is one frame has only one weight value correlated to CLS token, as it shown in the attachment.

The image you added looks good, seems like the info in the start and end are the most important

But in your figure, a single frame will have different values.

We also have one value per frame (same as in your case). We get one value per sec in a 10 sec video, sampled with 25fps, so 250 values which we plot in the top, but for the video images we sub-sample only for making it easier for visualizing the video with 16 frames.

from slowfast.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.