Hey, thanks for your fantastic work! It inspires me a lot. <p di

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Visualization: how effective is a PART of frames on decision making about slowfast HOT 7 CLOSED

bomri commented on August 17, 2024

Visualization: how effective is a PART of frames on decision making

from slowfast.

Comments (7)

ziyuleoliu commented on August 17, 2024

as my understanding, the attention weights matrix is 17*17, CLS token + frame 1-16.

if we want to visualize the CLS token, we need to take out the first row and columns 1-16 of the matrix. it represents the attention scores between CLS token and frames 1-16.

We align them to raw images 1-16, each frame will have only one attention score.

In your paper, Fig.7, a single frame is, however, corresponding to 16 different values and has 16 different highlights.

I am wondering if we only visualize the CLS token, how could we get 16 different values on a single frame? Shouldn't one frame have only one value that represents the similarity of the frame to CLS token?

It would be great if you could give me some help.

Thanks
Best

from slowfast.

bomri commented on August 17, 2024

Hi @ziyuleoliu,
The visualization we suggest shows which frames contributed more to the classification and not which part in the image (pixel-wise) was most relevant.

See issue #17 on how to implement this visualization.

from slowfast.

ziyuleoliu commented on August 17, 2024

hey

thanks for your reply :)

i'm trying to use the same framework, but vanilla transformer instead of Longformer. The cls token in my case is a [17,17] matrix. (1 CLS token + 16 frames). If I try to visualize the CLS token and take the first row of the matrix. I can only get an array of 17. It can show the contribution of each frame to CLS token. But each frame has only 1 value. But in your paper figure 7, each frame has 16 different values.
I'm wondering what makes this difference. Is it because of Longformer?

Your framework inspires me a lot, I would love to ask if it's possible to visualize the grad-CAM of the temporal encoder?
Normally if we want to visualize the VIT we have to resize the token [bs, 196, dim] to [bs, 14,14,dim]. but in this framework, temporal encoder takes directly the feature vectores as tokens eg. res50 (bs, 16, 768). Can we still visualize the gradcam of temporal encoder and put the heatmap on org pics?

Thanks again for your time.
Best wishes from noob Leo

from slowfast.

ziyuleoliu commented on August 17, 2024

hey @bomri

i would be appreciate if you could give me a bit more help. i stuck here since weeks and get really confused....

Best

from slowfast.

bomri commented on August 17, 2024

Hi @ziyuleoliu ,

This figure's visualization uses the weights correlated to the CLS token. If you are not using the Longformer, how do you consider the temporal information? we use the "temporal" weight per frame in the clip to illustrate this.

We haven't tried grad-CAM , for temporal heatmaps, we use what is shown in the paper, for spatial it can probably work similarly to other usages of grad-Cam.

from slowfast.

ziyuleoliu commented on August 17, 2024

hey @bomri

thanks for your reply.

I use a vanilla transformer (VIT) instead of Longformer as Temporal Encoder to model the temporal information.

The temporal attention weights is a 17*17 matrix:

If I have the right understanding, I should only take the first row and column [1:17](frame1 to 16). It represents CLS as Query and frame 1 to 16 as Key, so that i can get which frame has the biggest contribution for CLS token e.g., the classification result, right? Or should i take the first column and row [1:17]?

But what confused me is one frame has only one weight value correlated to CLS token, as it shown in the attachment.

But in your figure, a single frame will have different values.

I'm worried that i have an incorrect understanding and took the wrong weights.

I appreciate your time and it would be great if you can give me a bit more help.
Thanks again.
LEO

from slowfast.

bomri commented on August 17, 2024

If I have the right understanding, I should only take the first row and column [1:17](frame1 to 16). It represents CLS as Query and frame 1 to 16 as Key, so that i can get which frame has the biggest contribution for CLS token e.g., the classification result, right?

Right

Or should i take the first column and row [1:17]?

It depends on your implementation and I don't remember if we used row or col in our setup.

But what confused me is one frame has only one weight value correlated to CLS token, as it shown in the attachment.

The image you added looks good, seems like the info in the start and end are the most important

But in your figure, a single frame will have different values.

We also have one value per frame (same as in your case). We get one value per sec in a 10 sec video, sampled with 25fps, so 250 values which we plot in the top, but for the video images we sub-sample only for making it easier for visualizing the video with 16 frames.

from slowfast.

Visualization: how effective is a PART of frames on decision making about slowfast HOT 7 CLOSED

Comments (7)

Related Issues (17)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent