When I pass 1024px size and get intermediate image features from Swin Transform I get

Hi, As noted by <a class="user-mention notranslate" data-hovercard-type="user" dat

Related: <a class="issue-link js-issue-link" data-error-text="

How to get intermediate image features like from Swin Transformers? about dinov2 HOT 3 CLOSED

facebookresearch commented on August 27, 2024

How to get intermediate image features like from Swin Transformers?

from dinov2.

Comments (3)

ccharest93 commented on August 27, 2024 3

Swin transformer uses non overlapping attention windows with local attention, which is different from this model, this was done to combat the quadratic complexity of increasing the patch numbers resulting from smaller patches.

Using flash attention, this model can directly take the 1024 pixel input which somewhat addresses that issue (patch size is stuck at 14 but allows for higher resolution images).

Now if you want to have local attention within a bigger image, nothing stops you from cropping your image in 4, 9,16... non overlapping pieces and then feeding these into the network. This would result in local attention within these pieces.

from dinov2.

TimDarcet commented on August 27, 2024 2

Hi,
As noted by @ccharest93 , the model architecture is simply different, you won't get the same feature map shapes as in a Swin.

If you'd like different resolutions of feature maps (eg to input to a decoder such as upernet), you can downsample high-res feature maps with avg pooling (the general idea in https://arxiv.org/abs/2203.16527)

from dinov2.

woctezuma commented on August 27, 2024

#2 (comment) get_intermediate_layers()

from dinov2.

How to get intermediate image features like from Swin Transformers? about dinov2 HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent