Why “patches” are all you need? Patch embedding is Conv7x7 stem, The body is s

What's new about this model? about convmixer HOT 5 CLOSED

locuslab commented on June 28, 2024

What's new about this model?

from convmixer.

Comments (5)

tmp-iclr commented on June 28, 2024 3

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

I would say that's a fairly accurate summary. Basically, two things happened simultaneously with the introduction of ViTs, MLP-Mixers, and their variants:
(1) convolution was replaced with new operations like self-attention or MLPs,
(2) and network designs were changed and vastly simplified, putting all the downsampling at the stem (i.e., using patch embeddings) and otherwise performing no downsampling/resizing throughout the network (i.e., isotropy).
Despite these two things being introduced simultaneously, all of the resulting performance gains have been attributed to (1). Couldn't (2) also be at least partly responsible for the performance gains? By using just convolutions instead of (1), we provide evidence that (2) is itself a powerful template for deep learning.

I'm going to go ahead and close this issue, but feel free to reopen it or open a new issue if you have more questions.

from convmixer.

tmp-iclr commented on June 28, 2024

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

from convmixer.

rentainhe commented on June 28, 2024

Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").

It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.

Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.

Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic vision models?

from convmixer.

rentainhe commented on June 28, 2024

Thanks for your reply : ), I hope this paper can be accepted in ICLR 2022~

from convmixer.

vztu commented on June 28, 2024

Thank you for your clarification. Hope your reviews go well!

from convmixer.

What's new about this model? about convmixer HOT 5 CLOSED

Comments (5)

Related Issues (16)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent