Comments (5)
Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple
isotropic
vision models?
I would say that's a fairly accurate summary. Basically, two things happened simultaneously with the introduction of ViTs, MLP-Mixers, and their variants:
(1) convolution was replaced with new operations like self-attention or MLPs,
(2) and network designs were changed and vastly simplified, putting all the downsampling at the stem (i.e., using patch embeddings) and otherwise performing no downsampling/resizing throughout the network (i.e., isotropy).
Despite these two things being introduced simultaneously, all of the resulting performance gains have been attributed to (1). Couldn't (2) also be at least partly responsible for the performance gains? By using just convolutions instead of (1), we provide evidence that (2) is itself a powerful template for deep learning.
I'm going to go ahead and close this issue, but feel free to reopen it or open a new issue if you have more questions.
from convmixer.
Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").
It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.
Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.
from convmixer.
Some models do indeed use a stem with k = 7 convolutions, but this is often with stride = 2. The patch embedding stem sets kernel size equal to patch size, which reduces size more than stride = 2. That is, all the dimension reduction happens immediately at the stem, in contrast to most CNNs where it happens gradually throughout the model (i.e., "pyramid shaped").
It's also unusual that we use k = 9 convolutions at all, as typically stacked small-kernel convolutions are favored.
Overall, the model is exceedingly simple yet still performs very well in terms of accuracy.
Thanks for your great work. According to my understanding the key idea of this paper is to evaluate the power of the tokenizing inputs in simple isotropic
vision models?
from convmixer.
Thanks for your reply : ), I hope this paper can be accepted in ICLR 2022~
from convmixer.
Thank you for your clarification. Hope your reviews go well!
from convmixer.
Related Issues (16)
- is single gpu training possible? HOT 2
- License HOT 2
- Cifar10 baseline doesn't reach 95% HOT 13
- Training time HOT 1
- Training scheme modifications for small GPUs HOT 4
- Could you please release the training logs of convmixer on imagenet and cifar?
- CIFAR-10 training settings
- something about loss
- weight location HOT 1
- Experiments with full convolutional layers instead of patch embedding? HOT 2
- 你好,convmixer能否直接提取图片的2D特征?
- Do you use this model on other downstream work? (like semantic segmentation ) HOT 3
- padding=same? HOT 1
- Request more experiment results to compare to other architecture. HOT 1
- Segmentation ConvMixer architecture ?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from convmixer.