lucidrains / mlp-mixer-pytorch Goto Github PK

View Code? Open in Web Editor NEW

968.0 11.0 108.0 123 KB

An All-MLP solution for Vision, from Google AI

License: MIT License

Python 100.00%

deep-learning vision

mlp-mixer-pytorch's Introduction

MLP Mixer - Pytorch

An All-MLP solution for Vision, from Google AI, in Pytorch.

No convolutions nor attention needed!

Yannic Kilcher video

Install

$ pip install mlp-mixer-pytorch

Usage

import torch
from mlp_mixer_pytorch import MLPMixer

model = MLPMixer(
    image_size = 256,
    channels = 3,
    patch_size = 16,
    dim = 512,
    depth = 12,
    num_classes = 1000
)

img = torch.randn(1, 3, 256, 256)
pred = model(img) # (1, 1000)

Rectangular image

import torch
from mlp_mixer_pytorch import MLPMixer

model = MLPMixer(
    image_size = (256, 128),
    channels = 3,
    patch_size = 16,
    dim = 512,
    depth = 12,
    num_classes = 1000
)

img = torch.randn(1, 3, 256, 128)
pred = model(img) # (1, 1000)

Citations

@misc{tolstikhin2021mlpmixer,
    title   = {MLP-Mixer: An all-MLP Architecture for Vision},
    author  = {Ilya Tolstikhin and Neil Houlsby and Alexander Kolesnikov and Lucas Beyer and Xiaohua Zhai and Thomas Unterthiner and Jessica Yung and Daniel Keysers and Jakob Uszkoreit and Mario Lucic and Alexey Dosovitskiy},
    year    = {2021},
    eprint  = {2105.01601},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@misc{hou2021vision,
    title   = {Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition},
    author  = {Qibin Hou and Zihang Jiang and Li Yuan and Ming-Ming Cheng and Shuicheng Yan and Jiashi Feng},
    year    = {2021},
    eprint  = {2106.12368},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

mlp-mixer-pytorch's People

Contributors

Stargazers

Watchers

Forkers

dumpmemory sailfish009 chenchy xjohnxjohn lzhbrian arogozhnikov ting005 yjhuasheng hanbin03 gjiananchen wuhan2020rs 13301338176 leoncesc zkcpku calculator01 arithmeticjia huanghaoyu1997 ashertrockman varamishitha himeshph cvlinks chschaitanya orange1999 longterm417 yha992 hyejwon sikikun zhbfy hwguo11 liuzhuang13 taoshss mirth lillllllll illusion2018 ivanchenph aojunzz jhvics1 lee0v0 john-zf rookielike lgyoung zhangicenight laureateen onionon1on topology1225 anminhhung zhe-meng loujc lizhaofu bigfishtwo winhemfuture puyangchen dev-jinwoohong mldl murosin mariy1289 zhouchenxia afcarl ai-research-group-publication jawaechan laozhanger zbwxp yougray juanigp shenghuacheng eenzeenee oytunturk pkulwj1994 cv-ip jgraving nathanyanjing zfgao66 rhtm02 zxc1667543276 linlhc torment123 otakbeku crystraldo kylehkhsu tanmdl jandago tonylibing bugmaker-boyan txfr gongzhihong xxaxtt xyupeng dtzz ravi-0841 vivva 5l1v3r1 hehongjie wushidiguo seannobel kenshin-y dl-mlp canbaba0517 haonan917 bupthuangxin tzjs2015ii

mlp-mixer-pytorch's Issues

Question about Parameters

Why are the parameters so different from the one in the paper?

Take the S-16 as an example:

your model : 5.01 M
Paper : 18 M

I compared your model with official imlementation model (Jax version), maybe you made a mistake.

I can not find any transpose operation in your model, which is necessary for Mixer layer.

If i do not understand your model, please correct me ,thanks a lot!

Does the performance get dropped by converting patch embedding from CNN version to rearrange version?

Amazing work! How difficult is it to implement mlp into Dall-E? As the whole idea around Dall-E evolves around attention layers and transformers, I wonder if this simpler model would enable smaller, equally capable models...

Expansion factor choices

Thank you for the clear and well-executed implementation.

Following up on this issue: #11

May I kindly ask why you chose to expand the token-mixing MLP while bottlenecking the channel-mixing MLP? Is there a particular reason behind this design, or is it simply because this setup provides the best performance?

expansion_factor on tokens is actually a bottleneck in original codebase

Thanks for your implementation. In comparing your codebase to the author's implementation, I discovered that while you have a single expansion factor in your configuration, the authors have separate values - one for tokens and one for channels.

Specifically, their channels expansion factor is 4, but their tokens expansion factor is 0.5. (The hidden_dim is the base projection size). Note that they actually use a feature count, but I'm translating to the mechanism you use in this codebase.

Thus, when executing the MixerBlock, the tokens "expansion" is actually a bottleneck.

The parameters can be verified as well in Table 1 ("Specifications of Mixer Architectures") at the top of page 4 in version 4 (the current version as of Feb 14, 2022) of their paper.

I'm not suggesting that anything necessarily needs to change in your implementation. However, if you wanted to align your codebase to be able to fully replicate the author's work, you may consider allowing for two separate parameters - token_expansion_factor and channels_expansion_factor.

Thank you again for this work, and for all your contributions generally. You are a an incredible asset to the community.