icaruswizard / mae Goto Github PK

View Code? Open in Web Editor NEW

212.0 212.0 45.0 98 KB

PyTorch implementation of Masked Autoencoder

License: MIT License

Python 100.00%

mae's Introduction

Hi there, I am Xingyuan Zhang 张兴远. 👋

🔭I am a PhD student at Volkswagen and TUM, working on world models focusing on pretraining and transfer.
🤔I am generally an AI enthusiast and interested in all related topics. I believe AGI can be achieved in my lifetime.
⚡In my non-research life, I am a board game lover. If you want a match with me, you can find me with IcarusWizard on BGA.
📫If you want to contact me, please reach out to [email protected].

mae's People

Contributors

Stargazers

Watchers

Forkers

che-yjwj zeyuyun1 shwinshaker ruyian aayushktyagi pprp jinghany damaggu tianjinyellow ydmin yaodongyu tirthvyas-tk-labs zhihong-ni zheng547 sybahk whuhxb zxp46 jiazewang blacksun-v liuxysherry spidartist sfarhand prajwalsingh sunkilin k-zhai pangzhiyu amc-cbn luke-byrne-eng seann999 makoofficial vertyxzz kaustav546 markpp alexandermittet moucheng2017 dezmon haoranch3n mprzewie chenjinuab hoidn sailfish009 wojtke jungjaewon

mae's Issues

Custom masking

Hi, thanks for the code.
You answered that we can modify the PatchShuffle class to create custom masks. However, the patch shuffle class takes the output of a Conv2d layer, making it hard to know precisely what part of the image we are masking. Is there any reason for this?

Originally posted by @wenhaowang1995 in #14 (comment)

tqdm is required package

in my environment, Windows 11 crashes when running python3 mae_pretrain.py w/o tqdm.

Ask some questions

Dear Dr.Zhang, If there are no labels in my trainset.Can I remove the cls_token? If do, will it influence the training process?

extracting embeddings

Hi, how would you recommend extracting (1D) image embeddings from the model on (say) a held out data set?

Problem with embedding dimension

Hello,

Thank you for sharing such a wonderful code.

I found in my experiments that the parameter of emb_dim that refers to the embedding dimension should be always a multiple of 192. I do not understand why as the other codes allow other values such as 512 and 256.

Is there a solution to enable such dimensions.

Thank a lot =)

What are the versions you worked with, keep getting errors

Hi, I'm getting a lot of errrors like:
AttributeError: 'Mlp' object has no attribute 'drop1'
or
AttributeError: 'Block' object has no attribute 'drop_path1'
Its probably something non compatible between timm and pytorch, any idea whta are the versions you worked with?
It happend only in train_classifier.py.
Thanks!

resource requirement

我想请问一下学长你目前复现这个工作用的什么GPU？大概要跑完要多久？

Turn off masking

Hi @IcarusWizard,

Great work!!

We are trying to correct segmentation label using masked Autoencoder. I tried MAE and its working good.
Now for inference I want to pass whole image(without being masked) and check if MAE is able to correct certain regions.
Is there a way where I can turn off mask and infer on whole image?

let me know if my question is unclear.

Thanks

Could you provide a one-dimensional MAE implementation in pytorch?

Need help for reconstracting your experiments

Hi !
I install all the requirments by pip and when I run MAE code, here comes up with an error:

Traceback (most recent call last):
File "D:\mywork\Projects\PJ1\my_MAE_cifar\main.py", line 87, in <module>
rep, backward_indices = encoder(img0)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\mywork\Projects\PJ1\my_MAE_cifar\model.py", line 80, in forward
features = self.layer_norm(self.transformer(patches))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\timm\models\vision_transformer.py", line 268, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Block' object has no attribute 'drop_path1'. Did you mean: 'drop_path'?

This seems to be a problem comes from TIMM package

While I look into the vision_transformer.py of TIMM package, it seems there is no problem with the drop_path1 and drop_path2.

Therefore could you please tell me which version of TIMM are you using? It is best if you could provide your torch and cuda version

权重读取后重建图像不理想

您好，我使用您提供的预训练模型mae-t-vit.pt进行测试时，输入如下一张图片，重建图像却不理想，请问是什么原因造成的呢？

我的代码如下：

import torch
from torchvision import transforms
import cv2
from PIL import Image
import random
import numpy as np

def setup_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True

setup_seed()

# img = cv2.imread('benign_image.png')
img = Image.open('10.png')

trans = transforms.Compose([
    transforms.ToTensor()
])

img = trans(img).unsqueeze(0)

model = torch.load('vit-t-mae.pt', map_location='cpu')
outs = model(img)

transforms.ToPILImage()(outs[0].squeeze(0)).save('result.png')

The training loss is decreasing while the validation loss is increasing

Sorry to bother you again. I use the pretrain model of yours and run the train_classifier.py. I visualized the log by tensorboard and found the training loss is decreasing while validation loss is increasing. I wonder if the model is overfitting?

Why do the masks' positions on the visualized images are same?

I run the pretrain program and saw the visualized images on tensorboard. I found that the masks of those 16 images are exactly the same as yours. So I was wondering if it isn't actually random shuffled?
The picture with time is the visualization of mine, and the other is yours.

TypeError: init() got an unexpected keyword argument 'verbose'

你好，我在运行python mae_pretrain.py的时候遇到了这个问题：
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "mae_pretrain.py", line 44, in
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=lr_func, verbose=True)
TypeError: init() got an unexpected keyword argument 'verbose'
请问这个应该怎么处理呢？

Unable to reconstruct a distinguishable image

Sorry to bother you, I tried to reconstruct the single-channel CT image (512x512), but the MSE loss did not decrease any more and remained at 0.03 when the epoch was 1000, the reconstructed image quality was very poor, what was the cause of this problem? I'm not familiar with Transformer, so I just tried adjusting the patch size (2->16), embedding dimensions(192->768) and the number of encoder/decoder heads (12).

最后你得重建图像是怎么来的？

how to make the masked patches not random

Hi, is there any possible way to fix the position of masked patches? (for example. make all masked patches together into the middle of the image instead of spreading round the image randomly)

Thank you for your help!

train_classifier

Hello,
Why did we not drop the decoder and insert a new linear layer at the end of the norm layer in the train classifier for fine-tuning? Why did you get each module of the pretrained model individually and try to reconstruct the fine-tuner instead of dropping the decoder module?

Suspected minor bug

Hi, there is a minor bug in the MAE_Decoder implementation.
The input to the forward function (line 103 in model.py) has features.shape[0] == 1 + t (where 1 refers to the cls_token, and t the number of unmasked tokens).
Then, in line 117: mask[T:] = 1, should actually be mask[T-1:] = 1, since the mask was created without accounting for the cls_token.

What are the rules for setting the parameters of vit-tiny's decoder？

Thanks for your work! I’m pretraining the vit-tiny for my dataset, about 260k images. But i can not determine the setting for decoder's parameters (depth/embed_dim/num_heads), just consistent with vit-base/large/huge or choose some smaller value to make a lightweight decoder? Due to the limitation of gpus, i can not try many times. Could you give me some suggestions, thanks a lot. :)

Expected dtype int64 for index

sorry maybe it's a stupid question - I'm new to torch .... I experience the below issue when trying first step, please help. Thanks.

C:\Python\MAE>python mae_pretrain.py
Files already downloaded and verified
Files already downloaded and verified
Adjusting learning rate of group 0 to 1.2000e-05.
0%| | 0/98 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Python\MAE\mae_pretrain.py", line 54, in
predicted_img, mask = model(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 141, in forward
features, backward_indexes = self.encoder(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 70, in forward
patches, forward_indexes, backward_indexes = self.shuffle(patches)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 33, in forward
patches = take_indexes(patches, forward_indexes)
File "C:\Python\MAE\model.py", line 18, in take_indexes
return torch.gather(sequences, 0, repeat(indexes, 't b -> t b c', c=sequences.shape[-1]))
RuntimeError: gather(): Expected dtype int64 for index

Environtment:
Windows 11
C:\Python\MAE>python --version
Python 3.9.9

Fine-tuning and linear evaluation

Thank you so much for this simple yet effective code!

Excuse me because I'm still new to this. My questions are about train_classifier.py code.
1- Does it do fine-tuning or linear evaluation?
2- Assuming it does one of them, how to toggle to do the other? .. what should I change?
3- does it use the encoder or the decoder to do this?

Excuse my basic questions, but you answer will be really appreciated.

Query about Persistent Use of Masking during Inference and Finetuning in MAE

Hello,

I have been reviewing the MAE implementation, and I noticed that during both inference and finetuning, the model continues to use the same mask ratio (0.75) as was used during pretraining. Could you clarify why the model does not encode the entire image instead of using a masking approach? I am curious about the advantages or the rationale behind continuing with this masking strategy post-pretraining.

Thank you for your insights!