Code Monkey home page Code Monkey logo

mae's Introduction

Hi there, I am Xingyuan Zhang 张兴远. 👋

  • 🔭I am a PhD student at Volkswagen and TUM, working on world models focusing on pretraining and transfer.
  • 🤔I am generally an AI enthusiast and interested in all related topics. I believe AGI can be achieved in my lifetime.
  • ⚡In my non-research life, I am a board game lover. If you want a match with me, you can find me with IcarusWizard on BGA.
  • 📫If you want to contact me, please reach out to [email protected].

Xingyuan's GitHub stats

mae's People

Contributors

icaruswizard avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

mae's Issues

Custom masking

Hi, thanks for the code.
You answered that we can modify the PatchShuffle class to create custom masks. However, the patch shuffle class takes the output of a Conv2d layer, making it hard to know precisely what part of the image we are masking. Is there any reason for this?

Originally posted by @wenhaowang1995 in #14 (comment)

Ask some questions

Dear Dr.Zhang, If there are no labels in my trainset.Can I remove the cls_token? If do, will it influence the training process?

extracting embeddings

Hi, how would you recommend extracting (1D) image embeddings from the model on (say) a held out data set?

Problem with embedding dimension

Hello,

Thank you for sharing such a wonderful code.

I found in my experiments that the parameter of emb_dim that refers to the embedding dimension should be always a multiple of 192. I do not understand why as the other codes allow other values such as 512 and 256.

Is there a solution to enable such dimensions.

Thank a lot =)

What are the versions you worked with, keep getting errors

Hi, I'm getting a lot of errrors like:
AttributeError: 'Mlp' object has no attribute 'drop1'
or
AttributeError: 'Block' object has no attribute 'drop_path1'
Its probably something non compatible between timm and pytorch, any idea whta are the versions you worked with?
It happend only in train_classifier.py.
Thanks!

resource requirement

我想请问一下学长你目前复现这个工作用的什么GPU?大概要跑完要多久?

Turn off masking

Hi @IcarusWizard,

Great work!!

We are trying to correct segmentation label using masked Autoencoder. I tried MAE and its working good.
Now for inference I want to pass whole image(without being masked) and check if MAE is able to correct certain regions.
Is there a way where I can turn off mask and infer on whole image?

let me know if my question is unclear.

Thanks

Need help for reconstracting your experiments

Hi !
I install all the requirments by pip and when I run MAE code, here comes up with an error:

Traceback (most recent call last):
File "D:\mywork\Projects\PJ1\my_MAE_cifar\main.py", line 87, in <module>
rep, backward_indices = encoder(img0)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\mywork\Projects\PJ1\my_MAE_cifar\model.py", line 80, in forward
features = self.layer_norm(self.transformer(patches))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\torch\nn\modules\container.py", line 141, in forward
input = module(input)
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1110, in _call_impl
return forward_call(*input, **kwargs)
File "D:\Python\lib\site-packages\timm\models\vision_transformer.py", line 268, in forward
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
File "D:\Python\lib\site-packages\torch\nn\modules\module.py", line 1185, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Block' object has no attribute 'drop_path1'. Did you mean: 'drop_path'?

This seems to be a problem comes from TIMM package

While I look into the vision_transformer.py image of TIMM package, it seems there is no problem with the drop_path1 and drop_path2.

Therefore could you please tell me which version of TIMM are you using? It is best if you could provide your torch and cuda version

权重读取后重建图像不理想

您好,我使用您提供的预训练模型mae-t-vit.pt进行测试时,输入如下一张图片,重建图像却不理想,请问是什么原因造成的呢?

image

我的代码如下:

import torch
from torchvision import transforms
import cv2
from PIL import Image
import random
import numpy as np

def setup_seed(seed=42):
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True

setup_seed()

# img = cv2.imread('benign_image.png')
img = Image.open('10.png')

trans = transforms.Compose([
    transforms.ToTensor()
])

img = trans(img).unsqueeze(0)

model = torch.load('vit-t-mae.pt', map_location='cpu')
outs = model(img)

transforms.ToPILImage()(outs[0].squeeze(0)).save('result.png')

Why do the masks' positions on the visualized images are same?

I run the pretrain program and saw the visualized images on tensorboard. I found that the masks of those 16 images are exactly the same as yours. So I was wondering if it isn't actually random shuffled?
The picture with time is the visualization of mine, and the other is yours.
问题1
问题2

TypeError: __init__() got an unexpected keyword argument 'verbose'

你好,我在运行python mae_pretrain.py的时候遇到了这个问题:
Files already downloaded and verified
Files already downloaded and verified
Traceback (most recent call last):
File "mae_pretrain.py", line 44, in
lr_scheduler = torch.optim.lr_scheduler.LambdaLR(optim, lr_lambda=lr_func, verbose=True)
TypeError: init() got an unexpected keyword argument 'verbose'
请问这个应该怎么处理呢?

Unable to reconstruct a distinguishable image

Sorry to bother you, I tried to reconstruct the single-channel CT image (512x512), but the MSE loss did not decrease any more and remained at 0.03 when the epoch was 1000, the reconstructed image quality was very poor, what was the cause of this problem? I'm not familiar with Transformer, so I just tried adjusting the patch size (2->16), embedding dimensions(192->768) and the number of encoder/decoder heads (12).

how to make the masked patches not random

Hi, is there any possible way to fix the position of masked patches? (for example. make all masked patches together into the middle of the image instead of spreading round the image randomly)

Thank you for your help!

train_classifier

Hello,
Why did we not drop the decoder and insert a new linear layer at the end of the norm layer in the train classifier for fine-tuning? Why did you get each module of the pretrained model individually and try to reconstruct the fine-tuner instead of dropping the decoder module?

Suspected minor bug

Hi, there is a minor bug in the MAE_Decoder implementation.
The input to the forward function (line 103 in model.py) has features.shape[0] == 1 + t (where 1 refers to the cls_token, and t the number of unmasked tokens).
Then, in line 117: mask[T:] = 1, should actually be mask[T-1:] = 1, since the mask was created without accounting for the cls_token.

What are the rules for setting the parameters of vit-tiny's decoder?

Thanks for your work! I’m pretraining the vit-tiny for my dataset, about 260k images. But i can not determine the setting for decoder's parameters (depth/embed_dim/num_heads), just consistent with vit-base/large/huge or choose some smaller value to make a lightweight decoder? Due to the limitation of gpus, i can not try many times. Could you give me some suggestions, thanks a lot. :)

Expected dtype int64 for index

sorry maybe it's a stupid question - I'm new to torch .... I experience the below issue when trying first step, please help. Thanks.

C:\Python\MAE>python mae_pretrain.py
Files already downloaded and verified
Files already downloaded and verified
Adjusting learning rate of group 0 to 1.2000e-05.
0%| | 0/98 [00:00<?, ?it/s]
Traceback (most recent call last):
File "C:\Python\MAE\mae_pretrain.py", line 54, in
predicted_img, mask = model(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 141, in forward
features, backward_indexes = self.encoder(img)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 70, in forward
patches, forward_indexes, backward_indexes = self.shuffle(patches)
File "C:\Users\AppData\Roaming\Python\Python39\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "C:\Python\MAE\model.py", line 33, in forward
patches = take_indexes(patches, forward_indexes)
File "C:\Python\MAE\model.py", line 18, in take_indexes
return torch.gather(sequences, 0, repeat(indexes, 't b -> t b c', c=sequences.shape[-1]))
RuntimeError: gather(): Expected dtype int64 for index

Environtment:
Windows 11
C:\Python\MAE>python --version
Python 3.9.9

Fine-tuning and linear evaluation

Thank you so much for this simple yet effective code!

Excuse me because I'm still new to this. My questions are about train_classifier.py code.
1- Does it do fine-tuning or linear evaluation?
2- Assuming it does one of them, how to toggle to do the other? .. what should I change?
3- does it use the encoder or the decoder to do this?

Excuse my basic questions, but you answer will be really appreciated.

Query about Persistent Use of Masking during Inference and Finetuning in MAE

Hello,

I have been reviewing the MAE implementation, and I noticed that during both inference and finetuning, the model continues to use the same mask ratio (0.75) as was used during pretraining. Could you clarify why the model does not encode the entire image instead of using a masking approach? I am curious about the advantages or the rationale behind continuing with this masking strategy post-pretraining.

Thank you for your insights!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.