jinjingzhu / pmtrans Goto Github PK

View Code? Open in Web Editor NEW

45.0 45.0 7.0 4.28 MB

Python 99.57% Shell 0.43%

pmtrans's People

Contributors

Stargazers

Watchers

Forkers

fengjy99 hilini varshani-brabaharan wkq-wukaiqi suhail270

pmtrans's Issues

Hello. How to set up this distributed training

lack of config.py file

Thank you for your code, could you send config.py file in config folder.
And i want to ask the computation of loss is complete? i was confused.

PMTrans/swin_pm.py model training errors

When I run "bash dist_train.sh" and the program runs to start training,the swin_pm.py has the following error occurred.But I just changed the path of the dataset, the other parameters were not modified.Can you answer for me?thanks

code complete

Is your code complete now? I want to replace it with my own dataset for debugging.THANK YOU!

A question about the semi-supervised loss

why the label similarity is denoted by

instead of y^i (y^s)T.

Results on the DomainNet.

Thank you so much for the interesting work you've done that has inspired us! We would like to follow up on your work and conduct an experiment to compare. However, we found that there seems to be a problem with your calculation of Avg. on the DomainNet dataset, as the averaged result should be 52.4 instead of 62.9. We hope to hear from you, thank you very much!

model training errors

I run bash dist_train.sh by using the default setting(office_home, swin_b) on a single GPU without changing any parameters. But there is something wrong when the code tries to patch_embed the source data as seen in the following, and I have no idea how to deal with it. Can you give me some help?

model pretrained

The base_swin default download the swin_base_patch4_window7_224_22kto1k.pth ,how can I use the swin_base_patch4_window7_224.pth for pretrain.

How to solve the problem of model preloading

RuntimeError: Error(s) in loading state_dict for Swin:
Missing key(s) in state_dict: "layers.0.blocks.0.attn.relative_position_index", "layers.0.blocks.1.attn_mask", "layers.0.blocks.1.attn.relative_position_index", "layers.0.downsample.reduction.weight", "layers.0.downsample.norm.weight", "layers.0.downsample.norm.bias", "layers.1.blocks.0.attn.relative_position_index", "layers.1.blocks.1.attn_mask", "layers.1.blocks.1.attn.relative_position_index", "layers.2.blocks.0.attn.relative_position_index", "layers.2.blocks.1.attn_mask", "layers.2.blocks.1.attn.relative_position_index", "layers.2.blocks.2.attn.relative_position_index", "layers.2.blocks.3.attn_mask", "layers.2.blocks.3.attn.relative_position_index", "layers.2.blocks.4.attn.relative_position_index", "layers.2.blocks.5.attn_mask", "layers.2.blocks.5.attn.relative_position_index", "layers.2.blocks.6.attn.relative_position_index", "layers.2.blocks.7.attn_mask", "layers.2.blocks.7.attn.relative_position_index", "layers.2.blocks.8.attn.relative_position_index", "layers.2.blocks.9.attn_mask", "layers.2.blocks.9.attn.relative_position_index", "layers.2.blocks.10.attn.relative_position_index", "layers.2.blocks.11.attn_mask", "layers.2.blocks.11.attn.relative_position_index", "layers.2.blocks.12.attn.relative_position_index", "layers.2.blocks.13.attn_mask", "layers.2.blocks.13.attn.relative_position_index", "layers.2.blocks.14.attn.relative_position_index", "layers.2.blocks.15.attn_mask", "layers.2.blocks.15.attn.relative_position_index", "layers.2.blocks.16.attn.relative_position_index", "layers.2.blocks.17.attn_mask", "layers.2.blocks.17.attn.relative_position_index", "layers.3.blocks.0.attn.relative_position_index", "layers.3.blocks.1.attn.relative_position_index", "hidden.weight", "hidden.bias", "my_fc.weight", "my_fc.bias".
Unexpected key(s) in state_dict: "head.fc.weight", "head.fc.bias", "layers.3.downsample.norm.weight", "layers.3.downsample.norm.bias", "layers.3.downsample.reduction.weight".
size mismatch for layers.1.downsample.reduction.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
size mismatch for layers.1.downsample.norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for layers.1.downsample.norm.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for layers.2.downsample.reduction.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 2048]).
size mismatch for layers.2.downsample.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for layers.2.downsample.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([2048]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9384) of binary: /usr/bin/python3

This is the error when I train the model ,how to deal with this problem

how to solve "IndexError: tuple index out of range" problem?

I have tried experimenting with both Python environments 3.7 and 3.9, and I keep getting stuck at the same error. I would appreciate it if you could provide a solution.

Solved!!!

If you have same problem, please refer to the link below!!! Thanks.

https://github.com/NVIDIA/apex/pull/1282/files/01802f623c9b54199566871b49f94b2d07c3f047

the environment provided in readme can't work

timm version is too old and doesn't contain swin

i can't create 'ds_vit_base_patch16_224' via timm, can you tell me how to solve this problem?

i use timm==0.5.4 and other packages are the same as readme

DomainNet train-test splits

Did you use the complete train+test .txt files from DomainNet for training, because in dataset/domainnet/ I only find one file for each domain and no train-test splits? This is important because prior works mostly adopt a train-test split using the respective files.

PMTrans/torch.distributed.elastic.multiprocessing.api:failed

When I trained "VisDA" Datasets,I run "nohup bash dist_train.sh &".But the progress stopped unexpectedly and didn't finish.After that, whenever I run "bash dist_train.sh", I always meet the following error and can't finish the " dist.init_process_group()".I found some solutions but it didn't work, can you help me with the answer？Thank you!