pmtrans's People
pmtrans's Issues
Hello. How to set up this distributed training
Hello. How to set up this distributed training
lack of config.py file
PMTrans/swin_pm.py model training errors
code complete
Is your code complete now? I want to replace it with my own dataset for debugging.THANK YOU!
A question about the semi-supervised loss
Results on the DomainNet.
Thank you so much for the interesting work you've done that has inspired us! We would like to follow up on your work and conduct an experiment to compare. However, we found that there seems to be a problem with your calculation of Avg. on the DomainNet dataset, as the averaged result should be 52.4 instead of 62.9. We hope to hear from you, thank you very much!
model training errors
model pretrained
The base_swin default download the swin_base_patch4_window7_224_22kto1k.pth ,how can I use the swin_base_patch4_window7_224.pth for pretrain.
How to solve the problem of model preloading
RuntimeError: Error(s) in loading state_dict for Swin:
Missing key(s) in state_dict: "layers.0.blocks.0.attn.relative_position_index", "layers.0.blocks.1.attn_mask", "layers.0.blocks.1.attn.relative_position_index", "layers.0.downsample.reduction.weight", "layers.0.downsample.norm.weight", "layers.0.downsample.norm.bias", "layers.1.blocks.0.attn.relative_position_index", "layers.1.blocks.1.attn_mask", "layers.1.blocks.1.attn.relative_position_index", "layers.2.blocks.0.attn.relative_position_index", "layers.2.blocks.1.attn_mask", "layers.2.blocks.1.attn.relative_position_index", "layers.2.blocks.2.attn.relative_position_index", "layers.2.blocks.3.attn_mask", "layers.2.blocks.3.attn.relative_position_index", "layers.2.blocks.4.attn.relative_position_index", "layers.2.blocks.5.attn_mask", "layers.2.blocks.5.attn.relative_position_index", "layers.2.blocks.6.attn.relative_position_index", "layers.2.blocks.7.attn_mask", "layers.2.blocks.7.attn.relative_position_index", "layers.2.blocks.8.attn.relative_position_index", "layers.2.blocks.9.attn_mask", "layers.2.blocks.9.attn.relative_position_index", "layers.2.blocks.10.attn.relative_position_index", "layers.2.blocks.11.attn_mask", "layers.2.blocks.11.attn.relative_position_index", "layers.2.blocks.12.attn.relative_position_index", "layers.2.blocks.13.attn_mask", "layers.2.blocks.13.attn.relative_position_index", "layers.2.blocks.14.attn.relative_position_index", "layers.2.blocks.15.attn_mask", "layers.2.blocks.15.attn.relative_position_index", "layers.2.blocks.16.attn.relative_position_index", "layers.2.blocks.17.attn_mask", "layers.2.blocks.17.attn.relative_position_index", "layers.3.blocks.0.attn.relative_position_index", "layers.3.blocks.1.attn.relative_position_index", "hidden.weight", "hidden.bias", "my_fc.weight", "my_fc.bias".
Unexpected key(s) in state_dict: "head.fc.weight", "head.fc.bias", "layers.3.downsample.norm.weight", "layers.3.downsample.norm.bias", "layers.3.downsample.reduction.weight".
size mismatch for layers.1.downsample.reduction.weight: copying a param with shape torch.Size([256, 512]) from checkpoint, the shape in current model is torch.Size([512, 1024]).
size mismatch for layers.1.downsample.norm.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for layers.1.downsample.norm.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([1024]).
size mismatch for layers.2.downsample.reduction.weight: copying a param with shape torch.Size([512, 1024]) from checkpoint, the shape in current model is torch.Size([1024, 2048]).
size mismatch for layers.2.downsample.norm.weight: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([2048]).
size mismatch for layers.2.downsample.norm.bias: copying a param with shape torch.Size([1024]) from checkpoint, the shape in current model is torch.Size([2048]).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 9384) of binary: /usr/bin/python3
This is the error when I train the model ,how to deal with this problem
how to solve "IndexError: tuple index out of range" problem?
I have tried experimenting with both Python environments 3.7 and 3.9, and I keep getting stuck at the same error. I would appreciate it if you could provide a solution.
Solved!!!
If you have same problem, please refer to the link below!!! Thanks.
https://github.com/NVIDIA/apex/pull/1282/files/01802f623c9b54199566871b49f94b2d07c3f047
the environment provided in readme can't work
timm version is too old and doesn't contain swin
i can't create 'ds_vit_base_patch16_224' via timm, can you tell me how to solve this problem?
i use timm==0.5.4 and other packages are the same as readme
DomainNet train-test splits
Did you use the complete train+test .txt files from DomainNet for training, because in dataset/domainnet/ I only find one file for each domain and no train-test splits? This is important because prior works mostly adopt a train-test split using the respective files.
PMTrans/torch.distributed.elastic.multiprocessing.api:failed
When I trained "VisDA" Datasets,I run "nohup bash dist_train.sh &".But the progress stopped unexpectedly and didn't finish.After that, whenever I run "bash dist_train.sh", I always meet the following error and can't finish the " dist.init_process_group()".I found some solutions but it didn't work, can you help me with the answer?Thank you!
apex has too many bugs, try not to use it.
can you tell me the version
Could you please provide the versions of the following: nvidia-smi (NVIDIA driver), CUDA toolkit (CUDA driver), Python, and PyTorch?
There seems to be a version mismatch causing an error with torch.norm.
I'm attempting to replace torch.norm with l2.norm. Would that be fine?
Could you please advise how to debug?
Could you please advise how to debug in VSCode or Pycharm?
app's version
Can you redistribute the versions of each software.
I have configured the software according to the version on GitHub, but the program cannot run, especially timm, which does not inherit the SwinTransformer model at all
Also, I have successfully installed Apex according to NVIDIA's steps, but there will still be errors.
I hope you can reply that I have been configuring it for several days and the program has been unable to run properly
If I install all the software according to the latest version, there will still be errors.
such as
MakePmTrans is the code I removed from distributed content
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.