facebookresearch / swav Goto Github PK
View Code? Open in Web Editor NEWPyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
License: Other
PyTorch implementation of SwAV https//arxiv.org/abs/2006.09882
License: Other
Hello,
I'm trying to run main_swav.py with the following command:
python -m torch.distributed.launch --nproc_per_node=1 main_swav.py --images_path=<path to data directory> --train_annotations_path <path to data file> --epochs 400 --base_lr 0.6 --final_lr 0.0006 --warmup_epochs 0 --batch_size 32 --size_crops 224 96 --nmb_crops 2 6 --min_scale_crops 0.14 0.05 --max_scale_crops 1. 0.14 --use_fp16 true --freeze_prototypes_niters 5005 --queue_length 3840 --epoch_queue_starts 15
Some of those parameters have been added to accommodate our data. The only changes I have made to the code are minor changes to the dataset and additional/changed arguments. When I run this command I get the following error:
`Traceback (most recent call last):
File "main_swav.py", line 380, in
main()
File "main_swav.py", line 189, in main
model, optimizer = apex.amp.initialize(model, optimizer, opt_level="O1")
File "/opt/conda/lib/python3.6/site-packages/apex/amp/frontend.py", line 358, in initialize
return _initialize(models, optimizers, _amp_state.opt_properties, num_losses, cast_model_outputs)
File "/opt/conda/lib/python3.6/site-packages/apex/amp/_initialize.py", line 158, in _initialize
raise TypeError("optimizers must be either a single optimizer or a list of optimizers.")
TypeError: optimizers must be either a single optimizer or a list of optimizers.
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'main_swav.py', '--local_rank=0', '--images_path=/data/computer_vision_projects/rare_planes/classification_data/images/', '--train_annotations_path', '/data/computer_vision_projects/rare_planes/classification_data/annotations/instances_train_role_mislabel_category_id_033_chipped.json', '--epochs', '400', '--base_lr', '0.6', '--final_lr', '0.0006', '--warmup_epochs', '0', '--batch_size', '32', '--size_crops', '224', '96', '--nmb_crops', '2', '6', '--min_scale_crops', '0.14', '0.05', '--max_scale_crops', '1.', '0.14', '--use_fp16', 'true', '--freeze_prototypes_niters', '5005', '--queue_length', '3840', '--epoch_queue_starts', '15']' returned non-zero exit status 1.
make: *** [Makefile:69: train-rare-planes] Error 1`
Immediately before the line that throws the error I placed a couple print statements:
print("type(OPTIMIZER)", type(optimizer)) print("OPTIMIZER", optimizer)
The output from those is:
type(OPTIMIZER) <class 'apex.parallel.LARC.LARC'> OPTIMIZER SGD ( Parameter Group 0 dampening: 0 lr: 0.6 momentum: 0.9 nesterov: False weight_decay: 1e-06 )
Here are some version numbers I'm using:
Python 3.6.9 :: Anaconda, Inc. PyTorch == 1.5.0a0+8f84ded torchvision == 0.6.0a0 CUDA == 10.2 apex == 0.1
Any ideas why I would be seeing this error? Thanks in advance!
Thanks for your awesome work.
I wonder why the learning rate is so small in linear classification(0.3 in eval_linear.py)?
In the linear classification of MoCo, the initial learning rate is 30 with a two-stage reduction. There is a 100x difference with this repo.
Have you ever run the eval_linear.py with moco v2 weights or run swav weights with the code from MoCo?
I wonder about the performance impact of the lr.
Hi, how can you evaluate different models on a custom dataset?
Hi!
I wonder if I can use swav internally in a commercial company? We do not charge end-users directly, but of course the company is for profit and it's profit may increase due to usage of DL models.
As the name of the license suggests, I can't use it, but would like to clarify.
Thanks!
main_swav.py
the dataset folder is used args.data_path
which is the path about imagenet dataset root path (contain train
, val
, test
)
train_dataset = MultiCropDataset(
args.data_path,
args.size_crops,
args.nmb_crops,
args.min_scale_crops,
args.max_scale_crops,
)
eval_linear.py
train_dataset = datasets.ImageFolder(os.path.join(args.data_path, "train"))
in main_swav.py
, if set the args.data_path=/path/to/imagenet
, it will use the ( train
, val
, test
) to do self supervised pretraining , am i right ?
Hi, a wonderful work and thanks for sharing your code! Now i run your code on ImageNet following your setting, but I found it takes about one hour and a half for training just one epoch,which is too slow. So I want to know if I was missing some key points which are import for speeding up training?
Hi! Thanks for the sharing of SWAV! I just wonder that do you have any follow up plans on releasing the pre-trained weight for larger models (like Res50X4, Res152X4), which might provide great help for researchers, as it might be too computational resources demanding for us to re-train it :(
Again, thanks for your work very much :)
Hi, I ran into a problem when I tried to load the pretrained resnet-50 model. It seems that the keys in the pre-trained model and keys in the torchvision resnet-50 are not the same. The same problem appears when I tried to load other models listed on the Model Zoo table. Could you please help me with this issue? Thanks.
Here is my code:
import torch, torchvision
model = torchvision.models.resnet50()
checkpoint = torch.load('.user/swav_800ep_pretrain.pth.tar')
model.load_state_dict(checkpoint, strict=False)
when I set strict=False, the model does not load any weights and act like a random initialized model.
when I set strict=True, it will raise error as following:
RuntimeError: Error(s) in loading state_dict for ResNet:
Missing key(s) in state_dict:
"conv1.weight", "bn1.weight", "bn1.bias", "bn1.running_mean", "bn1.running_var", "layer1.0.conv1.weight", "layer1.0.bn1.weight", "layer1.0.bn1.bias", "layer1.0.bn1.running_mean", "layer1.0.bn1.running_var",
......
"layer4.2.bn2.running_mean", "layer4.2.bn2.running_var", "layer4.2.conv3.weight", "layer4.2.bn3.weight", "layer4.2.bn3.bias", "layer4.2.bn3.running_mean", "layer4.2.bn3.running_var", "fc.weight", "fc.bias".
Unexpected key(s) in state_dict:
"module.conv1.weight", "module.bn1.weight", "module.bn1.bias", "module.bn1.running_mean", "module.bn1.running_var", "module.bn1.num_batches_tracked", "module.layer1.0.conv1.weight", "module.layer1.0.bn1.weight", "module.layer1.0.bn1.bias", "module.layer1.0.bn1.running_mean",
......
"module.projection_head.0.weight", "module.projection_head.0.bias", "module.projection_head.1.weight", "module.projection_head.1.bias", "module.projection_head.1.running_mean", "module.projection_head.1.running_var", "module.projection_head.1.num_batches_tracked", "module.projection_head.3.weight", "module.projection_head.3.bias", "module.prototypes.weight".
I wonder whether the released pretrained models were trained on uncurated data(1 billion random public non-EU images from Instagram) ?
Thanks for open sourcing your codebase. Would it be possible to share the final model corresponding to Imagenet downstream task that gets 75.3% top-1 accuracy? Thanks in advance
Hi, I'm curious about how you choose 2048 as the dimension in the projection layer, it seems that ResNet in this repo will output the tensor with 512 channels.
If my Resnet has to output tensor with 256 channels, do you think I need to decrease it from 2048?
Thx
So I was trying to get this working as a prototype on google colab. I installed apex, and when I run
python -m torch.distributed.launch main_swav.py \
--data_path /content/data/fer/images \
--epochs 20 \
--base_lr 0.6 \
--final_lr 0.0006 \
--warmup_epochs 0 \
--batch_size 32 \
--size_crops 48 48 \
--use_fp16 true \
--freeze_prototypes_niters 5005 \
--queue_length 0 \
--epoch_queue_starts 15
I get this error :
Traceback (most recent call last):
File "main_swav.py", line 375, in <module>
main()
File "main_swav.py", line 123, in main
init_distributed_mode(args)
File "/content/swav/src/utils.py", line 65, in init_distributed_mode
rank=args.rank,
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 391, in init_process_group
init_method, rank, world_size, timeout=timeout
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/rendezvous.py", line 79, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for ://
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 263, in <module>
main()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main_swav.py', '--local_rank=0', '--data_path', '/content/data/fer/images', '--epochs', '20', '--base_lr', '0.6', '--final_lr', '0.0006', '--warmup_epochs', '0', '--batch_size', '32', '--size_crops', '48', '48', '--use_fp16', 'true', '--freeze_prototypes_niters', '5005', '--queue_length', '0', '--epoch_queue_starts', '15']' returned non-zero exit status 1.
Thinking this might be related to the python -m torch.distributed.launch \
because obviously I am not using a distributed computing environment, I try to change it to maybe torch.launch
which does not obviously work.
Can I get any help ? Thanks.
Hi,
Thanks for your nice work!
I notice in your code that
Line 280 in 82bddbb
Hi, thanks for your excellent work! I meet some problems when I run the codes.
Firstly,I train the swav model with the command python -m torch.distributed.launch --nproc_per_node=2 main_swav.py ...
,and the model parameters saved in the checkpoint.pth.tar. But when I run the eval_linear.py
with the pretrained swav model with the command python -m torch.distributed.launch --nproc_per_node=2 eval_linear.py --pretrained checkpoint.pth.tar
,I meet some errors,the logs are:
Traceback (most recent call last):
File "/home/yc/codes/swav/src/utils.py", line 144, in restart_from_checkpoint
msg = value.load_state_dict(checkpoint[key], strict=False)
TypeError: load_state_dict() got an unexpected keyword argument 'strict'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "eval_linear.py", line 397, in <module>
main()
File "eval_linear.py", line 201, in main
scheduler=scheduler,
File "/home/yc/codes/swav/src/utils.py", line 147, in restart_from_checkpoint
msg = value.load_state_dict(checkpoint[key])
File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/optim/optimizer.py", line 123, in load_state_dict
raise ValueError("loaded state dict contains a parameter group "
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
Traceback (most recent call last):
File "/home/yc/anaconda3/envs/tf2/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/yc/anaconda3/envs/tf2/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/home/yc/anaconda3/envs/tf2/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
Does it means that there are some errors when the optimizer restore from the checkpoints? Could you help me,thanks!
I'd like to ask a few questions.
due to the limitation of GPU resources, I can only use a single GPU to run swav experiments. In this case, what needs to be adjusted in the setting of experimental parameters? Will the performance of the pre-training model decrease significantly?
How many instances are needed at least in order to get a relatively good pre-training effect?
With regard to the model superparameter args.nmb_prototypes, if the actual categories of the custom dataset are few (far less than 1k), is it necessary to make corresponding adjustments?
In line 371 of the file main_swav.py, why does args.world_size appear in the code but not in the pseudo-code in the article?
Thanks again for being able to open source this code. I am looking forward to your reply.
Hi, I love the project a lot, thanks for sharing it!
However, for some reason, the GPU is not available to me. I have downloaded the pretrained model, and my ultimate goal is to get some visual embedding from the pretrained model, so I just wondering if there is any easy way to do the following thing: If I input a single image into the model, the model would output a corresponding embedding. Note that I don't expect to re-train or fine-tune the model, so maybe it's still possible to do the job without using a GPU.
Also, if this is not too dumb, could you please specify the input/output size? Thank you very much!
Hi @mathildecaron31
As you said,
In the paper, the prototypes are indeed normalized along the first dimension because the prototypes matrix, C
, is of dimension DxK (i.e, 128x3000).
On the contrary, in the code, w
is of dimension KxD (i.e, 3000x128). You can easily check that the normalization is done correctly by printing:
print(torch.norm(w, dim=1).shape) # should give 3000
print(torch.norm(w, dim=1)) # should give a vector with 1 everywhere
But the the code goes:
self.prototypes = nn.Linear(output_dim, nmb_prototypes, bias=False)
which output_dim = 128 , nmb_prototypes = 3000
So the w
is of dimension D * K (128 * 3000), which means there are 3000 prototypes and each prototype is a vector with dim 128
Thanks
Hellow. Thanks for the work. As far as I can see pretrained models in the end have fully counted dim =1000 . Shouldn't the projection head = 128 be there? I want to get embeddings , hiw to do this? Do you have appropriate pretraibed model with 128 x projection head?
Hi, nice work! I tried to do pretraining with main_swav.py on multiple machines.
Here's the main code for distributed training.
python -m torch.distributed.launch main_swav.py --rank 0 \
--world_size 8 \
--dist_url 'tcp://172.31.11.200:23456' \
I comment the line 55-59 in src/utils.py in order to set ranks for each machine. It is okay to run.
But I found that during training, on each machine, only 1 GPU was used. I think it is caused by
Line 68 in 77f7185
Many thanks!
can you provide a model / results of training swav with bs 256, input size 2x224+6x96 for 100 epochs?
The training time is too long.
SyncBatchNorm.convert_sync_batchnorm() causes ValueError: expected at least 3D input (got 2D input).
how to solve it ?
Thanks for sharing your code of such a wonderful work. Have you done experiments using DeepCluster-v2 for fewer training epochs, e.g. 100 or 200 epochs ? If so, can you provide the linear evaluation top-1 acc. for such settings? Many thanks.
I trained a network from scratch with my own dataset and wrote some code that sorts images in different folders regarding their cluster assignments. I did this with the following lines of code:
embedding, output = model(inputs)
p = softmax(output / args.temperature)
prediction = p.tolist()
prototyp = []
for i in range(len(prediction)):
prototyp.append(np.argmax(prediction[i]))
The problem is that when I save the images in different folders regarding their cluster assignment, some folders remain empty. The number of folders is the same as the number of prototypes. I always thought that the images are equally distributed between the different prototypes. What is the problem? Can you help me?
Hi,
when I try to load one of your provided checkpoint models in your main_sway.py file, I always receive the warning:
WARNING - 01/05/21 11:22:46 - 0:00:09 - => failed to load optimizer from checkpoint ...
WARNING - 01/05/21 11:22:46 - 0:00:09 - => failed to load amp from checkpoint ...
WARNING - 01/05/21 11:22:46 - 0:00:09 - => failed to load state_dict from checkpoint ...
What is the reason for the warnings? Isn't it possible to use your provided models for finetuning?
Thank you for sharing this wonderful work.
I conducted several experiments using the released pretrained models. However, the pretrained resnet50w5 is failed to be loaded because the batchnorm layer of the projection_head is missing in the pretrained models.
So could I just ignore this batchnorm layer when using the pretrained resnet50w5?
Thanks very much!
Hi Mathilde,
In your swav paper, I understand that the backbone as well as the prototypes are updated.
Therefore, I was wondering why you call embeddings.detach() (https://github.com/facebookresearch/swav/blob/master/main_swav.py#L291) in your script. I thought when detaching a tensor, no gradient will be back-propagated along this variable.
Thanks in advance for your help!
Hi,
In paper, the pseudo-code shows the prototypes are normalized along the first dimension:
**with torch.no_grad():
C = normalize(C, dim=0, p=2)**
But in the source code, the prototypes are normalized along the second dimension:
**# normalize the prototypes
with torch.no_grad():
w = model.module.prototypes.weight.data.clone()
w = nn.functional.normalize(w, dim=1, p=2)
model.module.prototypes.weight.copy_(w)**
Since the column of the prototypes is regarded as one cluster, the prototype should be normalized along the first dimension( dim = 0) ?
Thanks
Good job! Thanks for sharing the code. However I was wondering how much gain can Multi-Crop bring on MoCo ? Have you tried it?
I tried to train swav with a small dataset, and I got these generated files:
If I have the model after training how can I use it? how to assign an unseen image to one of those clusters and how to retrieve images from the same cluster?
I used this command for training:
python -m torch.distributed.launch --nproc_per_node=1 main_swav.py \
--data_path pics1 \
--epochs 5 \
--base_lr 0.6 \
--final_lr 0.0006 \
--warmup_epochs 0 \
--batch_size 32 \
--size_crops 224 96 \
--nmb_crops 2 6 \
--min_scale_crops 0.14 0.05 \
--max_scale_crops 1. 0.14 \
--use_fp16 true \
--freeze_prototypes_niters 5005 \
--queue_length 3840 \
--epoch_queue_starts 15
I have used main_deepclusterv2.py train a model on my custom dataset, I want to see the results of cluster. For example, put the images in same class into the same folder. How to do that? Thank you.
Hello
I am trying to train a custom dataset.
I am trying to train in an environment where there is one gpu.
What's the problem?
Also, can you provide a tutorial for testing on a custom dataset?
export NGPU=1; python -m torch.distributed.launch --nproc_per_node=$NGPU main_swav.py --data_path /home/ubuntu/merge/src/swav/data/train --epochs 400 --base_lr 0.6 --final_lr 0.0006 --warmup_epochs 0 --batch_size 32 --size_crops 224 96 --nmb_crops 2 6 --min_scale_crops 0.14 0.05 --max_scale_crops 1. 0.14 --use_fp16 true --freeze_prototypes_niters 5005 --queue_length 3840 --epoch_queue_starts 15
Traceback (most recent call last):
File "main_swav.py", line 374, in
main()
File "main_swav.py", line 122, in main
init_distributed_mode(args)
File "/home/ubuntu/merge/src/swav/src/utils.py", line 65, in init_distributed_mode
rank=args.rank,
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 391, in init_process_group
init_method, rank, world_size, timeout=timeout
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 79, in rendezvous
raise RuntimeError("No rendezvous handler for {}://".format(result.scheme))
RuntimeError: No rendezvous handler for ://
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/ubuntu/anaconda3/envs/swav/lib/python3.6/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ubuntu/anaconda3/envs/swav/bin/python', '-u', 'main_swav.py', '--local_rank=0', '--data_path', '/home/ubuntu/merge/src/swav/data/train', '--epochs', '400', '--base_lr', '0.6', '--final_lr', '0.0006', '--warmup_epochs', '0', '--batch_size', '32', '--size_crops', '224', '96', '--nmb_crops', '2', '6', '--min_scale_crops', '0.14', '0.05', '--max_scale_crops', '1.', '0.14', '--use_fp16', 'true', '--freeze_prototypes_niters', '5005', '--queue_length', '3840', '--epoch_queue_starts', '15']' returned non-zero exit status 1.
Hi,
Thank you so much for sharing your codes!
May I know if you have a copy of your loss record?
When I trained your model from scratch, the loss was stacked around 8 for the first 2 epochs. (I am still training the model)
Is it the same for you?
Thank you.
Hi,
I wanted to benchmark SwAV on CIFAR-10.
Is there any recommended configuration for CIFAR-10? For eg:
Also, do you plan to publish any pretrained model on CIFAR-10?
Hi
The algorithm design: view A --> code x, view B ---> code y, then let view B predict code x and view A predicts code y.
But in some experiments (CIFAR_10 dataset) , found that, the model will learn to cheat by predicting nearly same embeddings(z in paper) for all images including their augmentations. In this way the loss will decrease rapidly, but model learns wrong.
Hello,
Thanks for your inspiring paper and code.
I trained SwAV with a batch size of 4096 for 200 epochs and then trained a linear classifier with your default setting (batch size of 256 on 8 GPUs), achieving 74.5% top-1 accuracies. I wanted to fasten the linear classifier training process, so I tried to train it with a batch size of 2048 on 64 GPUs and left all the other settings the same. I observed 73.3% in terms of top-1, showing a slight drop from your default setting.
So I am wondering how to train the linear classifier on 64 GPUs and achieve similar performance as training on 8 GPUs, e.g. tuning some hyper-parameters? Looking forward to your reply.
Thanks
Referring to this section of the paper:
In the code, this part is supposedly handled with crops_for_assign
:
for i, crop_id in enumerate(args.crops_for_assign):
with torch.no_grad():
out = output[bs * crop_id: bs * (crop_id + 1)]
# time to use the queue
if queue is not None:
if use_the_queue or not torch.all(queue[i, -1, :] == 0):
use_the_queue = True
out = torch.cat((torch.mm(
queue[i],
model.module.prototypes.weight.t()
), out))
# fill the queue
queue[i, bs:] = queue[i, :-bs].clone()
queue[i, :bs] = embedding[crop_id * bs: (crop_id + 1) * bs]
# get assignments
q = torch.exp(out / args.epsilon).t()
q = distributed_sinkhorn(q, args.sinkhorn_iterations)[-bs:]
I am not sure how this indexing out = output[bs * crop_id: bs * (crop_id + 1)]
ensures we are only operating on full resolution views (224/160)?
I tried to load checkpoint downloaded from resnet50w2 to do some experiments, but an error occurred. It seems the model you published doesn't match the config model in resnet.py for resnet50w2.
size mismatch for module.layer1.0.conv1.weight: copying a param with shape torch.Size([128, 128, 1, 1]) from checkpoint, the shape in current model is torch.Size([256, 128, 1, 1]). size mismatch for module.layer1.0.bn1.running_var: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]). size mismatch for module.layer1.0.bn1.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]). size mismatch for module.layer1.0.bn1.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]). size mismatch for module.layer1.0.bn1.running_mean: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]). size mismatch for module.layer1.0.conv2.weight: copying a param with shape torch.Size([128, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([256, 256, 3, 3]). size mismatch for module.layer1.0.bn2.running_var: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([256]).
The script is swav_RN50w2_400ep_pretrain.sh and the checkpoint is resnet50w2.
Hi Mathilde,
Thanks for your great work. I enjoyed reading your paper!
When running main_swav.py, I experience no reproducibility of the results (although the seeds are set nicely in utils.fix_random_seeds).
RUN1:
INFO - 12/07/20 09:38:11 - 0:00:06 - Epoch: [0][0] Loss 3.5037 (3.5037)
INFO - 12/07/20 09:38:30 - 0:00:25 - Epoch: [0][50] Loss 2.9354 (3.0861)
RUN2:
INFO - 12/07/20 09:37:31 - 0:00:06 - Epoch: [0][0] Loss 3.5037 (3.5037)
INFO - 12/07/20 09:37:51 - 0:00:25 - Epoch: [0][50] Loss 2.9074 (3.0710)
Do you experience the same? If yes, do you have a clue why that is the case (maybe distributed training)?
Thanks in advance!
Hi,
Thanks for sharing this awesome repo.
I was wondering in the eval script how is the finetuning on 1% and 10% of imagenet done?
Here it looks like the entire folder is used for the dataset:
Line 103 in 139623b
Was 1% and 10% of train imagenet preselected and placed into separate folders?
If I run main_deepclusterv2.py in a non-distributed training mode, what modifications do I need to make?
Traceback (most recent call last):
File "main_deepclusterv2.py", line 426, in
main()
File "main_deepclusterv2.py", line 119, in main
init_distributed_mode(args)
File "/remote_projects/ImageSimilarity/swav/src/utils.py", line 56, in init_distributed_mode
args.rank = int(os.environ["RANK"])
File "/root/software/anaconda3/envs/similarity/lib/python3.6/os.py", line 669, in getitem
raise KeyError(key) from None
KeyError: 'RANK'
Hi,
how can I display clustering results? When I forward an image through a pretrained network, I get a vector of numbers with the length of the number of prototype vectors. Do I have to pass this vector to the distributed_sinkhorn function?
The distributed_sinkhorn function returns the probabilities for every cluster, is this correct?
Hi @mathildecaron31.
I was wondering if you'd be interested in including our (@ayulockin and mine) implementation (in TensorFlow) of SwAV in the README. Many folks might find it helpful.
Hi, thanks for your excellent work! Could you kindly release the model and results pre-trained with batch size of 256 for 200 epochs without multi-crops? I am asking because this seems a commonly used configuration in the literature but it is missing both in the paper and the repo. Some researchers also raised this issue before but it seems that it has not be resolved. Due to the limited computing resources, I think releasing this model would help a lot. Thank you very much!
Hi,
I noticed that you adopted 8 GPUs as a group in SyncBN (https://github.com/facebookresearch/swav/blob/master/main_swav.py#L158) when training with a large batch size of 4096, i.e. 512 training samples in a group for sync batchnorm. I am wondering that 1) why don't you use global syncBN for training and 2) how much does it affect?
Thanks!
@mathildecaron31 thanks for the help with the lightning implementation.
Would you like us to submit a PR for it?
Great work. When I trained the model, in each epoch, it will print a few lines of "Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to ***". I didn't change anything. Is this normal? Thanks
HI, I would like to use the main_deepclusterv2.py clustering new images without fine-tuning. How can I use the main_deepclusterv2.py to implement the project of deepcluster(eval_voc_classif_fc6_8.sh)? Thank you!
Hi, thanks for your excellent work!
I have a question about num_prototype in deepclustering V2. What does the num_prototype mean?
why num_prototype can bigger than class_num?
Thanks!
Hi, Thanks for your excellent work! Is it possible for you to release pre-trained models and results without 4x96 crops? Thank you so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.