Code Monkey home page Code Monkey logo

dcl's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dcl's Issues

TypeError: can't convert np.ndarray of type numpy.str_.

evaluating val ...
Traceback (most recent call last):
File "train.py", line 228, in
checkpoint=args.check_point)
File "/home/wmj/DCL/utils/train_model.py", line 139, in train
val_acc1, val_acc2, val_acc3 = eval_turn(Config, model, data_loader['val'], 'val', epoch, log_file)
File "/home/wmj/DCL/utils/eval_model.py", line 43, in eval_turn
labels = Variable(torch.from_numpy(np.array(data_val[1])).long().cuda())
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, int64, int32, int16, int8, uint8, and bool.

How can I solve this issue?

保存的模型有一层维度出现偏差

运行CUB数据集,保存模型,加载后发现有一层维度不一样
saved_model:
......
model.7.2.bn3.running_vartorch.Size([2048])
model.7.2.bn3.num_batches_trackedtorch.Size([])
classifier.weighttorch.Size([200, 2048])
classifier_swap.weighttorch.Size([400, 2048])
Convmask.weighttorch.Size([1, 2048, 1, 1])
Convmask.biastorch.Size([1])

origin_model:
......
model.7.2.bn3.running_vartorch.Size([2048])
model.7.2.bn3.num_batches_trackedtorch.Size([])
classifier.weighttorch.Size([200, 2048])
classifier_swap.weighttorch.Size([2, 2048])
Convmask.weighttorch.Size([1, 2048, 1, 1])
Convmask.biastorch.Size([1])

咋回事呢

What are the training hyper parameters to get top1 acc>0.87?

What are the training hyper parameters to get top1 acc>0.87, like batch, lr strategy? and how many epochs do you run?
I can only get 0.83 on the CUB_200_2001. I divided the data sets to 70% training, 30% validation.
Please do not close the issues so fast.
Thanks.

about dataloader

in dataset_DCL.py
function getitem()
there is :
return img_unswap, img_swap, label, label_swap, swap_law1, swap_law2, self.paths[item]
when i debug in train_model.py,
inputs, labels, labels_swap, swap_law, img_names = data
why they are different items, and why the batch of inputs are twice of batch_size

can you help me find out why?

No module named 'models'

Traceback (most recent call last): File "train_rel.py", line 16, in <module> from models.resnet_swap_2loss_add import resnet_swap_2loss_add ModuleNotFoundError: No module named 'models'

from resnet_swap_2loss_add import resnet_swap_2loss_add
AttributeError: 'resnet_swap_2loss_add' object has no attribute 'module'

maybe you just delete .module if you just you use one gpu.
when I use model = nn.DataParallel(model) my gpu is stuck and kill -9 pid is not response

关于标签的问题

开源代码中给的CUB下的文件train.txt和val.txt中图片的标签是从1-200可是如果按照文件中的标签的话带进去算loss会出错的,还不是把标签值改到0-199呢

What the mean of swap_law1

Hi, I can not understand the use of swap_law1, and why use 24 but not other number
swap_law1 = [(i-24)/49 for i in range(crop_num[0]*crop_num[1])]

使用mobilenet作为backbone网络

你好,
我是新手上路,resnet50作为特征提取网络,模型太大了,使用mobilenetv2替换resnet50在stanford cars上训练,acc只有0.74。请问大佬是否尝试过轻量型网络,是否适合这类任务,有什么优化措施?

RuntimeError: cudaEventSynchronize in future::wait: device-side assert triggered

I can run it only for a few steps.

ljy@scw4750:~/liang-codes/DCL-master$ python train.py --data CUB --epoch 360 --backbone resnet50                     --tb 16 --tnw 16 --vb 512 --vnw 16                     --lr 0.0008 --lr_step 60                     --cls_lr_ratio 10 --start_epoch 0                     --detail training_descibe --size 512                     --crop 448 --cls_mul --swap_num 7 7
Namespace(auto_resume=False, backbone='resnet50', base_lr=0.0008, check_point=5000, cls_2=False, cls_lr_ratio=10.0, cls_mul=True, crop_resolution=448, dataset='CUB', decay_step=60, discribe='training_descibe', epoch=360, resize_resolution=512, resume=None, save_point=5000, start_epoch=0, swap_num=[7, 7], train_batch=16, train_num_workers=16, val_batch=512, val_num_workers=16)
Choose model and train set
resnet50
train from imagenet pretrained models ...
Set cache dir
the num of new layers: 4
step:        1 / 375 loss=ce_loss+swap_loss+law_loss: 11.6468 = 5.2024 + 6.0823 + 0.3620 
step:        2 / 375 loss=ce_loss+swap_loss+law_loss: 11.6235 = 5.2029 + 5.9551 + 0.4655 
step:        3 / 375 loss=ce_loss+swap_loss+law_loss: 11.7090 = 5.1929 + 6.0129 + 0.5033 
step:        4 / 375 loss=ce_loss+swap_loss+law_loss: 11.6939 = 5.3655 + 5.9946 + 0.3338 
step:        5 / 375 loss=ce_loss+swap_loss+law_loss: 11.6287 = 5.3284 + 5.9917 + 0.3086 
step:        6 / 375 loss=ce_loss+swap_loss+law_loss: 12.3495 = 5.6930 + 6.1811 + 0.4753 
step:        7 / 375 loss=ce_loss+swap_loss+law_loss: 12.0914 = 5.4335 + 6.0904 + 0.5675 
step:        8 / 375 loss=ce_loss+swap_loss+law_loss: 12.1460 = 5.6956 + 6.1155 + 0.3348 
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [3,0,0] Assertion `t >= 0 && t < n_classes` failed.
CUDA error after cudaEventDestroy in future dtor: device-side assert triggeredTraceback (most recent call last):
  File "train.py", line 229, in <module>
    checkpoint=args.check_point)
  File "/home/ljy/liang-codes/DCL-master/utils/train_model.py", line 111, in train
    law_loss = add_loss(outputs[2], swap_law) * gamma_
  File "/home/ljy/anaconda3/envs/p36c8ljy/lib/python3.6/site-packages/torch/nn/modules/module.py", line 491, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ljy/anaconda3/envs/p36c8ljy/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 85, in forward
    reduce=self.reduce)
  File "/home/ljy/anaconda3/envs/p36c8ljy/lib/python3.6/site-packages/torch/nn/functional.py", line 1558, in l1_loss
    input, target, size_average, reduce)
  File "/home/ljy/anaconda3/envs/p36c8ljy/lib/python3.6/site-packages/torch/nn/functional.py", line 1537, in _pointwise_loss
    return lambd_optimized(input, target, size_average, reduce)
RuntimeError: cudaEventSynchronize in future::wait: device-side assert triggered

performance on FGVC-Aircraft dataset

Hello, I retrained DCL on Aircraft dataset. All hyper parameters except N followed default setting. When I set N as 7 and 2, I got the highest accuracy 90.4% and 92.3% respectively. But the reported results were 92.2% and 93.0%.
Is there anything wrong? Could someone provide suggestions on reproducing the reported results?
THX!!!

work well

Applied to my own data, the effect is indeed improved, and it works very well! The disadvantage is that the code quality is not very good.

Excuse me, how to test with test.py?

When I ran test.py, I encountered an ‘no attribute 'submit'’ error,and ‘unswap’ is unexpected.Please tell me how to run test.py, thanks.

咨询

作者,你好,你在说明里面写到这个算法已经用到了京东商品识别上面,我想问,CUB的类别仅仅200类,而实际商品类别数量可能上万,并且你的网络有2个全连接,这样全连接导致的显存占用以及参数量会剧增,请问你如何解决这种问题呢。
总是在小规模数据上,并不能很好的看出网络的真实性能的。假如有大规模数据,那么如何能实战呢?

test.py #75

got an unexpected keyword argument 'unswap'?

CUDA Error while running the code

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu line=110 error=59 : device-side assert triggered
Traceback (most recent call last):
File "mytrain2.py", line 146, in
save_dir=save_dir)
File "/media/HDD_3TB2/rupali/Code/DCL/utils/train_util_DCL.py", line 60, in train
loss = criterion(outputs[0], labels)
File "/home/rupali/anaconda3/envs/EnvPytorch/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in call
result = self.forward(*input, **kwargs)
File "/home/rupali/anaconda3/envs/EnvPytorch/lib/python3.7/site-packages/torch/nn/modules/loss.py", line 942, in forward
ignore_index=self.ignore_index, reduction=self.reduction)
File "/home/rupali/anaconda3/envs/EnvPytorch/lib/python3.7/site-packages/torch/nn/functional.py", line 2056, in cross_entropy
return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction)
File "/home/rupali/anaconda3/envs/EnvPytorch/lib/python3.7/site-packages/torch/nn/functional.py", line 1871, in nll_loss
ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
RuntimeError: cuda runtime error (59) : device-side assert triggered at /opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/generic/ClassNLLCriterion.cu:110

RuntimeError: reduce failed to synchronize: device-side assert triggered

I git clone the code, down the aricraft dataset, and use the datasets/AIR/train.txt" as the annotation. But when run the code, it occurs RuntimeError: reduce failed to synchronize: device-side assert triggered`.

/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [20,0,0] Assertion `t >= 0 && t < n_classes` failed.
/opt/conda/conda-bld/pytorch_1532579245307/work/aten/src/THCUNN/ClassNLLCriterion.cu:105: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [21,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "train.py", line 229, in <module>
    checkpoint=args.check_point)
  File "/home/workspace/git/python/DCL/utils/train_model.py", line 111, in train
    law_loss = add_loss(outputs[2], swap_law) * gamma_
  File "/home/miniconda3/envs/cuda92/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/miniconda3/envs/cuda92/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 87, in forward
    return F.l1_loss(input, target, reduction=self.reduction)
  File "/home/miniconda3/envs/cuda92/lib/python3.6/site-packages/torch/nn/functional.py", line 1702, in l1_loss
    input, target, reduction)
  File "/home/miniconda3/envs/cuda92/lib/python3.6/site-packages/torch/nn/functional.py", line 1674, in _pointwise_loss
    return lambd_optimized(input, target, reduction)
RuntimeError: reduce failed to synchronize: device-side assert triggered

After check the file datasets/AIR/train.txt", I find the class ids range in [1, 100], but it shoud be in [0, 99]. So it Assertion t >= 0 && t < n_classes failed`.

Solution: substract 1 for each label number, in the file dataset_DCL.py#L41

from: self.labels = anno['label'].tolist()
to : self.labels = [int(x)-1 for x in anno['label'].tolist()]

ct_train/val/test file

Hi,

Can you share the ct_train/val/test.txt files of CUB/ Stanford Car/Aircraft to us?

Thank you.

Kind regards.

Kiki

Is it possible to get the ground truth labels of TEST set of FGVC product dataset?

Dear authors, thanks for sharing your amazing work. However, I have met a little problem.

I wonder how to evaluate the recognition performance on FGVC product dataset, since the test labels of FGVC dataset are not provided at all. I also noticed that the FGVC competition now is closed and I cannot submit a submission to kaggle server https://www.kaggle.com/c/imaterialist-product-2019/submit. So currently, is it possible to get the ground truth labels of TEST set of FGVC product datasets?

I am looking for your kind help. Thanks!

RuntimeError: Error(s) in loading state_dict for MainModel

Dear Authors:
I use single GPU(TITAN Xp 12G) to train and test on datasets(STCAR,CUB,AIR). No errors appear while training. But when I test, I get the same problem on three dataset which is the weight could not be loaded. Details is shown as below.
/usr/bin/python3.6 /media/duanzd/local/DCL-master/test.py --ver test --acc_report --data STCAR --backbone resnet50 --save /media/duanzd/local/DCL_weights/net_model/DCL_STCAR/weights_358_508_1.0000_1.0000.pth
Namespace(acc_report=True, backbone='resnet50', batch_size=16, crop_resolution=448, dataset='STCAR', num_workers=16, resize_resolution=512, resume='/media/duanzd/local/DCL_weights/net_model/DCL_STCAR/weights_358_508_1.0000_1.0000.pth', save_suffix=None, swap_num=[7, 7], version='test')
resnet50
Traceback (most recent call last):
File "/media/duanzd/local/DCL-master/test.py", line 93, in
model.load_state_dict(model_dict)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 721, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for MainModel:
While copying the parameter named "classifier_swap.weight", whose dimensions in the model are torch.Size([2, 2048]) and whose dimensions in the checkpoint are torch.Size([392, 2048]).
I wonder if there are some wrong settings when I train the weight. Thank you for help.

'MainModel' object has no attribute 'module'

Traceback (most recent call last): │
File "train.py", line 194, in │
ignored_params1 = list(map(id, model.module.classifier.parameters())) │
File "/home/weiyanyan/anaconda3/lib/python3.6/site-packages/torch/nn/modules/module.py", line 518, in getattr
type(self).name, name)) │
AttributeError: 'MainModel' object has no attribute 'module'

Some inconsistance bettween paper and code

Dear @wyvernbai,
I might found some inconsistance bettween paper and code. Hopefuly looking forward to your further explaination.
Firstly, you mentioned θadv ∈ Rd×2 in the paper, which means discriminator is a two-way classifier. However, the code says it is a 2 * num_class way classifier.
Sencondly, adv loss is usually optimized by a minmax game, but I did not find any minmax optimazation within your code. Is that Intended?
Lastly, the paper says "the outputs are handled by an ReLU and an average pooling to
get a map with the size of 2×N ×N" to produce location map. But convmask(2048, 1, ,1) + avepool2 in the code might not take that effects?

fix random seed to get identical result

Hi, thanks for sharing your work. Can you please fix the random seed so that we could have a consistent result? which may help further research. (I failed to fix it, cause some random sources reach beyond my knowledge.)

Thanks again.

Training from different backbone

I want to train Stanford dog dataset with DCL from a pretrained model, when I start my training job like this:

python train.py --data product --epoch 360 --backbone resnet50 \
                    --tb 16 --tnw 16 --vb 512 --vnw 16 \
                    --lr 0.0008 --lr_step 60 \
                    --cls_lr_ratio 10 --start_epoch 1 \
                    --detail resnet50_zqs --size 512 \
                    --crop 448 --cls_mul --swap_num 7 7

It works just fine, I got a 84% accuracy. So I try to train it with a larger backbone like this:

python train.py --data product --epoch 60 --backbone senet154 \
                    --tb 16 --tnw 32 --vb 512 --vnw 32 \
                    --lr 0.01 --lr_step 12 \
                    --cls_lr_ratio 10 --start_epoch 1 \
                    --detail senet154_zqs --size 512 \
                    --crop 448 --cls_2 --swap_num 7 7

However, it ended with 29% accuracy. I don't know wether there are something wrong with my hyper parameter, can anyone help ?

I have already downloaded the pretrained model and put it in the correct path, and in config.py i added it like this:
pretrained_model = {'resnet50': './models/pretrained/resnet50-19c8e357.pth',
'senet154': './models/pretrained/senet154-c7b49a05.pth'}

咨询

作者,您好。
我想咨询一下,basline的问题,我用resnet50在CUB上面训练,学习率设置和您的代码一样,共训练300epoch,每90个epoch降低到上一次学习率的1/10,但是我发现在前面训练的过程中,网络测试精度上升非常缓慢,到最后也只是52%的测试精度;
请问您训练baseline有什么技巧么?
谢谢。

More than one objects in image and Feature Vector for Image Retrieval

Nice work!
wonder 2 things:
1st:
DCL focuses on the detailed discriminant part, then extract features, and then do classification. So if there are more than one objects [dog and cat] in image, have you tested the impact on DCL?
2nd:
Have you extented DCL feature vector for Image Retrieval?
since classifier layer is 2048->num_classes and there are many SKUs in e-commerce, wondering how to bridge the gap of the former two?
If there are 1 billion SKUs, just using a small and finite part of the SKUs may not good enough. such as 2048->1000SKUs, or 20480->10000SKUs

test.py size of tensor doesn't match

hi,
I successfully run the training code. But when it turned to test. I don't know how to set the parser.
I run train.py --cls_mul with resnet50

in test.py, I use val set for test, annotated line63-66( dont use ct_test.txt ), annotated line 73 (undefined) , annotated line 94 (only using 1 device), replace line89 resume to the path of my model.

But it turn out that in line111:
RuntimeError: the size of tensor a( **) must match the size of tensor b(2) at non-singleton dimension 1

a () is the same as the clsnum I set in config.py, however output[1] seems not to match that size.

I'm really interested with your project and what's wrong with my process..

training hyper parameters

what's the meaning of the training hyper parameter '--cls_2' or '--cls_mul'? and what's the different between of them?

nn.DataParallel problem

When use the nn.DataParallel, I got OOM even setting train_batch to 1. And after I shutdown nn.DataParallel, it runs smoothly with train batch setting to 16 and use 12G memory of titan xp. And by looking issues in the previous code, i got that train DCL with train batch 16 use 88GB meory of 4 P40s. And I can't figure out why it uses so much memory and did I got things right.
My environment info is as below:
OS: Ubuntu 16.04
Pytorch: .0.4.1
nvidia-driver: 384.90
CUDA; 9.0

how to understand the parameter of cls_2 and cls_2xmul?

In the paper, the parameter of cls_2 is indicating whether the image is destructed or not. It is confirmed to the code in "LoadModel.py".
However, what is the meaning of the parameter of cls_2xmul in "LoadModel.py". the output channels is 2*num_class. And when using the cls_2xmul, the swap label is "label+numcls" in "dataset_DCL.py".

Hope to answer, thanks.

problem with st_car datasets

It seems that the train.txt of STCAR porvided by source code is incorrect, the filename in the train.txt can not map to the source images. such as 013178.jpg ,013191.jpg,013434.jpg and so on.

how can I continue to train from the previous saved model?

Follow parameters setting seems doesn't work. The previous saved model has not been loaded.

python train.py --data product --epoch 60 --backbone senet154
--tb 96 --tnw 32 --vb 512 --vnw 32
--lr 0.01 --lr_step 12
--cls_lr_ratio 10 --start_epoch $LAST_EPOCH
--detail training_descibe4checkpoint --size 512
--crop 448 --cls_2 --swap_num 7 7

FileNotFoundError while python train.py as the Readme.md proposed

After orgnaizing the CUB dataset as the Readme.md suggest, while python train.py for training from scatch it reported 'FileNotFoundError: [Errno 2] File b'./dataset/CUB_200_2011/anno/ct_train.txt' does not exist: b'./dataset/CUB_200_2011/anno/ct_train.txt''.How could I fix it?Thanks a lot.

CUDA memory error

I have two 1080ti GPU and ech of their memory is 11G. But when I train DCL on kaggle product dataset, I only can set train batch 4 and val batch 16. Though in this way, GPU is used more than 90%. So may I get your GPUs info and some suggestion?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.