yaodongyu / trades Goto Github PK

View Code? Open in Web Editor NEW

507.0 10.0 123.0 1.5 MB

TRADES (TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization)

License: MIT License

Python 100.00%

trades's People

Contributors

Stargazers

Watchers

Forkers

meysamsadeghi yuyangshi bai-li b97 tengerye lwx0724 yfreedomlithu xiahaifeng1995 iidsample a554b554 machengcheng2016 george1ee yguooo stjordanis sunshine352 hongxin001 uooga wanghan0501 silkylove fuweijie blakecheng fagan2888 tianqi-777 mathczh yangarbiter vsehwag lin2020 robot-ai-machinelearning homles11 ehsankf equationliu atlas8346 divyam02 averyma douxiaotian juancprzs pingli00 huiminzeng 19stevejobs88 marcelomata originofamonia ruchithamanne leeyegy cxmscb lironghuo milkigit metehancekic littlefish12 aspnetcs mldl xksteven geekjzy rophen2333 wangaxe yunduanzhineng admk collinxz haojieyuan mhilmiasyrofi lruedag siqi-xia shudong-zhang pbashivan candlia royzon mathpopo dushik davidenterprise-star shiqisun jiyuan-liu yujingmarkjiang zhaoyuchen803 yuxi120407 sammy42779 giladcohen asber777 zibojia jokeryan tianzheng4 cv-scb0 mohanamij tongjian121 haoyang-219 kkuntal990 emp325 aam-at cswangle jungle77777 bigbigyellow jiahuigeng linlyac foreverps yuelongli yyou22 hylee817 adverml rattaoup iamkanghyunchoi yechao-zhang plus1-top

trades's Issues

trades.py

I have read your article many times and feel that the article is well written, but I have a question. Should the sample be standardized to [0, 1]? Can it be standardized to [-1, 1]?why? Thank you very much,。

What is the self.sub_block1 in models/wideresnet.py ？

In the wideresnet.py I find there is a self.sub_block1 in WideResNet.__init__(), but it doesn't appear in forward().

Random Initialization in Attack

For comparability with other evaluations in the literature, it might be useful to first add random noise to the image then begin PGD.
https://github.com/yaodongyu/TRADES/blob/master/pgd_attack_cifar10.py#L51
In a quick-and-dirty experiment, adding attack random initialization decreased TRADES model adversarial accuracy by ~1%.

Recently, it has also become apparent that multiple random restarts in evaluation is important, so adding that functionality might be useful as well.

Can you share the Hyperparameters of evaluated attacks in the leaderborad?

Where can I get the Hyperparameters (such as steps, n_restarts) of evaluated methods in leaderboard ?.

I test Deepfool to attack TRADES and got acc of 54%, but in your leaderboard, Deepfool linf only got 61.38%. It seems strange. Maybe some of hyperparameters are different. I want to use some of your result in my paper if you can share more informantion.

Cuda out of memory when train with TRADES On resnet50.

Hi, @yaodongyu , I'm very interested in your work at ICML'19, and I attempt to use it in the competition. I tried to train resnet50 with trades_loss but there was an Error alert:CUDA out of memory. I wonder if trades_loss needs more CUDA memory.

I trained the model resnet50 on NVIDIA 1080ti with cross entropy loss, and the batch size can be set to 128.However, when I trained with trades_loss, it raise an error"CUDA out of memory" with batch_size 16.
I'm not sure whether there's a problem with my code, or trades_loss needs more CUDA memory.

thank you!

About loss_robust implemented in trades.py

I've changed loss_robust in trades.py to use cross entropy loss as the following, but then found that in that case the training fails to converge when beta=6.0. I suspect that this is because dCE(f(x), f(x'))/dw and dKL(f(x), f(x'))/dw is different as f(x') should be also a variable.
As the paper claims that one should use classification-calibrated CE in TRADES to avoid the method to be the logit pairing, I wonder if is it ok to use KL instead of CE for loss_robust. Or have you considered some practices when loss_robust is implemented in CE, e.g. different beta?

def _cross_entropy(input, targets, reduction='mean'):
    targets_prob = F.softmax(targets, dim=1)
    xent = (-targets_prob * F.log_softmax(input, dim=1)).sum(1)
    if reduction == 'sum':
        return xent.sum()
    elif reduction == 'mean':
        return xent.mean()
    elif reduction == 'none':
        return xent
    else:
        raise NotImplementedError()

loss_robust = _cross_entropy(model(x_adv), model(x_natural), reduction='mean')

Training loss = nan

Thank you for providing the codes, they are so excellent. But I met a problem, when I run train_trades_cifar10.py using resnet, the training loss is always nan.

What I modified is the epsilon size, I used epsilon = 2/255, 4/255, 8/255, and 16/255. Do you know the reason? Thank you again

Robust Accuracy on CIFAR10

Hi, I tried running the code train_trades_cifar10.py directly with '--beta 6.0' twice, but I failed to achieve the adversarial accuracy as showed in your paper. My final result is only about 49%. I wonder if some other details like training set partition should be done to reach the performance or else. Thank you!

The shuffle of test loader is False

Hi,
Probably a typo
shuffle=True → shuffle=False

https://github.com/yaodongyu/TRADES/blob/master/train_trades_mnist.py#L64

Code running slow

Hi,

Thank you for providing the code and cheers to the great work!

I am training the model on CIFAR-10 using an NVIDIA Titan RTX-24G gpu. Unfortunately, the code is prohibitively slow and each iteration tasks about 4 seconds. Does it run at the same speed on your machine? The WRN model is several times larger than an ordinary classifier for CIFAR-10. I know that the model for adv training should be large, but is it necessary to use such a huge model?

Regards,
Ali

Unused Wide-ResNet Block sub_block1

Hi,

just tried out the pre-trained models and came across an unused ResNet block in your Wide ResNet: sub_block1. Unfortunately, the pre-trained models include parameters for this block, making it impossible to load the models using a Wide ResNet implementation that does not have sub_block1.

As quick fix, I loaded the models using your Wide ResNet implementation, set sub_block1 to a simple dummy identity layer (or any other layer without parameters) and saved them again. Afterwards, they can be loaded using an implementation without sub_block1.

Thought that might be interesting for others, or worth fixing (and re-uploading the models) as the unused block also incurs an unncessary memory overhead.

Does f(x') replace the label or the prediction in the regularization term?

Thank you for sharing an implementation of TRADES - it really helps understand your paper. However, there one thing was unclear to me when comparing the paper and the code. According to the paper (and also the github readme), in the regularization term the adversarial prediction f(X’) plays the role of the label (i.e. second argument to $\mathcal{L}$), while f(X) remains in the same place as in the natural loss. In contrast, in the regularization term implemented in trades.py, model(x_natural) plays the role of the label (second argument to criterion_kl), and model(x_adv) forms the prediction.

Which version is the correct one (i.e. the one used to train the publicly available CIFAR-10 model)?

Is there a version implemented with TensorFlow

The role of optimizer in trades_loss

Hi,
In trades_loss you have used the argument 'optimizer' and in line 77 you call 'optimizer.zero_grad()'.
Was there a need for this? In which part of the calculation of trade_loss gradients of the model are updated that we need to zero them?
Thanks a lot.

Bug report in the loss_trade (l_2 norm)

There is a bug somewhere in the loss_trade with l_2 norm (the l_inf norm is okay). The consuming memory will increase with the increase of the iterations (batch) and finally out of memory.

How about appplying trades on a small model?

Thank you for your contribution!
If I want to apply trades on a small network, which has only 2M parameters, and train it on my own dataset, I found the result really bad on both standard acc and robust acc. Is it normal? Or what can I do to modify it?

Bug in learning rate adjustment

There seems to be a bug in the adjust_learning_rate function in train_trades_cifar10.py; it only decreases the learning rate once at epoch 75 (the code in the elif clauses is never reached).

Error when training with DDP

Hi,

Thanks for your great work! I am trying to run your code with Distributed DataParallel(DDP) in Pytorch, but met some errors when using trades_loss function. Here's the error

RuntimeError: one of the variables needed for gradient computation has been modified by 
an inplace operation: [torch.cuda.FloatTensor [512]] is at version 4; expected version 3
 instead. Hint: the backtrace further above shows the operation that failed to compute i
ts gradient. The variable in question was changed in there or anywhere later. Good luck!

After setting torch.autograd.set_detect_anomaly(True), I've got this:

[W python_anomaly_mode.cpp:104] Warning: Error detected in CudnnBatchNormBackward. Trace
back of forward call that caused the error:

I think it's the BatchNorm that caused this error. Do you have the plan to make some modifications to the code to make it fit the DDP? Cuz it's too slow when training with DP or a single GPU 😹

A reminder for the submitted results of CAA

Hi, Dear Hongyang Zhang and Yaodong Yu,

This is a reminder that I have submitted our results(CAA) on TRADES White-box leaderboard. I have sent e-mail to your address. The issue can be closed once you have check the mailbox and no other problems occurred.
Feeling sorry to interrupt you and I would be grateful for your help. Thanks.

Xiaofeng Mao

The hyperparameter of deep_fool and C&W

I want to test the method with deepfool and C&W attacks, but I don't know how to set the hyperparameters. Could you please tell me how to set the hyperparameter?

pgd_attack_cifar10.py Output

Hi , forgive me for my ignorance, I'm trying to pgd_attack_cifar10.py, with your WideResnext, and I'm getting the following:

I have a hard time to understand the results, what does each "err pdg" means?
What does the last two lines means? I mean not the names, but the values "1508", "4281"
Maybe I'm just used to top1, top5...
Thanks!

The loss in trades.py

According to the paper eq. (5), the loss L should be the same, saying cross entropy loss in the paper. While in trades.py, there are two kinds of loss used. For max, torch.nn.KLDivLoss is adopted, while for min, cross entropy used for f(x),y and torch.nn.KLDivLoss used for f(x),f(x'). So why use two different loss here? using both cross entropy loss at the both place is ok? performance?

Submitting my result to white MNIST and CIFAR-10 leaderboards

dear,yaodong yu and hongyang zhang:
I am very happy to read such a good paper, and thank you very much for providing the white box MNIST and CIFAR-10 leaderboards. I recently（2020.8.15） submitted the results of my adversarial attack to you. If you have time, could you check my results and update the MNIST and CIFAR-10 leaderboards?
Thank you very much！
My name is ye Liu.

Nasty floating-point rounding errors

We were trying to evaluate our attack with the CIFAR-10 model. This is our script to convert saved images to a .npy file: https://github.com/admk/TRADES/blob/master/convert.py

We are using the same xadv = torch.clamp(xadv - x, -epsilon, epsilon) + x as in https://github.com/yaodongyu/TRADES/blob/master/pgd_attack_cifar10.py#L76
to guarantee the boundaries, but it didn't work for us because of floating-point rounding errors:

Do you know how we can reliably torch.clamp the ranges for your checks?

Update: PyTorch==1.7.0, CPU and GPU gave different magnitudes of rounding errors.

Performance difference between repo's models and torchvision models

Hey, thanks for this repo. I came across this and I wanted to see if it was an issue. There seems to be a significant change in run time when switching from one of the models implemented in your repo and the official torchvision models.

Running train_trades_cifar10.py and specifying the model as model = ResNet18() from the TRADES repo, the time per batch is ~ 0.8 seconds.

Running train_trades_cifar10.py and specifying the model as model = torchvision.models.resnet18(), the time per batch is ~ 0.28 seconds.

I checked to make sure that it wasn't due to model size, and the models have 11181642 trainable params each.

Any advice on why the behavior is this way would be greatly appreciated.

Model outputs significantly different depending on batch size and contents

Thank you very much for releasing the model and associated code along with your paper. I'm very grateful that you've put the effort into making it as easy as possible to get everything up and running, and I sincerely hope others involved in the contest follow your lead.

I'm taking an initial first pass at looking at everything, and am getting somewhat confusing results. First, it looks like the model is giving very different results at small batch sizes:

from models.small_cnn import SmallCNN
import numpy as np
import torch
from models.wideresnet import WideResNet

from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn

device = torch.device("cuda")
model = WideResNet().to(device)
model.load_state_dict(torch.load('./checkpoints/model_cifar_wrn.pt'))

X_data = np.load("data_attack/cifar10_X.npy")
Y_data = np.load("data_attack/cifar10_Y.npy")
X_data = np.transpose(X_data, (0, 3, 1, 2))

for bs in [1, 2, 4, 5, 10, 50, 100]:
    predictions = []
    for i in range(0,100,bs):
        logits = model(torch.from_numpy(np.array(X_data[i:i+bs], dtype=np.float32)).to(device)).cpu().detach().numpy()
        predictions.extend(np.argmax(logits,axis=1))
    print("mean accuracy with batch size %d: %f"%(bs,np.mean(predictions == Y_data[:100])))

will output

mean accuracy with batch size 1: 0.140000
mean accuracy with batch size 2: 0.640000
mean accuracy with batch size 4: 0.760000
mean accuracy with batch size 5: 0.790000
mean accuracy with batch size 10: 0.840000
mean accuracy with batch size 50: 0.840000
mean accuracy with batch size 100: 0.830000

It looks like there is also some dependence on the data of the batch to classify each input. I have some batch of 10000 examples I want to process as a [100, 100, 3, 32, 32] matrix, and if I process them in row-major order I get a different accuracy than column-major. I suspect this might have the same underlying cause, so I'll give details for that later if necessary.

As you might imagine, this makes it difficult to evaluate the defense: evaluating the network with a batch of [99 clean examples] + [1 adversarial example] gives a different result than [50 clean examples] + [50 adversarial examples].

Is this intended, am I doing something wrong, or something else?

Why does the first term in the loss_robust equal F.log_softmax(model(x_adv))?

Why does the first term in the loss_robust equal F.log_softmax(model(x_adv))? I have tryed to generate adv samples by noraml PGD, and set the loss to the trades's loss. But I cannot understand why it is and I guess I prefer to use criterion_kl(F.softmax(model(x_adv)), F.softmax(model(x_natural))). Are there some nice people who can answer my question? I would appreciate it.