bargavj / evaluatingdpml Goto Github PK

View Code? Open in Web Editor NEW

128.0 6.0 49.0 42.84 MB

This project's goal is to evaluate the privacy leakage of differentially private machine learning models.

License: MIT License

Python 75.87% Shell 3.80% Jupyter Notebook 20.34%

evaluatingdpml's Introduction

Analyzing the Leaky Cauldron

The goal of this project is to evaluate the privacy leakage of differential private machine learning algorithms.

The code has been adapted from the code base of membership inference attack work by Shokri et al.

Below we describe the setup and installation instructions. To run the experiments for the following projects, refer to their respective README files (hyperlinked):

Evaluating Differentially Private Machine Learning in Practice (evaluating_dpml\)
Revisiting Membership Inference Under Realistic Assumptions (improved_mi\)
Are Attribute Inference Attacks Just Imputation? (improved_ai\)

Software Requirements

Python 3.8
Tensorflow : To use Tensorflow with GPU, cuda-toolkit-11 and cudnn-8 are also required.
Tensorflow Privacy

Installation Instructions

Assuming the system has Ubuntu 18.04 OS. The easiest way to get Python 3.8 is to install Anaconda 3 followed by installing the dependencies via pip. The following bash code installs the dependencies (including scikit_learn, tensorflow>=2.4.0 and tf-privacy) in a virtual environment:

$ python3 -m venv env
$ source env/bin/activate
$ python3 -m pip install --upgrade pip
$ python3 -m pip install --no-cache-dir -r requirements.txt

Furthermore, to use cuda-compatible nvidia gpus, the following script should be executed (copied from Tensorflow website) to install cuda-toolkit-11 and cudnn-8 as required by tensorflow-gpu:

# Add NVIDIA package repositories
$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
$ sudo mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
$ sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
$ sudo apt-get update

$ wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb

$ sudo apt install ./nvidia-machine-learning-repo-ubuntu1804_1.0.0-1_amd64.deb
$ sudo apt-get update

$ wget https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64/libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
$ sudo apt install ./libnvinfer7_7.1.3-1+cuda11.0_amd64.deb
$ sudo apt-get update

# Install development and runtime libraries (~4GB)
$ sudo apt-get install --no-install-recommends \
    cuda-11-0 \
    libcudnn8=8.0.4.30-1+cuda11.0  \
    libcudnn8-dev=8.0.4.30-1+cuda11.0

# Reboot. Check that GPUs are visible using the command: nvidia-smi

# Install TensorRT. Requires that libcudnn8 is installed above.
$ sudo apt-get install -y --no-install-recommends libnvinfer7=7.1.3-1+cuda11.0 \
    libnvinfer-dev=7.1.3-1+cuda11.0 \
    libnvinfer-plugin7=7.1.3-1+cuda11.0

Obtaining the Data Sets

Data sets can be obtained using the preprocess_dataset.py script provided in the extra/ folder. The script requires raw files for the respective data sets which can be found online using the following links:

Purchase-100X: The source file transactions.csv can be downloaded from https://www.kaggle.com/c/acquire-valued-shoppers-challenge/data and should be saved in the dataset/ folder.
Census19: The source files can be downloaded from https://www2.census.gov/programs-surveys/acs/data/pums/2019/1-Year/ and should be saved in the dataset/census/ folder. Alternatively, the source files can be obtained by running the crawl_census_data.py script in the extra/ folder:

$ python3 crawl_census_data.py

Texas-100X: PUDF_base1q2006_tab.txt, PUDF_base2q2006_tab.txt, PUDF_base3q2006_tab.txt and PUDF_base4q2006_tab.txt files can be downloaded from https://www.dshs.texas.gov/THCIC/Hospitals/Download.shtm and should be saved in the dataset/texas_100_v2/ folder.

Once the source files for the respective data set are obtained, preprocess_dataset.py script would be able to generate the processed data set files, which are in the form of two pickle files: $DATASET_feature.p and $DATASET_labels.p (where $DATASET is a placeholder for the data set file name). For Purchase-100X, $DATASET = purchase_100. For Texas-100X, $DATASET = texas_100_v2. For Census19, $DATASET = census.

$ python3 preprocess_dataset.py $DATASET --preprocess=1

Alternatively, Census19 data set (as is used in the attribute inference paper) can also be found in the dataset/ folder in zip format.

For pre-processing other data sets, bound the L2 norm of each record to 1 and pickle the features and labels separately into $DATASET_feature.p and $DATASET_labels.p files in the dataset/ folder (where $DATASET is a placeholder for the data set file name, e.g. for Purchase-100 data set, $DATASET will be purchase_100).

evaluatingdpml's People

Contributors

Stargazers

Watchers

Forkers

jonahweissman eryous ashiakerwang jding0 phillipwangaust lishaofeng niklausliu byzhang matchading jhjiangcs vickyqi7 rpplayground ianchen88 milkigit tzq2doc evansuva vu1seek katiekn wanlixue hxsylzpf goodej elaf1200 ijarin allenfeizz zedoul sssupertian kypomon fanqihang sunqiheng chamikara1986 peixj zihangxiang llbbcc mathislu petrusbellmonte leesocool avudzor kcadzow 12-plus-1 ssxxxxxxxxxx hhh476-tian harishgovardhandamodar khangtran2020 ehsanul9511 donaldzc subangkar jackyz137 qq-jiang

evaluatingdpml's Issues

how to calculate sigma given epsilon value

Hi Dr Bargavj,

I looked at your constants.py file and find that the noise multipliers of rdp and gdp are pre-calculated given the specific epsilon.

"rdp_noise_multiplier = {
30: {0.01: 290, 0.05: 70, 0.1: 36, 0.5: 7.6, 1: 3.9, 5: 1.1, 10: 0.79, 50: 0.445, 100: 0.356, 500: 0.206, 1000: 0.157},
100: {0.01: 525, 0.05: 150, 0.1: 70, 0.5: 13.8, 1: 7, 5: 1.669, 10: 1.056, 50: 0.551, 100: 0.445, 500: 0.275, 1000: 0.219}
}"
"gdp_noise_multiplier = {
30: {0.01: 190, 0.05: 45, 0.1: 24, 0.5: 5.5, 1: 3, 5: 0.94, 10: 0.701, 50: 0.481, 100: 0.438},
100: {0.01: 350, 0.05: 82, 0.1: 44, 0.5: 10, 1: 5.4, 5: 1.43, 10: 0.955, 50: 0.564, 100: 0.498}
}"

Just wondering how did you get sigma given an epsilon value?

Thanks a lot.

I can not run the main.py file in improved_ai

FileNotFoundError: [Errno 2] No such file or directory: 'data/census/target_data.npz'

I have census_feature_desc.p ,census_feature.p ,census_feature.csv ,census_labels.p,census_labels.csv
but it need target_data.npz,holdout_data.npz,skewed_data.npz,skewed_2_data.npz

input:
python main.py census
--use_cpu=1
--skew_attribute=0
--skip_corr=1
--skew_outcome=3
--sensitive_outcome=3
--target_test_train_ratio=0.5
--target_data_size=50000
--candidate_size=10000
--target_model='nn'
--target_epochs=50
--target_l2_ratio=1e-6
--target_learning_rate=0.001
--target_batch_size=500
--target_clipping_threshold=4
--attribute=1
--run=1

output:
{'train_dataset': 'census', 'run': 1, 'use_cpu': 1, 'save_model': 0, 'save_data': 0, 'attribute': 1, 'candidate_size': 10000, 'skew_attribute': 0, 'skew_outcome': 3, 'sensitive_outcome': 3, 'banished_records': 0, 'skip_corr': 1, 'n_shadow': 5, 'target_data_size': 50000, 'target_test_train_ratio': 0.5, 'target_model': 'nn', 'target_learning_rate': 0.001, 'target_batch_size': 500, 'target_n_hidden': 256, 'target_epochs': 50, 'target_l2_ratio': 1e-06, 'target_clipping_threshold': 4.0, 'target_privacy': 'no_privacy', 'target_dp': 'gdp', 'target_epsilon': 0.5, 'target_delta': 1e-05, 'attack_model': 'nn', 'attack_learning_rate': 0.01, 'attack_batch_size': 100, 'attack_n_hidden': 64, 'attack_epochs': 100, 'attack_l2_ratio': 1e-06}
Traceback (most recent call last):
File "/Users/zhouyukai/PycharmProjects/EvaluatingDPML-master/improved_ai/main.py", line 332, in
run_experiment(args)
File "/Users/zhouyukai/PycharmProjects/EvaluatingDPML-master/improved_ai/main.py", line 135, in run_experiment
train_x, train_y, test_x, test_y = load_data('target_data.npz', args)
File "/Users/zhouyukai/PycharmProjects/EvaluatingDPML-master/core/data_util.py", line 204, in load_data
with np.load(DATA_PATH + data_name) as f:
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/numpy/lib/npyio.py", line 390, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'data/census/target_data.npz'

FileNotFoundError: [Errno 2] No such file or directory: './data/target_data.npz'

when I run python attack.py cifar_100 --target_model=nn --target_l2_ratio=1e-4,

I met the error

Traceback (most recent call last):
  File "attack.py", line 595, in 
    run_experiment(args)
  File "attack.py", line 481, in run_experiment
    dataset = load_data('target_data.npz', args)
  File "attack.py", line 227, in load_data
    with np.load(DATA_PATH + data_name) as f:
  File "/home/billyluo/anaconda3/envs/tf14/lib/python3.7/site-packages/numpy/lib/npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")
FileNotFoundError: [Errno 2] No such file or directory: './data/target_data.npz'

How can I solve it?

What does it mean when the privacy leakage metric is negative?

The paper indicates that privacy leakage metric is always between 0 and 1, where the value of 0 indicates that there is no leakage.
When I run the code many times

$python3 main.py cifar_100 --target_model='nn' --target_l2_ratio=1e-4 --target_privacy='grad_pert' --target_dp='rdp' --target_epsilon=0.01 --target_epochs=100 --attack_epoch=100

I got the Advantage = -0.0009、0.0005、-0.0031、-0.0006、0.002
What does it mean when the privacy leakage metric is negative?
The smaller the privacy leakage metric, the more advantageous to the attack model?

How to get non-private graphs

I have run this code and getting all the graphs, such as accuracy loss and privacy leakage at various privacy budgets. I was wondering how can I generate the privacy leakage from the non-private model and plot it along with the perturbed results? I'd greatly appreciate if could help. Thanks.

purchase100

Could you also share the parameters (learning rate, regularization parameter, etc) for purchase100?

And are you using the dataset from here ?

Yeom Colluding adversary

have I understood it correctly that you only implement Adversaries 1 and 2 of Yeom et al. (in yeom_membership_inference)? If so, was there a technical reason the colluding adversary (adversary 3) was not included in your analysis?

Missing method 'plot_layer_outputs' in core.utilities

/core/attack.py imports 'plot_layer_outputs' from /core/utilities.py:

from core.utilities import plot_layer_outputs

This method does not exist and to the best of my knowlegde has never existed (in this repo).
Consequently running any command importing or using /core/attack.py raises an ImportError.

Is this method not supposed to be there? Where can I find it?

Enquire about batch clipping and per-instance clipping

Hi Dr.Bargavj,

I am reading your paper "Evaluating Differentially Private Machine Learning in Practice" and have some questions as follows.

I noticed that you did experiments comparing effects on batch clipping and per-instance clipping, however, I did not find how to set parameters in your code to do this experiment. Does just setting "--target_batch_size=1" in evaluating_dpml.py file means per-instance clipping?

Many thanks if you can kindly reply to it.

Missing args in evaluating_dpml.py

Hi Dr Barga,

I read your paper 'Evaluating Differentially Private Machine Learning in Practice' and it is very interesting work.

when I run your code I find that in evaluating_dpml.py, it seems in line 22,
pred_y, membership, test_classes, classifier, aux = train_target_model(
dataset=dataset,
epochs=args.target_epochs,
batch_size=args.target_batch_size,
learning_rate=args.target_learning_rate,
clipping_threshold=args.target_clipping_threshold,
n_hidden=args.target_n_hidden,
l2_ratio=args.target_l2_ratio,
model=args.target_model,
privacy=args.target_privacy,
dp=args.target_dp,
epsilon=args.target_epsilon,
delta=args.target_delta,
save=args.save_model
),
should add 'args' to target_train_model()