Code in this repository compare the performace of three differentially private mechanisms: the Laplace mechanism, the exponential mechanism, and the method proposed by Johnson & Shmatikov (2013). The data used for demonstration are simulated genome-wide association study data generated by HAP-SAMPLE.
Code in this repository were used to produce results in the following paper:
"Scalable privacy-preserving data sharing methodology for genome-wide association studies." Yu, F, S. Fienberg, S. Slavković, C. Uhler (2014). Journal of Biomedical Informatics (forthcoming).
The paper is available at here and here.
The example case/control genotype data (example/case_genotypes.dat
and example/case_genotypes.dat
) were generated by HAP-SAMPLE with the following options:
- Population: CEU
- Source for SNPs:
Chrom9_Chrom13_snps.txt
- Disease Model File:
AR_chrom9_chrom13_Nov09_additive1_MAF025.txt
- Simulation Type: Case/Control
- Number of Cases: 1000
- Number of Controls: 1000
- Average breaks per cM: 1
- Output Format: SNPs v. Individuals
The disease model file describes a disease with 2 causative SNPs having addive effects. For more details about the disease model, see Malaspinas & Uhler (2010).
Or if you have ipython notebook installed, run in the terminal
cd notebooks
../start_ipynb.sh
and open the compare_all_dp_mechanisms
notebook in the browser. By default, the iPyhon notebook is at http://localhost:8888/
First of all, untar the example data:
tar xvfz exmaple.tar.gz
Then convert raw data to genotype tables:
python ./notebooks/raw_to_geno_table.py ./example/case_genotypes.dat ./example/anticase_genotypes.dat ./table.tmp
To get results from the Laplace mechanism, run:
python ./notebooks/write_chisquare.py ./table.tmp ./chisquare.tmp
python ./notebooks/get_laplace_results.py k e n_case n_control ./chisquare.tmp
where "k" is the number of top SNPs to release, "e" is the privacy budget (commonly known as epsilon in epsilon-differential privacy), and "n_case" and "n_control" are the numbers of cases and controls.
Similarly, to get results from the Exponential mechanism, run:
python ./notebooks/write_chisquare.py ./table.tmp ./chisquare.tmp
python ./notebooks/get_expo_results.py k e n_case n_control ./chisquare.tmp
To get results from the Johnson & Shmatikov method, run:
python ./notebooks/write_JS_distance.py -p pval ./table.tmp ./js_distance.tmp
python ./notebooks/get_JS_results.py k e ./js_distance.tmp
where "pval" is the p-value specified in Johnson & Shmatikov (2013), which can be interpreted as the overall p-value (say 0.05) of the multiple testing problem involving thousands of SNPs.