This is a research prototype built as part of an internship project. See the Mozilla Governance thread and bug 1386569 for context.
This is a Python reimplementation of the analysis part of RAPPOR (paper | original repository | updated repository). Concretely, there are two different notebooks available:
- RAPPOR-Production: Can be used to perform an entire RAPPOR analysis using all the components that we deemed useful. Additionally, test datasets can be dynamically generated using Spark.
- RAPPOR-Prototyping: Contains a lot of explanations for the individual steps in RAPPOR and a lot of experimenting with different ideas. This notebook works with datasets generated from the original repository.
Both notebooks need to be run from within this repository as they require the files located in the clients/
folder. The production notebook is expected to run on the data generated by this SHIELD study.
Generally, only the usual SciPy tech stack is required.
$ pip install scipy numpy matplotlib pandas sklearn
To be able to use the same hash functions as the client part, some files were already copied from the other repository and put into the client
folder.
Before we can start running the analysis part, we need to have sopme datasets to work on.
To generate these, clone the repository with Alejandro's updates:
$ git clone https://github.com/Alexrs95/rappor
Then, run one of the following commands, depending on which distribution you want to use:
$ ./regtest.sh run-seq 'r-zipf1-small-sim_final'
$ ./regtest.sh run-seq 'r-zipf1.5-small-sim_final'
$ ./regtest.sh run-seq 'r-gauss-small-sim_final'
$ ./regtest.sh run-seq 'r-exp-small-sim_final'
$ ./regtest.sh run-seq 'r-unif-small-sim_final'
Alternatively, all these datasets can be generated with one command:
$ ./regtest.sh run-seq 'r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'
There are many other datasets, a full list can be generated by running tests/regtest_spec.py
.
Generally, we want to use the datasets with the final
parameters, as listed here.
Using small
means we work on reports from 1,000,000 clients with 100 unique values. This is a reasonable dataset size to still run the analysis on a laptop.
For testing, looking at all distributions is useful. The real distribution for our use case is probably most similar to zipf1.5
.
Just for seeing how important using the tuned parameters is, a different dataset can be generated:
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1'
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1|r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'
At the top of the Jupyter notebook, the path to the generated data needs to be adapted. Afterwards, just run all cells to see the results. Depending on the data and parameters used, this might take a while.