RAPPOR one-off aggregator

This is a research prototype built as part of an internship project. See the Mozilla Governance thread and bug 1386569 for context.

Analysis

This is a Python reimplementation of the analysis part of RAPPOR (paper | original repository | updated repository). Concretely, there are two different notebooks available:

RAPPOR-Production: Can be used to perform an entire RAPPOR analysis using all the components that we deemed useful. Additionally, test datasets can be dynamically generated using Spark.
RAPPOR-Prototyping: Contains a lot of explanations for the individual steps in RAPPOR and a lot of experimenting with different ideas. This notebook works with datasets generated from the original repository.

Both notebooks need to be run from within this repository as they require the files located in the clients/ folder. The production notebook is expected to run on the data generated by this SHIELD study.

Installing dependencies

Generally, only the usual SciPy tech stack is required.

$ pip install scipy numpy matplotlib pandas sklearn

To be able to use the same hash functions as the client part, some files were already copied from the other repository and put into the client folder.

Prototyping notebook

Generating data

Before we can start running the analysis part, we need to have sopme datasets to work on.

To generate these, clone the repository with Alejandro's updates:

$ git clone https://github.com/Alexrs95/rappor

Then, run one of the following commands, depending on which distribution you want to use:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final'
$ ./regtest.sh run-seq 'r-zipf1.5-small-sim_final'
$ ./regtest.sh run-seq 'r-gauss-small-sim_final'
$ ./regtest.sh run-seq 'r-exp-small-sim_final'
$ ./regtest.sh run-seq 'r-unif-small-sim_final'

Alternatively, all these datasets can be generated with one command:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

There are many other datasets, a full list can be generated by running tests/regtest_spec.py.

Generally, we want to use the datasets with the final parameters, as listed here. Using small means we work on reports from 1,000,000 clients with 100 unique values. This is a reasonable dataset size to still run the analysis on a laptop. For testing, looking at all distributions is useful. The real distribution for our use case is probably most similar to zipf1.5.

Just for seeing how important using the tuned parameters is, a different dataset can be generated:

$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1'
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1|r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

Running the analysis

At the top of the Jupyter notebook, the path to the generated data needs to be adapted. Afterwards, just run all cells to see the results. Depending on the data and parameters used, this might take a while.

mozilla-github-standards / 09d396e0a3247951df29a935ab5929f622310ff6d154abbf689ae1c942e83295 Goto Github PK

09d396e0a3247951df29a935ab5929f622310ff6d154abbf689ae1c942e83295's Introduction

RAPPOR one-off aggregator

Analysis

Installing dependencies

Prototyping notebook

Generating data

Running the analysis

09d396e0a3247951df29a935ab5929f622310ff6d154abbf689ae1c942e83295's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent