Code Monkey home page Code Monkey logo

09d396e0a3247951df29a935ab5929f622310ff6d154abbf689ae1c942e83295's Introduction

RAPPOR one-off aggregator

This is a research prototype built as part of an internship project. See the Mozilla Governance thread and bug 1386569 for context.

Analysis

This is a Python reimplementation of the analysis part of RAPPOR (paper | original repository | updated repository). Concretely, there are two different notebooks available:

  • RAPPOR-Production: Can be used to perform an entire RAPPOR analysis using all the components that we deemed useful. Additionally, test datasets can be dynamically generated using Spark.
  • RAPPOR-Prototyping: Contains a lot of explanations for the individual steps in RAPPOR and a lot of experimenting with different ideas. This notebook works with datasets generated from the original repository.

Both notebooks need to be run from within this repository as they require the files located in the clients/ folder. The production notebook is expected to run on the data generated by this SHIELD study.

Installing dependencies

Generally, only the usual SciPy tech stack is required.

$ pip install scipy numpy matplotlib pandas sklearn

To be able to use the same hash functions as the client part, some files were already copied from the other repository and put into the client folder.


Prototyping notebook

Generating data

Before we can start running the analysis part, we need to have sopme datasets to work on.

To generate these, clone the repository with Alejandro's updates:

$ git clone https://github.com/Alexrs95/rappor

Then, run one of the following commands, depending on which distribution you want to use:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final'
$ ./regtest.sh run-seq 'r-zipf1.5-small-sim_final'
$ ./regtest.sh run-seq 'r-gauss-small-sim_final'
$ ./regtest.sh run-seq 'r-exp-small-sim_final'
$ ./regtest.sh run-seq 'r-unif-small-sim_final'

Alternatively, all these datasets can be generated with one command:

$ ./regtest.sh run-seq 'r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

There are many other datasets, a full list can be generated by running tests/regtest_spec.py.

Generally, we want to use the datasets with the final parameters, as listed here. Using small means we work on reports from 1,000,000 clients with 100 unique values. This is a reasonable dataset size to still run the analysis on a laptop. For testing, looking at all distributions is useful. The real distribution for our use case is probably most similar to zipf1.5.

Just for seeing how important using the tuned parameters is, a different dataset can be generated:

$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1'
$ ./regtest.sh run-seq 'r-gauss-small-sim_bloom_filter1_1|r-zipf1-small-sim_final|r-zipf1.5-small-sim_final|r-gauss-small-sim_final|r-exp-small-sim_final|r-unif-small-sim_final'

Running the analysis

At the top of the Jupyter notebook, the path to the generated data needs to be adapted. Afterwards, just run all cells to see the results. Depending on the data and parameters used, this might take a while.

09d396e0a3247951df29a935ab5929f622310ff6d154abbf689ae1c942e83295's People

Contributors

alexrs avatar dexterp37 avatar florian avatar mozilla-github-standards avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.