Code Monkey home page Code Monkey logo

desktop's Introduction

Diffix for Desktop

Desktop application for anonymizing data using Open Diffix Elm.

To use

Download and run setup package. Documentation can be found in the application itself, and also in the docs folder in this repo.

Sample CSV files for playing with the app can be found in the sample_data folder in this repo.

Development

Run asdf install to install node via asdf. Run npm install to install dependencies. Run git submodule update --init && npm run build to get and build the anonymizer. (Use npm run build-{win|linux|osx} if build doesn't work for you.)

Following the setup, run npm start to start the development environment with hot code reloading.

Before committing, make sure to lint and format your code with npm run lint and npm run format.

Making a release

  1. Make sure there is a new section titled "### Next version" in the changelog, listing the most recent changes.

  2. Execute npm version [major|minor|patch] in a clean working folder.

This will bump and update the app version, create a commit for the new release, tag it and push it to GitHub. A draft release will be then created automatically on GitHub.

  1. After making sure the generated assets are OK, the release can be published manually in the GitHub UI.

desktop's People

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

iabih daimpad

desktop's Issues

CSS class naming convention

If you have a particular naming scheme in mind please speak up. It will be hard to migrate once we write a bunch of components.

If there are no objections we can go with BEM.

Show anonymized vs real results

The result produced should show:

  • the anonymized aggregate
  • the real aggregate and maybe(?) something indicating how large the error is?
  • rows that have been suppressed

I wonder if it shouldn't also be possible to hide this view, so you only see the anonymized results?
By default suppressed rows and unanonymized aggregates should not make it into the exports!(?)

Try out Maui

  • How mature is it?
  • Does it provide useful primitives for us even at this stage?
  • Does it seem like a good choice moving forward?

Make sure nobody needs to do an extra .NET install

Our goal is a very easy installation of Easy Diffix.

It was mentioned that .NET, or at least the crucial files, can be bundled with the installation. I think we should really look into that.

Having said that, for a beta / alpha version, it's not necessary. But maybe a nice piece of work for in between?

npm run make issues for linux

For .deb artifacts we need to add executableName prop in package.json: config.forge.packagerConfig.executableName.

For .rpm artifacts we need a license field in package.json

Try out Chromely.

Chromely is an alternative to Electron.NET, which seems less popular, but with more active maintenance (0/42 vs 10/120 PRs).

Frontend <> Backend communication

We need to agree on the way the Frontend communicates with the Backend.

Since transpiling the reference code to JS resulted in poor performance, the anonymization code will stay in dotnet.
Furthermore, I don't think it is a good idea to manually build the query AST in JS land. It couples the Frontend and Backend internals too much. Sending a SQL statement feels cleaner.

As input we send: filename, query statement, anonymization settings.
As output we get: query result or an error.

Option 1: anonymize using the CLI.

We pass the input as command-line arguments , we get back the query result (as either CSV or JSON) in the stdout stream or we get an error in stderr stream.

PROs:

  • We don't need to have .NET code in the publisher repository;
  • It keeps the reference code separate from the GUI and free of pollution with Frontend concerns;
  • Allows for easy automatization, as all the functionality is easily accessible from the CLI;
  • Makes sure the reference tool works as intended from the CLI (since the Frontend depends on it).

CONs:

  • We won't have live progress reports (unless we get a bit hacky);
  • We pay the CLR startup cost for each anonymization call;
  • Functionality will be limited to what the CLI provides.

Option 2: anonymize using IPC.

We will need an additional .NET project in this repository that loads the core reference library and dispatches anonymization requests to it. We pass the input as a JSON object and we get back a JSON object with the result or error. We need to decide if we use a socket or the process stdio streams for message exchange.

PROs:

  • We can add functionality not supported by the CLI;
  • JSON messages are more expressive than invoking a CLI application;
  • Lower latency, since the CLR is kept loaded.

CONs:

  • Additional .NET code added to this repository;
  • CLI might become stale, since it will be rarely used;
  • Tighter coupling between the publisher and reference repositories;
  • Reference code will get polluted with Frontend concerns (like progress reports).

I am slightly in favor of Option 1 (I don't consider the drawbacks for it too big).

You should be able to see the effects of the anonymization

I.e. show the difference in counts between the anon count and the real count. And see the rows that were anonymized away.
I suggest making it possible to toggle whether you just want the anonymized view, or the view that shows the diff.

Handle logging for prod build

I installed a deployment and it's failing without much information:

image

Any way to access main process logs? We definitely need this to track issues in the wild.

Ability to export results

It should be possible to export the results as a CSV so you can take the anonymized results out of the anonymizer and use them in your reports or whatnot.

How to display the distortion information?

One option would be to show the distortion information (actual or magnitude) per-bucket.

Another option would be to only show aggregate information, like:

  • number of suppressed buckets;
  • max relative distortion;
  • avg relative distortion.

I think the latter is easier to digest.

Concurrent anonymizations have unexpected effects

Pre-requisites:

  • have a dataset of decent size, so anonymization takes a while
  • the dataset should have multiple columns

To reproduce:

  • Select one column
  • Immediately select a second

The app will issue two anonymization requests that are run one after the other.
As soon as the first one returns a result, the frontend will consider the result as finished, and will stop showing the "processing animation". This makes it appear to the analyst as if the second column that was added for anonymization was dropped/ignored. A while later, once the second anonymization ends, the result will be updated and everything will be as expected.

It would be ideal if we could terminate other ongoing and queued anonymization requests when making a new one. That way the time to useful results is kept smaller. This would also solve this bug as only the desired anonymization request would ever be returned.

Alternatively we could send a request ID with the anonymization requests, and only mark the anonymiaztion requests as complete if the result for the desired request has been returned. This might be generally useful as we otherwise end up with the risk of a race conditions...

npm run build no longer works

The build-x versions are slow. I want to be able to run the fast and base version for development purposes. When we added the trimming flag to the fsproj, it no longer works.

CSV import step

During CSV import we should either get really good at auto-detecting the delimiter used in the CSV and auto guessing the types of columns, or we should allow the user to specify it. Maybe we should guess, and then allow the user to fine tune?

Sorting by column

It should be possible to sort the data exported by a column (ascending or descending).
This makes it easier to consume and inspect the results of the tool.

Optimize data loading.

Can we hook the CSV reader so that it emits rows to another parallel thread and stores to an sqlite DB (or other fast to read format)? This could happen in the background while a query is doing a full scan.

The file can be hidden in some directory and we can use the hash (which we already have) as an identifier.

Not sure how much we would gain by this. Maybe benchmark a CSV scan against an SQLite scan?

Anonymization is not stable

Re-running the anonymization tends to produce a new and different set of values.
It seems the anonymizer is not stable, despite the seed being set and fixed.

  • How is the JS PRNG seeded/created?
  • Can we fix this? Or if not, maybe we should fake a stable result by caching previously calculated values?

How do we compute the combined data view?

If we compute the raw and anonymized data separately, we will need to first do two passes through the dataset, and then an additional pass through the results to combine the two sets of buckets. This might be pretty slow for larger inputs.

Alternatively, we could create a custom SQL statement for the reference tool that computes everything we need in a single pass:

SELECT
  column1,
  column2,
  count(*) AS real_count,
  diffix_count(aid) AS anon_count,
  diffix_lcf(aid) AS suppressed
FROM table
GROUP BY 1, 2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.