diffix / desktop Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 2.0 61.55 MB

Cross-platform desktop application for data anonymization with Open Diffix Elm.

Home Page: https://www.open-diffix.org

License: Other

HTML 0.13% JavaScript 2.39% CSS 3.30% TypeScript 80.21% F# 13.97%

anonymization anonymisation anonymize privacy-protection privacy-tools

desktop's Introduction

Diffix for Desktop

Desktop application for anonymizing data using Open Diffix Elm.

To use

Download and run setup package. Documentation can be found in the application itself, and also in the docs folder in this repo.

Sample CSV files for playing with the app can be found in the sample_data folder in this repo.

Development

Run asdf install to install node via asdf. Run npm install to install dependencies. Run git submodule update --init && npm run build to get and build the anonymizer. (Use npm run build-{win|linux|osx} if build doesn't work for you.)

Following the setup, run npm start to start the development environment with hot code reloading.

Before committing, make sure to lint and format your code with npm run lint and npm run format.

Making a release

Make sure there is a new section titled "### Next version" in the changelog, listing the most recent changes.
Execute npm version [major|minor|patch] in a clean working folder.

This will bump and update the app version, create a commit for the new release, tag it and push it to GitHub. A draft release will be then created automatically on GitHub.

After making sure the generated assets are OK, the release can be published manually in the GitHub UI.

desktop's People

Stargazers

Watchers

Forkers

iabih daimpad

desktop's Issues

CSS class naming convention

If you have a particular naming scheme in mind please speak up. It will be hard to migrate once we write a bunch of components.

If there are no objections we can go with BEM.

Add CI

We could add a formatter check like the one we have for the website: https://github.com/diffix/website/blob/main/.github/workflows/checks.yml

Once we have tests it would be good to run those too.
Are there some typescript specific checks we could run already?

Show anonymized vs real results

The result produced should show:

the anonymized aggregate
the real aggregate and maybe(?) something indicating how large the error is?
rows that have been suppressed

I wonder if it shouldn't also be possible to hide this view, so you only see the anonymized results?
By default suppressed rows and unanonymized aggregates should not make it into the exports!(?)

Autodetect boolean, integer and real column types.

Try out Electron.NET

Is it still being maintained?

Does IPC work?

Try out Maui

How mature is it?
Does it provide useful primitives for us even at this stage?
Does it seem like a good choice moving forward?

Make a table component with diffing support

https://ant.design/components/table/

Make sure nobody needs to do an extra .NET install

Our goal is a very easy installation of Easy Diffix.

It was mentioned that .NET, or at least the crucial files, can be bundled with the installation. I think we should really look into that.

Having said that, for a beta / alpha version, it's not necessary. But maybe a nice piece of work for in between?

GUI: column selector buttons could be more compact

The spacing of the column selection buttons seems unnecessarily spacious. Could be much more compact (esp. given the font size).

Add data preview table

age column, in census_small.csv, is interpreted as text not number

In the file census_small.csv, the age column is being interpreted as text (at least, it seems so because the column is sorted alphabetically instead of numerically)

Remove raw results view from table

npm run make issues for linux

For .deb artifacts we need to add executableName prop in package.json: config.forge.packagerConfig.executableName.

For .rpm artifacts we need a license field in package.json

Strict: true?

Should we add strict: true to our tsconfig?

More information about it can be found here.

I am in favor of going as strict as possible now as we are starting out.

Icon not visible on running app

Wire up anonymizer to reference

You need to implement DiffixAnonymizer in src/state/anonymizer.ts.

It should be possible to bucketize/generalize values

We should allow the analyst to bucketize or generalize columns.
This does rely on having the types of columns specified such that we can know how to do generalization. Say turns numbers into ranges, or redact parts of a string.

Try out Chromely.

Chromely is an alternative to Electron.NET, which seems less popular, but with more active maintenance (0/42 vs 10/120 PRs).

Sorting in table is broken

Sorting results by a column doesn't do anything.

Frontend <> Backend communication

We need to agree on the way the Frontend communicates with the Backend.

Since transpiling the reference code to JS resulted in poor performance, the anonymization code will stay in dotnet.
Furthermore, I don't think it is a good idea to manually build the query AST in JS land. It couples the Frontend and Backend internals too much. Sending a SQL statement feels cleaner.

As input we send: filename, query statement, anonymization settings.
As output we get: query result or an error.

Option 1: anonymize using the CLI.

We pass the input as command-line arguments , we get back the query result (as either CSV or JSON) in the stdout stream or we get an error in stderr stream.

PROs:

We don't need to have .NET code in the publisher repository;
It keeps the reference code separate from the GUI and free of pollution with Frontend concerns;
Allows for easy automatization, as all the functionality is easily accessible from the CLI;
Makes sure the reference tool works as intended from the CLI (since the Frontend depends on it).

CONs:

We won't have live progress reports (unless we get a bit hacky);
We pay the CLR startup cost for each anonymization call;
Functionality will be limited to what the CLI provides.

Option 2: anonymize using IPC.

We will need an additional .NET project in this repository that loads the core reference library and dispatches anonymization requests to it. We pass the input as a JSON object and we get back a JSON object with the result or error. We need to decide if we use a socket or the process stdio streams for message exchange.

PROs:

We can add functionality not supported by the CLI;
JSON messages are more expressive than invoking a CLI application;
Lower latency, since the CLR is kept loaded.

CONs:

Additional .NET code added to this repository;
CLI might become stale, since it will be rarely used;
Tighter coupling between the publisher and reference repositories;
Reference code will get polluted with Frontend concerns (like progress reports).

I am slightly in favor of Option 1 (I don't consider the drawbacks for it too big).

Figure out how to ship binary executables with Electron

The platform specific binaries need to be bundled with the Electron app. I don't yet have a good sense of how to do this.

You should be able to see the effects of the anonymization

I.e. show the difference in counts between the anon count and the real count. And see the rows that were anonymized away.
I suggest making it possible to toggle whether you just want the anonymized view, or the view that shows the diff.

Handle logging for prod build

I installed a deployment and it's failing without much information:

Any way to access main process logs? We definitely need this to track issues in the wild.

Ability to export results

It should be possible to export the results as a CSV so you can take the anonymized results out of the anonymizer and use them in your reports or whatnot.

How to display the distortion information?

One option would be to show the distortion information (actual or magnitude) per-bucket.

Another option would be to only show aggregate information, like:

number of suppressed buckets;
max relative distortion;
avg relative distortion.

I think the latter is easier to digest.

Try out Edge.js

https://github.com/tjanczuk/edge

Is it maintained at all? Last update in 2018...

App needs an icon.

@fjab Do we have a logo for Open Diffix?

Add app layout

Browse the examples and pick something which matches with our use case https://ant.design/components/layout/#components-layout-demo-side

Concurrent anonymizations have unexpected effects

Pre-requisites:

have a dataset of decent size, so anonymization takes a while
the dataset should have multiple columns

To reproduce:

Select one column
Immediately select a second

The app will issue two anonymization requests that are run one after the other.
As soon as the first one returns a result, the frontend will consider the result as finished, and will stop showing the "processing animation". This makes it appear to the analyst as if the second column that was added for anonymization was dropped/ignored. A while later, once the second anonymization ends, the result will be updated and everything will be as expected.

It would be ideal if we could terminate other ongoing and queued anonymization requests when making a new one. That way the time to useful results is kept smaller. This would also solve this bug as only the desired anonymization request would ever be returned.

Alternatively we could send a request ID with the anonymization requests, and only mark the anonymiaztion requests as complete if the result for the desired request has been returned. This might be generally useful as we otherwise end up with the risk of a race conditions...

Anonymized table width should be compact by default...

The anonymized table width is automatically the width of the page. Would be better I think if it automatically compressed to the natural column size so we don't have so much left/right white space.

Add lint and test github actions

Make IPC requests cancellable

Crash when re-choosing dimensions after sorting (v 0.1.2)

To reproduce:

import .csv
choose dimensions to anonymise
sort by one column
deselect one of the previously chosen dimensions

=> results in white screen / crash of app

MacOS 11.5.1, easy_diffix 0.1.2

npm run build no longer works

The build-x versions are slow. I want to be able to run the fast and base version for development purposes. When we added the trimming flag to the fsproj, it no longer works.

Mention submodule init steps in setup

When cloning you need to init the submodule. Those are a pain to work with...

Add CSV loader

https://ant.design/components/upload/

CSV import step

During CSV import we should either get really good at auto-detecting the delimiter used in the CSV and auto guessing the types of columns, or we should allow the user to specify it. Maybe we should guess, and then allow the user to fine tune?

Add column selector interface

We have some basic typing for a schema. Use that to pick which columns we want rendered.

Unicode characters are not displayed properly.

Not sure if this is a GUI or CLI issue.

Initial wireframe

@fjab

Edit: changed to a version with a drill down into the column selector

Sorting by column

It should be possible to sort the data exported by a column (ascending or descending).
This makes it easier to consume and inspect the results of the tool.

Optimize data loading.

Can we hook the CSV reader so that it emits rows to another parallel thread and stores to an sqlite DB (or other fast to read format)? This could happen in the background while a query is doing a full scan.

The file can be hidden in some directory and we can use the hash (which we already have) as an identifier.

Not sure how much we would gain by this. Maybe benchmark a CSV scan against an SQLite scan?

Add CSV export step

Experiment: is the anonymizer slow because of data volume or because of JS?

It would be good to experiment with anonymizing large CSVs both in our pure .NET code base as well as in the transpiled one in Electron. Are things slow because of the data volume/size (irrespective of JS or .NET) or because of running the anonymization in JS?

Anonymization is not stable

Re-running the anonymization tends to produce a new and different set of values.
It seems the anonymizer is not stable, despite the seed being set and fixed.

How is the JS PRNG seeded/created?
Can we fix this? Or if not, maybe we should fake a stable result by caching previously calculated values?

How do we compute the combined data view?

If we compute the raw and anonymized data separately, we will need to first do two passes through the dataset, and then an additional pass through the results to combine the two sets of buckets. This might be pretty slow for larger inputs.

Alternatively, we could create a custom SQL statement for the reference tool that computes everything we need in a single pass:

SELECT
  column1,
  column2,
  count(*) AS real_count,
  diffix_count(aid) AS anon_count,
  diffix_lcf(aid) AS suppressed
FROM table
GROUP BY 1, 2

Electron builder failing

I used this electron-snowpack thing to get electron and snowpack to work hand in hand: https://github.com/karolis-sh/electron-snowpack#readme

This might have been a bad idea, and I have a hunch that is also what is causing the building to fail.

Maybe we should start from a plain Electron setup instead.