Code Monkey home page Code Monkey logo

dapla-dlp-pseudo-service's People

Contributors

andilun avatar bjornandre avatar dapla-bot[bot] avatar dependabot[bot] avatar insulaventus avatar kschulst avatar mallport avatar mmwinther avatar nicolst avatar rupinderkaurssb avatar skykanin avatar snyk-bot avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dapla-dlp-pseudo-service's Issues

Exporting does not preserve field order

When exporting a dataset using CSV, the resulting fields are not ordered in the same way as the original dataset.

Example:

Adresse;Lykketall;Postnr;Type;Skurk;Fødselsdato;Poststed;Id;Navn;charset;contentKey;createdDate;topic;description;contentLength;position;source;tag;contentType;dataset;resourceType;ulid;position;timestamp
Andedammen 13;13;3158;And;false;1934-06-09;ANDEBU;26913712456;oaÅ¢B-W9¢:%;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;001;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1BXJNMTA3T06D6V77T;001;1619628998699
Musestien 5;5;8723;Mus;false;1928-10-01;HUSBY;79288479608;P¿J±5ý]èý;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;002;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1H056GRD1R2KG09QW8;002;1619628998705
Langsvingen 11;11;1405;Hund;false;1932-05-25;LANGHUS;92239844400;Omß×Ry°Ö;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;003;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1MZTXXBAKKAZY9PA78;003;1619628998708
Svartvika 42;42;3158;Katt;true;1930-04-24;ANDEBU;10370827580;T0G%ÖÏ-Î0o(eª;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;004;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1NMDYBBVM89ZT375WZ;004;1619628998709

Update README

The REST API has been changed a bit lately. The README with examples should be updated to reflect these changes.

Authorize users based on DEPSEUDO privilege

Right now authorization to depseudonymize a dataset is granted by matching the users username to a static list of trusted users configured in the dapla-pseudo-service.

Instead, the dapla-pseudo-service should check if the user has a special DEPSEUDO privilege associated with the dataset that is requested to be depseudonymized.

Support "short format" when specifying dataset paths with timestamps

In the API, when specifying a dataset path, we need to have a separate property to denote the timestamp. Instead of requiring two different properties, we could represent the same with a short format, such as:

/path/to/dataset <-- which translates to /path/to/dataset/<current timestamp>
/path/to/dataset@<timestamp> <-- which would translate to /path/to/dataset/<timestamp>

Liveness always unknown

Liveness probes are configured in HelmRelease:
https://github.com/statisticsnorway/platform-dev/blame/f34599c16a067cf6e98d4264a3dc36149bdf8c39/flux/staging-bip-app/dapla/dapla-pseudo-service/dapla-pseudo-service.yaml#L64

But the liveness endpoint is not configured and will always returns UNKNOWN:

GET health/liveness from jupyter.dapla-staging.ssb.no

curl http://dapla-pseudo-service.dapla.svc.cluster.local/health/liveness
{"status":"UNKNOWN"}

Without the liveness probe k8s can't accurately determine the health of the app.

Provide endpoint to monitor progress of an export-job

The users should be able to query the API and get back information about ongoing export jobs, such as:

  • stauts (in progress, done, failed, ...)
  • Target path
  • Username that triggered the job
  • some metrics (number of processed records ++)
  • any error messages

It should be possible to filter the list of jobs by status (e.g. to only see failed jobs)

Export: deduce export destination path from dataset source path

Right now, we can specify an explicit target path for an exported dataset. Instead, we should require using the dataset source path.

Thus, when exporting /path/to/dataset then the final exported archive would be available at something like:

gs://bucketname/export/path/to/dataset/<timestamp-of-export>

Add version endpoint

We should have a version endpoint that could be used to see version information about the running pseudo service. This would be useful information to be displayed by the dapla-cli doctor command.

Fix bug in OpenAPI docs for pseudonymize endpoints

Some endpoints show the wrong json return type in the OpenAPI documentation. Specifically they're missing the datadoc and metadata fields.

  • Check all endpoints for which ones have incorrect docs
  • Ideally derive the documentation from classes that are serialized to JSON so that OpenAPI documentation doesn't go out of sync

Export: support retrieving pseudo rules from another dataset path

It has been observed that pseudonymization rules are not present in the metadata of all datasets. There seems to be an issue that leads to pseudo rules not being copied from source datasets.

To mitigate this, we should optionally support deducing pseudonymization rules from another path (e.g. some "originating" source) - instead of the one we are exporting.

Export: support partial depseudnoymization when deducing pseudo rules from dataset metadata

Right now, if specify that we should depseudonymize during export, then the depseudonymization will be applied for all pseudo rules that are provided. This is done using the pseudoRules parameter which accepts a list of pseudo rules (name, pattern, func), each of which potentially matches multiple fields.

In the export endpoint, if we don't explicitly specify which pseudo rules to use, then we try to retrieve these rules from the dataset metadata. Deducing pseudo rules from the dataset metadata is assumably going to be the main use case. However, for these cases:

  • we don't have any mechanism to only specify a subset of rules to applied
  • we don't have any mechanism to only specify a subset of fields to be depseudonymized

Thus, the suggestion is to introduce two new parameters: pseudoRulesFilter and pseudoFieldsFilter.

To summarize, depseudonymization during export would be specified by the following parameters:

  • pseudoRules - if not present, then deduce these from the dataset path
  • pseudoRulesPath - optional explicit path to deduce pseudo rules from (#2)
  • pseudoRulesFilter - a list of named pseudo rules that should be considered
  • pseudoFieldsFilter - a list of globs that addresses the fields that should be considered. Allows the user to have more control over which fields gets depseudonymized, since a pseudo rule might match multiple fields
  • depseudo - whether or not the export should depseudonymize. Only required if pseudo rules should be deduced from dataset path and no pseudo filters have been specified. If either of the above parameters are present, then the export should assume this property to be true.

Export: upload "export report" along with exported archives

When exporting a dataset, it would be useful to upload an export report that could be stored alongside the encrypted archive.

This report could include stuff such as:

  • Name of the user that exported the dataset
  • Export request parameters, such as pseudo rules that were applied, etc
  • Dataset metadata
  • Runtime properties, such as time elapsed, etc

Provide endpoint to pseudonymize files (csv, json, parquet, ...) from GCS to GCS

Vi har per nå et export endepunkt som depseudonymiserer data fra produkt-bøtten og flytter det til synk-ned. For å bli raskere uavhengig av linux så hadde det vært fint om vi kunne pseudonymiserer eks. fra kildebøtten og flytter det til produktbøtten. Dette vil gjøre at vi kan flytte filer opp fra bakken slik at vi eks. kan gjøre tabelloppdrag på Dapla.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.