statisticsnorway / dapla-dlp-pseudo-service Goto Github PK
View Code? Open in Web Editor NEW(De/)pseudonymization endpoints
License: MIT License
(De/)pseudonymization endpoints
License: MIT License
When exporting a dataset using CSV, the resulting fields are not ordered in the same way as the original dataset.
Example:
Adresse;Lykketall;Postnr;Type;Skurk;Fødselsdato;Poststed;Id;Navn;charset;contentKey;createdDate;topic;description;contentLength;position;source;tag;contentType;dataset;resourceType;ulid;position;timestamp
Andedammen 13;13;3158;And;false;1934-06-09;ANDEBU;26913712456;oaÅ¢B-W9¢:%;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;001;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1BXJNMTA3T06D6V77T;001;1619628998699
Musestien 5;5;8723;Mus;false;1928-10-01;HUSBY;79288479608;P¿J±5ý]èý;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;002;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1H056GRD1R2KG09QW8;002;1619628998705
Langsvingen 11;11;1405;Hund;false;1932-05-25;LANGHUS;92239844400;Omß×Ry°Ö;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;003;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1MZTXXBAKKAZY9PA78;003;1619628998708
Svartvika 42;42;3158;Katt;true;1930-04-24;ANDEBU;10370827580;T0G%ÖÏ-Î0o(eª;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;004;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1NMDYBBVM89ZT375WZ;004;1619628998709
The REST API has been changed a bit lately. The README with examples should be updated to reflect these changes.
Right now authorization to depseudonymize a dataset is granted by matching the users username to a static list of trusted users configured in the dapla-pseudo-service.
Instead, the dapla-pseudo-service should check if the user has a special DEPSEUDO
privilege associated with the dataset that is requested to be depseudonymized.
In the API, when specifying a dataset path, we need to have a separate property to denote the timestamp. Instead of requiring two different properties, we could represent the same with a short format, such as:
/path/to/dataset
<-- which translates to /path/to/dataset/<current timestamp>
/path/to/dataset@<timestamp>
<-- which would translate to /path/to/dataset/<timestamp>
The export service is in many cases used in order to debug pseudonymization join issues. Some datasets are BIG, and in these cases it would be nice to only export a subset of the complete dataset (e.g. the n first records)
Liveness probes are configured in HelmRelease:
https://github.com/statisticsnorway/platform-dev/blame/f34599c16a067cf6e98d4264a3dc36149bdf8c39/flux/staging-bip-app/dapla/dapla-pseudo-service/dapla-pseudo-service.yaml#L64
But the liveness endpoint is not configured and will always returns UNKNOWN:
jupyter.dapla-staging.ssb.no
curl http://dapla-pseudo-service.dapla.svc.cluster.local/health/liveness
{"status":"UNKNOWN"}
Without the liveness probe k8s can't accurately determine the health of the app.
The users should be able to query the API and get back information about ongoing export jobs, such as:
It should be possible to filter the list of jobs by status (e.g. to only see failed jobs)
Right now, we can specify an explicit target path for an exported dataset. Instead, we should require using the dataset source path.
Thus, when exporting /path/to/dataset
then the final exported archive would be available at something like:
gs://bucketname/export/path/to/dataset/<timestamp-of-export>
We should have a version endpoint that could be used to see version information about the running pseudo service. This would be useful information to be displayed by the dapla-cli doctor command.
Some endpoints show the wrong json return type in the OpenAPI documentation. Specifically they're missing the datadoc
and metadata
fields.
It has been observed that pseudonymization rules are not present in the metadata of all datasets. There seems to be an issue that leads to pseudo rules not being copied from source datasets.
To mitigate this, we should optionally support deducing pseudonymization rules from another path (e.g. some "originating" source) - instead of the one we are exporting.
Right now, if specify that we should depseudonymize during export, then the depseudonymization will be applied for all pseudo rules that are provided. This is done using the pseudoRules
parameter which accepts a list of pseudo rules (name
, pattern
, func
), each of which potentially matches multiple fields.
In the export endpoint, if we don't explicitly specify which pseudo rules to use, then we try to retrieve these rules from the dataset metadata. Deducing pseudo rules from the dataset metadata is assumably going to be the main use case. However, for these cases:
Thus, the suggestion is to introduce two new parameters: pseudoRulesFilter
and pseudoFieldsFilter
.
To summarize, depseudonymization during export would be specified by the following parameters:
pseudoRules
- if not present, then deduce these from the dataset pathpseudoRulesPath
- optional explicit path to deduce pseudo rules from (#2)pseudoRulesFilter
- a list of named pseudo rules that should be consideredpseudoFieldsFilter
- a list of globs that addresses the fields that should be considered. Allows the user to have more control over which fields gets depseudonymized, since a pseudo rule might match multiple fieldsdepseudo
- whether or not the export should depseudonymize. Only required if pseudo rules should be deduced from dataset path and no pseudo filters have been specified. If either of the above parameters are present, then the export should assume this property to be true.When exporting a dataset, it would be useful to upload an export report that could be stored alongside the encrypted archive.
This report could include stuff such as:
Vi har per nå et export endepunkt som depseudonymiserer data fra produkt-bøtten og flytter det til synk-ned. For å bli raskere uavhengig av linux så hadde det vært fint om vi kunne pseudonymiserer eks. fra kildebøtten og flytter det til produktbøtten. Dette vil gjøre at vi kan flytte filer opp fra bakken slik at vi eks. kan gjøre tabelloppdrag på Dapla.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.