statisticsnorway / dapla-dlp-pseudo-service Goto Github PK

View Code? Open in Web Editor NEW

1.0 14.0 0.0 31.7 MB

(De/)pseudonymization endpoints

License: MIT License

Dockerfile 0.27% Shell 3.26% Batchfile 3.09% Java 92.56% Makefile 0.83%

dapla azure-pipeline dlp pseudo statistikktjenester

dapla-dlp-pseudo-service's People

Contributors

Stargazers

Watchers

dapla-dlp-pseudo-service's Issues

Exporting does not preserve field order

When exporting a dataset using CSV, the resulting fields are not ordered in the same way as the original dataset.

Example:

Adresse;Lykketall;Postnr;Type;Skurk;Fødselsdato;Poststed;Id;Navn;charset;contentKey;createdDate;topic;description;contentLength;position;source;tag;contentType;dataset;resourceType;ulid;position;timestamp
Andedammen 13;13;3158;And;false;1934-06-09;ANDEBU;26913712456;oaÅ¢B-W9¢:%;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;001;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1BXJNMTA3T06D6V77T;001;1619628998699
Musestien 5;5;8723;Mus;false;1928-10-01;HUSBY;79288479608;P¿J±5ý]èý;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;002;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1H056GRD1R2KG09QW8;002;1619628998705
Langsvingen 11;11;1405;Hund;false;1932-05-25;LANGHUS;92239844400;Omß×Ry°Ö;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;003;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1MZTXXBAKKAZY9PA78;003;1619628998708
Svartvika 42;42;3158;Katt;true;1930-04-24;ANDEBU;10370827580;T0G%ÖÏ-Î0o(eª;UTF-8;entry;2021-04-28T13:33:42.567860227Z;disney;Karakterer i Disney-universet;42;004;andeby.fandom.com;test;text/csv;disney;entry;01F4CPJG1NMDYBBVM89ZT375WZ;004;1619628998709

Update README

The REST API has been changed a bit lately. The README with examples should be updated to reflect these changes.

Authorize users based on DEPSEUDO privilege

Right now authorization to depseudonymize a dataset is granted by matching the users username to a static list of trusted users configured in the dapla-pseudo-service.

Instead, the dapla-pseudo-service should check if the user has a special DEPSEUDO privilege associated with the dataset that is requested to be depseudonymized.

Support "short format" when specifying dataset paths with timestamps

In the API, when specifying a dataset path, we need to have a separate property to denote the timestamp. Instead of requiring two different properties, we could represent the same with a short format, such as:

/path/to/dataset <-- which translates to /path/to/dataset/<current timestamp>
/path/to/dataset@<timestamp> <-- which would translate to /path/to/dataset/<timestamp>

It should be possible to export only a subset (n records) from a given dataset

The export service is in many cases used in order to debug pseudonymization join issues. Some datasets are BIG, and in these cases it would be nice to only export a subset of the complete dataset (e.g. the n first records)

Liveness always unknown

Liveness probes are configured in HelmRelease:
https://github.com/statisticsnorway/platform-dev/blame/f34599c16a067cf6e98d4264a3dc36149bdf8c39/flux/staging-bip-app/dapla/dapla-pseudo-service/dapla-pseudo-service.yaml#L64

But the liveness endpoint is not configured and will always returns UNKNOWN:

GET health/liveness from `jupyter.dapla-staging.ssb.no`

curl http://dapla-pseudo-service.dapla.svc.cluster.local/health/liveness
{"status":"UNKNOWN"}

Without the liveness probe k8s can't accurately determine the health of the app.

Provide endpoint to monitor progress of an export-job

The users should be able to query the API and get back information about ongoing export jobs, such as:

stauts (in progress, done, failed, ...)
Target path
Username that triggered the job
some metrics (number of processed records ++)
any error messages

It should be possible to filter the list of jobs by status (e.g. to only see failed jobs)

Export: deduce export destination path from dataset source path

Right now, we can specify an explicit target path for an exported dataset. Instead, we should require using the dataset source path.

Thus, when exporting /path/to/dataset then the final exported archive would be available at something like:

gs://bucketname/export/path/to/dataset/<timestamp-of-export>

Add version endpoint

We should have a version endpoint that could be used to see version information about the running pseudo service. This would be useful information to be displayed by the dapla-cli doctor command.

Fix bug in OpenAPI docs for pseudonymize endpoints

Some endpoints show the wrong json return type in the OpenAPI documentation. Specifically they're missing the datadoc and metadata fields.

Check all endpoints for which ones have incorrect docs
Ideally derive the documentation from classes that are serialized to JSON so that OpenAPI documentation doesn't go out of sync

Cloud logging logs warnings as `INFO`

All warnings are currently logged with severity info in cloud logging. This should be fixed to allow for better filtering.

Example:

Export: support retrieving pseudo rules from another dataset path

It has been observed that pseudonymization rules are not present in the metadata of all datasets. There seems to be an issue that leads to pseudo rules not being copied from source datasets.

To mitigate this, we should optionally support deducing pseudonymization rules from another path (e.g. some "originating" source) - instead of the one we are exporting.

Export: support partial depseudnoymization when deducing pseudo rules from dataset metadata

Right now, if specify that we should depseudonymize during export, then the depseudonymization will be applied for all pseudo rules that are provided. This is done using the pseudoRules parameter which accepts a list of pseudo rules (name, pattern, func), each of which potentially matches multiple fields.

In the export endpoint, if we don't explicitly specify which pseudo rules to use, then we try to retrieve these rules from the dataset metadata. Deducing pseudo rules from the dataset metadata is assumably going to be the main use case. However, for these cases:

we don't have any mechanism to only specify a subset of rules to applied
we don't have any mechanism to only specify a subset of fields to be depseudonymized

Thus, the suggestion is to introduce two new parameters: pseudoRulesFilter and pseudoFieldsFilter.

To summarize, depseudonymization during export would be specified by the following parameters:

pseudoRules - if not present, then deduce these from the dataset path
pseudoRulesPath - optional explicit path to deduce pseudo rules from (#2)
pseudoRulesFilter - a list of named pseudo rules that should be considered
pseudoFieldsFilter - a list of globs that addresses the fields that should be considered. Allows the user to have more control over which fields gets depseudonymized, since a pseudo rule might match multiple fields
depseudo - whether or not the export should depseudonymize. Only required if pseudo rules should be deduced from dataset path and no pseudo filters have been specified. If either of the above parameters are present, then the export should assume this property to be true.

Export: upload "export report" along with exported archives

When exporting a dataset, it would be useful to upload an export report that could be stored alongside the encrypted archive.

This report could include stuff such as:

Name of the user that exported the dataset
Export request parameters, such as pseudo rules that were applied, etc
Dataset metadata
Runtime properties, such as time elapsed, etc

Provide endpoint to pseudonymize files (csv, json, parquet, ...) from GCS to GCS

Vi har per nå et export endepunkt som depseudonymiserer data fra produkt-bøtten og flytter det til synk-ned. For å bli raskere uavhengig av linux så hadde det vært fint om vi kunne pseudonymiserer eks. fra kildebøtten og flytter det til produktbøtten. Dette vil gjøre at vi kan flytte filer opp fra bakken slik at vi eks. kan gjøre tabelloppdrag på Dapla.

statisticsnorway / dapla-dlp-pseudo-service Goto Github PK

dapla-dlp-pseudo-service's People

Contributors

Stargazers

Watchers

dapla-dlp-pseudo-service's Issues

GET health/liveness from jupyter.dapla-staging.ssb.no

Recommend Projects

Recommend Topics

Recommend Org

GET health/liveness from `jupyter.dapla-staging.ssb.no`