getstrm / pace Goto Github PK

Data policy IN, dynamic view OUT: PACE is the Policy As Code Engine. It helps you to programatically create and apply a data policy to a processing platform like Databricks, Snowflake or BigQuery (or plain 'ol Postgres, even!) with definitions imported from Collibra, Datahub, ODD and the like.

Home Page: https://pace.getstrm.com

License: Apache License 2.0

Shell 0.23% Makefile 1.94% Dockerfile 0.13% Kotlin 97.28% Python 0.41%

data-catalog data-processing policy-enforcement bigquery data-contracts data-governance databricks snowflake

pace's People

Contributors

Stargazers

Watchers

Forkers

ledbutter

pace's Issues

[PACE-4] Release pipeline should update various resources on the main branch

After doing a release (i.e. a pipeline triggering after doing a GitHub release), some resources should be updated on the main branch:

Buf generated dependency version
...

_PACE-4

[PACE-29] Add complex validations for entities to Protovalidate

Some validations can be done with protovalidate, such as:

Last transform in the list of transforms should have an empty principals list

_{From SyncLinear.com | PACE-30}

[PACE-2] Consider using template methods for transforms

In other words, let specific sql converters implement (or override) transforms defined in the abstract sql generator. This way, we could support a hashing function for Postgres and or BigQuery.

Note: the built-in BigQuery hash function is limited to string or bytearray input, and bytearray output. This can be casted to string again, but this means that if you'd have a principal without hash and one with hash, that they would have a different data type. So this could be trickier than expected.

For Postgres, it looks like there are different hashing functions depending on the type. Although this is perhaps also the case for Snowflake and Databricks, maybe we just didn't run into it with our sample policies.

I say "consider" in the title of this ticket as you could argue that users could use the SQL statement transform to do this instead (although the principal data type issue could happen there too).

_PACE-2

[PACE-44] Setup processing platform integration tests

Problem description
In order to test transforms and platform specific implementations, we should set up some form of integration tests.

Proposed solution
TBD

_PACE-44

[PACE-83] Implement API paging

@bvdeenen and I discussed how we should implement paging, and we will follow AIP-158, including the skip parameter.

_{From SyncLinear.com | PACE-83}

[PACE-41] Offer default transformations

Options:

Examples
Global transforms (tag based, or library based?)
other?

_PACE-42

[documentation] provide generated view definitions in examples/tutorials

Problem description
While the provided tutorials show the configuration of policies quite well, it's not super clear to see how the resulting view is implemented (for each processing platform). Eg., does Pace make use of the Databricks UC built-in masking feature, or is it implemented in pure SQL?

Proposed solution
For all tutorials, include the generated view definitions (next to the result table, shown per principal - which IS cool).

Additional context
Pace looks really promising, looking forward to see it evolve!

[PACE-6] Better hierarchical entity support for data catalogs

Collibra uses a classic hierarchical model database-schema-table, whereas other data catalogs, such as DataHub and Open Data Discovery have a flat model. We should ensure that the latter have a better fit in the model, and ensure that the schema returns all tables of the data catalog (instead of only a single one, which it currently is).

_PACE-6

[PACE-1] Make PACE more lenient regarding casing

In the SQL world, casing requirements of table names and fields are typically quite lenient. In PACE however, some interactions may fail if such casing mismatches. This can be confusing.

_PACE-1

Handle Databricks Pending responses

The current client translates this in an error. Happens for instance when getting a data-policy

dpace get data-policy --processing-platform dbr-pace pace.alpha_test.gddemo
Error code = Internal
Details = An error occurred in call ...ProcessingPlatformsService/GetBlueprintPolicy.

Do we loop? Pass it to the cli (try again later?..)

This response might be that the compute is busy (or absent!), so some logic needs to be defined.

_{From SyncLinear.com | PACE-35}

[PACE-24] Implement basic detokenization field transform

null

_PACE-24

[PACE-36] Extract transform class from view generator

The view generator class does not allow platform specific implementations for transforms and is becoming quite large. To make it more maintainable and testable, we should extract it from the view generator.

_PACE-36

[PACE-3] Add multi-tenant support

Currently, PACE is a single-tenant application. No authentication and no scoped resources for a specific user / org / dept / etc.

_PACE-3

[PACE-23] Determine strategy for overlapping transforms in GlobalTransforms / Data Policies

The global transforms that belong to various tags (or when combining a Data Policy's transforms with the transforms defined in the Global Transforms) can be overlapping / can collide. We should think of the strategy we want to apply in this situation. Multiple options:

First one wins (this is determined by the order in which the tags are returned by the processing platform / data catalog) -> possible leads to unexpected / unwanted policies.
Merge (how?)
Strict mode -> refuse to generate a bare policy with global transforms and let the end user first remove the overlap.

The UI will be very helpful here: showing whether the GlobalTransform is compatible with the Data Policy, and if not, what the conflicts are.

_{From SyncLinear.com | PACE-23}

[PACE-71] Data Policy Element Lineage: add meta information for traceability of Data Policy elements

In order to be able to track where a section in a Data Policy originated from, we need to keep track of this information in the Data Policy itself as a meta information section. An example:
if a user would like to know that certain transforms originated from a Global Transform, the user currently would have no clue on how to determine that.

_{From SyncLinear.com | PACE-71}

[PACE-14] Assert that the data type of a field for a regex replacement is a string

null

_PACE-14

[PACE-57] Better hierarchical entity support for OpendataDiscovery

Current code uses essentially a list all tables query, and creates one PACE database entry for each one. This makes no sense and is not scalable. I've looked at the ODD rpc results, and database level information is available.

# essentially shows all datasets in ODD as databse
pace list databases --catalog odd -o json | jq -r '.databases[]|[.id,.display_name,.type]|@csv'
"111","CATALOG_RETURNS","Snowflake Sample Data"
"110","PARTSUPP","Snowflake Sample Data"
"109","CUSTOMER","Snowflake Sample Data"
"108","STORE_RETURNS","Snowflake Sample Data"
"107","WEB_SITE","Snowflake Sample Data"
"106","WEB_RETURNS","Snowflake Sample Data"
"105","PROMOTION","Snowflake Sample Data"
"104","CUSTOMER","Snowflake Sample Data"
"103","REGION","Snowflake Sample Data"
"102","CATALOG_RETURNS","Snowflake Sample Data"
"101","CATALOG_SALES","Snowflake Sample Data"
"100","HOURLY_16_TOTAL","Snowflake Sample Data"
"99","CALL_CENTER","Snowflake Sample Data"
"98","WEB_SITE","Snowflake Sample Data"
"97","REGION","Snowflake Sample Data"
"96","CUSTOMER_DEMOGRAPHICS","Snowflake Sample Data"

...

# shows one schema identical to the table name
pace list schemas --catalog odd --database 4
schemas:
- database:
    catalog:
      id: odd
      type: ODD
    display_name: sales_denorm
    id: "4"
    type: BookShop Data Lake
  id: "4"
  name: BookShop Data Lake

# And only shows one table
pace list tables --catalog odd --database 4 --schema 4
tables:
- id: "4"
  name: BookShop Data Lake
  schema:
    database:
      catalog:
        id: odd
        type: ODD
      display_name: sales_denorm
      id: "4"
      type: BookShop Data Lake
    id: "4"
    name: BookShop Data Lake

_{From SyncLinear.com | PACE-57}

[PACE-78] Add Kubernetes support & docs

Problem description
There's currently no built-in support for Kubernetes (e.g. configmaps for configuration), and limited docs.

Proposed solution
We could use the Spring Boot Kubernetes starter, which adds out-of-the box k8s support. We can then extend the docs page on it.

_PACE-78

Improve consistency of GetBlueprintPolicy requests

Problem description

The GetBlueprintPolicyRequests are not consistent with ResourceUrn anymore.

Processing platforms:

message GetBlueprintPolicyRequest {
    string platform_id = 1 [(buf.validate.field).string = {min_len: 1}];
    entities.v1alpha.Table table = 2;
    optional string fqn = 3;
}

Data catalogs:

message GetBlueprintPolicyRequest {
  string catalog_id = 1 [(buf.validate.field).string = {min_len: 1}];
  optional string database_id = 2;
  optional string schema_id = 3;
  string table_id = 4 [(buf.validate.field).string = {min_len: 1}];
  string fqn = 5;
}

Proposed solution
Consider refactoring the requests.

[PACE-39] Buf open branches should be deleted in CI after merge

Can be done using:

buf beta registry draft delete buf.build/getstrm/pace:standalone

_PACE-39

[PACE-9] Add hierarchical model to processing platforms

We may want to add the same hierarchical model to the processing platforms, as we use with the data catalogs (i.e. database-schema-table). This would enable us to reuse the same connection to a platform for multiple databases.

_PACE-9

[PACE-10] Separate upsert and apply of Data Policy

Currently, an upsert of a Data Policy leads to an apply immediately on the respective processing platform. These should be two separate steps.

_PACE-10

[PACE-28] Implement numeric rounding transforms

Allow rounding:

ceil / floor integers
ceil / floor floats
rounding by precision

_PACE-28

[PACE-8] Add pagination support to all (relevant) RPCs

With a real organization, list tables for instance might easily run into the thousands. We need pagination on all list calls.

_{From SyncLinear.com | PACE-8}

[PACE-66] Run the standalone quickstart in PR builds

In order to test whether modifications to the PACE app still result in a working standalone quickstart, we should:

Start all relevant containers (a PACE Postgres DB and a Postgres processing platform, but not the PACE app, that should be the just built app from an earlier step in the CI)
Perform all quickstart actions, i.e.
1. Create the Data Policy
2. Apply the Data Policy
Connect to the Postgres processing platform as all different users that are mentioned in the quickstart:
1. Query the view
2. Assert the result set per user to see if the contents are correct

_{From SyncLinear.com | PACE-66}

[PACE-16] Add docs section regarding authentication to private GH and ghcr

null

_PACE-16

[PACE-79] Add Envoy CORS policy

In order to use gRPC Web, we need to configure Envoy to answer OPTIONS requests.

_{From SyncLinear.com | PACE-79}

[PACE-55] Improve hash transform

The current hash transform is implemented in the platform-agnostic DefaultProcessingPlatformTransformer. There are a couple issues with it:

The hashing functions differ across platforms, e.g. in their return types.
Hashing functions aren't really stable: the output type is typically a number or bytes, regardless of the input. This means that if a policy is created where for some principals the original value is returned, but for other the hashed value, a data type mismatch occurs.

The first issue can be resolved by creating platform-specific implementations, which is straightforward with the recent ProcessingPlatformTransformer interface.

The second issue requires additional thought. The hashed value could perhaps be cast to the original datatype or so. In the meantime, platform-specific hashing can be implemented using the SQL Statement transform.

_PACE-56

[PACE-40] Add full application configuration reference to docs

A full application configuration is very useful as a reference. We might even have something like a JSON schema to document the different properties and their purposes.

_PACE-40

[PACE-76] Change Data Policy primary key

A Data Policy currently is unique based on the id and the platform_id (and version), but this is only enforced in the database.

To ensure consistency across the API and database, @bvdeenen and @trietsch propose the following change:

Remove platform_id from the database migration in the data policy primary key
A Data Policy will be unique on the id and the version
If the version is omitted, the latest version is assumed
In RPCs, id and version should be included
A Data Policy reference message (including id and version) or a URN should be introduced to easily refer to a single unique Data Policy.

_{From SyncLinear.com | PACE-76}

[PACE-13] Regex capturing group number escaping with `\` is very error prone

Since there are many different layers across which the regex replacement string (e.g. ***\1; many platforms use backslash for referring to capturing group numbers) is passed through, this often goes wrong and involves a lot of attempts ("how many backslashes do I need this time?").

I propose to use our own replacement character, which is not (/less) sensitive to escaping issues, such as $. This is what will be used in the YAML / proto message representations, and it will be translated to the correct number of \ when creating a view on the processing platform.

_PACE-13

[PACE-11] Add support for repeated fields

We currently do not support transforms on fields that are arrays / are repeated.

_PACE-11

[PACE-33] add call of addRuleSet to getBlueprintPolicy on catalog.

null

_PACE-33

[PACE-15] Add apiVersion and kind to all YAML representations of Proto messages

An example of a Data Policy that includes version and kind:

apiVersion: pace.getstrm.com/v1alpha
kind: DataPolicy
metadata:
  name: "alpha_test.demo"
platform:
...

This would enable us to operate multiple different api versions simultaneously. This is a client side feature, as it currently is not embedded in the protos themselves, and the client should read the fields to determine the RPC to call and to determine whether the version is still supported.
Another option would be to include this in the protos and enforce this server side.

_PACE-15

[PACE-37] Move all `toCase` ViewGeneratorTests to TransformerTests

Currently, the ViewGeneratorTests contain tests for a full create view SQL statement and individual toCase statements. As we've introduced ProcessingPlatformTransformers, we can now move the toCase tests to Transformer Test classes. This also means that the toCase could be private in the ProcessingPlatformViewGenerator.

_PACE-37

[PACE-12] Add optional extra Postgres grant that allows any other user with view access

The view that is created by PACE, isn't automatically accessible by any user that has access to the postgres cluster (this may apply to other processing platforms as well). Add an optional flag (api change) that allows for configuring arbitrary access to the view.

_PACE-12

[PACE-7] Support draft Data Policies

null

_PACE-7

[core]: ensure it's possible to extend an existing rule set with a global transform

Problem description
At the moment, global tag transforms result in a new (blueprint) data policy. We should make it possible to extend existing rule sets. That way, one can specify both tags and additional filters/transforms in for example DBT model metadata.

Proposed solution
Add a function that takes a rule set or data policy and adds the tag-based transforms to the existing rule set(s). Some validation/merge logic may be needed.

[PACE-58] Better hierarchical entity support for Datahub

pace list databases --catalog datahub-on-dev -o json | jq '.databases[]|.display_name,.id'
null
"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:s3,project/root/events/logging_events_bckp,PROD)"

_{From SyncLinear.com | PACE-58}

[PACE-75] use LLM plugin for sample data generation

_{From SyncLinear.com | PACE-75}

[PACE-19] Add support for data retention

When implementing a data policy, I would like to specify a retention period, after which data should be omitted from the created view.

Details:

In the rulesets, add support for retention
take a specified retention period (in days), indicated by a tag
In building the dynamic view, check if a row is inside or outside of the retention period
Design choice: checks should be conducted from the creation (not update) timestamp of each row
If outside of the retention period, filter out the row.
Question: would it be possible to delete the source data altogether instead of filtering out from a view only?

_PACE-19

[PACE-117] Create a service 'data resources managed by pace'

We now have the strange situation that a view, created via a PACE ruleset is considered to be not managed by pace, and the lineage call returns has_no_data_policy.
We should probably keep track of an extra table with all the data resources managed by pace (list of DataResourceRefs).

_{From SyncLinear.com | PACE-117}

[PACE-17] Add support for tag-based transforms

Define an api and implementation to add transforms to rulesets based on tags obtained from a processing platform or data catalog.

_PACE-17

[PACE-5] Collibra databases don't show full name

pace list databases --catalog COLLIBRA-testdrive
databases:

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: 8665f375-e08a-4810-add6-7af490f748ad
type: Snowflake

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: 99379294-6e87-4e26-9f09-21c6bf86d415
type: CData JDBC Driver for Google BigQuery 2021

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: b6e043a7-88f1-42ee-8e81-0fdc1c96f471
type: Snowflake

The field displayName is not filled in, and it should contain something useful

_{From SyncLinear.com | PACE-5}

getstrm / pace Goto Github PK

pace's People

Contributors

Stargazers

Watchers

Forkers

pace's Issues

Recommend Projects

Recommend Topics

Recommend Org