Code Monkey home page Code Monkey logo

getstrm / pace Goto Github PK

View Code? Open in Web Editor NEW
33.0 4.0 1.0 13.4 MB

Data policy IN, dynamic view OUT: PACE is the Policy As Code Engine. It helps you to programatically create and apply a data policy to a processing platform like Databricks, Snowflake or BigQuery (or plain 'ol Postgres, even!) with definitions imported from Collibra, Datahub, ODD and the like.

Home Page: https://pace.getstrm.com

License: Apache License 2.0

Shell 0.23% Makefile 1.94% Dockerfile 0.13% Kotlin 97.28% Python 0.41%
data-catalog data-processing policy-enforcement bigquery data-contracts data-governance databricks snowflake

pace's People

Contributors

astronomous avatar bobvandenhoogen avatar bvdeenen avatar getstrmbot avatar ivan-p92 avatar renovate[bot] avatar trietsch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ledbutter

pace's Issues

[PACE-2] Consider using template methods for transforms

In other words, let specific sql converters implement (or override) transforms defined in the abstract sql generator. This way, we could support a hashing function for Postgres and or BigQuery.

Note: the built-in BigQuery hash function is limited to string or bytearray input, and bytearray output. This can be casted to string again, but this means that if you'd have a principal without hash and one with hash, that they would have a different data type. So this could be trickier than expected.

For Postgres, it looks like there are different hashing functions depending on the type. Although this is perhaps also the case for Snowflake and Databricks, maybe we just didn't run into it with our sample policies.

I say "consider" in the title of this ticket as you could argue that users could use the SQL statement transform to do this instead (although the principal data type issue could happen there too).

PACE-2

[documentation] provide generated view definitions in examples/tutorials

Problem description
While the provided tutorials show the configuration of policies quite well, it's not super clear to see how the resulting view is implemented (for each processing platform). Eg., does Pace make use of the Databricks UC built-in masking feature, or is it implemented in pure SQL?

Proposed solution
For all tutorials, include the generated view definitions (next to the result table, shown per principal - which IS cool).

Additional context
Pace looks really promising, looking forward to see it evolve!

[PACE-6] Better hierarchical entity support for data catalogs

Collibra uses a classic hierarchical model database-schema-table, whereas other data catalogs, such as DataHub and Open Data Discovery have a flat model. We should ensure that the latter have a better fit in the model, and ensure that the schema returns all tables of the data catalog (instead of only a single one, which it currently is).

PACE-6

Handle Databricks Pending responses

image.png

The current client translates this in an error. Happens for instance when getting a data-policy

dpace get data-policy --processing-platform dbr-pace pace.alpha_test.gddemo
Error code = Internal
Details = An error occurred in call ...ProcessingPlatformsService/GetBlueprintPolicy.

Do we loop? Pass it to the cli (try again later?..)

This response might be that the compute is busy (or absent!), so some logic needs to be defined.

From SyncLinear.com | PACE-35

[PACE-23] Determine strategy for overlapping transforms in GlobalTransforms / Data Policies

The global transforms that belong to various tags (or when combining a Data Policy's transforms with the transforms defined in the Global Transforms) can be overlapping / can collide. We should think of the strategy we want to apply in this situation. Multiple options:

  1. First one wins (this is determined by the order in which the tags are returned by the processing platform / data catalog) -> possible leads to unexpected / unwanted policies.
  2. Merge (how?)
  3. Strict mode -> refuse to generate a bare policy with global transforms and let the end user first remove the overlap.

The UI will be very helpful here: showing whether the GlobalTransform is compatible with the Data Policy, and if not, what the conflicts are.

From SyncLinear.com | PACE-23

[PACE-57] Better hierarchical entity support for OpendataDiscovery

Current code uses essentially a list all tables query, and creates one PACE database entry for each one. This makes no sense and is not scalable. I've looked at the ODD rpc results, and database level information is available.

# essentially shows all datasets in ODD as databse
pace list databases --catalog odd -o json | jq -r '.databases[]|[.id,.display_name,.type]|@csv'
"111","CATALOG_RETURNS","Snowflake Sample Data"
"110","PARTSUPP","Snowflake Sample Data"
"109","CUSTOMER","Snowflake Sample Data"
"108","STORE_RETURNS","Snowflake Sample Data"
"107","WEB_SITE","Snowflake Sample Data"
"106","WEB_RETURNS","Snowflake Sample Data"
"105","PROMOTION","Snowflake Sample Data"
"104","CUSTOMER","Snowflake Sample Data"
"103","REGION","Snowflake Sample Data"
"102","CATALOG_RETURNS","Snowflake Sample Data"
"101","CATALOG_SALES","Snowflake Sample Data"
"100","HOURLY_16_TOTAL","Snowflake Sample Data"
"99","CALL_CENTER","Snowflake Sample Data"
"98","WEB_SITE","Snowflake Sample Data"
"97","REGION","Snowflake Sample Data"
"96","CUSTOMER_DEMOGRAPHICS","Snowflake Sample Data"

...

# shows one schema identical to the table name
pace list schemas --catalog odd --database 4
schemas:
- database:
    catalog:
      id: odd
      type: ODD
    display_name: sales_denorm
    id: "4"
    type: BookShop Data Lake
  id: "4"
  name: BookShop Data Lake

# And only shows one table
pace list tables --catalog odd --database 4 --schema 4
tables:
- id: "4"
  name: BookShop Data Lake
  schema:
    database:
      catalog:
        id: odd
        type: ODD
      display_name: sales_denorm
      id: "4"
      type: BookShop Data Lake
    id: "4"
    name: BookShop Data Lake

From SyncLinear.com | PACE-57

Improve consistency of GetBlueprintPolicy requests

Problem description

The GetBlueprintPolicyRequests are not consistent with ResourceUrn anymore.

Processing platforms:

message GetBlueprintPolicyRequest {
    string platform_id = 1 [(buf.validate.field).string = {min_len: 1}];
    entities.v1alpha.Table table = 2;
    optional string fqn = 3;
}

Data catalogs:

message GetBlueprintPolicyRequest {
  string catalog_id = 1 [(buf.validate.field).string = {min_len: 1}];
  optional string database_id = 2;
  optional string schema_id = 3;
  string table_id = 4 [(buf.validate.field).string = {min_len: 1}];
  string fqn = 5;
}

Proposed solution
Consider refactoring the requests.

[PACE-66] Run the standalone quickstart in PR builds

In order to test whether modifications to the PACE app still result in a working standalone quickstart, we should:

  1. Start all relevant containers (a PACE Postgres DB and a Postgres processing platform, but not the PACE app, that should be the just built app from an earlier step in the CI)
  2. Perform all quickstart actions, i.e.
    1. Create the Data Policy
    2. Apply the Data Policy
  3. Connect to the Postgres processing platform as all different users that are mentioned in the quickstart:
    1. Query the view
    2. Assert the result set per user to see if the contents are correct

From SyncLinear.com | PACE-66

[PACE-55] Improve hash transform

The current hash transform is implemented in the platform-agnostic DefaultProcessingPlatformTransformer. There are a couple issues with it:

  • The hashing functions differ across platforms, e.g. in their return types.
  • Hashing functions aren't really stable: the output type is typically a number or bytes, regardless of the input. This means that if a policy is created where for some principals the original value is returned, but for other the hashed value, a data type mismatch occurs.

The first issue can be resolved by creating platform-specific implementations, which is straightforward with the recent ProcessingPlatformTransformer interface.

The second issue requires additional thought. The hashed value could perhaps be cast to the original datatype or so. In the meantime, platform-specific hashing can be implemented using the SQL Statement transform.

PACE-56

[PACE-76] Change Data Policy primary key

A Data Policy currently is unique based on the id and the platform_id (and version), but this is only enforced in the database.

To ensure consistency across the API and database, @bvdeenen and @trietsch propose the following change:

  • Remove platform_id from the database migration in the data policy primary key
  • A Data Policy will be unique on the id and the version
  • If the version is omitted, the latest version is assumed
  • In RPCs, id and version should be included
  • A Data Policy reference message (including id and version) or a URN should be introduced to easily refer to a single unique Data Policy.

From SyncLinear.com | PACE-76

[PACE-13] Regex capturing group number escaping with `\` is very error prone

Since there are many different layers across which the regex replacement string (e.g. ***\1; many platforms use backslash for referring to capturing group numbers) is passed through, this often goes wrong and involves a lot of attempts ("how many backslashes do I need this time?").

I propose to use our own replacement character, which is not (/less) sensitive to escaping issues, such as $. This is what will be used in the YAML / proto message representations, and it will be translated to the correct number of \ when creating a view on the processing platform.

PACE-13

[PACE-15] Add apiVersion and kind to all YAML representations of Proto messages

An example of a Data Policy that includes version and kind:

apiVersion: pace.getstrm.com/v1alpha
kind: DataPolicy
metadata:
  name: "alpha_test.demo"
platform:
...

This would enable us to operate multiple different api versions simultaneously. This is a client side feature, as it currently is not embedded in the protos themselves, and the client should read the fields to determine the RPC to call and to determine whether the version is still supported.
Another option would be to include this in the protos and enforce this server side.

PACE-15

[PACE-37] Move all `toCase` ViewGeneratorTests to TransformerTests

Currently, the ViewGeneratorTests contain tests for a full create view SQL statement and individual toCase statements. As we've introduced ProcessingPlatformTransformers, we can now move the toCase tests to Transformer Test classes. This also means that the toCase could be private in the ProcessingPlatformViewGenerator.

PACE-37

[core]: ensure it's possible to extend an existing rule set with a global transform

Problem description
At the moment, global tag transforms result in a new (blueprint) data policy. We should make it possible to extend existing rule sets. That way, one can specify both tags and additional filters/transforms in for example DBT model metadata.

Proposed solution
Add a function that takes a rule set or data policy and adds the tag-based transforms to the existing rule set(s). Some validation/merge logic may be needed.

[PACE-58] Better hierarchical entity support for Datahub

pace list databases --catalog datahub-on-dev -o json | jq '.databases[]|.display_name,.id'
null
"urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,SampleHiveDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_created,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,fct_users_deleted,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:hive,logging_events,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:kafka,SampleKafkaDataset,PROD)"
null
"urn:li:dataset:(urn:li:dataPlatform:s3,project/root/events/logging_events_bckp,PROD)"

From SyncLinear.com | PACE-58

[PACE-19] Add support for data retention

When implementing a data policy, I would like to specify a retention period, after which data should be omitted from the created view.

Details:

  • In the rulesets, add support for retention
  • take a specified retention period (in days), indicated by a tag
  • In building the dynamic view, check if a row is inside or outside of the retention period
  • Design choice: checks should be conducted from the creation (not update) timestamp of each row
  • If outside of the retention period, filter out the row.
  • Question: would it be possible to delete the source data altogether instead of filtering out from a view only?

PACE-19

[PACE-5] Collibra databases don't show full name

pace list databases --catalog COLLIBRA-testdrive
databases:

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: 8665f375-e08a-4810-add6-7af490f748ad
type: Snowflake

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: 99379294-6e87-4e26-9f09-21c6bf86d415
type: CData JDBC Driver for Google BigQuery 2021

catalog:
id: COLLIBRA-testdrive
type: COLLIBRA
id: b6e043a7-88f1-42ee-8e81-0fdc1c96f471
type: Snowflake

The field displayName is not filled in, and it should contain something useful

From SyncLinear.com | PACE-5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.