opendatahub-io-contrib / datamesh-platform Goto Github PK

License: Apache License 2.0

Shell 52.31% Smarty 40.58% Dockerfile 7.11%

datamesh-platform's Introduction

Data Mesh Pattern

This repository is a mono-repository that provides a community pattern of distributed data mesh architecture based on open source components. As a prerequisite, we recommended reading the following foundational articles to understand the concept and approach towards building a data mesh:

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks

Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks

The community pattern implementation is based on 4 foundational principles:

Domain Ownership: The domain ownership principle mandates the domain teams to take responsibility for their data. According to this principle, analytical data should be composed around domains, similar to the team boundaries aligning with the system’s bounded context. Following the domain-driven distributed architecture, analytical and operational data ownership is moved to the domain teams, away from the central data team.
Data-as-a-Product: In a data mesh architecture pattern, data ownership is decentralized and domain data product owners are responsible for all capabilities within a given domain, including discoverability, understandability, quality and security of the data. In effect we achieve this by having agile, autonomous, distributed teams building data pipelines as code and data product interfaces on standard tooling and shared infrastructure services. These teams own the code for data pipelines loading, transforming and serving the data as well as the related metadata, and drive the development cycle for their own data domains. In the process, data handling logic is always treated as code, leveraging the same practices we apply to software code versioning and management to manage changes to data and metadata pipelines.The data as a product principle projects a product thinking philosophy onto analytical data. This principle means that there are consumers for the data beyond the domain. The domain team is responsible for satisfying the needs of other domains by providing high-quality data. Basically, domain data should be treated as any other public API.
Self-service data infrastructure as a platform: The idea behind the self-serve data infrastructure platform is to adopt platform thinking to data infrastructure. A dedicated data platform team provides domain-agnostic functionality, tools, and systems to build, execute, and maintain interoperable data products for all domains. With its platform, the data platform team enables domain teams to seamlessly consume and create data products.
Federated governance: The platform requires a layer providing a federated view of the data domains while being able to support the establishment of common operating standards around data / metadata / data lineage management, quality assurance, security and compliance policies, and by extension any cross-cutting supervision concern across all data products. The federated governance principle achieves interoperability of all data products through standardization, which is promoted through the whole data mesh by the governance group.

The platform is deployed on top of OpenShift Container Platform and OpenShift AI based on the following component architecture:

The list of open source components integrated into the pattern and their respective role is summarized below.

Component	Description
Object Storage (Ceph)]	Responsible for secure storage of raw data and object-level data access for data ingestion / loading. This is a system layer where data transactions are system-based, considered to be privileged activity and typically handled via automated processes. The data security at this layer is to be governed through secure code management validating the logic of the intended data transaction and secret management to authenticate all access requests for privileged credentials and then enforce data security policies.
Data Serving (Apache Iceberg)	Open standard, high-performance table format responsible for managing the availability, integrity and consistency of data transactions and data schema for huge analytic datasets. This includes the management of schema evolution (supporting add, drop, update, rename), transparent management of hidden partitioning / partitioning layout for data objects, support of ACID data transactions (particularly multiple concurrent writer processes) through eventually-consistent cloud-object stores, and time travel / version rollback for data. This layer is critical to our data-as-code approach as it enables reproducible queries based on any past table snapshot as well as examination of historical data changes.
Distributed SQL Query Engine (Trino)	Centralized data access and analytics layer with query federation capability. This component ensures that authentication and data access query management is standardized and centralized for all data clients consuming data, by providing management capabilities for role-based access control at the data element level in Data Commons. This being a federation layer it can support query across multiple external distributed data sources for cross-querying requirements.
Data Versioning and Lineage Management (Pachyderm)	Automate data transformations with data versioning and lineage. This component helps us provide an immutable lineage with data versioning of any data type, for data being processed in pipelines. It is data-agnostic, supporting both unstructured data such as documents, video and images as well as tabular data from data warehouses. It can automatically detect changes to data, trigger data processing pipelines on change, and ensure both source and processed data are version controlled.
Data Transformation (DBT)	SQL-first transformation workflow tooling that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation for data pipelines. This component supports our Data-as-Code approach by providing Git-enabled version control which enables full traceability of code related to data changes and allows a return to previous states. It also supports programmatic and repeatable data validation as it allows automating data pipeline testing, and sharing dynamically generated documentation with all data stakeholders, including dependency graphs and dynamic data dictionaries to promote trust and transparency for data consumers.
Data Validation (Great Expectations)	Open-source library for validating, documenting, and profiling data. This component provides automated testing based on version-controlled assertions which are essentially unit tests for data quality. It also helps speed up the development of data quality controls by profiling source data to get basic statistics, and automatically generating a suite of expectations based on what is observed in the data. Finally, it also provides clean, human-readable documentation, which is integrated with the metadata management system to facilitate data discovery.
Metadata Management (Open Metadata)	Component responsible for data pipeline and metadata management. By tying together data pipeline code and execution, it provides file-based automated data and metadata versioning across all stages of our data pipelines (including intermediate results), including immutable data lineage tracking for every version of code, models and data to ensure full reproducibility of data and code for transparency and compliance purpose.
Workflow Management (Airflow)	This component is the workflow management platform for all data engineering pipelines and associated metadata processing. It provides the ability for dynamic pipeline generation leveraging standard Python features to create workflows, including date time formats for scheduling and loops to dynamically generate tasks which are required to integrate data ingestion and validation tasks with various ingestion methods. It also provides a robust and modern web front-end to monitor, schedule and manage workflows.
Data Security Management (Open Policy Agent & Fybrik)	Framework responsible for the monitoring and management of comprehensive data security across the entire Data Commons, as a multi-tenant environment. It provides centralized security administration for all security related tasks in a management UI, fine-grained authorization for specific action / operation on data at the data set / data element level including metadata-based authorization and controls. It also centralizes the auditing of user / process-level access for all data transactions. Ultimately, we want this layer to provide data access management and visibility directly to the data / data product owners as a self-service capability.
Analytics Dashboard (Apache Superset)	Software application for data exploration and data visualization able to handle data at petabyte scale. Superset connects to any datasource in Data Commons through Trino, so it allows integrating data from cloud native databases and engines to build rich visualizations and dashboard. The visualization plug-in architecture makes it easy to build custom visualizations that can be integrated into reactive web application front-end if required.
Data Analysis (pandas)	Software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series that may be required for complex data manipulation in line with requirements of machine learning pipelines

datamesh-platform's People

Contributors

Stargazers

Watchers

Forkers

caldeirav zagaus pujathacker2210 nanyte25 redmikhail

datamesh-platform's Issues

Deployment for Data Mesh MVP components

Airflow
postgress DB
single node Airflow cluster - Web Ui, Scheduler and Worker all runs on single pod
Trino
Trino cluster
with 1 co-ordinator
2 Worker Node
basic authentication
Hive MetaStore service
Postgres DB for MetaData
OpemMetaData
OpenMetaData Web UI
Elastic Search
Postgres Database
Airflow integration for Data ingestion
Minio
Deployment

Build helm chart / Kustomize for deployment

DBT metadata ingestion through OpenMetadata / Airflow

Setup OpenMetadata in order to integrate with our DBT pipelines (as code, contained in the data product). List of integrations to be tested:
https://docs.open-metadata.org/v1.4.x/connectors/ingestion/workflows/dbt

The DBT ingestion can be configured manually from the OpenMetadata UI as per the following instructions:
https://docs.open-metadata.org/v1.4.x/connectors/ingestion/workflows/dbt/configure-dbt-workflow-from-ui

In our case, the manifest.json will be retrived from the GitHub repository for the completed data pipeline, via the File Server method.

Modify Data Product Template to use Spark dataframe for table creation and data insertion (into Iceberg)

Spark 3 introduced the new DataFrameWriterV2 API for writing to tables using data frames. We want to update the data-product-template example to show data loading through a Spark dataframe, following the pattern:

val data: DataFrame = ...
data.writeTo("db.table").using("iceberg").create()

The example should work with a permanent Spark cluster running into the platform.

Keycloak Integrate with Superset

Create lastest version of Trino Helm Chart to build auto deployment pipeline

Create Trino helm chart to deploy Trino co-ordinator and minimum of 2 worker with Hive Meta-store service , Postgres Database. ZAGA team already has this on their instance. You can co-ordinate with Sharanya from ZAGA for any other information related to Trino helm chart

Test first MVP deployment

@sharanyajune7 Initial MVP deployment code has checked and instruction readme file updated . Please try it on your local server. once successfully tested on ZAGA server. Deploy to Red Hat sandbox.

Install a cloud database management tool

As a data engineer, I want to be able to browse existing catalogs and data ingested in Trino.

For this, it is suggested we install a cloud database management tool such as Cloudbeaver community (https://github.com/dbeaver/cloudbeaver) or similar. This tool has been used successfully before by OS-Climate. Ideally it should be deployed with an authentication provide in line with our management of access in Trino.

Add Git Action CI-CD piepline for airflow deployment

create a "dags" sub folder to Vincent product template repo and add git action ci-cd pipeline for airflow DAGS .
https://github.com/opendatahub-io-contrib/data-product-template under a /dags sub folder

Keycloak Integrate with minIo

Deploy Openshift on the new instance

Add user and role based access for Trino

File based access control - https://trino.io/docs/current/security/file-system-access-control.html

Explore DVC with Minio for versioning

Look at the possibility of leveraging DVC instead of Pachyderm as it would help having all the versioning under the data product repo.

Good article showing end to end setup and use:
https://rachidbenouini.medium.com/data-version-control-from-zero-to-one-fcea7a1c48b3

Pachyderm should be replaced in the main readme-ov-file file

The section about Pachyderm should be rewritten to explain how DVC will perform the role of data versioning.

Define high level structure for datamesh-platform documentation

This issue aims and identifying how the documentation for this repository should be so that it is easy to follow along for the contributors.
The readme should be structured, detailed and at the same time easy to follow for installation and usage of the platform.
It could have sections like:

How the platform works
How to install the platform
How to build a data pipeline
others..

Minimum infra sizing for Data Mesh platform

Build infrastructre for MVP deployment

Currently we are using ZAGA infrastructure. We need to have stable instance for deployment. We need to discuss can we use ZAGA infrastructure or Red Hat.

Remove keycloak integration with Trino, OpenMetadata and Airflow

Looking at the difference in implementation mechanism and usability aspects of the applications, we have agreed on NOT integrating keycloak for SSO authentication.

This issue is mean to remove keycloak based authentication from Trino, Airflow, OpenMeteData and other applications in the platform.

We also need to setup(or ensure it is already setup) local users and document how it users can be added/removed using the same.

SSO - Key Cloak integration

Install Key Cloak and integrate Airflow, OpenMetaData , Trino

Introduce dbt elementary package for dbt management

dbt elementary is a dbt package that provides infra management capabilities around pipelines:

https://leo-godin.medium.com/are-you-using-elementary-for-dbt-f9a56ecbef42

Evaluate whether this can add additional capabilities in complement of OpenMetadata integration

Explore and integrate Great Expectations with data product

Integrate quality testing with data product using Great Expectations or similar:
https://github.com/great-expectations/great_expectations

This can be integrated with OpenMetadata:
https://docs.open-metadata.org/v1.3.x/connectors/ingestion/great-expectations
https://www.youtube.com/watch?v=L05uG8hPHGY

Explore pyIceberg for Date Writes

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. Using this library can enable us to write data tables direcly from PyArrow dataframes. PyIceberg currently has native support for REST, SQL, Hive, Glue and DynamoDB-based catalogs so suggestion is to use the current hive catalog for testing. This could provide an alternative to the use of spark (simpler but less mature).

https://py.iceberg.apache.org/

Deploy simple Kafka cluster for integrate spark stream

Deploy Kafka cluster to test sprak structured streaming.

Integrate Spark as a component in the Data Mesh for data processing / data ingestion in Iceberg via Python

Refer to existing Spark on OpenShift for possible approaches:
https://github.com/opendatahub-io-contrib/spark-on-openshift

install Argo cd and Cert Mgr

Install Argo CD and Cert Mgr

Document data product dependencies in requirements.txt

In order to build a minimum image with correct dependencies for the data product, document all dependencies in requirements.txt

Document data insertion through Spark and data query process with code example

Revisit the data product to leverage Spark-based insertion

@jpaulrajredhat to provide an example to @caldeirav to introduce into the data product

Document end-to-end installation of data mesh on an OpenShift cluster

In order to transition the project to FINOS, we will need an end-to-end and step-by-step deployment documentation in the repository. Ideally this should be done by someone who has not done the install before with guidance in order to be sure that anyone can reproduce.

Configure the github identity provider for openhsift authendication

Configure the Github identity provider to validate user names and passwords against GitHub . Add all our communiter members .

Move Pachyderm data Versioning to DVC on minio

Ref : https://rachidbenouini.medium.com/data-version-control-from-zero-to-one-fcea7a1c48b3

Create latest version of OpenMetaData helm chart

Create OpenMetaAta helm chart which includes all dependent components such as Elastic search , Postgres DB for persistence and Airflow ingestion framework.

Airflow Kubernete worker pod

configure and deploy kubernete worker for for airflow instead airflow worker

Default configuration for Spark standing cluster

As a data engineer I want to be able to initialize my Spark environment across multiple jobs or sessions without a complicated series of commands.

In Spark, we can use conf/spark-defaults.conf to set common configuration across runs. This approach is typically used for configuration that doesn’t change often, like defining catalogs. Each invocation will include the defaults, which can be overridden by settings passed through the command-line. The file format is a Java properties file.

When configuring an environment for the datamesh, it is best to include the Iceberg dependencies in the jars directory in our Spark distribution so they are defined when spark is deployed and not fetched each time a session is started.

Reference: https://tabular.io/apache-iceberg-cookbook/getting-started-spark-configuration/

Integrate Unity catalog for federated governance and data sharing

As a data product owner, I want to be able to define data access and sharing policies for my product, ideally at a granular level (column / row based filter).

Looking at a potential partnership with Databricks to engineer this as part of our FINOS engagement. Reference material:

https://github.com/unitycatalog/unitycatalog
https://medium.com/@kywe665/unity-catalog-oss-with-hudi-delta-iceberg-and-emr-duckdb-710ab8f8a7dc

Notebook server integration with GitHub

We want data scientists / developers to have standalone notebook environments. The preferred way is to enable multi-user through GitHub authentication and each user able to spawn his own notebook instance.

Implement and test universal format with X-Table

As a data engineer, I want to be able to federated multiple types of catalogs of different types e.g. Iceberg and Delta Lake.

We want to explore UniForm and XTable to achieve this. References:
https://delta.io/blog/unifying-open-table/
https://medium.com/@kywe665/unity-catalog-oss-with-hudi-delta-iceberg-and-emr-duckdb-710ab8f8a7dc

Upgrade metadata to latest version and integrate with airflow

Remove Pachyderm logo from datamesh-arch.png

The Pachyderm logo appears in this image:

https://github.com/opendatahub-io-contrib/datamesh-platform/blob/main/images/datamesh-arch.png

It should be removed and replaced with the new stack we will use for data versioning.

Document all data product dependencies in requirements.txt

In order to provide a minimum image for the data product, or external users to be able to install the right libraries to run the use-case, build a requirements.txt covering the end-to-end process

Can CI/CD test notebooks that need Trino/AWS access?

I don't think there's a way we can make notebooks run in CI/CD and safely protect AWS secrets or JWT tokens. An attacker can easily introspect their environment and steal keys from the service account by creating a pull request and examining outputs. I think therefore we need to be doubly careful that there are no credentials within the service account, or they are sufficiently well hidden that they cannot be extracted by mere mortals. It's too easy to look at environment variables (where keys are currently stashed).

This code is how we do stuff in our own environments:

# Load environment variables from credentials.env
osc.load_credentials_dotenv()

This loads a file not visible to github (credentials.env) that is chock-full of secrets and injects them into the python environment. Since the person running the scripts owns and controls their own credentials.env file, that's not technically a security leak. But if the CI/CD account were to have its own credentials.env file, then a user could dump the environment and secrets would leak from CI/CD to the malicious user.

It's one thing to safely pass the secret from a GitHub actions context to a program you want to run with the guarantee that nobody can see the secret in GitHub. It's another thing to protect the key from a malicious program that wants to leak it. How can SUPER_SECRET be protected from raise ValueError("attention hackers: S U P E R _ S E C R E T = " + os.env[SUPER_SECRET])

The best way to limit the AWS blast radius would be to use public data that doesn't require key-signing to access, accessor pays, and top up the CI/CD account with the trivial costs of data access (that it would bear anyways under provider pays). Our test data is public data.

But there's still the trino JWT token. And to efficiently write data to trino we need a writable AWS data bucket where parquet files we write to Hive can be teleported to Iceberg parquet. (It's MIND BLOWING that Trino's import capabilities are so primitive and non-performant: the DERA-DATA ingestion takes 3-4 hours using "normal" Trino ingestion and 1-2 minutes using Hive->Trino, but that's another problem). So...unless and until we can solve the credentials leak problem, I don't think we should try make actual ingestion pipelines work with CI/CD. Rather we need write scripts that users can run to validate their expectations (things they need to read are readable; if they are writing, what they wrote is sensible) and whose receipts they can commit to "prove" their commits are in working order.

Taking a page from the pytest book, we can have our CI/CD notebook system ignore all files that don't start with test_ unless you ask nicely. If we change the GitHub actions to only test notebooks that match **/test_*.ipynb
then the writers of notesbooks can decide if they are writing "must run to the end" notebooks (which are pytest friendly) or they can name them something that won't match.

But ultimately we need to provide clear guidance as to how we (1) protect secrets in our CI/CD system, and (2) limit the blast radius of inevitable failures.

@ModeSevenIndustrialSolutions

Apache Iceberg Sink Connector

Explore the integration of Apache Iceberg Sink Connector for Kafka Connect which is a sink connector for writing data from Kafka into Iceberg tables:

https://github.com/tabular-io/iceberg-kafka-connect

Features:

Commit coordination for centralized Iceberg commits
Exactly-once delivery semantics
Multi-table fan-out
Row mutations (update/delete rows), upsert mode
Automatic table creation and schema evolution
Field name mapping via Iceberg’s column mapping functionality