Code Monkey home page Code Monkey logo

datamesh-platform's Introduction

Data Mesh Pattern

This repository is a mono-repository that provides a community pattern of distributed data mesh architecture based on open source components. As a prerequisite, we recommended reading the following foundational articles to understand the concept and approach towards building a data mesh:

How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks

Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks

The community pattern implementation is based on 4 foundational principles:

images/data-mesh-components.png

  1. Domain Ownership: The domain ownership principle mandates the domain teams to take responsibility for their data. According to this principle, analytical data should be composed around domains, similar to the team boundaries aligning with the system’s bounded context. Following the domain-driven distributed architecture, analytical and operational data ownership is moved to the domain teams, away from the central data team.

  2. Data-as-a-Product: In a data mesh architecture pattern, data ownership is decentralized and domain data product owners are responsible for all capabilities within a given domain, including discoverability, understandability, quality and security of the data. In effect we achieve this by having agile, autonomous, distributed teams building data pipelines as code and data product interfaces on standard tooling and shared infrastructure services. These teams own the code for data pipelines loading, transforming and serving the data as well as the related metadata, and drive the development cycle for their own data domains. In the process, data handling logic is always treated as code, leveraging the same practices we apply to software code versioning and management to manage changes to data and metadata pipelines.The data as a product principle projects a product thinking philosophy onto analytical data. This principle means that there are consumers for the data beyond the domain. The domain team is responsible for satisfying the needs of other domains by providing high-quality data. Basically, domain data should be treated as any other public API.

  3. Self-service data infrastructure as a platform: The idea behind the self-serve data infrastructure platform is to adopt platform thinking to data infrastructure. A dedicated data platform team provides domain-agnostic functionality, tools, and systems to build, execute, and maintain interoperable data products for all domains. With its platform, the data platform team enables domain teams to seamlessly consume and create data products.

  4. Federated governance: The platform requires a layer providing a federated view of the data domains while being able to support the establishment of common operating standards around data / metadata / data lineage management, quality assurance, security and compliance policies, and by extension any cross-cutting supervision concern across all data products. The federated governance principle achieves interoperability of all data products through standardization, which is promoted through the whole data mesh by the governance group.

The platform is deployed on top of OpenShift Container Platform and OpenShift AI based on the following component architecture:

images/data-mesh-components.png

The list of open source components integrated into the pattern and their respective role is summarized below.

Component Description
Object Storage (Ceph)] Responsible for secure storage of raw data and object-level data access for data ingestion / loading. This is a system layer where data transactions are system-based, considered to be privileged activity and typically handled via automated processes. The data security at this layer is to be governed through secure code management validating the logic of the intended data transaction and secret management to authenticate all access requests for privileged credentials and then enforce data security policies.
Data Serving (Apache Iceberg) Open standard, high-performance table format responsible for managing the availability, integrity and consistency of data transactions and data schema for huge analytic datasets. This includes the management of schema evolution (supporting add, drop, update, rename), transparent management of hidden partitioning / partitioning layout for data objects, support of ACID data transactions (particularly multiple concurrent writer processes) through eventually-consistent cloud-object stores, and time travel / version rollback for data. This layer is critical to our data-as-code approach as it enables reproducible queries based on any past table snapshot as well as examination of historical data changes.
Distributed SQL Query Engine (Trino) Centralized data access and analytics layer with query federation capability. This component ensures that authentication and data access query management is standardized and centralized for all data clients consuming data, by providing management capabilities for role-based access control at the data element level in Data Commons. This being a federation layer it can support query across multiple external distributed data sources for cross-querying requirements.
Data Versioning and Lineage Management (Pachyderm) Automate data transformations with data versioning and lineage. This component helps us provide an immutable lineage with data versioning of any data type, for data being processed in pipelines. It is data-agnostic, supporting both unstructured data such as documents, video and images as well as tabular data from data warehouses. It can automatically detect changes to data, trigger data processing pipelines on change, and ensure both source and processed data are version controlled.
Data Transformation (DBT) SQL-first transformation workflow tooling that lets teams quickly and collaboratively deploy analytics code following software engineering best practices like modularity, portability, CI/CD, and documentation for data pipelines. This component supports our Data-as-Code approach by providing Git-enabled version control which enables full traceability of code related to data changes and allows a return to previous states. It also supports programmatic and repeatable data validation as it allows automating data pipeline testing, and sharing dynamically generated documentation with all data stakeholders, including dependency graphs and dynamic data dictionaries to promote trust and transparency for data consumers.
Data Validation (Great Expectations) Open-source library for validating, documenting, and profiling data. This component provides automated testing based on version-controlled assertions which are essentially unit tests for data quality. It also helps speed up the development of data quality controls by profiling source data to get basic statistics, and automatically generating a suite of expectations based on what is observed in the data. Finally, it also provides clean, human-readable documentation, which is integrated with the metadata management system to facilitate data discovery.
Metadata Management (Open Metadata) Component responsible for data pipeline and metadata management. By tying together data pipeline code and execution, it provides file-based automated data and metadata versioning across all stages of our data pipelines (including intermediate results), including immutable data lineage tracking for every version of code, models and data to ensure full reproducibility of data and code for transparency and compliance purpose.
Workflow Management (Airflow) This component is the workflow management platform for all data engineering pipelines and associated metadata processing. It provides the ability for dynamic pipeline generation leveraging standard Python features to create workflows, including date time formats for scheduling and loops to dynamically generate tasks which are required to integrate data ingestion and validation tasks with various ingestion methods. It also provides a robust and modern web front-end to monitor, schedule and manage workflows.
Data Security Management (Open Policy Agent & Fybrik) Framework responsible for the monitoring and management of comprehensive data security across the entire Data Commons, as a multi-tenant environment. It provides centralized security administration for all security related tasks in a management UI, fine-grained authorization for specific action / operation on data at the data set / data element level including metadata-based authorization and controls. It also centralizes the auditing of user / process-level access for all data transactions. Ultimately, we want this layer to provide data access management and visibility directly to the data / data product owners as a self-service capability.
Analytics Dashboard (Apache Superset) Software application for data exploration and data visualization able to handle data at petabyte scale. Superset connects to any datasource in Data Commons through Trino, so it allows integrating data from cloud native databases and engines to build rich visualizations and dashboard. The visualization plug-in architecture makes it easy to build custom visualizations that can be integrated into reactive web application front-end if required.
Data Analysis (pandas) Software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series that may be required for complex data manipulation in line with requirements of machine learning pipelines

datamesh-platform's People

Contributors

jpaulrajredhat avatar mukesh-zaga avatar sharanyajune7 avatar caldeirav avatar pujathacker2210 avatar taneem-ibrahim avatar

Stargazers

Michael Tiemann avatar  avatar  avatar  avatar

Watchers

Clyde Tedrick avatar Guillaume Moutier avatar Bryon Baker avatar  avatar  avatar  avatar

datamesh-platform's Issues

Deployment for Data Mesh MVP components

  1. Airflow
  2. postgress DB
  3. single node Airflow cluster - Web Ui, Scheduler and Worker all runs on single pod
  4. Trino
  5. Trino cluster
  6. with 1 co-ordinator
  7. 2 Worker Node
  8. basic authentication
  9. Hive MetaStore service
  10. Postgres DB for MetaData
  11. OpemMetaData
  12. OpenMetaData Web UI
  13. Elastic Search
  14. Postgres Database
  15. Airflow integration for Data ingestion
  16. Minio
  17. Deployment

Build helm chart / Kustomize for deployment

DBT metadata ingestion through OpenMetadata / Airflow

Setup OpenMetadata in order to integrate with our DBT pipelines (as code, contained in the data product). List of integrations to be tested:
https://docs.open-metadata.org/v1.4.x/connectors/ingestion/workflows/dbt

The DBT ingestion can be configured manually from the OpenMetadata UI as per the following instructions:
https://docs.open-metadata.org/v1.4.x/connectors/ingestion/workflows/dbt/configure-dbt-workflow-from-ui

In our case, the manifest.json will be retrived from the GitHub repository for the completed data pipeline, via the File Server method.

Test first MVP deployment

@sharanyajune7 Initial MVP deployment code has checked and instruction readme file updated . Please try it on your local server. once successfully tested on ZAGA server. Deploy to Red Hat sandbox.

Install a cloud database management tool

As a data engineer, I want to be able to browse existing catalogs and data ingested in Trino.

For this, it is suggested we install a cloud database management tool such as Cloudbeaver community (https://github.com/dbeaver/cloudbeaver) or similar. This tool has been used successfully before by OS-Climate. Ideally it should be deployed with an authentication provide in line with our management of access in Trino.

Define high level structure for datamesh-platform documentation

This issue aims and identifying how the documentation for this repository should be so that it is easy to follow along for the contributors.
The readme should be structured, detailed and at the same time easy to follow for installation and usage of the platform.
It could have sections like:

  • How the platform works
  • How to install the platform
  • How to build a data pipeline
  • others..

Build infrastructre for MVP deployment

Currently we are using ZAGA infrastructure. We need to have stable instance for deployment. We need to discuss can we use ZAGA infrastructure or Red Hat.

Remove keycloak integration with Trino, OpenMetadata and Airflow

Looking at the difference in implementation mechanism and usability aspects of the applications, we have agreed on NOT integrating keycloak for SSO authentication.

This issue is mean to remove keycloak based authentication from Trino, Airflow, OpenMeteData and other applications in the platform.

We also need to setup(or ensure it is already setup) local users and document how it users can be added/removed using the same.

Explore pyIceberg for Date Writes

PyIceberg is a Python implementation for accessing Iceberg tables, without the need of a JVM. Using this library can enable us to write data tables direcly from PyArrow dataframes. PyIceberg currently has native support for REST, SQL, Hive, Glue and DynamoDB-based catalogs so suggestion is to use the current hive catalog for testing. This could provide an alternative to the use of spark (simpler but less mature).

https://py.iceberg.apache.org/

Default configuration for Spark standing cluster

As a data engineer I want to be able to initialize my Spark environment across multiple jobs or sessions without a complicated series of commands.

In Spark, we can use conf/spark-defaults.conf to set common configuration across runs. This approach is typically used for configuration that doesn’t change often, like defining catalogs. Each invocation will include the defaults, which can be overridden by settings passed through the command-line. The file format is a Java properties file.

When configuring an environment for the datamesh, it is best to include the Iceberg dependencies in the jars directory in our Spark distribution so they are defined when spark is deployed and not fetched each time a session is started.

Reference: https://tabular.io/apache-iceberg-cookbook/getting-started-spark-configuration/

Notebook server integration with GitHub

We want data scientists / developers to have standalone notebook environments. The preferred way is to enable multi-user through GitHub authentication and each user able to spawn his own notebook instance.

Can CI/CD test notebooks that need Trino/AWS access?

I don't think there's a way we can make notebooks run in CI/CD and safely protect AWS secrets or JWT tokens. An attacker can easily introspect their environment and steal keys from the service account by creating a pull request and examining outputs. I think therefore we need to be doubly careful that there are no credentials within the service account, or they are sufficiently well hidden that they cannot be extracted by mere mortals. It's too easy to look at environment variables (where keys are currently stashed).

This code is how we do stuff in our own environments:

# Load environment variables from credentials.env
osc.load_credentials_dotenv()

This loads a file not visible to github (credentials.env) that is chock-full of secrets and injects them into the python environment. Since the person running the scripts owns and controls their own credentials.env file, that's not technically a security leak. But if the CI/CD account were to have its own credentials.env file, then a user could dump the environment and secrets would leak from CI/CD to the malicious user.

It's one thing to safely pass the secret from a GitHub actions context to a program you want to run with the guarantee that nobody can see the secret in GitHub. It's another thing to protect the key from a malicious program that wants to leak it. How can SUPER_SECRET be protected from raise ValueError("attention hackers: S U P E R _ S E C R E T = " + os.env[SUPER_SECRET])

The best way to limit the AWS blast radius would be to use public data that doesn't require key-signing to access, accessor pays, and top up the CI/CD account with the trivial costs of data access (that it would bear anyways under provider pays). Our test data is public data.

But there's still the trino JWT token. And to efficiently write data to trino we need a writable AWS data bucket where parquet files we write to Hive can be teleported to Iceberg parquet. (It's MIND BLOWING that Trino's import capabilities are so primitive and non-performant: the DERA-DATA ingestion takes 3-4 hours using "normal" Trino ingestion and 1-2 minutes using Hive->Trino, but that's another problem). So...unless and until we can solve the credentials leak problem, I don't think we should try make actual ingestion pipelines work with CI/CD. Rather we need write scripts that users can run to validate their expectations (things they need to read are readable; if they are writing, what they wrote is sensible) and whose receipts they can commit to "prove" their commits are in working order.

Taking a page from the pytest book, we can have our CI/CD notebook system ignore all files that don't start with test_ unless you ask nicely. If we change the GitHub actions to only test notebooks that match **/test_*.ipynb
then the writers of notesbooks can decide if they are writing "must run to the end" notebooks (which are pytest friendly) or they can name them something that won't match.

But ultimately we need to provide clear guidance as to how we (1) protect secrets in our CI/CD system, and (2) limit the blast radius of inevitable failures.

@ModeSevenIndustrialSolutions

Apache Iceberg Sink Connector

Explore the integration of Apache Iceberg Sink Connector for Kafka Connect which is a sink connector for writing data from Kafka into Iceberg tables:

https://github.com/tabular-io/iceberg-kafka-connect

Features:

  • Commit coordination for centralized Iceberg commits
  • Exactly-once delivery semantics
  • Multi-table fan-out
  • Row mutations (update/delete rows), upsert mode
  • Automatic table creation and schema evolution
  • Field name mapping via Iceberg’s column mapping functionality

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.