opendatahub-io-contrib / data-mesh-pattern Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 15.0 39.99 MB

Data Mesh Pattern

Home Page: https://opendatahub-io-contrib.github.io/data-mesh-pattern

License: Apache License 2.0

Smarty 3.56% Python 14.34% Shell 24.72% Dockerfile 3.15% Open Policy Agent 0.08% Makefile 1.63% Jinja 2.44% Go 50.08%

data-mesh-pattern's People

Contributors

Stargazers

Watchers

Forkers

eformat caldeirav mamurak os-climate mbogoevici pujathacker2210 avinashsingh77 rathankumarrh gizter jpaulrajredhat michaelepley rakhmad day0hero geoffallendev

data-mesh-pattern's Issues

Integrate Trino ETL Jobs using dbt-trino with Airflow on Kubernetes

Once a DBT pipeline has been set, we want to encapsulate the execution into a DAG on Airflow, running on OpenShift. Look at implementing this from a pipeline-as-code perspective (everything included in the pipeline repo).

Good reference article:
https://itnext.io/the-way-to-integrate-trino-etl-jobs-using-dbt-trino-with-airflow-on-kubernetes-51cc851a366

Introduce dbt elementary package for dbt management

dbt elementary is a dbt package that provides infra management capabilities around pipelines:

https://leo-godin.medium.com/are-you-using-elementary-for-dbt-f9a56ecbef42

Evaluate whether this can add additional capabilities in complement of OpenMetadata integration

Add automation of data pipeline with versioned airflow dag in Data Product Template

The end-to-end execution of data pipelines should happen with a DAG maintained in the Data Product Template repository and secrets being injected at execution time.

LDAP trino group provider

📝 Description

we need an ldap trino group provider. i don't think there is one in upstream trino (only the file group provider).

examples of a java based one:

https://github.com/eformat/trino-group-provider-ldap-ad
https://github.com/arghya18/trino-group-provider-ldap-ad

starburst has this out of the box in the product - https://docs.starburst.io/latest/security/ldap-group-provider.html

perhaps we should write a qaurkus version for ourselves ?

🐛 [issue] - Update ArgoCD to support the sidecar model

📝 Description

ArgoCD is deprecating ConfigMap's for plugin configuration in favor of the more secure sidecar container pod model.

This is supported with the ArgoCD Vault Plugin:

https://argocd-vault-plugin.readthedocs.io/en/stable/installation/#custom-image-and-configuration-via-sidecar

Example config is here:

https://github.com/eformat/argocd-vault-sidecar

This is being worked in this branch DRAFT pr:

#45

🐛 [bug] - Issue with fybrik-dev installing opa et.al

📝 Description

[... of the issue you're seeing in the content / tech demo exercises]
During the Supply chain Builds step it has us create our argocd app of apps.

Problem is that when we run it
(

data-mesh-pattern/gitops/argocd/cluster-dev/rainforest-ci-cd/fybrik-dev.yaml

Line 40 in 699ce55

runAsUser: 1000810000

)

we get an error with the security context not matching any constraints:

    ```pods "opa-5867777fb9-" is forbidden: unable to validate against any
    security context constraint: [provider "anyuid": Forbidden: not usable
    by user or serviceaccount, provider "pipelines-scc": Forbidden: not
    usable by user or serviceaccount,
    spec.initContainers[0].securityContext.runAsUser: Invalid value:
    1000810000: must be in the ranges: [1000860000, 1000869999],
    spec.containers[0].securityContext.runAsUser: Invalid value: 1000810000:
    must be in the ranges: [1000860000, 1000869999],
    spec.containers[1].securityContext.runAsUser: Invalid value: 1000810000:
    must be in the ranges: [1000860000, 1000869999], provider "restricted":
    Forbidden: not usable by user or serviceaccount, provider
    "container-build": Forbidden: not usable by user or serviceaccount,
    provider "nonroot-v2": Forbidden: not usable by user or serviceaccount,
    provider "nonroot": Forbidden: not usable by user or serviceaccount,
    provider "hostmount-anyuid": Forbidden: not usable by user or
    serviceaccount, provider "machine-api-termination-handler": Forbidden:
    not usable by user or serviceaccount, provider "hostnetwork-v2":
    Forbidden: not usable by user or serviceaccount, provider "hostnetwork":
    Forbidden: not usable by user or serviceaccount, provider "hostaccess":
    Forbidden: not usable by user or serviceaccount, provider
    "node-exporter": Forbidden: not usable by user or serviceaccount,
    provider "privileged": Forbidden: not usable by user or serviceaccount]```

🐛 [bug] - fix jhub single user profiles for ODH/RHODS and Spark

📝 Description

the JHUB single user profiles need integrating with ODH/RHODS config.

🚶 Steps to reproduce

choosing a Spark based JHUB image does not spin up a cluster for the user.

🧙‍♀️ Suggested solution

the code exists for a odh/custom jhub deployment. need to see if this can be made to work with rhods using config.

https://github.com/opendatahub-io-contrib/jupyterhub-singleuser-profiles

🐛 [bug] - Build failure for git-sync

📝 Description

Cloning "https://gitlab-ce.apps.osc-cl4.apps.os-climate.org/osclimate-datamesh/data-mesh-pattern" ...
Commit: efb9821cee326adb0256eaa715d14ab17deb4bae (UPDATE - project rename)
Author: Derek Dinosaur [email protected]
Date: Tue Jun 13 07:40:45 2023 +0000
time="2023-06-21T15:57:03Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0621 15:57:03.856489 1 defaults.go:102] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".
Pulling image registry.access.redhat.com/ubi8/ubi:8.7-1112 ...
Trying to pull registry.access.redhat.com/ubi8/ubi:8.7-1112...
Getting image source signatures
Copying blob sha256:6208c5a2e205726f3a2cd42a392c5e4f05256850d13197a711000c4021ede87b
Copying config sha256:768688a189716f9aef8d33a9eef4209f57dc2e66e9cb5fc3b8862940f314b9bc
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/11: FROM registry.access.redhat.com/ubi8/go-toolset:1.17.12-3 AS builder
Trying to pull registry.access.redhat.com/ubi8/go-toolset:1.17.12-3...
Getting image source signatures
Copying blob sha256:7e3624512448126fd29504b9af9bc034538918c54f0988fb08c03ff7a3a9a4cb
Copying blob sha256:e0dc1b5a4801cf6fec23830d5fcea4b3fac076b9680999c49935e5b50a17e63b
Copying blob sha256:db0f4cd412505c5cc2f31cf3c65db80f84d8656c4bfa9ef627a6f532c0459fc4
Copying blob sha256:354c079828fae509c4f8e4ccb59199d275f17b0f26b1d7223fd64733788edf32
Copying blob sha256:26f52032c311fbc800e08f09294173c94c35c8fcd36ed2d43ee3255bda598373
Copying config sha256:068b656b38eb7ca9715019ba440d0cd2dade3154390e13b6397d4601a8bdce66
Writing manifest to image destination
Storing signatures
[1/2] STEP 2/11: ARG ARG_OS=linux
--> ef8de5d13a9
[1/2] STEP 3/11: ARG ARG_ARCH=amd64
--> f7bf97ebc3e
[1/2] STEP 4/11: ARG ARG_BIN=git-sync
--> 010315e264e
[1/2] STEP 5/11: ARG TARGETOS=linux
--> d643f9978f3
[1/2] STEP 6/11: ARG TARGETARCH=amd64
--> decd079af01
[1/2] STEP 7/11: WORKDIR /workspace
--> 5fe2777e29d
[1/2] STEP 8/11: RUN git clone https://github.com/kubernetes/git-sync.git /workspace
Cloning into '/workspace'...
/workspace/.git: Permission denied
error: build error: error building at STEP "RUN git clone https://github.com/kubernetes/git-sync.git /workspace": error while running runtime: exit status 1

🐛 [bug] - OpenMetadata Keycloak - Roles + Logout

📝 Description

Integration for Keycloak based login was added in #49

Two issues need some more work:

(1) No Backchannel logout mechanism in Openmetadata for KC. The frontchannel configured in the example did not seem to work (ie. session remains in KC post logout)

https://github.com/open-metadata/openmetadata-demo/blob/main/keycloak-sso/config/data-sec.json

likely this needs fixing in OMD itself.

(2) The Team / Roles - seem to be managed in the app - i.e. Admin is set using the env.var AUTHORIZER_ADMIN_PRINCIPALS and the default roles in KC client have no effect:

  roles:
    - name: DataConsumer
      composite: false
      clientRole: true
    - name: Admin
      composite: false
      clientRole: true
    - name: DataSteward
      composite: false
      clientRole: true

Would be nice if these Roles could be managed in KC instead.

Review lakeFS / DVC for data versioning

Pachyderm provides limited capabilities for us to manage data versioning, in particular:

No actual integration with the git repository for the data product
No ability for a client to visually check data and data versioning process in an interface
No integration with query capabilities

This issue is to review other open source projects such as lakeFS and DVC for potential replacement.

Integrate kepler with observability and data mesh

Integrate kepler (https://sustainable-computing.io/) with the data mesh pattern to generate power consumption data at pod level and leverage as optimization data for AIOps use-cases.

This will include:

Kepler deployment as part of the pattern (likely leveraging helm charts)
Integration with cluster prometheus metrics
Integration with observability layer via distributed tracing
Persist kepler data long-term in lakehouse (on container storage) with a real-time pipeline

Note: ideally we filter workload data (no persistence of control plane) given how much storage this will create, and there is a way to start / stop the collection.

🐛 [bug] - osc-ingest-tools 0.4.3 vs. SqlAlchemy 2.0

📝 Description

First off, the Elyra Dockerfile likely needs osc-ingest-tools to do anything interesting when it comes to building data pipelines. Alas, os-climate/osc-ingest-tools#46 cites that osc-ingest-tools uses code that's deprecated in SqlAlchemy 2.0. @HeatherAck

🚶 Steps to reproduce

When there is a Data Mesh pattern available in one of the OS-Climate clusters, I'll create a recipe for reproduction. This issue is just book-keeping at this point. @redmikhail

🧙‍♀️ Suggested solution

We need an updated osc-ingest-tools library (@erikerlandson) and an updated Elyra Dockerfile referencing that updated library.

Setup Airflow with Hashicorp Vault Secrets Backend

In order to enable sage secrets retrieval for data pipeline, we want to use vault as a secret backend for airflow. Relevant article: https://airflow.apache.org/docs/apache-airflow-providers-hashicorp/stable/secrets-backends/hashicorp-vault.html

Update Data Mesh documentation with Data Engineering doc based on data product template

There should be a Data Engineer guide looking at how to build an end-to-end data product based on the template. Include development setup, and step-by-step pipeline development and execution.

Some reference of a step-by-step lakehouse build:
https://blog.min.io/lakehouse-architecture-iceberg-minio/
OS-Climate Developer guide

Document deployment of pattern

Initial documentation for deployment of pattern including prerequisites (with links), and deployment steps, similar to other patterns e.g. https://hybrid-cloud-patterns.io/patterns/multicloud-gitops/getting-started/

Review integration of GeoSpatial data in Data Mesh with STAC

STAC (SpatioTemporal Asset Catalogs) specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered.

Review how this should be integrated with the data mesh pattern in terms of data storage, metadata management, etc...

Reference link: https://stacspec.org/en

🐛 [bug] - OpenMetaData integration with Airflow ingestion

📝 Description

Deployed Airflow data ingestion wit new version of OpenMetadata.. Able to create a pipeline

🚶 Steps to reproduce

... Able to create a pipeline , but not able to deploy data ingestion pipeline to Airflow

🧙‍♀️ Suggested solution

There is a issue with latest version of Metadata . opened support ticket in OpenMetaData slack waiting for resolution from OpenMetaData community
https://app.slack.com/client/T02BVTLN3G8

Prototype upstream pattern / downstream integration starting with Trino configuration

The pattern should take care of how downstream deployment evolve / get updated over time, allow evolution without breaking changes and safeguard downstream specific configurations (such as connectors).

For a start, we want to test in the context of OS-Climate how specific Trino configuration can be driven over time, use this as a way to start exploring the relationship between upstream and downstream.

Integrate data mesh over platform observability

Integrate with OpenTelemetry to export metrics, logs, and traces from the platform (as well as potentially Kepler) into data mesh ingestion. For this we focus on technical stacks to be used long term by our engineering team for metrics / logs / traces collection in the platform.

Metrics: Prometheus / Thanos
Logs: Loki / Vector
Traces: Jaeger / OpenTelemetry

The proposed approach would create a single layer of data delivery for metrics, logs and traces for the data collected and stored (potentially via ingestion through Trino / Iceberg).

Foundation Models Integration - Using large language models to automate metadata management

This issue is to explore a systematic approach to generating a complex ontologies / metadata and data links by first ingesting a standard ontology and using a large language model like GPT-3 to create a script to generate and populate a metadata dictionary or a graph database.

Reference article: https://venturebeat.com/ai/how-to-use-large-language-models-and-knowledge-graphs-to-manage-enterprise-data/

Direct ingestion with Iceberg FileIO

Ingestion of data in batch via dataframe is slow and we are looking at leveraging Iceberg and MinIO for direct ingest in Trino from partioned ORC or Parquet tables written under the iceberg bucket.

Reference:
https://blog.min.io/lakehouse-architecture-iceberg-minio/

🐛 [bug] - OpenMetaData integrate with OIDC/OAUTH2/Keycloak

📝 Description

OpenMetadata needs integrating with Keycloak login

Setup BigChainDB and Trino connector (via MongoDB)

BigChainDB (https://www.bigchaindb.com/) allows developers and enterprise to deploy blockchain proof-of-concepts, platforms and applications with a blockchain database, supporting a wide range of industries and use cases. In particular, it is used in GAIA-X and CatenaX for building decentralized data exchange secured by tokenization.

This issue is to support a PoC for deploying BigChainDB on our cluster, create a digital record (https://www.bigchaindb.com/developers/guide/tutorial-piece-of-art/) and then query the data via the MongoDB connector (https://docs.bigchaindb.com/en/latest/query.html)

Rule engine for data transformation

Maintenance of taxonomies for data should ideally be done in some kind of standard format with the ability to build rules for data equivalence between different data formats. This would be useful in particular in the case of ESG taxonomies mapping. Without such an ability to have mappings maintained in a one dimensional format, a lot of maintenance is required for cross-mappings for example:

https://github.com/OS-SFT/Taxonomy-Mappings-Library

This issue is to investigate a better way to maintain mappings in order to support the taxonomy equivalence project run within OS-Climate.

@MichaelTiemannOSC

🐛 [issue] - Remove non-essential components

📝 Description

pare down the tools, to those in the pattern. e.g.

remove mlflow - we intend to use modelmesh
spark - handy for demoing, but likely will align with ray.io in odh for example

Setup Trino / Fybrik integration

Leverage the fybrik trino module (https://github.com/fybrik/trino-module) to setup the integration with the trino cluster

Setup grafana visualization over trino

Grafana is potentially a good tool for dynamic visualization of geographical data (https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/geomap/). Using the trino plugin (https://github.com/trinodb/grafana-trino), it would be interesting to see if we can visualize data retrieved from trino (such as power plants) overlayed into a world map.

Dynamic Query Routing for Trino

Trino users may utilize various clients such as CLI, JupyterHub, SQL editors, custom reporting tools, and other JDBC-based apps to connect to Trino from a wide range of locations over an enterprise network. Therefore, implementation of data mesh patterns at scale typically will have multiple trino clusters to avoid single point of failure, scaling, and potentially optimize the network routing closer to the query location. This can be achieved with dynamic query routing and Goldman Sachs has implemented a solution using envoy proxies to support this type of distributed trino deployment.

https://developer.gs.com/blog/posts/enabling-highly-available-trino-clusters-at-goldman-sachs

We should review this architecture and determine if and how we could support similar deployment models with our pattern, in order to provide an out-of-the-box high availability approach.

🐛 [issue] - add back the aiflow dags directory details

📝 Description

airflow needs a DAG directory with .airflowignore setup - this was in the old doco as a step. need to add back in so airflow deploys OK

Add data quality checks leveraging great_expectation in Data Product Template

In addition to simple quality checks testing done with DBT, we want to use great_expectation for more complex business checks, and integrate them with the data lineage automatically produced by OpenMetadata.

Foundation Models Integration - Atlas visualization for datasets

Google has been leveraging Atlas visualizer for embeddings:
https://atlas.nomic.ai/

Look at the possibility of embedding visualization for distributed data sets as a way to explore / search for data. This can be a complement to an Elastic Search type of discovery.

🐛 [bug] - Cosmetic - Hyperlinks pointing to incorrect/different endpoints

📝 Description

The following URLs as mentioned in https://github.com/opendatahub-io-contrib/data-mesh-pattern#data-mesh-pattern are pointing to incorrect endpoints:
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks

Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks

🚶 Steps to reproduce

Click on any of the following two URLs in https://github.com/opendatahub-io-contrib/data-mesh-pattern#data-mesh-pattern:
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks

Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks

🧙‍♀️ Suggested solution

[How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh] should(most probably) be linked to https://martinfowler.com/articles/data-monolith-to-mesh.html

[Data Mesh Principles and Logical Architecture] should(most probably) be linked to https://martinfowler.com/articles/data-mesh-principles.html

Trino Connector - Mechanism for drop in Connector Configuration

📝 Description

Currently the helm chart supports separate Hive/S3 catalog deployments. You can define multiple Hive catalogs in the catalogs list:

https://github.com/opendatahub-io-contrib/data-mesh-pattern/blob/main/gitops/trino/chart/trino/values.yaml#L80-L85

catalogs:
  # Hive Demo Catalog
  - name: demo
    enabled: true
    replicaCountHive: 1
    replicaCountDb: 1

an have the different connection secrets e.g. DEMO_*

https://github.com/opendatahub-io-contrib/data-mesh-pattern/blob/main/gitops/trino/chart/trino/templates/hive-db-secret.yaml#LL15C1-L19C60

  database-password: <{{ $cat.name | upper }}_HIVE_DB_PASSWORD>
  database-host: {{ $cat.name }}-hive-db
  database-port: "5432"
  database-name: <{{ $cat.name | upper }}_HIVE_DB_NAME>
  database-user: <{{ $cat.name | upper }}_HIVE_DB_USERNAME>

Different approaches to this problem exist.

e.g. For example in the upstream chart a simple ConfigMap is used for this example:

https://github.com/trinodb/charts/blob/main/charts/trino/templates/configmap-catalog.yaml#L12-L17

e.g. And in Operate-First/OS-climate this is done using kustomize overlays:

https://github.com/operate-first/apps/blob/master/kfdefs/overlays/osc/osc-cl2/trino/configs/catalogs/oef_openclimate.properties

We would like to document and extend a mechanism to support a broader range of connectors.

should connectors be separate from the Helm Chart ?
for example you could have them exclusively as a set of connector kustomize overlays ?
or with helm - inject all the connector properties with separate connector values.yaml files ?

🐛 [issue] - single rhods-notebooks project, can we cater for multiple data science teams ?

📝 Description

RHODS places all of the users and their notebook pods in one namespace rhods-notebooks

https://access.redhat.com/documentation/en-us/red_hat_openshift_data_science/1/html-single/managing_users_and_user_resources

in a single OpenShift cluster it would be nice to be able to multi-tenant teams so that users notebooks are not visible to everyone who has access to the rhods-notebooks project.

🧙‍♀️ Suggested solution

In the original code base, we could deploy an instance of upstream odh jupyterhub per-team i.e. multiple jupyterhub instances - thus allowing this type of separation.

👑 [exercise] - remove old non-data-mesh examples

📝 Description

the rainforest demo examples need removing and/or changing to target data mesh instead.

🥤 Additional Info

see the docs/4-aiml-demos folder for user demo's and examples

✅ A/Cs

Big Picture Updated (if applicable)
Facilitator notes updated (if applicable)
Exercise peer reviewed / tested with one other region member
Addition of new exercise does not affect previous exercise (maintain modularity)

👑 [exercise] - On-board SAMEPATH

Module:

📝 Description

The SAMEPATH datasets (https://samepath.shinyapps.io/samepath/#dataAccess) consist of many tables from NGFS, UNIPRI, GECO, and other public sources related to sustainable finance. We want to demonstrate the ease with which we can federate this data from primary sources, maintain the data as it is updated (usually annually), and serve as the future data source for the SAMEPATH visualization (R-Shiny) tools.

🥤 Additional Info

...link to the docs where the exercise should be added or links to blogs etc that form the basis of the exercise.

✅ A/Cs

Big Picture Updated (if applicable)
Facilitator notes updated (if applicable)
Exercise peer reviewed / tested with one other region member
Addition of new exercise does not affect previous exercise (maintain modularity)

OpenMetadata is not able to execute metadata pipelines

Configured successfully a trino connection from OpenMetadata, however we are not able to run metadata pipelines due to integration with Airflow Managed APIs.

Unable to access minIO with keycloak users

📝 Minio credential

... Not able to access Minio using OCP user id user1 and admin

🐛 [issue] - remove in-tree reloader + git-sync code

📝 Description

we forked the reloader, git-sync code for off-line building based on ubi
we should revert to the upstream code git repos, but keep the builds though as ubi based is advanteagous.

Self-signed certificate issue when connecting with Trino

We have a certificate issue when running a query against trino passing the self-signed certificate at https://github.com/opendatahub-io-contrib/data-mesh-pattern/blob/main/supply-chain/trino/trino-certs/ca.crt:

Code to reproduce:

certificate_path = '../../ca.crt'
engine = create_engine(
'trino://' + os.environ['TRINO_USER'] + ':' + os.environ['TRINO_PASSWD']
+ '@' + os.environ['TRINO_HOST'] + ':' + os.environ['TRINO_PORT'] + '/'
+ ingest_catalog,
connect_args={'verify': certificate_path},
)

with engine.connect() as connection:
result = connection.execute(text('show catalogs'))
for row in result:
print(row)

Error:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)

During handling of the above exception, another exception occurred:

MaxRetryError: HTTPSConnectionPool(host='trino-service.daintree-dev.svc.cluster.local', port=8443): Max retries exceeded with url: /v1/statement (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)')))

Foundation Models Integration - Data Contributions

Date mesh pattern should provide a way for data product owner to contribute curated data for LLM training. A good approach and reference is the datalake approach for gpt4all:

https://github.com/nomic-ai/gpt4all-datalake

🐛 [bug] - pachyderm cannot start

📝 Description

Similar to #79 , pachyderm containers are unable to start as it appears to be looking for an operator group that isnt there:

        failed to populate resolver cache from source
        operatorgroup-unavailable/pachyderm: found 0 operatorgroups in namespace
        pachyderm: expected 1```

Notebook image for Data Mesh is not working

The custom notebook image for Data Mesh is not being started successfully from the RHODS console:

There are issues with the image build.

Document data domain / product design guidelines

There should be some documentation on design guidance for data domains / products, which can be referenced then in examples and development doc.

A good checklist for this can be found at: https://towardsdatascience.com/data-domains-and-data-products-64cc9d28283e

🐛 [bug] - Elyra-tflow container fails to build

📝 Description

During the build of all the containers the elyra-tflow container fails to build:

Cloning "https://gitlab-ce.apps.osc-cl4.apps.os-climate.org/osclimate-datamesh/data-mesh-pattern" ...
	Commit:	efb9821cee326adb0256eaa715d14ab17deb4bae (UPDATE - project rename)
	Author:	Derek Dinosaur <[email protected]>
	Date:	Tue Jun 13 07:40:45 2023 +0000
Replaced Dockerfile FROM image elyra-base:0.2.1
time="2023-06-26T20:43:48Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0626 20:43:48.530450       1 defaults.go:102] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".

Pulling image image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea ...
Trying to pull image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea...
Getting image source signatures
Copying blob sha256:028bdc977650c08fcf7a2bb4a7abefaead71ff8a84a55ed5940b6dbc7e466045
Copying blob sha256:af38327575d72c979478aaddf6a33ad9cf561844588f5db47e85c4ee721012ec
Copying blob sha256:819ccd5eb87778d75c516f3a542ae6a3d2367498bd7062a701cb2237995f6cb5
Copying blob sha256:a439c75b0a4f2699983da35fc5e15fd9809bc37f694f54717020886cffc0548b
Copying blob sha256:0c673eb68f88b60abc0cba5ef8ddb9c256eaf627bfd49eb7e09a2369bb2e5db0
Copying blob sha256:c37fd7de0840b4031b29e532b9c694c59a63983ae93162a2e6476882cd075b21
Copying blob sha256:bf105214519e48fd5c21e598563e367f6f3b7c30996d1745a99428752c0ad1ae
Copying blob sha256:0cdbf2b404cc6f9f91c9f46d490f467080c4b5d8ee3b5d4c925e02a340e8d10b
Copying blob sha256:f2316205fe7bc7979d3019254716646bf2f786c1825faa1c1ed39f7420174b25
Copying blob sha256:68057c5053360a1a580bb505ba567d6f4c771d07fe959a30c547d4e276bc0467
Copying blob sha256:988a562fbd90b733eb253c56d63a830afed36df0e609418700caccd23a245fdc
Copying blob sha256:90cf9451d289c16ed981d2a646cfc979874f0eff05ea2e86edfefac87ff0b2e6
Copying blob sha256:ebb3898343c60b4a8d79aed8a93654dc73a0f980ea1bf7e30018bd449d4f611b
Copying blob sha256:059ceb835a667820ab78d7d6fb48b9e7fbb769ce612281ba189bed25ce0a99db
Copying blob sha256:f31e46de923b1250ab065453646dcff2466749a2e9549ea289b038cfa3fefe36
Copying blob sha256:9ff9b64097f0280c8b0ecd3a2a801bf474d0aa3fc160350fd699c1d929e0241b
Copying blob sha256:90c508cf12e1e5825e29e1eec796188af045440ffa6d697f35279a813b004b9b
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:8fde022b6648ce49357f0c7620a96ba04104c5ad0e9029078ff878cfc37021bb
Copying blob sha256:90367ac5959ea0a29369bb20aff6c90903326a1fa703befc629d1cdf024fc99a
Copying blob sha256:ae97caea9fa3345a096a09d1df0fa8b68a31cdd398c4402748e0b548fe2f25ff
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:8a441ef86887ecc2a66703d73d2b86538a75edfa38a90cc19d73bb7aaa4aa8cd
Copying blob sha256:23946ae671303d0c6cc4870accc51fe43463e7993e122f7b08082dc2a9726a0f
Copying blob sha256:073c6e194011062bd49b1ccda1819f15aa368590829afae3e0263759cf4dacba
Copying blob sha256:46e601ccae7c5a32545b6b6db733b3d2db5b6581b915520edbcbf262a2b79110
Copying blob sha256:4a269fdc289ca6b7833584bede177c80aa91f2706dea33bc2b94398a3e83d9d0
Copying blob sha256:12ef46b74f05917750404a8de7565168740216fa44be5b19d5d75273a3ec0c86
Copying blob sha256:b2ec48efaf35963d699ae8446e20120869f9fb1ca34ee70f64b82a6050e627f7
Copying blob sha256:c48957ebd2d09b52f4d564cbd5914b1b9e94939f21142f6041db41d0e62fab74
Copying blob sha256:08c9d67bcd774940f73a67eb036be8a756d8eab9b2e4c43bc4e0bcdf17cdaea3
Copying blob sha256:c23f73eb778d14742f04e1238227b8efc4fd1ce51d17a98100744e912e752901
Copying blob sha256:5e0654a3c30dd59ab31f6531ae1a8ad9a8368c5cb6368550e0de2e7c66f9b3b9
Copying blob sha256:237741efa6248120129716d660cc7fece732ea172110784949b97a96e681cb62
Copying blob sha256:58787dd3cb793f5983c0aaa6b70341c30a41a1bb60fc1a5f6f1cd9061ee2edc0
Copying blob sha256:4f5aa417a25f646d2d39642577d4580eedd0fe809c857932aeabd3bb22587bb9
Copying blob sha256:f009e2fceca5421f4769b12a3dd42777940ee1e6e8f17c8c5b77b5e248b9b7d2
Copying blob sha256:2fb528adb3814ee51b07a0165956060c4d0703d454a18f08c6430ed667ed5853
Copying blob sha256:cef676ff822d33c5bdc8cc17a6af24ce425f2353463b189c7ee1a637c2d012ea
Copying blob sha256:f7ea4b46629aacb0aacbf8fe8197fb924a48c9e8875d9f9721565b4a7374549a
Copying blob sha256:d1473e2d5c4be6a885eba43606bbe79229239b92427436391a8cf9edb977e357
Copying blob sha256:d3cda3d33521c0cd44da393733605297f341d7e36a42850e945d122578533ded
Copying config sha256:4a7596a0ebbeb7ba5f97a2ca3d310d6ec4b0842fa024310ec3e235517d45d4dd
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/8: FROM image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea
STEP 2/8: USER root
--> 8ab6947207f
STEP 3/8: RUN /opt/app-root/bin/pip3 install jinja2==3.1.2
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Requirement already satisfied: jinja2==3.1.2 in /opt/app-root/lib/python3.8/site-packages (3.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/app-root/lib/python3.8/site-packages (from jinja2==3.1.2) (2.1.1)
WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
--> fbfab4021e8
STEP 4/8: RUN /opt/app-root/bin/pip3 install certifi
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Requirement already satisfied: certifi in /opt/app-root/lib/python3.8/site-packages (2022.9.24)
WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
--> 1c77c833b0e
STEP 5/8: RUN /opt/app-root/bin/pip3 install matplotlib numpy pandas scipy scikit-learn tensorflow minio
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Collecting matplotlib
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/matplotlib/3.7.1/matplotlib-3.7.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (9.2 MB)
Collecting numpy
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/numpy/1.24.4/numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting pandas
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/pandas/2.0.2/pandas-2.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting scipy
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/scipy/1.10.1/scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting scikit-learn
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/scikit-learn/1.2.2/scikit_learn-1.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
Collecting tensorflow
  Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/tensorflow/2.12.0/tensorflow-2.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
    tensorflow from https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/tensorflow/2.12.0/tensorflow-2.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=23850332f1f9f778d697c9dba63ca52be72cb73363e75ad358f07ddafef63c01:
        Expected sha256 23850332f1f9f778d697c9dba63ca52be72cb73363e75ad358f07ddafef63c01
             Got        2ecfc624220e0e36c414dc6889ab365f02f50a9edc3f230dcebbd4955cbf62fa

WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
error: build error: error building at STEP "RUN /opt/app-root/bin/pip3 install matplotlib numpy pandas scipy scikit-learn tensorflow minio": error while running runtime: exit status 1

👑 [exercise] - Google Data Commons POC

Google Data Commons POC

Module:

📝 Description

The Google Data Commons (https://datacommons.org/) has over 1 trillion datapoints of all kinds, organized in a knowledge graph and available via BigQuery. Some of this data is directly useful to climate and sustainable finance analysis, and some of this data could be useful when linked to corporate ownership (via entity matching).

Here are datasets federated by Google's Data Commons that relate to the topic Environment: https://docs.datacommons.org/datasets/Environment.html

Here is a narrowing of that data that relate to the topic Emissions within the US (based on EPA GHGRP): https://datacommons.org/tools/map#%26sv%3DAnnual_Emissions_CarbonDioxide_NonBiogenic%26pc%3D0%26denom%3DCount_Person%26pd%3Dcountry%2FUSA%26ept%3DState%26ppt%3DEpaReportingFacility

The goal of this exercise is to demonstrate our ability to federate a tiny but meaningful slice of Google's Data Commons data into the Data Mesh and to expose that data within OS-Climate's Data Exchange. The data should be chosen so that a meaningful "so what?" question can be answered, but the overall point of the exercise is to assess the ease with which the Data Mesh can enable data analysts to be maximally productive and effective in when asking and answering climate and sustainable finance questions.

🥤 Additional Info

Please feel free to flesh out and/or ask further questions.

✅ A/Cs

Big Picture Updated (if applicable)
Facilitator notes updated (if applicable)
Exercise peer reviewed / tested with one other region member
Addition of new exercise does not affect previous exercise (maintain modularity)

opendatahub-io-contrib / data-mesh-pattern Goto Github PK

data-mesh-pattern's People

Contributors

Stargazers

Watchers

Forkers

data-mesh-pattern's Issues

📝 Description

📝 Description

📝 Description

📝 Description

🚶 Steps to reproduce

🧙‍♀️ Suggested solution

📝 Description

📝 Description

📝 Description

🚶 Steps to reproduce

🧙‍♀️ Suggested solution

📝 Description

🚶 Steps to reproduce

🧙‍♀️ Suggested solution

📝 Description

📝 Description

📝 Description

📝 Description

🚶 Steps to reproduce

🧙‍♀️ Suggested solution

📝 Description

📝 Description

🧙‍♀️ Suggested solution

📝 Description

🥤 Additional Info

✅ A/Cs

📝 Description

🥤 Additional Info

✅ A/Cs

📝 Minio credential

📝 Description

📝 Description

📝 Description

Google Data Commons POC

📝 Description

🥤 Additional Info

✅ A/Cs

Recommend Projects

Recommend Topics

Recommend Org