opendatahub-io-contrib / data-mesh-pattern Goto Github PK
View Code? Open in Web Editor NEWData Mesh Pattern
Home Page: https://opendatahub-io-contrib.github.io/data-mesh-pattern
License: Apache License 2.0
Data Mesh Pattern
Home Page: https://opendatahub-io-contrib.github.io/data-mesh-pattern
License: Apache License 2.0
Once a DBT pipeline has been set, we want to encapsulate the execution into a DAG on Airflow, running on OpenShift. Look at implementing this from a pipeline-as-code perspective (everything included in the pipeline repo).
Good reference article:
https://itnext.io/the-way-to-integrate-trino-etl-jobs-using-dbt-trino-with-airflow-on-kubernetes-51cc851a366
dbt elementary is a dbt package that provides infra management capabilities around pipelines:
https://leo-godin.medium.com/are-you-using-elementary-for-dbt-f9a56ecbef42
Evaluate whether this can add additional capabilities in complement of OpenMetadata integration
The end-to-end execution of data pipelines should happen with a DAG maintained in the Data Product Template repository and secrets being injected at execution time.
we need an ldap trino group provider. i don't think there is one in upstream trino (only the file group provider).
examples of a java based one:
https://github.com/eformat/trino-group-provider-ldap-ad
https://github.com/arghya18/trino-group-provider-ldap-ad
starburst has this out of the box in the product - https://docs.starburst.io/latest/security/ldap-group-provider.html
perhaps we should write a qaurkus version for ourselves ?
ArgoCD is deprecating ConfigMap's for plugin configuration in favor of the more secure sidecar container pod model.
This is supported with the ArgoCD Vault Plugin:
Example config is here:
https://github.com/eformat/argocd-vault-sidecar
This is being worked in this branch DRAFT pr:
[... of the issue you're seeing in the content / tech demo exercises]
During the Supply chain Builds step it has us create our argocd app of apps.
Problem is that when we run it
(
we get an error with the security context not matching any constraints:
```pods "opa-5867777fb9-" is forbidden: unable to validate against any
security context constraint: [provider "anyuid": Forbidden: not usable
by user or serviceaccount, provider "pipelines-scc": Forbidden: not
usable by user or serviceaccount,
spec.initContainers[0].securityContext.runAsUser: Invalid value:
1000810000: must be in the ranges: [1000860000, 1000869999],
spec.containers[0].securityContext.runAsUser: Invalid value: 1000810000:
must be in the ranges: [1000860000, 1000869999],
spec.containers[1].securityContext.runAsUser: Invalid value: 1000810000:
must be in the ranges: [1000860000, 1000869999], provider "restricted":
Forbidden: not usable by user or serviceaccount, provider
"container-build": Forbidden: not usable by user or serviceaccount,
provider "nonroot-v2": Forbidden: not usable by user or serviceaccount,
provider "nonroot": Forbidden: not usable by user or serviceaccount,
provider "hostmount-anyuid": Forbidden: not usable by user or
serviceaccount, provider "machine-api-termination-handler": Forbidden:
not usable by user or serviceaccount, provider "hostnetwork-v2":
Forbidden: not usable by user or serviceaccount, provider "hostnetwork":
Forbidden: not usable by user or serviceaccount, provider "hostaccess":
Forbidden: not usable by user or serviceaccount, provider
"node-exporter": Forbidden: not usable by user or serviceaccount,
provider "privileged": Forbidden: not usable by user or serviceaccount]```
the JHUB single user profiles need integrating with ODH/RHODS config.
choosing a Spark based JHUB image does not spin up a cluster for the user.
the code exists for a odh/custom jhub deployment. need to see if this can be made to work with rhods using config.
https://github.com/opendatahub-io-contrib/jupyterhub-singleuser-profiles
Cloning "https://gitlab-ce.apps.osc-cl4.apps.os-climate.org/osclimate-datamesh/data-mesh-pattern" ...
Commit: efb9821cee326adb0256eaa715d14ab17deb4bae (UPDATE - project rename)
Author: Derek Dinosaur [email protected]
Date: Tue Jun 13 07:40:45 2023 +0000
time="2023-06-21T15:57:03Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0621 15:57:03.856489 1 defaults.go:102] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".
Pulling image registry.access.redhat.com/ubi8/ubi:8.7-1112 ...
Trying to pull registry.access.redhat.com/ubi8/ubi:8.7-1112...
Getting image source signatures
Copying blob sha256:6208c5a2e205726f3a2cd42a392c5e4f05256850d13197a711000c4021ede87b
Copying config sha256:768688a189716f9aef8d33a9eef4209f57dc2e66e9cb5fc3b8862940f314b9bc
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
[1/2] STEP 1/11: FROM registry.access.redhat.com/ubi8/go-toolset:1.17.12-3 AS builder
Trying to pull registry.access.redhat.com/ubi8/go-toolset:1.17.12-3...
Getting image source signatures
Copying blob sha256:7e3624512448126fd29504b9af9bc034538918c54f0988fb08c03ff7a3a9a4cb
Copying blob sha256:e0dc1b5a4801cf6fec23830d5fcea4b3fac076b9680999c49935e5b50a17e63b
Copying blob sha256:db0f4cd412505c5cc2f31cf3c65db80f84d8656c4bfa9ef627a6f532c0459fc4
Copying blob sha256:354c079828fae509c4f8e4ccb59199d275f17b0f26b1d7223fd64733788edf32
Copying blob sha256:26f52032c311fbc800e08f09294173c94c35c8fcd36ed2d43ee3255bda598373
Copying config sha256:068b656b38eb7ca9715019ba440d0cd2dade3154390e13b6397d4601a8bdce66
Writing manifest to image destination
Storing signatures
[1/2] STEP 2/11: ARG ARG_OS=linux
--> ef8de5d13a9
[1/2] STEP 3/11: ARG ARG_ARCH=amd64
--> f7bf97ebc3e
[1/2] STEP 4/11: ARG ARG_BIN=git-sync
--> 010315e264e
[1/2] STEP 5/11: ARG TARGETOS=linux
--> d643f9978f3
[1/2] STEP 6/11: ARG TARGETARCH=amd64
--> decd079af01
[1/2] STEP 7/11: WORKDIR /workspace
--> 5fe2777e29d
[1/2] STEP 8/11: RUN git clone https://github.com/kubernetes/git-sync.git /workspace
Cloning into '/workspace'...
/workspace/.git: Permission denied
error: build error: error building at STEP "RUN git clone https://github.com/kubernetes/git-sync.git /workspace": error while running runtime: exit status 1
Integration for Keycloak based login was added in #49
Two issues need some more work:
(1) No Backchannel logout mechanism in Openmetadata for KC. The frontchannel configured in the example did not seem to work (ie. session remains in KC post logout)
https://github.com/open-metadata/openmetadata-demo/blob/main/keycloak-sso/config/data-sec.json
likely this needs fixing in OMD itself.
(2) The Team / Roles - seem to be managed in the app - i.e. Admin is set using the env.var AUTHORIZER_ADMIN_PRINCIPALS and the default roles in KC client have no effect:
roles:
- name: DataConsumer
composite: false
clientRole: true
- name: Admin
composite: false
clientRole: true
- name: DataSteward
composite: false
clientRole: true
Would be nice if these Roles could be managed in KC instead.
Pachyderm provides limited capabilities for us to manage data versioning, in particular:
This issue is to review other open source projects such as lakeFS and DVC for potential replacement.
Integrate kepler (https://sustainable-computing.io/) with the data mesh pattern to generate power consumption data at pod level and leverage as optimization data for AIOps use-cases.
This will include:
Note: ideally we filter workload data (no persistence of control plane) given how much storage this will create, and there is a way to start / stop the collection.
First off, the Elyra Dockerfile likely needs osc-ingest-tools to do anything interesting when it comes to building data pipelines. Alas, os-climate/osc-ingest-tools#46 cites that osc-ingest-tools uses code that's deprecated in SqlAlchemy 2.0. @HeatherAck
When there is a Data Mesh pattern available in one of the OS-Climate clusters, I'll create a recipe for reproduction. This issue is just book-keeping at this point. @redmikhail
We need an updated osc-ingest-tools library (@erikerlandson) and an updated Elyra Dockerfile referencing that updated library.
In order to enable sage secrets retrieval for data pipeline, we want to use vault as a secret backend for airflow. Relevant article: https://airflow.apache.org/docs/apache-airflow-providers-hashicorp/stable/secrets-backends/hashicorp-vault.html
There should be a Data Engineer guide looking at how to build an end-to-end data product based on the template. Include development setup, and step-by-step pipeline development and execution.
Some reference of a step-by-step lakehouse build:
https://blog.min.io/lakehouse-architecture-iceberg-minio/
OS-Climate Developer guide
Initial documentation for deployment of pattern including prerequisites (with links), and deployment steps, similar to other patterns e.g. https://hybrid-cloud-patterns.io/patterns/multicloud-gitops/getting-started/
STAC (SpatioTemporal Asset Catalogs) specification is a common language to describe geospatial information, so it can more easily be worked with, indexed, and discovered.
Review how this should be integrated with the data mesh pattern in terms of data storage, metadata management, etc...
Reference link: https://stacspec.org/en
Deployed Airflow data ingestion wit new version of OpenMetadata.. Able to create a pipeline
... Able to create a pipeline , but not able to deploy data ingestion pipeline to Airflow
There is a issue with latest version of Metadata . opened support ticket in OpenMetaData slack waiting for resolution from OpenMetaData community
https://app.slack.com/client/T02BVTLN3G8
The pattern should take care of how downstream deployment evolve / get updated over time, allow evolution without breaking changes and safeguard downstream specific configurations (such as connectors).
For a start, we want to test in the context of OS-Climate how specific Trino configuration can be driven over time, use this as a way to start exploring the relationship between upstream and downstream.
Integrate with OpenTelemetry to export metrics, logs, and traces from the platform (as well as potentially Kepler) into data mesh ingestion. For this we focus on technical stacks to be used long term by our engineering team for metrics / logs / traces collection in the platform.
Metrics: Prometheus / Thanos
Logs: Loki / Vector
Traces: Jaeger / OpenTelemetry
The proposed approach would create a single layer of data delivery for metrics, logs and traces for the data collected and stored (potentially via ingestion through Trino / Iceberg).
This issue is to explore a systematic approach to generating a complex ontologies / metadata and data links by first ingesting a standard ontology and using a large language model like GPT-3 to create a script to generate and populate a metadata dictionary or a graph database.
Reference article: https://venturebeat.com/ai/how-to-use-large-language-models-and-knowledge-graphs-to-manage-enterprise-data/
Ingestion of data in batch via dataframe is slow and we are looking at leveraging Iceberg and MinIO for direct ingest in Trino from partioned ORC or Parquet tables written under the iceberg bucket.
Reference:
https://blog.min.io/lakehouse-architecture-iceberg-minio/
OpenMetadata needs integrating with Keycloak login
BigChainDB (https://www.bigchaindb.com/) allows developers and enterprise to deploy blockchain proof-of-concepts, platforms and applications with a blockchain database, supporting a wide range of industries and use cases. In particular, it is used in GAIA-X and CatenaX for building decentralized data exchange secured by tokenization.
This issue is to support a PoC for deploying BigChainDB on our cluster, create a digital record (https://www.bigchaindb.com/developers/guide/tutorial-piece-of-art/) and then query the data via the MongoDB connector (https://docs.bigchaindb.com/en/latest/query.html)
Maintenance of taxonomies for data should ideally be done in some kind of standard format with the ability to build rules for data equivalence between different data formats. This would be useful in particular in the case of ESG taxonomies mapping. Without such an ability to have mappings maintained in a one dimensional format, a lot of maintenance is required for cross-mappings for example:
https://github.com/OS-SFT/Taxonomy-Mappings-Library
This issue is to investigate a better way to maintain mappings in order to support the taxonomy equivalence project run within OS-Climate.
pare down the tools, to those in the pattern. e.g.
remove mlflow - we intend to use modelmesh
spark - handy for demoing, but likely will align with ray.io in odh for example
Leverage the fybrik trino module (https://github.com/fybrik/trino-module) to setup the integration with the trino cluster
Grafana is potentially a good tool for dynamic visualization of geographical data (https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/geomap/). Using the trino plugin (https://github.com/trinodb/grafana-trino), it would be interesting to see if we can visualize data retrieved from trino (such as power plants) overlayed into a world map.
Trino users may utilize various clients such as CLI, JupyterHub, SQL editors, custom reporting tools, and other JDBC-based apps to connect to Trino from a wide range of locations over an enterprise network. Therefore, implementation of data mesh patterns at scale typically will have multiple trino clusters to avoid single point of failure, scaling, and potentially optimize the network routing closer to the query location. This can be achieved with dynamic query routing and Goldman Sachs has implemented a solution using envoy proxies to support this type of distributed trino deployment.
https://developer.gs.com/blog/posts/enabling-highly-available-trino-clusters-at-goldman-sachs
We should review this architecture and determine if and how we could support similar deployment models with our pattern, in order to provide an out-of-the-box high availability approach.
airflow needs a DAG directory with .airflowignore setup - this was in the old doco as a step. need to add back in so airflow deploys OK
In addition to simple quality checks testing done with DBT, we want to use great_expectation for more complex business checks, and integrate them with the data lineage automatically produced by OpenMetadata.
Google has been leveraging Atlas visualizer for embeddings:
https://atlas.nomic.ai/
Look at the possibility of embedding visualization for distributed data sets as a way to explore / search for data. This can be a complement to an Elastic Search type of discovery.
The following URLs as mentioned in https://github.com/opendatahub-io-contrib/data-mesh-pattern#data-mesh-pattern are pointing to incorrect endpoints:
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks
Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks
Click on any of the following two URLs in https://github.com/opendatahub-io-contrib/data-mesh-pattern#data-mesh-pattern:
How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh -- Zhamak Dehghani, Thoughtworks
Data Mesh Principles and Logical Architecture -- Zhamak Dehghani, Thoughtworks
[How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh] should(most probably) be linked to https://martinfowler.com/articles/data-monolith-to-mesh.html
[Data Mesh Principles and Logical Architecture] should(most probably) be linked to https://martinfowler.com/articles/data-mesh-principles.html
Currently the helm chart supports separate Hive/S3 catalog deployments. You can define multiple Hive catalogs in the catalogs list:
catalogs:
# Hive Demo Catalog
- name: demo
enabled: true
replicaCountHive: 1
replicaCountDb: 1
an have the different connection secrets e.g. DEMO_*
database-password: <{{ $cat.name | upper }}_HIVE_DB_PASSWORD>
database-host: {{ $cat.name }}-hive-db
database-port: "5432"
database-name: <{{ $cat.name | upper }}_HIVE_DB_NAME>
database-user: <{{ $cat.name | upper }}_HIVE_DB_USERNAME>
Different approaches to this problem exist.
e.g. For example in the upstream chart a simple ConfigMap is used for this example:
https://github.com/trinodb/charts/blob/main/charts/trino/templates/configmap-catalog.yaml#L12-L17
e.g. And in Operate-First/OS-climate this is done using kustomize overlays:
We would like to document and extend a mechanism to support a broader range of connectors.
RHODS places all of the users and their notebook pods in one namespace rhods-notebooks
in a single OpenShift cluster it would be nice to be able to multi-tenant teams so that users notebooks are not visible to everyone who has access to the rhods-notebooks project.
In the original code base, we could deploy an instance of upstream odh jupyterhub per-team i.e. multiple jupyterhub instances - thus allowing this type of separation.
the rainforest demo examples need removing and/or changing to target data mesh instead.
see the docs/4-aiml-demos
folder for user demo's and examples
The SAMEPATH datasets (https://samepath.shinyapps.io/samepath/#dataAccess) consist of many tables from NGFS, UNIPRI, GECO, and other public sources related to sustainable finance. We want to demonstrate the ease with which we can federate this data from primary sources, maintain the data as it is updated (usually annually), and serve as the future data source for the SAMEPATH visualization (R-Shiny) tools.
...link to the docs where the exercise should be added or links to blogs etc that form the basis of the exercise.
... Not able to access Minio using OCP user id user1 and admin
we forked the reloader, git-sync code for off-line building based on ubi
we should revert to the upstream code git repos, but keep the builds though as ubi based is advanteagous.
We have a certificate issue when running a query against trino passing the self-signed certificate at https://github.com/opendatahub-io-contrib/data-mesh-pattern/blob/main/supply-chain/trino/trino-certs/ca.crt:
Code to reproduce:
certificate_path = '../../ca.crt'
engine = create_engine(
'trino://' + os.environ['TRINO_USER'] + ':' + os.environ['TRINO_PASSWD']
+ '@' + os.environ['TRINO_HOST'] + ':' + os.environ['TRINO_PORT'] + '/'
+ ingest_catalog,
connect_args={'verify': certificate_path},
)
with engine.connect() as connection:
ย ย result = connection.execute(text('show catalogs'))
ย ย for row in result:
ย ย ย ย print(row)
Error:
SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)
During handling of the above exception, another exception occurred:
MaxRetryError: HTTPSConnectionPool(host='trino-service.daintree-dev.svc.cluster.local', port=8443): Max retries exceeded with url: /v1/statement (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate (_ssl.c:1131)')))
Date mesh pattern should provide a way for data product owner to contribute curated data for LLM training. A good approach and reference is the datalake approach for gpt4all:
Similar to #79 , pachyderm containers are unable to start as it appears to be looking for an operator group that isnt there:
failed to populate resolver cache from source
operatorgroup-unavailable/pachyderm: found 0 operatorgroups in namespace
pachyderm: expected 1```
There should be some documentation on design guidance for data domains / products, which can be referenced then in examples and development doc.
A good checklist for this can be found at: https://towardsdatascience.com/data-domains-and-data-products-64cc9d28283e
During the build of all the containers the elyra-tflow container fails to build:
Cloning "https://gitlab-ce.apps.osc-cl4.apps.os-climate.org/osclimate-datamesh/data-mesh-pattern" ...
Commit: efb9821cee326adb0256eaa715d14ab17deb4bae (UPDATE - project rename)
Author: Derek Dinosaur <[email protected]>
Date: Tue Jun 13 07:40:45 2023 +0000
Replaced Dockerfile FROM image elyra-base:0.2.1
time="2023-06-26T20:43:48Z" level=info msg="Not using native diff for overlay, this may cause degraded performance for building images: kernel has CONFIG_OVERLAY_FS_REDIRECT_DIR enabled"
I0626 20:43:48.530450 1 defaults.go:102] Defaulting to storage driver "overlay" with options [mountopt=metacopy=on].
Caching blobs under "/var/cache/blobs".
Pulling image image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea ...
Trying to pull image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea...
Getting image source signatures
Copying blob sha256:028bdc977650c08fcf7a2bb4a7abefaead71ff8a84a55ed5940b6dbc7e466045
Copying blob sha256:af38327575d72c979478aaddf6a33ad9cf561844588f5db47e85c4ee721012ec
Copying blob sha256:819ccd5eb87778d75c516f3a542ae6a3d2367498bd7062a701cb2237995f6cb5
Copying blob sha256:a439c75b0a4f2699983da35fc5e15fd9809bc37f694f54717020886cffc0548b
Copying blob sha256:0c673eb68f88b60abc0cba5ef8ddb9c256eaf627bfd49eb7e09a2369bb2e5db0
Copying blob sha256:c37fd7de0840b4031b29e532b9c694c59a63983ae93162a2e6476882cd075b21
Copying blob sha256:bf105214519e48fd5c21e598563e367f6f3b7c30996d1745a99428752c0ad1ae
Copying blob sha256:0cdbf2b404cc6f9f91c9f46d490f467080c4b5d8ee3b5d4c925e02a340e8d10b
Copying blob sha256:f2316205fe7bc7979d3019254716646bf2f786c1825faa1c1ed39f7420174b25
Copying blob sha256:68057c5053360a1a580bb505ba567d6f4c771d07fe959a30c547d4e276bc0467
Copying blob sha256:988a562fbd90b733eb253c56d63a830afed36df0e609418700caccd23a245fdc
Copying blob sha256:90cf9451d289c16ed981d2a646cfc979874f0eff05ea2e86edfefac87ff0b2e6
Copying blob sha256:ebb3898343c60b4a8d79aed8a93654dc73a0f980ea1bf7e30018bd449d4f611b
Copying blob sha256:059ceb835a667820ab78d7d6fb48b9e7fbb769ce612281ba189bed25ce0a99db
Copying blob sha256:f31e46de923b1250ab065453646dcff2466749a2e9549ea289b038cfa3fefe36
Copying blob sha256:9ff9b64097f0280c8b0ecd3a2a801bf474d0aa3fc160350fd699c1d929e0241b
Copying blob sha256:90c508cf12e1e5825e29e1eec796188af045440ffa6d697f35279a813b004b9b
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:8fde022b6648ce49357f0c7620a96ba04104c5ad0e9029078ff878cfc37021bb
Copying blob sha256:90367ac5959ea0a29369bb20aff6c90903326a1fa703befc629d1cdf024fc99a
Copying blob sha256:ae97caea9fa3345a096a09d1df0fa8b68a31cdd398c4402748e0b548fe2f25ff
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:acab339ca1e8ed7aefa2b4c271176a7787663685bf8759f5ce69b40e4bd7ef86
Copying blob sha256:8a441ef86887ecc2a66703d73d2b86538a75edfa38a90cc19d73bb7aaa4aa8cd
Copying blob sha256:23946ae671303d0c6cc4870accc51fe43463e7993e122f7b08082dc2a9726a0f
Copying blob sha256:073c6e194011062bd49b1ccda1819f15aa368590829afae3e0263759cf4dacba
Copying blob sha256:46e601ccae7c5a32545b6b6db733b3d2db5b6581b915520edbcbf262a2b79110
Copying blob sha256:4a269fdc289ca6b7833584bede177c80aa91f2706dea33bc2b94398a3e83d9d0
Copying blob sha256:12ef46b74f05917750404a8de7565168740216fa44be5b19d5d75273a3ec0c86
Copying blob sha256:b2ec48efaf35963d699ae8446e20120869f9fb1ca34ee70f64b82a6050e627f7
Copying blob sha256:c48957ebd2d09b52f4d564cbd5914b1b9e94939f21142f6041db41d0e62fab74
Copying blob sha256:08c9d67bcd774940f73a67eb036be8a756d8eab9b2e4c43bc4e0bcdf17cdaea3
Copying blob sha256:c23f73eb778d14742f04e1238227b8efc4fd1ce51d17a98100744e912e752901
Copying blob sha256:5e0654a3c30dd59ab31f6531ae1a8ad9a8368c5cb6368550e0de2e7c66f9b3b9
Copying blob sha256:237741efa6248120129716d660cc7fece732ea172110784949b97a96e681cb62
Copying blob sha256:58787dd3cb793f5983c0aaa6b70341c30a41a1bb60fc1a5f6f1cd9061ee2edc0
Copying blob sha256:4f5aa417a25f646d2d39642577d4580eedd0fe809c857932aeabd3bb22587bb9
Copying blob sha256:f009e2fceca5421f4769b12a3dd42777940ee1e6e8f17c8c5b77b5e248b9b7d2
Copying blob sha256:2fb528adb3814ee51b07a0165956060c4d0703d454a18f08c6430ed667ed5853
Copying blob sha256:cef676ff822d33c5bdc8cc17a6af24ce425f2353463b189c7ee1a637c2d012ea
Copying blob sha256:f7ea4b46629aacb0aacbf8fe8197fb924a48c9e8875d9f9721565b4a7374549a
Copying blob sha256:d1473e2d5c4be6a885eba43606bbe79229239b92427436391a8cf9edb977e357
Copying blob sha256:d3cda3d33521c0cd44da393733605297f341d7e36a42850e945d122578533ded
Copying config sha256:4a7596a0ebbeb7ba5f97a2ca3d310d6ec4b0842fa024310ec3e235517d45d4dd
Writing manifest to image destination
Storing signatures
Adding transient rw bind mount for /run/secrets/rhsm
STEP 1/8: FROM image-registry.openshift-image-registry.svc:5000/osclimate-datamesh-ci-cd/elyra-base@sha256:c03982d4db4e361d5302a4d8c0632bc07bda9bc8b6ebb1b9029bcd8393bcb3ea
STEP 2/8: USER root
--> 8ab6947207f
STEP 3/8: RUN /opt/app-root/bin/pip3 install jinja2==3.1.2
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Requirement already satisfied: jinja2==3.1.2 in /opt/app-root/lib/python3.8/site-packages (3.1.2)
Requirement already satisfied: MarkupSafe>=2.0 in /opt/app-root/lib/python3.8/site-packages (from jinja2==3.1.2) (2.1.1)
WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
--> fbfab4021e8
STEP 4/8: RUN /opt/app-root/bin/pip3 install certifi
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Requirement already satisfied: certifi in /opt/app-root/lib/python3.8/site-packages (2022.9.24)
WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
--> 1c77c833b0e
STEP 5/8: RUN /opt/app-root/bin/pip3 install matplotlib numpy pandas scipy scikit-learn tensorflow minio
Looking in indexes: https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/simple
Collecting matplotlib
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/matplotlib/3.7.1/matplotlib-3.7.1-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (9.2 MB)
Collecting numpy
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/numpy/1.24.4/numpy-1.24.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
Collecting pandas
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/pandas/2.0.2/pandas-2.0.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.3 MB)
Collecting scipy
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/scipy/1.10.1/scipy-1.10.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting scikit-learn
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/scikit-learn/1.2.2/scikit_learn-1.2.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (9.8 MB)
Collecting tensorflow
Downloading https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/tensorflow/2.12.0/tensorflow-2.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (585.9 MB)
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
tensorflow from https://nexus-osclimate-datamesh-ci-cd.apps.osc-cl4.apps.os-climate.org/repository/pypi/packages/tensorflow/2.12.0/tensorflow-2.12.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl#sha256=23850332f1f9f778d697c9dba63ca52be72cb73363e75ad358f07ddafef63c01:
Expected sha256 23850332f1f9f778d697c9dba63ca52be72cb73363e75ad358f07ddafef63c01
Got 2ecfc624220e0e36c414dc6889ab365f02f50a9edc3f230dcebbd4955cbf62fa
WARNING: You are using pip version 21.2.3; however, version 23.1.2 is available.
You should consider upgrading via the '/opt/app-root/bin/python3.8 -m pip install --upgrade pip' command.
error: build error: error building at STEP "RUN /opt/app-root/bin/pip3 install matplotlib numpy pandas scipy scikit-learn tensorflow minio": error while running runtime: exit status 1
The Google Data Commons (https://datacommons.org/) has over 1 trillion datapoints of all kinds, organized in a knowledge graph and available via BigQuery. Some of this data is directly useful to climate and sustainable finance analysis, and some of this data could be useful when linked to corporate ownership (via entity matching).
Here are datasets federated by Google's Data Commons that relate to the topic Environment
: https://docs.datacommons.org/datasets/Environment.html
Here is a narrowing of that data that relate to the topic Emissions
within the US (based on EPA GHGRP): https://datacommons.org/tools/map#%26sv%3DAnnual_Emissions_CarbonDioxide_NonBiogenic%26pc%3D0%26denom%3DCount_Person%26pd%3Dcountry%2FUSA%26ept%3DState%26ppt%3DEpaReportingFacility
The goal of this exercise is to demonstrate our ability to federate a tiny but meaningful slice of Google's Data Commons data into the Data Mesh and to expose that data within OS-Climate's Data Exchange. The data should be chosen so that a meaningful "so what?" question can be answered, but the overall point of the exercise is to assess the ease with which the Data Mesh can enable data analysts to be maximally productive and effective in when asking and answering climate and sustainable finance questions.
Please feel free to flesh out and/or ask further questions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.