Code Monkey home page Code Monkey logo

cuelake's Introduction

Cuelake Logo

Maintainability Test Coverage PR Checks
License Discord Docker Pulls



With CueLake, you can use SQL to build ELT (Extract, Load, Transform) pipelines on a data lakehouse.

You write Spark SQL statements in Zeppelin notebooks. You then schedule these notebooks using workflows (DAGs).

To extract and load incremental data, you write simple select statements. CueLake executes these statements against your databases and then merges incremental data into your data lakehouse (powered by Apache Iceberg).

To transform data, you write SQL statements to create views and tables in your data lakehouse.

CueLake uses Celery as the executor and celery-beat as the scheduler. Celery jobs trigger Zeppelin notebooks. Zeppelin auto-starts and stops the Spark cluster for every scheduled run of notebooks.

To know why we are building CueLake, read our viewpoint.

CueLake

Getting started

CueLake uses Kubernetes kubectl for installation. Create a namespace and then install using the cuelake.yaml file. Creating a namespace is optional. You can install in the default namespace or in any existing namespace.

In the commands below, we use cuelake as the namespace.

kubectl create namespace cuelake
kubectl apply -f https://raw.githubusercontent.com/cuebook/cuelake/main/cuelake.yaml -n cuelake
kubectl port-forward services/lakehouse 8080:80 -n cuelake

Now visit http://localhost:8080 in your browser.

If you don’t want to use Kubernetes and instead want to try it out on your local machine first, we’ll soon have a docker-compose version. Let us know if you’d want that sooner.

Features

  • Upsert Incremental data. CueLake uses Iceberg’s merge into query to automatically merge incremental data.
  • Create Views in data lakehouse. CueLake enables you to create views over Iceberg tables.
  • Create DAGs. Group notebooks into workflows and create DAGs of these workflows.
  • Elastically Scale Cloud Infrastructure. CueLake uses Zeppelin to auto create and delete Kubernetes resources required to run data pipelines.
  • In-built Scheduler to schedule your pipelines.
  • Automated maintenance of Iceberg tables. CueLake does automated maintenance of Iceberg tables - expires snapshots, removes old metadata and orphan files, compacts data files.
  • Monitoring. Get Slack alerts when a pipeline fails. CueLake maintains detailed logs.
  • Versioning in Github. Commit and maintain versions of your Zeppelin notebooks in Github.
  • Data Security. Your data always stays within your cloud account.

Current Limitations

  • Supports AWS S3 as a destination. Support for ADLS and GCS is in the roadmap.
  • Uses Apache Iceberg as an open table format. Delta support is in the roadmap.
  • Uses Celery for scheduling jobs. Support for Airflow is in the roadmap.

Support

For general help using CueLake, read the documentation, or go to Github Discussions.

To report a bug or request a feature, open an issue.

Contributing

We'd love contributions to CueLake. Before you contribute, please first discuss the change you wish to make via an issue or a discussion. Contributors are expected to adhere to our code of conduct.

cuelake's People

Contributors

ankitkpandey avatar prabhu31 avatar praveencuebook avatar sachinkbansal avatar vikrantcue avatar vincue avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cuelake's Issues

Syntax error in interpreter.json

There is a syntax error (missing comma) on line 1275 in https://raw.githubusercontent.com/cuebook/cuelake/main/zeppelinConf/interpreter.json

Also, there is a \t on lines 1271 and 1272 that I suspect are incorrect.

And finally if you use less to view the content it C in the word Comma on line 201 is displayed as .

Below is a diff file or the changes that I made to the file.

201c201
<           "description": "Сomma separated schema (schema \u003d catalog \u003d database) filters to get metadata for completions. Supports \u0027%\u0027 symbol is equivalent to any set of characters. (ex. prod_v_%,public%,info)"
---
>           "description": "Comma separated schema (schema \u003d catalog \u003d database) filters to get metadata for completions. Supports \u0027%\u0027 symbol is equivalent to any set of characters. (ex. prod_v_%,public%,info)"
1271,1272c1271,1272
<         "spark.executor.extraJavaOptions\t": {
<           "name": "spark.executor.extraJavaOptions\t",
---
>         "spark.executor.extraJavaOptions": {
>           "name": "spark.executor.extraJavaOptions",
1275c1275
<         }
---
>         },

Fix hive metastore issues

Test the following scenarios on hive metastore for iceberg, delta and parquert tables:

  • Data should be created in the given warehouse directory as env variable
  • When table is dropped data should be deleted

Test and fix the behaviour of hive metastore on both S3 and GCS.

Use the latest version of iceberg and delta jars and also upgrade the spark version if required.

Workspaces v1

Ask following info while creating a workspace:

  • Name & Description
  • Storage (S3, GCS, AZFS, PV)
  • Storage credentials if required
  • Inactivity Timeout to shut down resources
  • Spark and Interpreter docker images (Show Cuelake's default values and link for creating custom images)

Can we used minio as S3 compatible for apache iceberg

Is your feature request related to a problem? Please describe.
Can we used minio as S3 compatible for apache iceberg

Describe the solution you'd like
Can we used minio as S3 compatible for apache iceberg

Describe alternatives you've considered
If we can use minio, need the steps to configure minio with cuelake

Additional context
Can we used minio as S3 compatible for apache iceberg

Improve logs UI

Currently, logs are just JSON dumps. Copy the parser code from zeppelin and implement in CueLake so that the logs look the same as they are in zeppelin.

Support for Jupyter Notebook

Is your feature request related to a problem? Please describe.
Your current system supports zeplin notebooks. We have a lot of notebooks designed with jupyter. And we have tons of tooling around the same. Its a tremendous effort to shift these. Requesting support for jupyter notebooks besides zepplyn.

Describe the solution you'd like
Ability to run jupyter notebooks

Describe alternatives you've considered
Tools for convert from jupyter to zepplyn. But thats a lot of work internally

Dashboard V1

Dashboard will show all the workspaces and their resouces.

CueLake will start with 0 workspaces.

User can add a workspace from dashboard.

For a workspace following info will be shown:

  • Resources currently running for the workspace. Zeppelin Server, All Interpreters
  • Restart button for the zeppelin server
  • Name & Description of the workspace

The default RBAC role is missing pods as a resource

Describe the bug
The default RBAC role is missing pods as a resource, which causes exceptions in lakehouse as shown below.

27.0.0.1 - - [27/May/2021:06:14:14 +0000] "GET /api/genie/notebooks/0 HTTP/1.1" 200 68 "http://127.0.0.1:8080/notebooks" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36"
Internal Server Error: /api/genie/driverAndExecutorStatus/
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/site-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.7/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/django/views/generic/base.py", line 70, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 509, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 469, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
    raise exc
  File "/usr/local/lib/python3.7/site-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/code/genie/views.py", line 243, in get
    res = KubernetesServices.getDriversCount()
  File "/code/genie/services/services.py", line 657, in getDriversCount
    ret = v1.list_namespaced_pod(POD_NAMESPACE, watch=False)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15302, in list_namespaced_pod
    return self.list_namespaced_pod_with_http_info(namespace, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/core_v1_api.py", line 15427, in list_namespaced_pod_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 243, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (403)
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '96c45951-281d-41d5-908d-b6429974a4dd', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Thu, 27 May 2021 06:14:14 GMT', 'Content-Length': '282'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"pods is forbidden: User \"system:serviceaccount:cuelake:default\" cannot list resource \"pods\" in API group \"\" in the namespace \"cuelake\"","reason":"Forbidden","details":{"kind":"pods"},"code":403}
```

***Workaround***

A workaround is to add "pods" as a resource in the default-role in cuelake.yaml.

```
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: default-role
rules:
- apiGroups: [""]
  resources: ["pods", "configmaps"]
  verbs: ["create", "get", "update", "patch", "list", "delete", "watch"]
- apiGroups: ["rbac.authorization.k8s.io"]
  resources: ["roles", "rolebindings"]
  verbs: ["bind", "create", "get", "update", "patch", "list", "delete", "watch"]
```

Rename models

Some models name are not so apt. Change the following model names:
RunStatus -> NotebookRunLogs
WorkflowRuns -> WorkflowRunLogs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.