ballista-compute / ballista Goto Github PK

View Code? Open in Web Editor NEW

2.3K 2.3K 147.0 2.5 MB

Distributed compute platform implemented in Rust, and powered by Apache Arrow.

Home Page: https://ballistacompute.org

License: Apache License 2.0

Rust 79.43% Shell 0.42% Dockerfile 0.95% Python 6.22% Scala 1.12% Java 11.87%

arrow dataframe datafusion distributed java jvm kotlin kubernetes rust scala spark

ballista's People

Stargazers

Watchers

Forkers

franchb woonhak igponce zoosky rrichardson hustnn max-sixty isgasho sd2k dd5ht wolf4ood jzhang sinhasantos placrosse rasouli rust-lang-experts stspyder cambricorp javierhdezorta xmlgrg blad mparaz houqp ratanos bernolt zznq lmatz kyprifog kkuo chuanlei fristonio hwchen laplab liw71 wille-li meijies mapbased kangpinghuang dyqer vvidlearn amrltqt awesomedatatool homefamilywork trucnguyenlam krishna-kumar456 jmpq farteen sujithjay isirfanm andyrao linuxai-pku joshthoward marks-yag edrevo rich-murphey kination andygrove bjcohen mariusmuja gangliao timbuktuu jacobjohansen org-mars suryatmodulus hadryan chang-e ckmro-labs rdettai zeta1999 westonsteimel dandandan jameswinegar sieut enzo-liu dt021 arcsi42 stjordanis morpheyesh yordan-pavlov simhaonline michael-j-ward eliasyaoyc boaz-codota dashstander biyanisuraj hendrikmakait pengye91 shamazo rolandpeelen pjmore jc4x4 samkaufman vigsterkr allengeorge lexi-sh bvalenti rustbunker zynaxsoft lideen999 myquant

ballista's Issues

Remove cargo lock files

These are turning out to be a pain because of the dependency on arrow github repo

Add support for running applications in client mode

I would like the ability to run applications in "client mode" (see the Apache Spark documentation [0] for a description of local, client, and cluster modes). Ballista currently only supports cluster mode which means we have to package the example in a docker image every time we want to run it.

This means we need a way to connect to a Ballista container from outside of the cluster, so we need to either enable ingress directly on the executor services, or add a new "master" pod with ingress that can then submit to the cluster.

[0] https://spark.apache.org/docs/latest/submitting-applications.html

Remove ugly hacks in client.rs

I wrote the following terrible code to get the response from the future that happens when the server returns results. I feel like this would be a good place use async/await when available, or at least use channels ?

        let mut results: Arc<Mutex<Vec<proto::ExecuteResponse>>> = Arc::new(Mutex::new(vec![]));

        tokio::run(execute);

        let lock = final_result.borrow_mut().lock().unwrap();
        Ok(lock[0].clone())

Speed up docker builds locally

Every build of the two docker images causes all dependencies to be downloaded and compiled every time.

Implement distributed query planner for simple aggregate queries

Given a simple aggregate query such as the following one, Ballista should be able to create a plan to execute this query against each partition and then combine the results and run a secondary aggregate query on those.

This will also require some partition-aware meta data about files.

SELECT passenger_count, MAX(fare_amount) FROM tripdata GROUP BY passenger_count

[Design] create simple json/toml format for representing tasks/apps to abstract away scheduler

Rather than using straight up format used by a machine scheduler, we allow apps/executors descriptors to be defined in a simpler json/toml that gets turned into a specific scheduler format.

Implement REPL / SQL Console

I want to be able to connect to a cluster and run SQL and see results. There is a REPL in DataFusion that could be used as a starting point.

Idea: support Nomad?

Kubernates is a bit of a beast for a lot of people to step into. Perhaps supporting a simpler scheduler could benefit this project.

Set up online chat to discuss project

Leaning towards using Gitter

Set up CI for Rust and JVM builds

I'm considering CircleCI or Github Actions. Would be good to set up code coverage for Rust and Kotlin as well, perhaps using codecov.io

Update example to use hand-built distributed query plan

As a step towards having a distributed query planner, I want to update the current example to build a distributed query plan manually and test out that flow. This will help drive requirements for the distributed query planner.

Add support for EvCxR Jupyter Kernel

Data nerds love their jupyter notebooks
https://github.com/google/evcxr

Example in README not creating K8s resources

Description

I've run into this issue when trying to run the repo example. I'm still investigating what could be going wrong, but thought it worth the time to throw up some info while I investigate.

I'm running this:

cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

Output from the program is:

Executed subcommand create-cluster in 1.566 seconds

Expected Output

A number of ballista-nyctaxi-* K8s resources when you run kubectl get pods

Actual Output

No resources found.

Further Information

minikube version output: minikube version: v1.2.0
minikube status output:

host: Running
kubelet: Running
apiserver: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.101

kubectl cluster-info output:

Kubernetes master is running at https://192.168.99.101:8443
KubeDNS is running at https://192.168.99.101:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

kubectl version output: Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Implement distributed query planner for simple projection/selection queries

Given a simple query consisting of any combination of TableScan, Projection, Selection (and no LIMIT/OFFSET/SORT/UNION) then Ballista should be able to execute these queries in parallel across partitions and simply combine the results without further processing.

Create enum to represent physical query plan

Status of project?

Whats the status of this project? I was curious about it but don't want to invest time if its abandoned. Commit log looks like it slowed down.

Kubernetes URL should be configurable

Currently, the Kubernetes URL is hard-coded and it needs to be configurable.

We need to think about how we want configuration to work in general. I am a fan of twelve factors [0] in general, and would encourage anyone working on this issue to read the configuration section there at least.

[0] https://12factor.net/

Idea: complete documentation for setting up ballista using K3S

In similar vein of reducing complexity. Maybe some documentation for how to run ballista on k3s from scratch to test could allow people to get started more simply

What kubernetes constructs make most sense for the cluster?

In the PoC, one pod and one service is created for each executor, but how should this really work? StatefulSet? Service with ReplicaSet?

I think the distributed query planner needs the ability to target specific executors (for things like data locality later on). Let's use this issue for the initial discussion.

Implement JDBC driver

As a user, I want to be able to connect to a Ballista cluster from JDBC and run queries. Note that there is already PoC code for this under jvm/jdbc.

Set up CICD

Set up regular call to discuss the project

I'm based in Colorado which is Mountain Time (https://time.is/MT) and early morning is a good time for me so I'd like to start by proposing that. This is end of day in UK so hopefully works for UK and Europe.

I am thinking every two weeks?

How about 8am MT every other Friday?

[Design] Consider using failure or snafu for error handling

There are so many sources of errors and we'll need to provide context to the user for each one. snafu seems to provide a nice way of doing this in a single place, so could be worth investigating.

Consider using structopt to define the CLI

If other tools in this space are anything to go by, we'll end up with a lot of options in the CLI, so having arg parsing handled automatically would be great.

[JVM] SQL Parser: Add support for HAVING

Start discussion on Arrow mailing list about unifying Gandiva/Ballista proto file

Example should not rely on path /mnt/ssd/nyc_taxis/csv

The example relies on data being located at /mnt/ssd/nyc_taxis/csv currently and we at least need to document this so others can run the example

Executors should return data in Arrow IPC format

Remove dependency on kubectl

The code currently shells out to kubectl for certain operations, instead of just posting YAML directly to the k8s API.

Project should depend on a published version of tower-grpc

Currently the project uses git submodule to clone copy of tower-grpc because I ran into problems early on with adding a dependency on the published release.

Implement Spark executor

I would like to be able to use Spark within Ballista. This will involve writing an executor in Scala that translates a Ballista query plan into a Spark query plan.

Set up Apache Spark benchmark

It will be very helpful if we can easily run comparable jobs on both Ballista and Spark and compare query plans and performance.

Research supporting WASM for custom code as part of a physical plan

As a user of Ballista, I would like the ability to execute arbitrary code as part of my distributed job. I want the ability to use multiple languages depending on my requirements (perhaps there are third party libraries I want to use).

WASM seems like a good potential choice for this?

Add support for binary expressions

It should be possible to execute queries using binary expressions e.g. some_column > 4. DataFusion already supports this and the proto file has definitions for this but the code to translate between Ballista and DataFusion plans does not yet support it.

We should also look at the Gandiva proto which has rich support for expressions and see if we can conform with their approach since the plan is to combine Ballista and Gandiva proto at some point.

Support calling Rust executor from JVM

Given that query plans are defined in protobuf, we should be able to pass a query plan from JVM to Rust via FFI and then receive the results back in Arrow IPC format.

Spark benchmark fails to run in minikube due to UnknownHostException

The spark benchmark fails with:

Caused by: java.net.UnknownHostException: io-andygrove-ballista-spark-main-1564066103519-driver-svc.default.svc

This is probably some minikube DNS issue that needs to be figured out and documented.

Non-kube clusters

Hi andy! This project looks awesome, I was curious if you had any ideas about supporting other clustering techniques that are not based around kube?

Add support for HDFS

As discussed on Gitter it would be great Ballista could read and write from HDFS.

Ideally Ballista would have a common way of accessing data from various sources such as S3 or other object stores.

What would the timeline for support here look like? I guess the initial steps would be to design the storage API, which arbitrary sources could access?

In terms of Rust support for HDFS, I haven't seen any particularly mature implementations. Do you see this (or a child project) as the place for that work to go? Or would that be more in the realm of Arrow?

Sorry if I've missed anything here or looked over something obvious, still in the grokking phase of Ballista but like what it does for distributed compute with rust. HDFS Support would really unblock things for me wrt to using rust in a professional capacity.

[JVM] SQL Parser: Add support for LIMIT and OFFSET

Executors should implement Flight protocol

Add cargo fmt check to travis CI

We want to use cargo fmt to keep the code formatted consistently. The usual approach to this is to have the build fail if the code isn't formatted correctly.

Here is some example travis yaml I found that should be pretty close to what we need:

language: rust
before_script:
  - rustup toolchain install nightly
  - rustup component add --toolchain nightly rustfmt-preview
  - which rustfmt || cargo install --force rustfmt-nightly
script:
  - cargo +nightly fmt --all -- --write-mode=diff
  - cargo build
  - cargo test

[JVM] SQL Parser: Add support for ORDER BY

Create example CRD-based workflow

In order to demonstrate how K8s would help manage client interaction, create example CRD schemas and a controller that has stub functionality for query submission and execution planning.

Deliverables:
BallistaQuery CRD
BallistaPlan CRD
BallistaOperator controller implementation and deployment.

Create Java bindings

As a JVM (Java/Kotlin/Scala) user, I would like to be able to build a query plan, execute it on a Ballista cluster, and receive the results.

I plan on implementing this in Kotlin.

Re-export arrow and datafusion crates from ballista

I think this will be needed so that anyone writing Ballista applications can guarantee they're using the correct versions of arrow and datafusion. Eventually it may be desirable to hide away the arrow/datafusion implementation, but for now this is a fairly easy solution.