Code Monkey home page Code Monkey logo

ballista's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ballista's Issues

Add support for running applications in client mode

I would like the ability to run applications in "client mode" (see the Apache Spark documentation [0] for a description of local, client, and cluster modes). Ballista currently only supports cluster mode which means we have to package the example in a docker image every time we want to run it.

This means we need a way to connect to a Ballista container from outside of the cluster, so we need to either enable ingress directly on the executor services, or add a new "master" pod with ingress that can then submit to the cluster.

[0] https://spark.apache.org/docs/latest/submitting-applications.html

Remove ugly hacks in client.rs

I wrote the following terrible code to get the response from the future that happens when the server returns results. I feel like this would be a good place use async/await when available, or at least use channels ?

        let mut results: Arc<Mutex<Vec<proto::ExecuteResponse>>> = Arc::new(Mutex::new(vec![]));

        tokio::run(execute);

        let lock = final_result.borrow_mut().lock().unwrap();
        Ok(lock[0].clone())

Implement distributed query planner for simple aggregate queries

Given a simple aggregate query such as the following one, Ballista should be able to create a plan to execute this query against each partition and then combine the results and run a secondary aggregate query on those.

This will also require some partition-aware meta data about files.

SELECT passenger_count, MAX(fare_amount) FROM tripdata GROUP BY passenger_count

Implement REPL / SQL Console

I want to be able to connect to a cluster and run SQL and see results. There is a REPL in DataFusion that could be used as a starting point.

Idea: support Nomad?

Kubernates is a bit of a beast for a lot of people to step into. Perhaps supporting a simpler scheduler could benefit this project.

Set up CI for Rust and JVM builds

I'm considering CircleCI or Github Actions. Would be good to set up code coverage for Rust and Kotlin as well, perhaps using codecov.io

Update example to use hand-built distributed query plan

As a step towards having a distributed query planner, I want to update the current example to build a distributed query plan manually and test out that flow. This will help drive requirements for the distributed query planner.

Example in README not creating K8s resources

Description

I've run into this issue when trying to run the repo example. I'm still investigating what could be going wrong, but thought it worth the time to throw up some info while I investigate.

I'm running this:

cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml

Output from the program is:

Executed subcommand create-cluster in 1.566 seconds

Expected Output

A number of ballista-nyctaxi-* K8s resources when you run kubectl get pods

Actual Output

No resources found.

Further Information

minikube version output: minikube version: v1.2.0
minikube status output:

host: Running
kubelet: Running
apiserver: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.101

kubectl cluster-info output:

Kubernetes master is running at https://192.168.99.101:8443
KubeDNS is running at https://192.168.99.101:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

kubectl version output: Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}

Status of project?

Whats the status of this project? I was curious about it but don't want to invest time if its abandoned. Commit log looks like it slowed down.

Kubernetes URL should be configurable

Currently, the Kubernetes URL is hard-coded and it needs to be configurable.

We need to think about how we want configuration to work in general. I am a fan of twelve factors [0] in general, and would encourage anyone working on this issue to read the configuration section there at least.

[0] https://12factor.net/

What kubernetes constructs make most sense for the cluster?

In the PoC, one pod and one service is created for each executor, but how should this really work? StatefulSet? Service with ReplicaSet?

I think the distributed query planner needs the ability to target specific executors (for things like data locality later on). Let's use this issue for the initial discussion.

Implement JDBC driver

As a user, I want to be able to connect to a Ballista cluster from JDBC and run queries. Note that there is already PoC code for this under jvm/jdbc.

Set up regular call to discuss the project

I'm based in Colorado which is Mountain Time (https://time.is/MT) and early morning is a good time for me so I'd like to start by proposing that. This is end of day in UK so hopefully works for UK and Europe.

I am thinking every two weeks?

How about 8am MT every other Friday?

Remove dependency on kubectl

The code currently shells out to kubectl for certain operations, instead of just posting YAML directly to the k8s API.

Implement Spark executor

I would like to be able to use Spark within Ballista. This will involve writing an executor in Scala that translates a Ballista query plan into a Spark query plan.

Set up Apache Spark benchmark

It will be very helpful if we can easily run comparable jobs on both Ballista and Spark and compare query plans and performance.

Research supporting WASM for custom code as part of a physical plan

As a user of Ballista, I would like the ability to execute arbitrary code as part of my distributed job. I want the ability to use multiple languages depending on my requirements (perhaps there are third party libraries I want to use).

WASM seems like a good potential choice for this?

Add support for binary expressions

It should be possible to execute queries using binary expressions e.g. some_column > 4. DataFusion already supports this and the proto file has definitions for this but the code to translate between Ballista and DataFusion plans does not yet support it.

We should also look at the Gandiva proto which has rich support for expressions and see if we can conform with their approach since the plan is to combine Ballista and Gandiva proto at some point.

Support calling Rust executor from JVM

Given that query plans are defined in protobuf, we should be able to pass a query plan from JVM to Rust via FFI and then receive the results back in Arrow IPC format.

Non-kube clusters

Hi andy! This project looks awesome, I was curious if you had any ideas about supporting other clustering techniques that are not based around kube?

Add support for HDFS

As discussed on Gitter it would be great Ballista could read and write from HDFS.

Ideally Ballista would have a common way of accessing data from various sources such as S3 or other object stores.

What would the timeline for support here look like? I guess the initial steps would be to design the storage API, which arbitrary sources could access?

In terms of Rust support for HDFS, I haven't seen any particularly mature implementations. Do you see this (or a child project) as the place for that work to go? Or would that be more in the realm of Arrow?

Sorry if I've missed anything here or looked over something obvious, still in the grokking phase of Ballista but like what it does for distributed compute with rust. HDFS Support would really unblock things for me wrt to using rust in a professional capacity.

Add cargo fmt check to travis CI

We want to use cargo fmt to keep the code formatted consistently. The usual approach to this is to have the build fail if the code isn't formatted correctly.

Here is some example travis yaml I found that should be pretty close to what we need:

language: rust
before_script:
  - rustup toolchain install nightly
  - rustup component add --toolchain nightly rustfmt-preview
  - which rustfmt || cargo install --force rustfmt-nightly
script:
  - cargo +nightly fmt --all -- --write-mode=diff
  - cargo build
  - cargo test

Create example CRD-based workflow

In order to demonstrate how K8s would help manage client interaction, create example CRD schemas and a controller that has stub functionality for query submission and execution planning.

Deliverables:
BallistaQuery CRD
BallistaPlan CRD
BallistaOperator controller implementation and deployment.

Create Java bindings

As a JVM (Java/Kotlin/Scala) user, I would like to be able to build a query plan, execute it on a Ballista cluster, and receive the results.

I plan on implementing this in Kotlin.

Re-export arrow and datafusion crates from ballista

I think this will be needed so that anyone writing Ballista applications can guarantee they're using the correct versions of arrow and datafusion. Eventually it may be desirable to hide away the arrow/datafusion implementation, but for now this is a fairly easy solution.

Implement Apache Spark Connector

As a Spark user, I want to be able to delegate some processing to Ballista executors, such as executing a Rust UDF against data.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.