ballista-compute / ballista Goto Github PK
View Code? Open in Web Editor NEWDistributed compute platform implemented in Rust, and powered by Apache Arrow.
Home Page: https://ballistacompute.org
License: Apache License 2.0
Distributed compute platform implemented in Rust, and powered by Apache Arrow.
Home Page: https://ballistacompute.org
License: Apache License 2.0
These are turning out to be a pain because of the dependency on arrow github repo
I would like the ability to run applications in "client mode" (see the Apache Spark documentation [0] for a description of local
, client
, and cluster
modes). Ballista currently only supports cluster
mode which means we have to package the example in a docker image every time we want to run it.
This means we need a way to connect to a Ballista container from outside of the cluster, so we need to either enable ingress directly on the executor services, or add a new "master" pod with ingress that can then submit to the cluster.
[0] https://spark.apache.org/docs/latest/submitting-applications.html
I wrote the following terrible code to get the response from the future that happens when the server returns results. I feel like this would be a good place use async/await when available, or at least use channels ?
let mut results: Arc<Mutex<Vec<proto::ExecuteResponse>>> = Arc::new(Mutex::new(vec![]));
tokio::run(execute);
let lock = final_result.borrow_mut().lock().unwrap();
Ok(lock[0].clone())
Every build of the two docker images causes all dependencies to be downloaded and compiled every time.
Given a simple aggregate query such as the following one, Ballista should be able to create a plan to execute this query against each partition and then combine the results and run a secondary aggregate query on those.
This will also require some partition-aware meta data about files.
SELECT passenger_count, MAX(fare_amount) FROM tripdata GROUP BY passenger_count
Rather than using straight up format used by a machine scheduler, we allow apps/executors descriptors to be defined in a simpler json/toml that gets turned into a specific scheduler format.
I want to be able to connect to a cluster and run SQL and see results. There is a REPL in DataFusion that could be used as a starting point.
Kubernates is a bit of a beast for a lot of people to step into. Perhaps supporting a simpler scheduler could benefit this project.
Leaning towards using Gitter
I'm considering CircleCI or Github Actions. Would be good to set up code coverage for Rust and Kotlin as well, perhaps using codecov.io
As a step towards having a distributed query planner, I want to update the current example to build a distributed query plan manually and test out that flow. This will help drive requirements for the distributed query planner.
Data nerds love their jupyter notebooks
https://github.com/google/evcxr
I've run into this issue when trying to run the repo example. I'm still investigating what could be going wrong, but thought it worth the time to throw up some info while I investigate.
I'm running this:
cargo run --bin ballista -- create-cluster --name nyctaxi --num-executors 12 --template examples/nyctaxi/templates/executor.yaml
Output from the program is:
Executed subcommand create-cluster in 1.566 seconds
A number of ballista-nyctaxi-*
K8s resources when you run kubectl get pods
No resources found.
minikube version
output: minikube version: v1.2.0
minikube status
output:
host: Running
kubelet: Running
apiserver: Running
kubectl: Correctly Configured: pointing to minikube-vm at 192.168.99.101
kubectl cluster-info
output:
Kubernetes master is running at https://192.168.99.101:8443
KubeDNS is running at https://192.168.99.101:8443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
kubectl version
output: Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.11", GitCommit:"637c7e288581ee40ab4ca210618a89a555b6e7e9", GitTreeState:"clean", BuildDate:"2018-11-26T14:38:32Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.0", GitCommit:"e8462b5b5dc2584fdcd18e6bcfe9f1e4d970a529", GitTreeState:"clean", BuildDate:"2019-06-19T16:32:14Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"linux/amd64"}
Given a simple query consisting of any combination of TableScan, Projection, Selection (and no LIMIT/OFFSET/SORT/UNION) then Ballista should be able to execute these queries in parallel across partitions and simply combine the results without further processing.
Whats the status of this project? I was curious about it but don't want to invest time if its abandoned. Commit log looks like it slowed down.
Currently, the Kubernetes URL is hard-coded and it needs to be configurable.
We need to think about how we want configuration to work in general. I am a fan of twelve factors [0] in general, and would encourage anyone working on this issue to read the configuration section there at least.
In similar vein of reducing complexity. Maybe some documentation for how to run ballista on k3s from scratch to test could allow people to get started more simply
In the PoC, one pod and one service is created for each executor, but how should this really work? StatefulSet? Service with ReplicaSet?
I think the distributed query planner needs the ability to target specific executors (for things like data locality later on). Let's use this issue for the initial discussion.
As a user, I want to be able to connect to a Ballista cluster from JDBC and run queries. Note that there is already PoC code for this under jvm/jdbc
.
I'm based in Colorado which is Mountain Time (https://time.is/MT) and early morning is a good time for me so I'd like to start by proposing that. This is end of day in UK so hopefully works for UK and Europe.
I am thinking every two weeks?
How about 8am MT every other Friday?
There are so many sources of errors and we'll need to provide context to the user for each one. snafu seems to provide a nice way of doing this in a single place, so could be worth investigating.
If other tools in this space are anything to go by, we'll end up with a lot of options in the CLI, so having arg parsing handled automatically would be great.
The example relies on data being located at /mnt/ssd/nyc_taxis/csv currently and we at least need to document this so others can run the example
The code currently shells out to kubectl for certain operations, instead of just posting YAML directly to the k8s API.
Currently the project uses git submodule to clone copy of tower-grpc because I ran into problems early on with adding a dependency on the published release.
I would like to be able to use Spark within Ballista. This will involve writing an executor in Scala that translates a Ballista query plan into a Spark query plan.
It will be very helpful if we can easily run comparable jobs on both Ballista and Spark and compare query plans and performance.
As a user of Ballista, I would like the ability to execute arbitrary code as part of my distributed job. I want the ability to use multiple languages depending on my requirements (perhaps there are third party libraries I want to use).
WASM seems like a good potential choice for this?
It should be possible to execute queries using binary expressions e.g. some_column > 4
. DataFusion already supports this and the proto file has definitions for this but the code to translate between Ballista and DataFusion plans does not yet support it.
We should also look at the Gandiva proto which has rich support for expressions and see if we can conform with their approach since the plan is to combine Ballista and Gandiva proto at some point.
Given that query plans are defined in protobuf, we should be able to pass a query plan from JVM to Rust via FFI and then receive the results back in Arrow IPC format.
The spark benchmark fails with:
Caused by: java.net.UnknownHostException: io-andygrove-ballista-spark-main-1564066103519-driver-svc.default.svc
This is probably some minikube DNS issue that needs to be figured out and documented.
Hi andy! This project looks awesome, I was curious if you had any ideas about supporting other clustering techniques that are not based around kube?
As discussed on Gitter it would be great Ballista could read and write from HDFS.
Ideally Ballista would have a common way of accessing data from various sources such as S3 or other object stores.
What would the timeline for support here look like? I guess the initial steps would be to design the storage API, which arbitrary sources could access?
In terms of Rust support for HDFS, I haven't seen any particularly mature implementations. Do you see this (or a child project) as the place for that work to go? Or would that be more in the realm of Arrow?
Sorry if I've missed anything here or looked over something obvious, still in the grokking phase of Ballista but like what it does for distributed compute with rust. HDFS Support would really unblock things for me wrt to using rust in a professional capacity.
We want to use cargo fmt
to keep the code formatted consistently. The usual approach to this is to have the build fail if the code isn't formatted correctly.
Here is some example travis yaml I found that should be pretty close to what we need:
language: rust
before_script:
- rustup toolchain install nightly
- rustup component add --toolchain nightly rustfmt-preview
- which rustfmt || cargo install --force rustfmt-nightly
script:
- cargo +nightly fmt --all -- --write-mode=diff
- cargo build
- cargo test
In order to demonstrate how K8s would help manage client interaction, create example CRD schemas and a controller that has stub functionality for query submission and execution planning.
Deliverables:
BallistaQuery CRD
BallistaPlan CRD
BallistaOperator controller implementation and deployment.
As a JVM (Java/Kotlin/Scala) user, I would like to be able to build a query plan, execute it on a Ballista cluster, and receive the results.
I plan on implementing this in Kotlin.
I think this will be needed so that anyone writing Ballista applications can guarantee they're using the correct versions of arrow and datafusion. Eventually it may be desirable to hide away the arrow/datafusion implementation, but for now this is a fairly easy solution.
As a Spark user, I want to be able to delegate some processing to Ballista executors, such as executing a Rust UDF against data.
The current PoC code returns a single ExecuteResponse
containing the entire data set. We will want to stream this instead.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.