splitgraph / seafowl Goto Github PK
View Code? Open in Web Editor NEWAnalytical database for data-driven Web applications 🪶
Home Page: https://seafowl.io
License: Apache License 2.0
Analytical database for data-driven Web applications 🪶
Home Page: https://seafowl.io
License: Apache License 2.0
(followup to #20)
The current default is not sending any cache-control headers at all, which means that the browser/CDNs will lean towards more caching. (from my basic testing, it seems like Cloudflare caches for 2h at least: https://developers.cloudflare.com/cache/about/edge-browser-cache-ttl/)
Possible settings:
Some reference: https://jakearchibald.com/2016/caching-best-practices/
Add ability to cache query results in the same object storage that we use for actual Parquet files. This might not be crucial if we implement #20 (in which case we can deploy an HTTP cache in front of the executor and get the same effect)
Currently, we just ignore it:
Line 139 in f00efc4
Use the index in order to speed up queries with filters. Also TODO: figure out if it's even worth storing min-max values in the metadata database, given that we can also cheaply (?) read them from the actual Parquet files by hitting the object storage with Range requests.
Currently we don't handle loading local parquet files well, as it seems the plan loads the entire file into memory prior to re-partitioning and uploading to object store.
For example (using area1.parquet from here, 2.45GB size):
CREATE EXTERNAL TABLE area1 STORED AS PARQUET LOCATION '/Users/markogrujic/Downloads/area1.parquet';
CREATE TABLE area1 AS SELECT * FROM staging.area1;
leads to the following memory profile (the first plateau corresponds to reading from the file itself)
Release on a tag, e.g. https://github.com/taiki-e/upload-rust-binary-action
Or build / upload a binary on every merge to master
Get a big-ish Parquet file (e.g. https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet):
~/seafowl-test $ du -h trase-supply-chains.parquet
146M trase-supply-chains.parquet
~/seafowl-test $ curl -iH "Authorization: Bearer BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M" -F "[email protected]" http://localhost:8080/upload/public/supply_chains
HTTP/1.1 413 Payload Too Large
content-type: text/plain; charset=utf-8
content-length: 32
date: Mon, 22 Aug 2022 15:39:40 GMT
The request payload is too large
might be related to Warp not supporting streaming multipart uploads
Currently, we just use to_string(): https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L338
Consider using protobufs?
More discussion here:
Line 773 in f00efc4
Add a Catalog
/ Repository
implementation for SQLite:
https://github.com/splitgraph/seafowl/blob/main/src/catalog.rs#L26
https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/repository.rs
We might be able to use very similar SQL statements / migration files. The SQLite database should be configurable to be in-memory or on-disk.
Store the function settings / WASM code in the database:
Lines 835 to 847 in f00efc4
Non-tutorial docs:
seafowl.toml
Tutorial:
When using an S3-compatible object store, DataFusion / Seafowl currently download the object every time they need to scan through it. We should let users dedicate some scratch space on a Seafowl instance as object cache. Potential options:
ObjectStore
trait that proxies an upstream object store and uses a local FS as cacheCurrently, on INSERT, one partition of the original DF plan produces one Parquet file. This is an issue because:
RepartitionExec
with round-robin partitioningFind a way to, when persisting the results of a query to object storage, chunk them up into partitions with a consistent length (e.g. 100k rows).
Follow-up to #20
Currently, we compute the ETag based on all versions of Seafowl tables in a query. This disregards:
Running e.g. ./seafowl --one-off "CREATE TABLE ..."
should, instead of starting a server, just execute a single command against the context and exit. This is going to be used when baking a dataset into a Docker image, e.g.:
RUN \
wget https://some-csv.com/file.csv -o file.csv && \
seafowl --one-off \
"CREATE EXTERNAL TABLE t LOCATION 'file://file.csv'; \
CREATE TABLE t_frozen AS SELECT * FROM t;" && \
rm file.csv
Something like:
POST /upload/[schema]/[table]
receives a standard multi-part upload so that a user can upload with curl. Considerations:
Version 1: single-node "Web" version
object_store
s3 feature)convergence
)Version 2: Full version
object_store
also supports GCP/Azure but we can keep those as compile-time crate features for very advanced users)Also see how much the WASI UDF support weighs.
If the GET API receives an unhashed SQL query in the URL (e.g. just urlencoded), hash it and redirect the user to the hashed endpoint), as per #20 (comment)
Need to consider the fact that the redirect will only make sense if the client ends up sending the actual query in the GET body/header correctly. So we need to use a redirect that preserves the headers/body and the client needs to send the headers/body in the original request as well, potentially limiting the friendliness of this.
More discussion here (very similar to UPDATE):
Line 773 in f00efc4
Some kind of a SELECT * FROM [table_name] AS OF [timestamp]
syntax.
A TOML file that lets the user configure:
etc
Currently, when writing data:
physical_region
table in the database)The ID of the region isn't content-addressable (it always increases), but the object storage ID is. This means that if we're about to write the same Parquet file (same hash), we'll create a new row in the physical_region
table (doesn't consume much space) and overwrite the same file in the object storage (doesn't consume space but consumes time uploading the file)
Lines 340 to 356 in f00efc4
Figure out:
Commandline option and/or daemon to delete:
~/seafowl/examples/multinode $ curl -i -H "Content-Type: application/json" localhost:8080/q -d '{"query": "CREATE TABLE tripdata AS SELECT * FROM staging.tripdata"}'
HTTP/1.1 400 Bad Request
Server: nginx/1.23.1
Date: Thu, 15 Sep 2022 09:51:36 GMT
Content-Length: 803
Connection: keep-alive
vary: Content-Type, Origin, X-Seafowl-Query
Object Store error: Generic S3 error: Error performing create multipart request: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><Key>9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Key><BucketName>seafowl</BucketName><Resource>/seafowl/9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Resource><RequestId>1714FFDF2B463C6E</RequestId><HostId>b0ac6bdc-1f16-40f2-9fe6-c2025e3e134f</H
tcpdump of Seafowl <> Minio comms:
Frame 92: 533 bytes on wire (4264 bits), 533 bytes captured (4264 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.4, Dst: 172.25.0.2
Transmission Control Protocol, Src Port: 35334, Dst Port: 9000, Seq: 1, Ack: 1, Len: 465
Hypertext Transfer Protocol
POST /seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads HTTP/1.1\r\n
x-amz-date: 20220915T095136Z\r\n
x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\n
authorization: AWS4-HMAC-SHA256 Credential=minioadmin/20220915//s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0aeb8d2eed42e74260c91724a349823dc659f6dfbdde0670ac91f5869ace4b2e\r\n
accept: */*\r\n
host: minio:9000\r\n
\r\n
[Full request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
[HTTP request 1/3]
[Response in frame: 94]
[Next request in frame: 96]
Frame 94: 1002 bytes on wire (8016 bits), 1002 bytes captured (8016 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.2, Dst: 172.25.0.4
Transmission Control Protocol, Src Port: 9000, Dst Port: 35334, Seq: 1, Ack: 466, Len: 934
Hypertext Transfer Protocol
HTTP/1.1 403 Forbidden\r\n
Accept-Ranges: bytes\r\n
Content-Length: 529\r\n
Content-Security-Policy: block-all-mixed-content\r\n
Content-Type: application/xml\r\n
Server: MinIO\r\n
Strict-Transport-Security: max-age=31536000; includeSubDomains\r\n
Vary: Origin\r\n
Vary: Accept-Encoding\r\n
X-Amz-Request-Id: 1714FFDF2AEB6157\r\n
X-Content-Type-Options: nosniff\r\n
X-Xss-Protection: 1; mode=block\r\n
Date: Thu, 15 Sep 2022 09:51:36 GMT\r\n
\r\n
[HTTP response 1/3]
[Time since request: 0.000320000 seconds]
[Request in frame: 92]
[Next request in frame: 96]
[Next response in frame: 98]
[Request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
File Data: 529 bytes
eXtensible Markup Language
This can be done without changing the Parquet files (the query engine right now can handle Parquet files with missing columns and pads them with NULLs)
Consider this SQL:
SELECT value
-- some comment
FROM some_table
HTTP headers don't support newlines, so our current Observable client replaces newlines with spaces. In this case, this also changes the semantic meaning of the query (the FROM some_table
becomes a comment). This also doesn't deal with Unicode and non-printable ASCII characters. We should figure out a way to safely pass queries with any character as the header.
Headers only support printable ASCII characters. Even though it's not a direct authority on what headers proxies and fetch()
will accept, CF's Page Rules docs are a good aproximation. So we don't need to url-encode all characters that would normally be urlencoded (e.g. spaces) to save on space.
Several examples in the docs (e.g. https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image#building-a-docker-image that uses #39) require being able to pass multiple queries in a single command. We currently don't allow that (copied from DF): https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L551-L558
The user could always send their queries one-by-one (which is why this is a P2, if we don't have time for it, we can change the docs to mention that), but we can use this as a basis to create implicit mini-transactions, i.e.:
CREATE TABLE staging AS (...);
DROP TABLE production;
ALTER TABLE staging RENAME TO production
which would get executed by Seafowl inside of a transaction on the catalog. The effect would be that other queries will always see the production
table, even while it's being recreated.
The semantics would be:
e.g. https://github.com/splitgraph/seafowl/blob/main/src/catalog.rs#L177
Could use https://github.com/dtolnay/thiserror, though a lot of these error handling libraries are designed for other libraries and not for applications -- we can get away with returning DataFusionError
objects and logging them properly.
Alluded to in #48.
Start a transaction before planning a batch of Seafowl statements, roll it back on error and commit on success (before returning a result): https://docs.rs/sqlx/latest/sqlx/struct.Transaction.html . Useful for:
CREATE staging + DELETE current + ALTER staging RENAME to current
, other readers will either see the old or the new version of the table when querying current
)We currently do not support UDAFs (user defined aggregation functions), even though DataFusion does (https://docs.rs/datafusion/latest/datafusion/physical_plan/udaf/struct.AggregateUDF.html).
The most basic implementation would be expecting the WASM function to be an "accumulator" (which we'll use to implement https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html) that takes in the current state, the new value, and returns the new state (or stores the state in the WASM memory)?
Inspiration: https://www.postgresql.org/docs/current/xaggr.html / https://www.postgresql.org/docs/current/sql-createaggregate.html
Currently, our WASM functions only support passing basic types like ints and floats. In order to be able to pass something more complex like strings or datetimes, we want to put them in the WASM memory and point the UDF to it.
We need to figure out what is the most ergonomic way to the function writer to do this. For reference, something like this:
EMSCRIPTEN_KEEPALIVE char* test_string(char* input) {
int len;
len = strlen(input);
char *out = malloc(len - 2 + 1);
strncpy(out, input, len - 2);
return out;
}
compiles to:
(type (;0;) (func (param i32) (result i32)))
...
(func (;3;) (type 0) (param i32) (result i32)
(local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
block ;; label = @1
local.get 0
local.tee 9
i32.const 3
i32.and
if ;; label = @2
loop ;; label = @3
local.get 0
...
This should work out of the box, without having to write a wrapper that converts some binary representation into a C string.
Done in #71:
- MemoryManagerConfig
: max runtime memory usage for plan execution (rough, since it doesn't track basic process data structures): https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/memory_manager.rs#L35-L55 / https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/runtime_env.rs#L141-L145
- DiskManagerConfig
: using the OS temp directory / other path to spill data to disk during execution: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/disk_manager.rs#L50-L65 / https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/runtime_env.rs#L135-L139
Misc:
Implement a read-only HTTP API that obeys HTTP cache semantics and can benefit from any CDN / cache like Varnish:
Client sends a query like:
GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]
The server receives the query, checks it's a SELECT query, checks that the hash matches, executes it. It sends it back with an ETag that is a function of the versions of all tables that participate in this query:
200 OK
[query results]
ETag: [hash of the version of the table(s)]
Cache-Control: max-age=3600 [optional]
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#directives
When the client's browser queries the data again, it can then pass the ETag to see if the query results have changed:
GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]
If-None-Match: [etag]
Intermediate caches/CDNs might not even forward the query to the origin, instead serving it from cache if it's not stale. If it reaches the server, the server can cheaply revalidate the entry by recomputing the etag and responding with a 304 Not Modified
if the tables in the query haven't changed their versions (without having to execute the query).
This only works for SELECT
queries. INSERT
and other writes should come in via POST
requests.
https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L1090-L1091
Currently, the Parquet file doesn't seem to store min-max values for utf-8 and probably other (like timestamp) columns.
Make sure that, when we need to, we can migrate:
Framework: https://diataxis.fr/
Good doc structure examples:
Build a PoC of this working: https://www.splitgraph.com/docs/seafowl/guides/scaling-multiple-nodes#one-writer-many-readers-sqlite-and-litestream
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.