The seafowl's discuss from splitgraph

Support for changing the Cache-Control header sent by the GET API

(followup to #20)

The current default is not sending any cache-control headers at all, which means that the browser/CDNs will lean towards more caching. (from my basic testing, it seems like Cloudflare caches for 2h at least: https://developers.cloudflare.com/cache/about/edge-browser-cache-ttl/)

Possible settings:

how soon to revalidate with an If-None-Match (maxage?)
sending s-maxage as well as maxage (the former seems to influence CDNs)

Some reference: https://jakearchibald.com/2016/caching-best-practices/

Query result cache

Add ability to cache query results in the same object storage that we use for actual Parquet files. This might not be crucial if we implement #20 (in which case we can deploy an HTTP cache in front of the executor and get the same effect)

Use the min-max index when planning the query (partition pruning)

Currently, we just ignore it:

seafowl/src/provider.rs

Line 139 in f00efc4

    
           // TODO: use filters and apply them to regions here (grab the code from list_files_for_scan)

Use the index in order to speed up queries with filters. Also TODO: figure out if it's even worth storing min-max values in the metadata database, given that we can also cheaply (?) read them from the actual Parquet files by hitting the object storage with Range requests.

Optimise reading from local files

Currently we don't handle loading local parquet files well, as it seems the plan loads the entire file into memory prior to re-partitioning and uploading to object store.

For example (using area1.parquet from here, 2.45GB size):

CREATE EXTERNAL TABLE area1 STORED AS PARQUET LOCATION '/Users/markogrujic/Downloads/area1.parquet';
CREATE TABLE area1 AS SELECT * FROM staging.area1;

leads to the following memory profile (the first plateau corresponds to reading from the file itself)

GitHub Action to compile binaries / perform a release

Release on a tag, e.g. https://github.com/taiki-e/upload-rust-binary-action
Or build / upload a binary on every merge to master

Find a home for the docs and a basic landing page

"The request payload is too large" when uploading a 150MB Parquet file

Get a big-ish Parquet file (e.g. https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet):

~/seafowl-test $ du -h trase-supply-chains.parquet 
146M    trase-supply-chains.parquet
~/seafowl-test $ curl     -iH "Authorization: Bearer BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M"     -F "[email protected]"     http://localhost:8080/upload/public/supply_chains
HTTP/1.1 413 Payload Too Large
content-type: text/plain; charset=utf-8
content-length: 32
date: Mon, 22 Aug 2022 15:39:40 GMT

The request payload is too large

might be related to Warp not supporting streaming multipart uploads

Support for ALTER SCHEMA ... RENAME

More efficient storage for min-max index

Currently, we just use to_string(): https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L338

Consider using protobufs?

Write a README

Test authz on the upload endpoint and clean up error handling

Support for `UPDATE`

More discussion here:

seafowl/src/context.rs

Line 773 in f00efc4

// Some kind of a node that combines Filter + Projection?

Workaround for having to load Parquet files in-memory before uploading them

See this TODO:

seafowl/src/context.rs

Lines 319 to 325 in f00efc4

    
           // TODO: the object_store crate doesn't support multi-part uploads / uploading a file 
        
           // from a local path. This means we have to read the file back into memory in full. 
        
           // https://github.com/influxdata/object_store_rs/issues/9 
        
           // 
        
           // Another implication is that we could just keep everything in memory (point ArrowWriter to a byte buffer, 
        
           // call get_parquet_file_statistics on that, upload the file) and run the output routine for each partition 
        
           // sequentially.

The implications are:

we should have smaller regions (#4)
we could limit the amount of concurrent writes (since each region has to be in RAM)

Support for `ALTER TABLE ... DROP COLUMN`

Support for `ALTER TABLE ... ALTER COLUMN` (type change/rename)

SQLite metadata backend

Add a Catalog / Repository implementation for SQLite:

https://github.com/splitgraph/seafowl/blob/main/src/catalog.rs#L26
https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/repository.rs

We might be able to use very similar SQL statements / migration files. The SQLite database should be configurable to be in-memory or on-disk.

Persistence and execution for CREATE FUNCTION

Store the function settings / WASM code in the database:

seafowl/src/context.rs

Lines 835 to 847 in f00efc4

    
           let _function = create_udf_from_wasm( 
        
               &name.to_string(), 
        
               &function_code, 
        
               &details.entrypoint, 
        
               details.input_types.iter().map(get_wasm_type).collect(), 
        
               get_wasm_type(&details.return_type), 
        
               get_volatility(&details.volatility), 
        
           )?; 
        
           // TODO we don't persist the function here in the database, so it'll get 
        
           // deleted every time we recreate the context 
        
           // also this requires &mut self 
        
           // self.inner.register_udf(function); 
        
           Ok(make_dummy_exec())

Write the tutorial

Non-tutorial docs:

Tutorial:

Part 1: running Seafowl locally
Part 2: deploying to fly.io (free tier)
Part 3, option 1: caching with Varnish
Part 3, option 2: CDN with Cloudflare
Part 4: visualization with Observable

Local object cache for S3

When using an S3-compatible object store, DataFusion / Seafowl currently download the object every time they need to scan through it. We should let users dedicate some scratch space on a Seafowl instance as object cache. Potential options:

write a wrapper around the ObjectStore trait that proxies an upstream object store and uses a local FS as cache
use https://github.com/kahing/goofys / + catfs OR s3fs?

Verify the sqlx JSON is up-to-date in CI

rustfmt check
cargo check / clippy
run tests (skip PG tests initially or find a way to get them working in CI)
might need some caching

Rechunk the table on INSERT instead of doing 1 Parquet file per partition

https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/context.rs#310

Currently, on INSERT, one partition of the original DF plan produces one Parquet file. This is an issue because:

some plans might only ever have one partition (e.g. SELECT *)
the partitions might be unbalanced or without an obvious ordering (if passed through a RepartitionExec with round-robin partitioning

Find a way to, when persisting the results of a query to object storage, chunk them up into partitions with a consistent length (e.g. 100k rows).

More sophisticated ETag calculation

Follow-up to #20

Currently, we compute the ETag based on all versions of Seafowl tables in a query. This disregards:

Contents changing when the version doesn't (e.g. using https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image will always use V1 for all tables, since we're rebuilding the dataset from scratch each time): we should use the hashes of the Parquet files instead?
UDF/built-in function definitions changing
DataFusion query plan / execution changing

Add a `--one-off` / `--command` CLI option

Running e.g. ./seafowl --one-off "CREATE TABLE ..." should, instead of starting a server, just execute a single command against the context and exit. This is going to be used when baking a dataset into a Docker image, e.g.:

RUN \
	wget https://some-csv.com/file.csv -o file.csv && \
	seafowl --one-off \
		"CREATE EXTERNAL TABLE t LOCATION 'file://file.csv'; \
		CREATE TABLE t_frozen AS SELECT * FROM t;" && \
	rm file.csv

HTTP API: uploading data

Something like:

POST /upload/[schema]/[table]

receives a standard multi-part upload so that a user can upload with curl. Considerations:

Find a way to specify a schema for CSV files (GET params?)
Support uploading Parquet files
(stretch goal) support uploading SQLite?

Research if the size difference makes it worth shipping 2 binary versions

Version 1: single-node "Web" version

Only SQLite backend
Only local storage (no object_store s3 feature)
Only HTTP frontend (no convergence)

Version 2: Full version

SQLite and PostgreSQL backend
S3 storage (object_store also supports GCP/Azure but we can keep those as compile-time crate features for very advanced users)
HTTP/PG frontend

Also see how much the WASI UDF support weighs.

Friendlier cached GET API

If the GET API receives an unhashed SQL query in the URL (e.g. just urlencoded), hash it and redirect the user to the hashed endpoint), as per #20 (comment)

Need to consider the fact that the redirect will only make sense if the client ends up sending the actual query in the GET body/header correctly. So we need to use a redirect that preserves the headers/body and the client needs to send the headers/body in the original request as well, potentially limiting the friendliness of this.

Support for `DELETE`

More discussion here (very similar to UPDATE):

seafowl/src/context.rs

Line 773 in f00efc4

// Some kind of a node that combines Filter + Projection?

Add ability to reference older versions of a table

Some kind of a SELECT * FROM [table_name] AS OF [timestamp] syntax.

Add configuration loading

A TOML file that lets the user configure:

metadata backend (SQLite in memory / on a path; PostgreSQL)
object storage backend (in memory / local path / S3 / GCS / Azure)
basic authorization (hash of a secret key used for writing)

etc

Content-addressable object IDs

Currently, when writing data:

we create a region for every partition of the original plan (a row in the physical_region table in the database)
the region has a unique ID (just a bigint)
the region has an "object storage ID" (path to the physical file)

The ID of the region isn't content-addressable (it always increases), but the object storage ID is. This means that if we're about to write the same Parquet file (same hash), we'll create a new row in the physical_region table (doesn't consume much space) and overwrite the same file in the object storage (doesn't consume space but consumes time uploading the file)

seafowl/src/context.rs

Lines 340 to 356 in f00efc4

    
           let mut hasher = Sha256::new(); 
        
           hasher.update(&data); 
        
           let hash_str = encode(hasher.finalize()); 
        
           let object_storage_id = hash_str + ".parquet"; 
        
           store 
        
               .put(&Path::from(object_storage_id.clone()), data) 
        
               .await?; 
        
           let region = SeafowlRegion { 
        
               object_storage_id: Arc::from(object_storage_id), 
        
               row_count: region_stats 
        
                   .num_rows 
        
                   .expect("Error counting rows in the written file") 
        
                   .try_into() 
        
                   .expect("row count greater than 2147483647"), 
        
               columns: Arc::new(columns), 
        
           };

Figure out:

if we want to have a separate "region ID" and "object storage ID"
how to skip uploading regions that already exist

Add garbage collection functionality

Commandline option and/or daemon to delete:

older versions of a table
regions that aren't linked to any tables

Fix object_store 0.5.0 SignatureDoesNotMatch with Minio

 ~/seafowl/examples/multinode $ curl -i -H "Content-Type: application/json" localhost:8080/q   -d '{"query": "CREATE TABLE tripdata AS SELECT * FROM staging.tripdata"}'
HTTP/1.1 400 Bad Request
Server: nginx/1.23.1
Date: Thu, 15 Sep 2022 09:51:36 GMT
Content-Length: 803
Connection: keep-alive
vary: Content-Type, Origin, X-Seafowl-Query

Object Store error: Generic S3 error: Error performing create multipart request: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><Key>9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Key><BucketName>seafowl</BucketName><Resource>/seafowl/9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Resource><RequestId>1714FFDF2B463C6E</RequestId><HostId>b0ac6bdc-1f16-40f2-9fe6-c2025e3e134f</H

tcpdump of Seafowl <> Minio comms:

Frame 92: 533 bytes on wire (4264 bits), 533 bytes captured (4264 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.4, Dst: 172.25.0.2
Transmission Control Protocol, Src Port: 35334, Dst Port: 9000, Seq: 1, Ack: 1, Len: 465
Hypertext Transfer Protocol
    POST /seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads HTTP/1.1\r\n
    x-amz-date: 20220915T095136Z\r\n
    x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\n
    authorization: AWS4-HMAC-SHA256 Credential=minioadmin/20220915//s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0aeb8d2eed42e74260c91724a349823dc659f6dfbdde0670ac91f5869ace4b2e\r\n
    accept: */*\r\n
    host: minio:9000\r\n
    \r\n
    [Full request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
    [HTTP request 1/3]
    [Response in frame: 94]
    [Next request in frame: 96]

Frame 94: 1002 bytes on wire (8016 bits), 1002 bytes captured (8016 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.2, Dst: 172.25.0.4
Transmission Control Protocol, Src Port: 9000, Dst Port: 35334, Seq: 1, Ack: 466, Len: 934
Hypertext Transfer Protocol
    HTTP/1.1 403 Forbidden\r\n
    Accept-Ranges: bytes\r\n
    Content-Length: 529\r\n
    Content-Security-Policy: block-all-mixed-content\r\n
    Content-Type: application/xml\r\n
    Server: MinIO\r\n
    Strict-Transport-Security: max-age=31536000; includeSubDomains\r\n
    Vary: Origin\r\n
    Vary: Accept-Encoding\r\n
    X-Amz-Request-Id: 1714FFDF2AEB6157\r\n
    X-Content-Type-Options: nosniff\r\n
    X-Xss-Protection: 1; mode=block\r\n
    Date: Thu, 15 Sep 2022 09:51:36 GMT\r\n
    \r\n
    [HTTP response 1/3]
    [Time since request: 0.000320000 seconds]
    [Request in frame: 92]
    [Next request in frame: 96]
    [Next response in frame: 98]
    [Request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
    File Data: 529 bytes
eXtensible Markup Language

Support for ALTER TABLE ... ADD COLUMN

This can be done without changing the Parquet files (the query engine right now can handle Parquet files with missing columns and pads them with NULLs)

Add V0 authn/z

unauthenticated requests are read-only
authenticated requests (with a pre-shared secret) can write to the database

Allow passing urlencoded (or otherwise escaped) values in X-Seafowl-Query

Consider this SQL:

SELECT value
-- some comment
FROM some_table

HTTP headers don't support newlines, so our current Observable client replaces newlines with spaces. In this case, this also changes the semantic meaning of the query (the FROM some_table becomes a comment). This also doesn't deal with Unicode and non-printable ASCII characters. We should figure out a way to safely pass queries with any character as the header.

Headers only support printable ASCII characters. Even though it's not a direct authority on what headers proxies and fetch() will accept, CF's Page Rules docs are a good aproximation. So we don't need to url-encode all characters that would normally be urlencoded (e.g. spaces) to save on space.

Support for multiple statements in a single query

Several examples in the docs (e.g. https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image#building-a-docker-image that uses #39) require being able to pass multiple queries in a single command. We currently don't allow that (copied from DF): https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L551-L558

The user could always send their queries one-by-one (which is why this is a P2, if we don't have time for it, we can change the docs to mention that), but we can use this as a basis to create implicit mini-transactions, i.e.:

CREATE TABLE staging AS (...);
DROP TABLE production;
ALTER TABLE staging RENAME TO production

which would get executed by Seafowl inside of a transaction on the catalog. The effect would be that other queries will always see the production table, even while it's being recreated.

The semantics would be:

return the output of the last query
for queries that aren't the last one, only execute writing queries (no point in executing RO queries if we won't see their results)

Support for ALTER TABLE ... RENAME

Replace `panic` / `unwrap` / `expect` with actual error propagation

e.g. https://github.com/splitgraph/seafowl/blob/main/src/catalog.rs#L177

Could use https://github.com/dtolnay/thiserror, though a lot of these error handling libraries are designed for other libraries and not for applications -- we can get away with returning DataFusionError objects and logging them properly.

Use a catalog DB-level transaction when running a Seafowl statement

Alluded to in #48.

Start a transaction before planning a batch of Seafowl statements, roll it back on error and commit on success (before returning a result): https://docs.rs/sqlx/latest/sqlx/struct.Transaction.html . Useful for:

automatic rollback e.g. an error during execution won't leave the DB in an inconsistent state (create a table version -> fail -> now we have a new empty table version)
being able to use table locks to e.g. disallow concurrent writes
atomicity (when someone runs CREATE staging + DELETE current + ALTER staging RENAME to current, other readers will either see the old or the new version of the table when querying current)

Support for WASM UDAFs

We currently do not support UDAFs (user defined aggregation functions), even though DataFusion does (https://docs.rs/datafusion/latest/datafusion/physical_plan/udaf/struct.AggregateUDF.html).

The most basic implementation would be expecting the WASM function to be an "accumulator" (which we'll use to implement https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html) that takes in the current state, the new value, and returns the new state (or stores the state in the WASM memory)?

Inspiration: https://www.postgresql.org/docs/current/xaggr.html / https://www.postgresql.org/docs/current/sql-createaggregate.html

Investigate WASM "calling conventions" and passing non-scalar datatypes like strings

Currently, our WASM functions only support passing basic types like ints and floats. In order to be able to pass something more complex like strings or datetimes, we want to put them in the WASM memory and point the UDF to it.

We need to figure out what is the most ergonomic way to the function writer to do this. For reference, something like this:

EMSCRIPTEN_KEEPALIVE char* test_string(char* input) {
    int len;
    len = strlen(input);

    char *out = malloc(len - 2 + 1);

    strncpy(out, input, len - 2);
    return out;
}

compiles to:

(type (;0;) (func (param i32) (result i32)))
...
  (func (;3;) (type 0) (param i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    block  ;; label = @1
      local.get 0
      local.tee 9
      i32.const 3
      i32.and
      if  ;; label = @2
        loop  ;; label = @3
          local.get 0
...

This should work out of the box, without having to write a wrapper that converts some binary representation into a C string.

Use `vergen` / `built` to add version / build info to the binary

https://docs.rs/built/latest/built/
https://docs.rs/vergen/latest/vergen/

Expose some DataFusion runtime execution settings in the config

Done in #71:
- MemoryManagerConfig: max runtime memory usage for plan execution (rough, since it doesn't track basic process data structures): https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/memory_manager.rs#L35-L55 / https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/runtime_env.rs#L141-L145
- DiskManagerConfig: using the OS temp directory / other path to spill data to disk during execution: https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/disk_manager.rs#L50-L65 / https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/runtime_env.rs#L135-L139

Misc:

Planner/executor config options from https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/config.rs and https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L1038-L1060

CDN-friendly HTTP API

Implement a read-only HTTP API that obeys HTTP cache semantics and can benefit from any CDN / cache like Varnish:

Client sends a query like:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]

The server receives the query, checks it's a SELECT query, checks that the hash matches, executes it. It sends it back with an ETag that is a function of the versions of all tables that participate in this query:

200 OK
[query results]
ETag: [hash of the version of the table(s)]
Cache-Control: max-age=3600 [optional]

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#directives

When the client's browser queries the data again, it can then pass the ETag to see if the query results have changed:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]
If-None-Match: [etag]

Intermediate caches/CDNs might not even forward the query to the origin, instead serving it from cache if it's not stale. If it reaches the server, the server can cheaply revalidate the entry by recomputing the etag and responding with a 304 Not Modified if the tables in the query haven't changed their versions (without having to execute the query).

This only works for SELECT queries. INSERT and other writes should come in via POST requests.

the table storage format (e.g. tag a table with "v1" and let users upgrade the table)
region indexes
the type storage format (currently we just dump an Arrow JSON string, we should have scope to add other custom datatypes later)

	// TODO: the object_store crate doesn't support multi-part uploads / uploading a file
	// from a local path. This means we have to read the file back into memory in full.
	// https://github.com/influxdata/object_store_rs/issues/9
	//
	// Another implication is that we could just keep everything in memory (point ArrowWriter to a byte buffer,
	// call get_parquet_file_statistics on that, upload the file) and run the output routine for each partition
	// sequentially.

	let _function = create_udf_from_wasm(
	&name.to_string(),
	&function_code,
	&details.entrypoint,
	details.input_types.iter().map(get_wasm_type).collect(),
	get_wasm_type(&details.return_type),
	get_volatility(&details.volatility),
	)?;
	// TODO we don't persist the function here in the database, so it'll get
	// deleted every time we recreate the context
	// also this requires &mut self
	// self.inner.register_udf(function);
	Ok(make_dummy_exec())

	let mut hasher = Sha256::new();
	hasher.update(&data);
	let hash_str = encode(hasher.finalize());
	let object_storage_id = hash_str + ".parquet";
	store
	.put(&Path::from(object_storage_id.clone()), data)
	.await?;

	let region = SeafowlRegion {
	object_storage_id: Arc::from(object_storage_id),
	row_count: region_stats
	.num_rows
	.expect("Error counting rows in the written file")
	.try_into()
	.expect("row count greater than 2147483647"),
	columns: Arc::new(columns),
	};

splitgraph / seafowl Goto Github PK

seafowl's Issues

Recommend Projects

Recommend Topics

Recommend Org