Code Monkey home page Code Monkey logo

seafowl's Introduction

sgr

Build status Coverage Status PyPI version Discord chat room Follow

Overview

sgr is the CLI for Splitgraph, a serverless API for data-driven Web applications.

With addition of the optional sgr Engine component, sgr can become a stand-alone tool for building, versioning and querying reproducible datasets. We use it as the storage engine for Splitgraph. It's inspired by Docker and Git, so it feels familiar. And it's powered by PostgreSQL, so it works seamlessly with existing tools in the Postgres ecosystem. Use sgr to package your data into self-contained Splitgraph data images that you can share with other sgr instances.

To install the sgr CLI or a local sgr Engine, see the Installation section of this readme.

Build and Query Versioned, Reproducible Datasets

Splitfiles give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN.

When you build data images with Splitfiles, you get provenance tracking of the resulting data: it's possible to find out what sources went into every dataset and know when to rebuild it if the sources ever change. You can easily integrate sgr your existing CI pipelines, to keep your data up-to-date and stay on top of changes to upstream sources.

Splitgraph images are also version-controlled, and you can manipulate them with Git-like operations through a CLI. You can check out any image into a PostgreSQL schema and interact with it using any PostgreSQL client. sgr will capture your changes to the data, and then you can commit them as delta-compressed changesets that you can package into new images.

sgr supports PostgreSQL foreign data wrappers. We call this feature mounting. With mounting, you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data providers (like Socrata) from your sgr instance with plain SQL. You can even snapshot the results or use them in Splitfiles.

Components

The code in this repository contains:

  • sgr CLI: sgr is the main command line tool used to work with Splitgraph "images" (data snapshots). Use it to ingest data, work with Splitfiles, and push data to Splitgraph.
  • sgr Engine: a Docker image of the latest Postgres with sgr and other required extensions pre-installed.
  • Splitgraph Python library: All sgr functionality is available in the Python API, offering first-class support for data science workflows including Jupyter notebooks and Pandas dataframes.

Docs

We also recommend reading our Blog, including some of our favorite posts:

Installation

Pre-requisites:

  • Docker is required to run the sgr Engine. sgr must have access to Docker. You either need to install Docker locally or have access to a remote Docker socket.

You can get the sgr single binary from the releases page. Optionally, you can run sgr engine add to create an engine.

For Linux and OSX, once Docker is running, install sgr with a single script:

$ bash -c "$(curl -sL https://github.com/splitgraph/sgr/releases/latest/download/install.sh)"

This will download the sgr binary and set up the sgr Engine Docker container.

See the installation guide for more installation methods.

Quick start guide

You can follow the quick start guide that will guide you through the basics of using sgr with Splitgraph or standalone.

Alternatively, sgr comes with plenty of examples to get you started.

If you're stuck or have any questions, check out the documentation or join our Discord channel!

Contributing

Setting up a development environment

  • sgr requires Python 3.7 or later.
  • Install Poetry: curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python to manage dependencies
  • Install pre-commit hooks (we use Black to format code)
  • git clone --recurse-submodules https://github.com/splitgraph/sgr.git
  • poetry install
  • To build the engine Docker image: cd engine && make

Running tests

The test suite requires docker-compose. You will also need to add these lines to your /etc/hosts or equivalent:

127.0.0.1       local_engine
127.0.0.1       remote_engine
127.0.0.1       objectstorage

To run the core test suite, do

docker-compose -f test/architecture/docker-compose.core.yml up -d
poetry run pytest -m "not mounting and not example"

To run the test suite related to "mounting" and importing data from other databases (PostgreSQL, MySQL, Mongo), do

docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.mounting.yml up -d
poetry run pytest -m mounting

Finally, to test the example projects, do

# Example projects spin up their own engines
docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.core.yml down -v
poetry run pytest -m example

All of these tests run in CI.

seafowl's People

Contributors

dependabot[bot] avatar gruuya avatar mildbyte avatar neumark avatar onpaws avatar sergeipatiakin avatar striezel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seafowl's Issues

Use a catalog DB-level transaction when running a Seafowl statement

Alluded to in #48.

Start a transaction before planning a batch of Seafowl statements, roll it back on error and commit on success (before returning a result): https://docs.rs/sqlx/latest/sqlx/struct.Transaction.html . Useful for:

  • automatic rollback e.g. an error during execution won't leave the DB in an inconsistent state (create a table version -> fail -> now we have a new empty table version)
  • being able to use table locks to e.g. disallow concurrent writes
  • atomicity (when someone runs CREATE staging + DELETE current + ALTER staging RENAME to current, other readers will either see the old or the new version of the table when querying current)

Support for changing the Cache-Control header sent by the GET API

(followup to #20)

The current default is not sending any cache-control headers at all, which means that the browser/CDNs will lean towards more caching. (from my basic testing, it seems like Cloudflare caches for 2h at least: https://developers.cloudflare.com/cache/about/edge-browser-cache-ttl/)

Possible settings:

  • how soon to revalidate with an If-None-Match (maxage?)
  • sending s-maxage as well as maxage (the former seems to influence CDNs)

Some reference: https://jakearchibald.com/2016/caching-best-practices/

Use the min-max index when planning the query (partition pruning)

Currently, we just ignore it:

// TODO: use filters and apply them to regions here (grab the code from list_files_for_scan)

Use the index in order to speed up queries with filters. Also TODO: figure out if it's even worth storing min-max values in the metadata database, given that we can also cheaply (?) read them from the actual Parquet files by hitting the object storage with Range requests.

Support for ALTER TABLE ... ADD COLUMN

This can be done without changing the Parquet files (the query engine right now can handle Parquet files with missing columns and pads them with NULLs)

Workaround for having to load Parquet files in-memory before uploading them

See this TODO:

seafowl/src/context.rs

Lines 319 to 325 in f00efc4

// TODO: the object_store crate doesn't support multi-part uploads / uploading a file
// from a local path. This means we have to read the file back into memory in full.
// https://github.com/influxdata/object_store_rs/issues/9
//
// Another implication is that we could just keep everything in memory (point ArrowWriter to a byte buffer,
// call get_parquet_file_statistics on that, upload the file) and run the output routine for each partition
// sequentially.

The implications are:

  • we should have smaller regions (#4)
  • we could limit the amount of concurrent writes (since each region has to be in RAM)

Query result cache

Add ability to cache query results in the same object storage that we use for actual Parquet files. This might not be crucial if we implement #20 (in which case we can deploy an HTTP cache in front of the executor and get the same effect)

More sophisticated ETag calculation

Follow-up to #20

Currently, we compute the ETag based on all versions of Seafowl tables in a query. This disregards:

Add configuration loading

A TOML file that lets the user configure:

  • metadata backend (SQLite in memory / on a path; PostgreSQL)
  • object storage backend (in memory / local path / S3 / GCS / Azure)
  • basic authorization (hash of a secret key used for writing)

etc

Rechunk the table on INSERT instead of doing 1 Parquet file per partition

https://github.com/splitgraph/seafowl/blob/f00efc451aaa80a818b42e5d0be72efe39f3f50c/src/context.rs#310

Currently, on INSERT, one partition of the original DF plan produces one Parquet file. This is an issue because:

  • some plans might only ever have one partition (e.g. SELECT *)
  • the partitions might be unbalanced or without an obvious ordering (if passed through a RepartitionExec with round-robin partitioning

Find a way to, when persisting the results of a query to object storage, chunk them up into partitions with a consistent length (e.g. 100k rows).

Optimise reading from local files

Currently we don't handle loading local parquet files well, as it seems the plan loads the entire file into memory prior to re-partitioning and uploading to object store.

For example (using area1.parquet from here, 2.45GB size):

CREATE EXTERNAL TABLE area1 STORED AS PARQUET LOCATION '/Users/markogrujic/Downloads/area1.parquet';
CREATE TABLE area1 AS SELECT * FROM staging.area1;

leads to the following memory profile (the first plateau corresponds to reading from the file itself)
image

Add V0 authn/z

  • unauthenticated requests are read-only
  • authenticated requests (with a pre-shared secret) can write to the database

Friendlier cached GET API

If the GET API receives an unhashed SQL query in the URL (e.g. just urlencoded), hash it and redirect the user to the hashed endpoint), as per #20 (comment)

Need to consider the fact that the redirect will only make sense if the client ends up sending the actual query in the GET body/header correctly. So we need to use a redirect that preserves the headers/body and the client needs to send the headers/body in the original request as well, potentially limiting the friendliness of this.

Persistence and execution for CREATE FUNCTION

Store the function settings / WASM code in the database:

seafowl/src/context.rs

Lines 835 to 847 in f00efc4

let _function = create_udf_from_wasm(
&name.to_string(),
&function_code,
&details.entrypoint,
details.input_types.iter().map(get_wasm_type).collect(),
get_wasm_type(&details.return_type),
get_volatility(&details.volatility),
)?;
// TODO we don't persist the function here in the database, so it'll get
// deleted every time we recreate the context
// also this requires &mut self
// self.inner.register_udf(function);
Ok(make_dummy_exec())

Research if the size difference makes it worth shipping 2 binary versions

Version 1: single-node "Web" version

  • Only SQLite backend
  • Only local storage (no object_store s3 feature)
  • Only HTTP frontend (no convergence)

Version 2: Full version

  • SQLite and PostgreSQL backend
  • S3 storage (object_store also supports GCP/Azure but we can keep those as compile-time crate features for very advanced users)
  • HTTP/PG frontend

Also see how much the WASI UDF support weighs.

Fix object_store 0.5.0 SignatureDoesNotMatch with Minio

 ~/seafowl/examples/multinode $ curl -i -H "Content-Type: application/json" localhost:8080/q   -d '{"query": "CREATE TABLE tripdata AS SELECT * FROM staging.tripdata"}'
HTTP/1.1 400 Bad Request
Server: nginx/1.23.1
Date: Thu, 15 Sep 2022 09:51:36 GMT
Content-Length: 803
Connection: keep-alive
vary: Content-Type, Origin, X-Seafowl-Query

Object Store error: Generic S3 error: Error performing create multipart request: response error "<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>SignatureDoesNotMatch</Code><Message>The request signature we calculated does not match the signature you provided. Check your key and signing method.</Message><Key>9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Key><BucketName>seafowl</BucketName><Resource>/seafowl/9bf2215e727bf8d194973f5346daa1c7ce4ed17c05dd4e590ab365dc995fa997.parquet</Resource><RequestId>1714FFDF2B463C6E</RequestId><HostId>b0ac6bdc-1f16-40f2-9fe6-c2025e3e134f</H

tcpdump of Seafowl <> Minio comms:

Frame 92: 533 bytes on wire (4264 bits), 533 bytes captured (4264 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.4, Dst: 172.25.0.2
Transmission Control Protocol, Src Port: 35334, Dst Port: 9000, Seq: 1, Ack: 1, Len: 465
Hypertext Transfer Protocol
    POST /seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads HTTP/1.1\r\n
    x-amz-date: 20220915T095136Z\r\n
    x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855\r\n
    authorization: AWS4-HMAC-SHA256 Credential=minioadmin/20220915//s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0aeb8d2eed42e74260c91724a349823dc659f6dfbdde0670ac91f5869ace4b2e\r\n
    accept: */*\r\n
    host: minio:9000\r\n
    \r\n
    [Full request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
    [HTTP request 1/3]
    [Response in frame: 94]
    [Next request in frame: 96]

Frame 94: 1002 bytes on wire (8016 bits), 1002 bytes captured (8016 bits)
Linux cooked capture
Internet Protocol Version 4, Src: 172.25.0.2, Dst: 172.25.0.4
Transmission Control Protocol, Src Port: 9000, Dst Port: 35334, Seq: 1, Ack: 466, Len: 934
Hypertext Transfer Protocol
    HTTP/1.1 403 Forbidden\r\n
    Accept-Ranges: bytes\r\n
    Content-Length: 529\r\n
    Content-Security-Policy: block-all-mixed-content\r\n
    Content-Type: application/xml\r\n
    Server: MinIO\r\n
    Strict-Transport-Security: max-age=31536000; includeSubDomains\r\n
    Vary: Origin\r\n
    Vary: Accept-Encoding\r\n
    X-Amz-Request-Id: 1714FFDF2AEB6157\r\n
    X-Content-Type-Options: nosniff\r\n
    X-Xss-Protection: 1; mode=block\r\n
    Date: Thu, 15 Sep 2022 09:51:36 GMT\r\n
    \r\n
    [HTTP response 1/3]
    [Time since request: 0.000320000 seconds]
    [Request in frame: 92]
    [Next request in frame: 96]
    [Next response in frame: 98]
    [Request URI: http://minio:9000/seafowl/0c509e2d1a1b69006c40072c9ee40c525fcd8eb8443060058ab44c73ad6c54bf.parquet?uploads]
    File Data: 529 bytes
eXtensible Markup Language

Expose some DataFusion runtime execution settings in the config

Support for multiple statements in a single query

Several examples in the docs (e.g. https://www.splitgraph.com/docs/seafowl/guides/baking-dataset-docker-image#building-a-docker-image that uses #39) require being able to pass multiple queries in a single command. We currently don't allow that (copied from DF): https://github.com/splitgraph/seafowl/blob/main/src/context.rs#L551-L558

The user could always send their queries one-by-one (which is why this is a P2, if we don't have time for it, we can change the docs to mention that), but we can use this as a basis to create implicit mini-transactions, i.e.:

CREATE TABLE staging AS (...);
DROP TABLE production;
ALTER TABLE staging RENAME TO production

which would get executed by Seafowl inside of a transaction on the catalog. The effect would be that other queries will always see the production table, even while it's being recreated.

The semantics would be:

  • return the output of the last query
  • for queries that aren't the last one, only execute writing queries (no point in executing RO queries if we won't see their results)

Allow passing urlencoded (or otherwise escaped) values in X-Seafowl-Query

Consider this SQL:

SELECT value
-- some comment
FROM some_table

HTTP headers don't support newlines, so our current Observable client replaces newlines with spaces. In this case, this also changes the semantic meaning of the query (the FROM some_table becomes a comment). This also doesn't deal with Unicode and non-printable ASCII characters. We should figure out a way to safely pass queries with any character as the header.

Headers only support printable ASCII characters. Even though it's not a direct authority on what headers proxies and fetch() will accept, CF's Page Rules docs are a good aproximation. So we don't need to url-encode all characters that would normally be urlencoded (e.g. spaces) to save on space.

Investigate WASM "calling conventions" and passing non-scalar datatypes like strings

Currently, our WASM functions only support passing basic types like ints and floats. In order to be able to pass something more complex like strings or datetimes, we want to put them in the WASM memory and point the UDF to it.

We need to figure out what is the most ergonomic way to the function writer to do this. For reference, something like this:

EMSCRIPTEN_KEEPALIVE char* test_string(char* input) {
    int len;
    len = strlen(input);

    char *out = malloc(len - 2 + 1);

    strncpy(out, input, len - 2);
    return out;
}

compiles to:

(type (;0;) (func (param i32) (result i32)))
...
  (func (;3;) (type 0) (param i32) (result i32)
    (local i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32 i32)
    block  ;; label = @1
      local.get 0
      local.tee 9
      i32.const 3
      i32.and
      if  ;; label = @2
        loop  ;; label = @3
          local.get 0
...

This should work out of the box, without having to write a wrapper that converts some binary representation into a C string.

"The request payload is too large" when uploading a 150MB Parquet file

Get a big-ish Parquet file (e.g. https://seafowl-public.s3.eu-west-1.amazonaws.com/tutorial/trase-supply-chains.parquet):

~/seafowl-test $ du -h trase-supply-chains.parquet 
146M    trase-supply-chains.parquet
~/seafowl-test $ curl     -iH "Authorization: Bearer BDCANn8w7ZnOzAS1HkFyBpmU3RF2Q25M"     -F "[email protected]"     http://localhost:8080/upload/public/supply_chains
HTTP/1.1 413 Payload Too Large
content-type: text/plain; charset=utf-8
content-length: 32
date: Mon, 22 Aug 2022 15:39:40 GMT

The request payload is too large

might be related to Warp not supporting streaming multipart uploads

HTTP API: uploading data

Something like:

POST /upload/[schema]/[table]

receives a standard multi-part upload so that a user can upload with curl. Considerations:

  • Find a way to specify a schema for CSV files (GET params?)
  • Support uploading Parquet files
  • (stretch goal) support uploading SQLite?

CDN-friendly HTTP API

Implement a read-only HTTP API that obeys HTTP cache semantics and can benefit from any CDN / cache like Varnish:

Client sends a query like:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]

The server receives the query, checks it's a SELECT query, checks that the hash matches, executes it. It sends it back with an ETag that is a function of the versions of all tables that participate in this query:

200 OK
[query results]
ETag: [hash of the version of the table(s)]
Cache-Control: max-age=3600 [optional]

https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Cache-Control#directives

When the client's browser queries the data again, it can then pass the ETag to see if the query results have changed:

GET /q/[SHA hash of the query text]
X-Seafowl-Query: [actual query text]
If-None-Match: [etag]

Intermediate caches/CDNs might not even forward the query to the origin, instead serving it from cache if it's not stale. If it reaches the server, the server can cheaply revalidate the entry by recomputing the etag and responding with a 304 Not Modified if the tables in the query haven't changed their versions (without having to execute the query).

This only works for SELECT queries. INSERT and other writes should come in via POST requests.

Write the tutorial

Non-tutorial docs:

  • Introduction
  • Guides:
    • Installation, single-node
    • Scaling to multiple nodes
    • Querying over HTTP
    • Querying over the cached HTTP API
    • Querying over PostgreSQL
    • Baking a dataset into a Docker image
    • Uploading a CSV/Parquet file
    • Writing data with SQL
    • Writing custom UDFs
  • Learning:
    • Architecture overview
    • Storage structure
    • Partition pruning
  • Reference:
    • Query syntax (SELECT, DDL, DML)
    • Types
    • HTTP API reference
    • PostgreSQL endpoint reference
    • seafowl.toml

Tutorial:

  • Part 1: running Seafowl locally
  • Part 2: deploying to fly.io (free tier)
  • Part 3, option 1: caching with Varnish
  • Part 3, option 2: CDN with Cloudflare
  • Part 4: visualization with Observable

Add a `--one-off` / `--command` CLI option

Running e.g. ./seafowl --one-off "CREATE TABLE ..." should, instead of starting a server, just execute a single command against the context and exit. This is going to be used when baking a dataset into a Docker image, e.g.:

RUN \
	wget https://some-csv.com/file.csv -o file.csv && \
	seafowl --one-off \
		"CREATE EXTERNAL TABLE t LOCATION 'file://file.csv'; \
		CREATE TABLE t_frozen AS SELECT * FROM t;" && \
	rm file.csv

Local object cache for S3

When using an S3-compatible object store, DataFusion / Seafowl currently download the object every time they need to scan through it. We should let users dedicate some scratch space on a Seafowl instance as object cache. Potential options:

  • write a wrapper around the ObjectStore trait that proxies an upstream object store and uses a local FS as cache
  • use https://github.com/kahing/goofys / + catfs OR s3fs?

Content-addressable object IDs

Currently, when writing data:

  • we create a region for every partition of the original plan (a row in the physical_region table in the database)
  • the region has a unique ID (just a bigint)
  • the region has an "object storage ID" (path to the physical file)

The ID of the region isn't content-addressable (it always increases), but the object storage ID is. This means that if we're about to write the same Parquet file (same hash), we'll create a new row in the physical_region table (doesn't consume much space) and overwrite the same file in the object storage (doesn't consume space but consumes time uploading the file)

seafowl/src/context.rs

Lines 340 to 356 in f00efc4

let mut hasher = Sha256::new();
hasher.update(&data);
let hash_str = encode(hasher.finalize());
let object_storage_id = hash_str + ".parquet";
store
.put(&Path::from(object_storage_id.clone()), data)
.await?;
let region = SeafowlRegion {
object_storage_id: Arc::from(object_storage_id),
row_count: region_stats
.num_rows
.expect("Error counting rows in the written file")
.try_into()
.expect("row count greater than 2147483647"),
columns: Arc::new(columns),
};

Figure out:

  • if we want to have a separate "region ID" and "object storage ID"
  • how to skip uploading regions that already exist

Support for WASM UDAFs

We currently do not support UDAFs (user defined aggregation functions), even though DataFusion does (https://docs.rs/datafusion/latest/datafusion/physical_plan/udaf/struct.AggregateUDF.html).

The most basic implementation would be expecting the WASM function to be an "accumulator" (which we'll use to implement https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.Accumulator.html) that takes in the current state, the new value, and returns the new state (or stores the state in the WASM memory)?

Inspiration: https://www.postgresql.org/docs/current/xaggr.html / https://www.postgresql.org/docs/current/sql-createaggregate.html

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.