The mason from samtecspg

Local Asynchronous Scheduler

Think about potential backends.

Presto metastore basic support

Move table spanning operators into "database" namespace

To avoid confusion table spanning operations should be moved out of the table namespace and into the database namespace:
For example: merge or join or list

Add local execution engine support to Query operator

Currently only supports athena and dask:

https://github.com/samtecspg/mason/blob/master/mason/examples/operators/table/query/operator.yaml

Build S3 backend plugin for airflow?

As per this issue there is still a gap between s3 and airflow:

apache/airflow#9937

We can address this with documentation in mason to direct them what we are doing (sync local dags to s3) for now but seems kind of unsavory. This would at least give them a plugin to use.

Make table and path protocal or inherited abstract classes

Table as a protocal would allow for expressions of things like S3Table, vs GlueTable, similarly for path.

DRY up and enforce various namespace conventions

For example, all table namespace operators should have required parameters

table_name
database_name

by default. Related to #46

Scrub out all company specific references in repo

Including company specific buckets.

Make the mason-sample-data repo public (only has NYCtaxi data right now), and reference in demo's

Allow config_id to be set on any rest api endpoint

Currently you would have to run the REST equivalent of

mason config -s 4
mason operator table format ...

to run an operator by itself with a particular config_id. This creates issues with statefullness. I want to make it so that you can pass config_id as a parameter to the rest endpoint to avoid having to do this. Setting the config_id would then just set the "default" config_id. I will rename the rest api endpoint appropriately.

1.6.0 Demo

Build demo shell file for 1.6.0

Query Support for Dask

Dask support for table query operator

Make Table deferred data type

Make table a shell of what it is now, and delegate schema population to the execution engine (table.populate something to that extent)

Clean up error messages

Right now errors are collected on a single line string. Either add line breaks, or collect them into an array so that the message resembles a log or a stack trace more than one continuous error string.

mason.util.notebook_environment NotebookEnvironment doesn't utilize ~/.aws/credentials files

Would like it if NotebookEnvironment read the boto config ~/.aws/credentials if no config is specified. Also would be awesome if the profile within the credentials could be specified too

Clean up Bunch O' stuff using the dry-python/returns library

Dask Format Support

building mason-dask repo and leveraging it

Revamp async mode

Make it so that run and run_async are the same again in operator definitions.

Make async an implicit property of a particular execution engine, maybe merge OperatorResponse and DelayedOperatorResponse

Move things such as sampling files in infer into local execution engine

Start to work on actual asynchronous execution for operator runs

Translate between Mason parameters/configurations and Papermill configurations

Would be nice to figure out how the mason->papermill part of the story will work. Still need to consider the various options here.

Figure out job "fan out" and aggregation

Suppose you wanted to have a workflow that did:

table list

followed by

join (for each item listed)

followed by

summarize (for the joined table)

Figure out how this would be handled in terms of the fan out of the mason job scheduling and workflow handing.

                          +--------------+           +-----------------+
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+           +-----------------+            |
           +--------------+ table query  +---------- | table summarize +------------+
           |              +--------------+           +-----------------+            |
           |                                                                  +-----+-----+            +-----+------+
+----------+---+                                                              |table join | ---------  |table dedupe|
| table list   |                                                              +-----+-----+            +-----+------+
+----------+---+           +-------------+           +-----------------+            |
           +---+-----------+table query  +---------- | table summarize +------------+
           |               +-------------+           +-----------------+            |
           |                                                                        |
           |                                                                        |
           |                                                                        |
           |              +--------------+          +------------------+            |
           +--------------+ table query  +----------+  table summarize +------------+
                          +--------------+          +------------------+

Add STDOUT as a path type

Alot of code can be cleaned up if stdout is specified as an output path and it can be set as the default for local synchronous operator runs (see: #60)

Figure out serial representations of Engine Models

I want a way for engine models to be serialized and passed between various programming languages (python -> scala for example) or libraries without having to depend on transpiling (like Jython).

Something like an avro or protobuf representation with client interpreters in each language.

python->python is easy (just libs with pypi), but handicaps the ability to scale beyond python ecosystem later on.

Dedupe Operator

Details to come, basically dedupe by a single key on a metastore table.

Build out "metadatabase" namespace in operator examples

In the metadatabase namespace the query operator for example would allow query that spans multiple distinct sources, ie federated query. For example:

mason operator metadatabase query SELECT * from source1.table join source2.table on ID

where source1 is a database in presto, source2 is an s3 bucket.

Would require doing: #15

Build out "database" namespace in operator examples

Example of operators:

mason operator database query

would allow you to do queries that span multiple tables, for example, "join" queries.

Generalize Engines to be Registerable

Right now the 4 base engines metastore, storage, execution, scheduler are baked into the code, would like to generalize them.

Spark Format Support

Spark support for table format operator

Add Operator Client Compatability Matrix

Something like this for each engine type:

Format Operator

Baseline code for format operator irrespective of client implementation.

Automatic operator installation

Need the ability to reference operators in a location without installing them, by specifying the operator home and just using them implicitly.

Spark support for Query

Spark support for table query operator

Add partitioning concepts to DDL generation

Autogenerate aspects of swagger files

Some aspects of swagger.yml definitions for workflows and operators can be generated based on the structure of the workflows and parameters.

This includes the 200 status, etc.

Notebook operator

Operator that executes contents of a jupyter notebook with papermill configuration.

Airflow Scheduler Support

As the first async scheduler. Involves building mason airflow operator

Refactor Mason Registry

Registry is currently just local files in ~/.mason Want to add:

(1) sha style versioning
(2) More options for backend than local file systems, for example distributed registry

Change how tables, databases, etc are referenced by connection strings

Right now some operators reference table_name database_name I want to consolidate into a SQLalchemy like connection string like:

<SOURCE>://<DB_NAME>/<TABLE_NAME>

for example:

athena://database/table
s3://bucket/path...
glue://database/table

etc.

Also think about how to reference multiple such objects (lists for joins for example)

Add Github actions to run tests

Now that we have more than myself in the code it seems appropriate to test using github actions.

Parameter Aliases

Right now referring to S3 bucket as "database_name" and the S3 path as "table_name" when using S3 as a metastore is consistent but akward. Would like to have parameter aliases so that they could be called:
"bucket", "path", etc.

Likely would be client specific overrides.

Join Operator

With Dask and Spark support initially.

Start thinking about places to integrate async to make mason server more scalable

https://docs.python.org/3/library/asyncio.html

Export Workflow

Export Workflow ==

Query Operator
followed by
Format Operator

Need to think about asynchronicity of Query operator. Callback mechanism?

Split out validations into their own library

There is alot of code in both mason and mason-dask just focused on validating typed inputs from the api as well as from json-schemas. It might be worth looking into streamlining that code or seperating it into its own repo (mason-validations).

Add calcite and express operators in terms of it

This is the first step in the following very ambitious 3 step process:

Step 1.
Reexpress as many mason operators in calcite as possible. In terms of execution, this would mean those jobs would all become QueryJob. Validate the calcite SQL, so when the job is serialized they are sending across calcite SQL (as opposed to SparkSQL, Hive, PrestoSQL)

Step 2.
In mason-spark use Coral to translate the calcite to SparkSQL (RelNode -> Spark Catalyst). Use this to build mason-hive and mason-presto (and get alot of operator support for free using coral).

Step 3.
Mason operators and workflows are now a curated collection of calcite SQL pipelines with some additional connecting tissue that goes outside of what SQL (really should) express. Look into the effort to add logica (datalog based query language) as a view language for coral. This has possibility to allow constraints address the additional connecting tissue (types for governance, authenatication, io formatting, workflow specification?). Then mason operators/workflows would be expressed completely within Datalog. This could be particularly interesting for expressing things like job fan out and aggregation

metastores: [hive, s3]
execution: [spark, dask]

And ability to use multiple of them within an operator

samtecspg / mason Goto Github PK

mason's People

Contributors

Stargazers

Watchers

Forkers

mason's Issues

Recommend Projects

Recommend Topics

Recommend Org