samtecspg / mason Goto Github PK
View Code? Open in Web Editor NEWData Operator Framework
License: Apache License 2.0
Data Operator Framework
License: Apache License 2.0
Think about potential backends.
To avoid confusion table spanning operations should be moved out of the table namespace and into the database
namespace:
For example: merge
or join
or list
Currently only supports athena and dask:
https://github.com/samtecspg/mason/blob/master/mason/examples/operators/table/query/operator.yaml
As per this issue there is still a gap between s3 and airflow:
We can address this with documentation in mason to direct them what we are doing (sync local dags to s3) for now but seems kind of unsavory. This would at least give them a plugin to use.
Table as a protocal would allow for expressions of things like S3Table, vs GlueTable, similarly for path.
For example, all table namespace operators should have required parameters
table_name
database_name
by default. Related to #46
Including company specific buckets.
Make the mason-sample-data repo public (only has NYCtaxi data right now), and reference in demo's
Currently you would have to run the REST equivalent of
mason config -s 4
mason operator table format ...
to run an operator by itself with a particular config_id. This creates issues with statefullness. I want to make it so that you can pass config_id as a parameter to the rest endpoint to avoid having to do this. Setting the config_id would then just set the "default" config_id. I will rename the rest api endpoint appropriately.
Build demo shell file for 1.6.0
Dask support for table query
operator
Make table a shell of what it is now, and delegate schema population to the execution engine (table.populate something to that extent)
Right now errors are collected on a single line string. Either add line breaks, or collect them into an array so that the message resembles a log or a stack trace more than one continuous error string.
Would like it if NotebookEnvironment read the boto config ~/.aws/credentials if no config is specified. Also would be awesome if the profile within the credentials could be specified too
building mason-dask repo and leveraging it
Make it so that run and run_async are the same again in operator definitions.
Make async an implicit property of a particular execution engine, maybe merge OperatorResponse and DelayedOperatorResponse
Move things such as sampling files in infer
into local execution engine
Start to work on actual asynchronous execution for operator runs
Would be nice to figure out how the mason->papermill part of the story will work. Still need to consider the various options here.
Suppose you wanted to have a workflow that did:
table list
followed by
join (for each item listed)
followed by
summarize (for the joined table)
Figure out how this would be handled in terms of the fan out of the mason job scheduling and workflow handing.
+--------------+ +-----------------+
+--------------+ table query +---------- | table summarize +------------+
| +--------------+ +-----------------+ |
| |
| |
| |
| +--------------+ +-----------------+ |
+--------------+ table query +---------- | table summarize +------------+
| +--------------+ +-----------------+ |
| +-----+-----+ +-----+------+
+----------+---+ |table join | --------- |table dedupe|
| table list | +-----+-----+ +-----+------+
+----------+---+ +-------------+ +-----------------+ |
+---+-----------+table query +---------- | table summarize +------------+
| +-------------+ +-----------------+ |
| |
| |
| |
| +--------------+ +------------------+ |
+--------------+ table query +----------+ table summarize +------------+
+--------------+ +------------------+
Alot of code can be cleaned up if stdout is specified as an output path and it can be set as the default for local synchronous operator runs (see: #60)
I want a way for engine models to be serialized and passed between various programming languages (python -> scala for example) or libraries without having to depend on transpiling (like Jython).
Something like an avro or protobuf representation with client interpreters in each language.
python->python is easy (just libs with pypi), but handicaps the ability to scale beyond python ecosystem later on.
Details to come, basically dedupe by a single key on a metastore table.
In the metadatabase namespace the query operator for example would allow query that spans multiple distinct sources, ie federated query. For example:
mason operator metadatabase query SELECT * from source1.table join source2.table on ID
where source1 is a database in presto, source2 is an s3 bucket.
Would require doing: #15
Example of operators:
mason operator database query
would allow you to do queries that span multiple tables, for example, "join" queries.
Right now the 4 base engines metastore, storage, execution, scheduler
are baked into the code, would like to generalize them.
Spark support for table format
operator
Baseline code for format operator irrespective of client implementation.
Need the ability to reference operators in a location without installing them, by specifying the operator home and just using them implicitly.
Spark support for table query
operator
Some aspects of swagger.yml definitions for workflows and operators can be generated based on the structure of the workflows and parameters.
This includes the 200 status, etc.
Operator that executes contents of a jupyter notebook with papermill configuration.
As the first async scheduler. Involves building mason airflow operator
Registry is currently just local files in ~/.mason Want to add:
(1) sha style versioning
(2) More options for backend than local file systems, for example distributed registry
Right now some operators reference table_name
database_name
I want to consolidate into a SQLalchemy like connection string like:
<SOURCE>://<DB_NAME>/<TABLE_NAME>
for example:
athena://database/table
s3://bucket/path...
glue://database/table
etc.
Also think about how to reference multiple such objects (lists for joins for example)
Now that we have more than myself in the code it seems appropriate to test using github actions.
Right now referring to S3 bucket as "database_name" and the S3 path as "table_name" when using S3 as a metastore is consistent but akward. Would like to have parameter aliases so that they could be called:
"bucket", "path", etc.
Likely would be client specific overrides.
With Dask and Spark support initially.
Export Workflow ==
Query Operator
followed by
Format Operator
Need to think about asynchronicity of Query operator. Callback mechanism?
There is alot of code in both mason and mason-dask just focused on validating typed inputs from the api as well as from json-schemas. It might be worth looking into streamlining that code or seperating it into its own repo (mason-validations).
This is the first step in the following very ambitious 3 step process:
Step 1.
Reexpress as many mason operators in calcite as possible. In terms of execution, this would mean those jobs would all become QueryJob. Validate the calcite SQL, so when the job is serialized they are sending across calcite SQL (as opposed to SparkSQL, Hive, PrestoSQL)
Step 2.
In mason-spark use Coral to translate the calcite to SparkSQL (RelNode -> Spark Catalyst). Use this to build mason-hive and mason-presto (and get alot of operator support for free using coral).
Step 3.
Mason operators and workflows are now a curated collection of calcite SQL pipelines with some additional connecting tissue that goes outside of what SQL (really should) express. Look into the effort to add logica (datalog based query language) as a view language for coral. This has possibility to allow constraints address the additional connecting tissue (types for governance, authenatication, io formatting, workflow specification?). Then mason operators/workflows would be expressed completely within Datalog. This could be particularly interesting for expressing things like job fan out and aggregation
With all of the requirements being ==
it can be quite hard to load mason into any type of shared env. Are all of the depencies known to only work at the locked in versions? or could the version requirements be opened up?
ASCII Dag is mostly abandoned so we will have to maintain. Its not a main area of concern so would like to pull out of mason resources and publish on pypi.
Currently Each operator just supports one engine of each type. Would like to have multiple engines per type, for example:
metastores: [hive, s3]
execution: [spark, dask]
And ability to use multiple of them within an operator
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.