slaterb1 / rettle Goto Github PK
View Code? Open in Web Editor NEWRust based ETL inspired by "keras"
License: Apache License 2.0
Rust based ETL inspired by "keras"
License: Apache License 2.0
Some brew jobs will be better implented as a running schedule. Currently when the Brewery goes out of scope, it sends the Terminate command to all workers. Need to evaluate what happens if Fill operation loops on schedule, will the brewery still go out of scope?
Test ideas in htttea
To be able to batch process events and benefit from Rusts "0 cost abstraction", the operations should occur on Arrays. This will enable the usage of .iter().map().collect().
The onus would still be on the developer and the Fill operation would specify the batch sizes (or Brewery?).
Current Steep only accepts &Tea
and a parameter. Need to investigate best way to pass any arguments (i.e. additional instructions, file paths field to target for update).
There are a few mechanisms that could work:
Option<>
values for the params, but specifies all params in wrapper https://stackoverflow.com/questions/28951503/how-can-i-create-a-function-with-a-variable-number-of-argumentsArgument
trait that can be implemented on any struct of "arguments" to be passed to function (Box<dyn Argument>
)*Leaning towards Argument
trait method, and make the exec
functions accept args as Option<>
as_any
?)exec
function to selectively pass argumentsCurrently brew()
is called with fixed brewer, that processes the tea with make_tea()
. Going forwards, a channel needs to be setup and a brewer that is a member of the brewer pool can pick up the request.
Currently all test examples are built and run from bin/main.rs
. Project on release will not have this file. The code here needs to be moved to examples/main.rs
. In addition some minor cleanup needs to happen to wrap up this project.
bin/main.rs
to examples/main.rs
Fn
based traits to fn
function referencesSkim Ingredient represents a job that only removes fields on a Tea struct or removes entire Tea struct if conditions are met. Logic for implementation would be the same as Steep (other crates would have to create the logic for remove fields matching or data matching).
ingredient.rs
make_tea
in brewery.rs
The current implementation of the Brewery is based on channels, similar to the rustbook Chapter 20 multithreaded server example. It might be more efficient to use tokio
job stealing architecture for processing the jobs as there are less blocking interactions (i.e. locking Mutex rx to receive message and run passed closure for iterating over the shared recipe).
Note: This ticket has been archived due to current state of channels achieving ~750ms avg on macbook and ~400ms avg on ubuntu. The advantages from Futures may be more complex then it is worth to change everything out.
tokio
Data struct management is difficult. The tea objects need to be defined every step of the way. This is not a problem after the Fill step, but if the incoming data from an http request or db table is large (multiple fields), it would be easier to evaluate what the data types are and output the struct to a file (could use serde_json Value struct to help), to be used with the actual running of the pipeline.
{
"id": 1,
"name": "test",
"field1": 23.5,
"field2": "2019-06-13",
...
"field100": "some text"
}
use std::time:SystemTime;
struct Tea-aAEjifo4ji3850FS {
id: i32,
name: String,
field1: f32,
field2: SystemTime,
...
field100: String
};
Transfuse combines data from other sources. The original logic was that it would be placed after a series of Fill source ingredients and combine those objects, but the complexity is too high and alignment of data is not guaranteed. This Ingredient should instead uses fields on the Tea struct to lookup and fetch data from another system (or always pull in additional data that is stamped on all data structs) and add it to the current Tea.
ingredient.rs
make_tea
in brewery.rs
Tea structure needs to be simplified and only include the fields necessary. Need to consider creating Tea as a Trait which can be pulled into other libraries that manipulate data (or keep it a struct but give it a trait definition).
Successfully created Tea as trait, but need to investigate the following:
Trait: Tea
)new()
to Tea Trait, error is that it cannot return Box<dyn Tea>
Box lifetimes: Did some research on Box lifetimes. Lifetimes of values 5
, "some string"
, heap memory
has 'static
lifetimes that are defined to exist the entire length of the problem. BUT has the variables that contain these values
go out of scope, the memory is reclaimed via the Drop
trait. Therefore, memory concerns that I had surrounding 'static
are not actually a problem.
The current test implementation only uses the first source in the Pot.sources
Vec. This needs to be updated before release of library.
The metadata piece is more complicated... I need to reconsider this in the future
Steep operation needs to manipulate a copy of the &ref tea passed in to pass back an edited Tea to be updated by the brewer. Currently Steep has to create an entirely new Tea object and define all fields, which will not scale going forwards
Using clone gets around the issue of what fields to make copies of but more thought and work needs to go into defining the field structure of future
Tea
trait objects being manipulated (will open a ticket).
- [ ] create copy
trait on Tea to make exact copy of reference tea object passed
Deriving the Clone
trait handled making a mutable copy of the tea object passed to be able to manipulate specific fields.
Before release, I would like to create the following crates to be used with the project:
One limitation of having a strongly typed language, that needs to know the data structures at compile time, is that all transformations of the original data object that mutate a field type or add/remove a field on the struct needs to be defined as a separate struct. The intent of this ticket is to look into alternatives to get around this handicap or manage it sanely...
Tea
structs (add Trait to serde_json value struct)A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.