slaterb1 / rettle Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 1.0 185 KB

Rust based ETL inspired by "keras"

License: Apache License 2.0

Rust 100.00%

rettle's People

Contributors

Stargazers

Watchers

rettle's Issues

Evaluate feasibility of running Fill operations on a schedule

Overview

Some brew jobs will be better implented as a running schedule. Currently when the Brewery goes out of scope, it sends the Terminate command to all workers. Need to evaluate what happens if Fill operation loops on schedule, will the brewery still go out of scope?

Test ideas in htttea

Update components to take Vec<Tea> instead of Tea

Overview

To be able to batch process events and benefit from Rusts "0 cost abstraction", the operations should occur on Arrays. This will enable the usage of .iter().map().collect().

The onus would still be on the developer and the Fill operation would specify the batch sizes (or Brewery?).

Tasks

update all components to take Vec
investigate batches on Fill or Brewery.

Update Steep exec function to work with optional passed parameters

Overview

Current Steep only accepts &Tea and a parameter. Need to investigate best way to pass any arguments (i.e. additional instructions, file paths field to target for update).

Research

generic function structure / wrapper

Results

There are a few mechanisms that could work:

creating a wrapper that has Option<> values for the params, but specifies all params in wrapper https://stackoverflow.com/questions/28951503/how-can-i-create-a-function-with-a-variable-number-of-arguments
create a new Argument trait that can be implemented on any struct of "arguments" to be passed to function (Box<dyn Argument>)

*Leaning towards Argument trait method, and make the exec functions accept args as Option<>

Tasks

research generic function structure
create Argument trait (could be empty, aside from as_any?)
update exec function to selectively pass arguments

Setup input to pull data and push to Brewer working pool

Overview

Currently brew() is called with fixed brewer, that processes the tea with make_tea(). Going forwards, a channel needs to be setup and a brewer that is a member of the brewer pool can pick up the request.

Tasks

setup channel to send input data to queue
setup brewer pool to pull and process requests

Create example/main.rs file and minor cleanup

Overview

Currently all test examples are built and run from bin/main.rs. Project on release will not have this file. The code here needs to be moved to examples/main.rs. In addition some minor cleanup needs to happen to wrap up this project.

Tasks

move bin/main.rs to examples/main.rs
add documentation to all modules in project
update documentation to include links to other crates (place holder for now until everything is opensource + CI)
add LICENSE
add Contributing.md
check on changing Fn based traits to fn function references

Add Ingredient Skim

Overview

Skim Ingredient represents a job that only removes fields on a Tea struct or removes entire Tea struct if conditions are met. Logic for implementation would be the same as Steep (other crates would have to create the logic for remove fields matching or data matching).

Tasks

Add Ingredient to ingredient.rs
Add logic to make_tea in brewery.rs

[R&D] Investigate tokio Futures instead of channels for concurrency

Overview

The current implementation of the Brewery is based on channels, similar to the rustbook Chapter 20 multithreaded server example. It might be more efficient to use tokio job stealing architecture for processing the jobs as there are less blocking interactions (i.e. locking Mutex rx to receive message and run passed closure for iterating over the shared recipe).

Note: This ticket has been archived due to current state of channels achieving ~750ms avg on macbook and ~400ms avg on ubuntu. The advantages from Futures may be more complex then it is worth to change everything out.

Research

complications with implementing Futures with tokio
benchmarking processing 1,000,000 Tea objects

Create Pour Crate that writes current data object to file as struct

Overview

Data struct management is difficult. The tea objects need to be defined every step of the way. This is not a problem after the Fill step, but if the incoming data from an http request or db table is large (multiple fields), it would be easier to evaluate what the data types are and output the struct to a file (could use serde_json Value struct to help), to be used with the actual running of the pipeline.

Examples data from table

{
  "id": 1,
  "name": "test",
  "field1": 23.5,
  "field2": "2019-06-13",
  ...
  "field100": "some text"
}

struct written to file

use std::time:SystemTime;

struct Tea-aAEjifo4ji3850FS {
  id: i32,
  name: String,
  field1: f32,
  field2: SystemTime,
  ...
  field100: String
};

Tasks

experiment with serde_json
create Pour Ingredient to accomplish above to be included with the library

Add Transfuse Ingredient

Overview

Transfuse combines data from other sources. The original logic was that it would be placed after a series of Fill source ingredients and combine those objects, but the complexity is too high and alignment of data is not guaranteed. This Ingredient should instead uses fields on the Tea struct to lookup and fetch data from another system (or always pull in additional data that is stamped on all data structs) and add it to the current Tea.

Tasks

Add ingredient to ingredient.rs
Add logic to make_tea in brewery.rs

Update docs to explain that Tea needs to be overwritten when inherited

Overview

Tea structure needs to be simplified and only include the fields necessary. Need to consider creating Tea as a Trait which can be pulled into other libraries that manipulate data (or keep it a struct but give it a trait definition).

Successfully created Tea as trait, but need to investigate the following:

lifetimes of Box objects (Tea is now a Box holding teas with Trait: Tea)
adding new() to Tea Trait, error is that it cannot return Box<dyn Tea>

Tasks

R&D

Box lifetimes: Did some research on Box lifetimes. Lifetimes of values 5, "some string", heap memory has 'static lifetimes that are defined to exist the entire length of the problem. BUT has the variables that contain these values go out of scope, the memory is reclaimed via the Drop trait. Therefore, memory concerns that I had surrounding 'static are not actually a problem.

[Tech Debt] Update brew method to take in data from all sources

Overview

The current test implementation only uses the first source in the Pot.sources Vec. This needs to be updated before release of library.

Tasks

update brew to handle multiple sources
~~add metadata object to Tea to know when and where the data was collected~~

The metadata piece is more complicated... I need to reconsider this in the future

Add `copy` function to tea to create exact copy of tea

Overview

Steep operation needs to manipulate a copy of the &ref tea passed in to pass back an edited Tea to be updated by the brewer. Currently Steep has to create an entirely new Tea object and define all fields, which will not scale going forwards

Future Considerations

evaluate changing tea objects that gain or lose fields (not the original Tea object copy)

Using clone gets around the issue of what fields to make copies of but more thought and work needs to go into defining the field structure of future Tea trait objects being manipulated (will open a ticket).

Tasks

~~- [ ] create copy trait on Tea to make exact copy of reference tea object passed~~

Deriving the Clone trait handled making a mutable copy of the tea object passed to be able to manipulate specific fields.

Create Ingredient crates

Overview

Before release, I would like to create the following crates to be used with the project:

Crates

[R&D] Investigate impact of changing Tea trait data object

Overview

One limitation of having a strongly typed language, that needs to know the data structures at compile time, is that all transformations of the original data object that mutate a field type or add/remove a field on the struct needs to be defined as a separate struct. The intent of this ticket is to look into alternatives to get around this handicap or manage it sanely...

Research

possibility of creating a util tool that initializes all intermediary structs
methods for organizing location of all data struct transforms
look into other alternatives and data handling projects
investigate weakly typed Tea structs (add Trait to serde_json value struct)

slaterb1 / rettle Goto Github PK

rettle's People

Contributors

Stargazers

Watchers

rettle's Issues

Overview

Overview

Tasks

Overview

Research

Results

Tasks

Overview

Tasks

Overview

Tasks

Overview

Tasks

Overview

Research

Overview

Examples data from table

struct written to file

Tasks

Overview

Tasks

Overview

Tasks

R&D

Overview

Tasks

Overview

Future Considerations

Tasks

Overview

Crates

Overview

Research

Recommend Projects

Recommend Topics

Recommend Org