dagworks-inc / hamilton Goto Github PK

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage/tracing and metadata. Runs and scales everywhere python does.

Home Page: https://hamilton.dagworks.io/en/latest/

License: BSD 3-Clause Clear License

Shell 0.06% Python 30.17% HTML 0.04% JavaScript 0.18% CSS 0.03% TypeScript 6.68% Jupyter Notebook 62.80% Dockerfile 0.01% Jinja 0.04%

data-science python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering featurization

hamilton's People

Contributors

Stargazers

Watchers

Forkers

charitykithaka blaisemuhirwa dataders jrcribb alejandrosuarez nitin-mane jdonaldson nanderoo richardsonjf ankur612 julienze joshcolts18 saksham294 charlpcronje styczynski st0v janhurst geezercodes franchiseboyz hal2001 elshize cortezitos williamlisci lmmx sweepai-dev nishanthvasudevan lucifercr07 raynorchavez anupjoseph azflin7 mhassan1a alti-tude kaffeegangster rajveer43 isdkz roelbertens anurag9455 flatflax saumzzz bryangalindo jojo10smith murali-pixel wentao-lu brunoscaglione amnairshadsipra 149189 flaviassantos scorpil swapdewalkar benhhack ishaan-jaff hvsesha vpraharsha03 mvandermeulen subhamc1 hurricanejin dorucioclea molcrafts jerod-estapa emekaokoli19 mfkiwl algoskynet alechewitt tyapochkin mahadih534 sahinasli ninoseki ego presteddy56 roy-kid nhuray mbrukman lawrenceelee weaviate-git-bot buggtb flavour jmanhype wmoreiraa dagworks-inc adaynu jrycw bustosalex1 id-2 viki-vki tonyshan jmbuhr chronojon adzcai munichpavel chilijung jamesarruda m4292007 renu2516 kemaleren jernejfrank noahridge ahmedshifa anush008 datechgeek summitsg-llc

hamilton's Issues

Auto generate pytest unit test stubs for hamilton functions

Issue by skrawcz
Wednesday Jul 20, 2022 at 21:19 GMT
Originally opened as stitchfix/hamilton#160

Is your feature request related to a problem? Please describe.
We should be able to bootstrap a unit test suite given a hamilton function module.

Most DS probably write functions first, and then think about tests. We should make bootstrapping unit tests easy.

Describe the solution you'd like

User writes Hamilton functions.
User uses a command line utility to generate pytest unit test stubs for a module.
feature_logic.py --> test_feature_logic.py
Within that module, then we should be able to determine which functions have tests and which do not, creating the test for them appropriately
def my_feature() --> def test_my_feature().
To start we probably don't want too many options -- perhaps a --dry-run argument that would list what would be created.

Describe alternatives you've considered
Not doing this.

Additional context
This would help the software engineering best practices story.
We could similarly use this approach to bulk add check_output decorators.

SQL adapters

Issue by elijahbenizzy
Thursday Sep 15, 2022 at 03:45 GMT
Originally opened as stitchfix/hamilton#197

See design here: https://docs.google.com/document/d/1-xVPX7PqyENOk31bXdcAVVPYSHaGMokeFiR85d_zoOo/edit.

Expose tags and function metadata to decorators

Issue by skrawcz
Wednesday Jun 22, 2022 at 21:42 GMT
Originally opened as stitchfix/hamilton#138

Is your feature request related to a problem? Please describe.
With tagging, we can annotate functions with metadata. It would be useful to allow decorators access to this, and other metadata accumulated.

E.g. tags could be used as a means to help inform what should happen if a check_output decorator runs a test and it fails. That is, if we standardize on tag keys, then decorators could assume them and make use of them.

Describe the solution you'd like
Enable decorators access to a context or some variable that would allow them to get at this information.

Describe alternatives you've considered
N/A

Additional context
Taken from the discussion with whylabs folks on what would be useful.

enable connecting model node to metric node

Issue by shellyjang
Wednesday Sep 01, 2021 at 00:30 GMT
Originally opened as stitchfix/hamilton#11

slack conversation

Current issue:

model_p_something (“model” node) and prob_something (“metric” node) being a complement pair,
create_database driver (and therefore its crawler) contain both nodes
simulate driver (and therefore its crawler) only contains the “metric” node and not the “model” node.
the simulate driver’s crawler can find the feature dependency of the metric node (presumably through the model coefficient configs)
the crawlers (and therefore the drivers) are unable to find the complementary connection between the model and metric nodes;

item 5 specifically means that a person needs to know the complementary pairs (= domain knowledge; or hard-coded somewhere?) instead of DAG containing this info. A complementary pair is identified via @model decorator.

@model(GLM, 'model_p_demand_manual_by_formerautoship')
def prob_demand_manual_existing_former_autoship() -> pd.Series:
    pass

We would like there to be a systematic mapping between the complementary pairs.

Explore Ibis Integration

Issue by elijahbenizzy
Friday Mar 18, 2022 at 21:49 GMT
Originally opened as stitchfix/hamilton#88

Is your feature request related to a problem? Please describe.
Ibis could happily replace pandas, and its more flexible (scalable, etc...)

Describe the solution you'd like
Ibis dataframes instead of pandas dataframes. Perhaps a plugin framework.

hamilton --init to get started

Issue by elijahbenizzy
Thursday Nov 24, 2022 at 00:11 GMT
Originally opened as stitchfix/hamilton#235

Is your feature request related to a problem? Please describe.
New folks might want to get started in an existing repo. New DS/college students could use hamilton to get started on a simple modeling project...

Describe the solution you'd like

hamilton init
# Creates a basic project structure with some functions + hamilton files

hamilton init --project=hello_world 
# Creates the hello_world example

hamilton init --project=recomendations_stack
# Creates the scaffolding for a rec-stack example

hamilton init --project=web-service
# Create[s the scaffolding for a flask app

hamilton init kaggle --kaggle-competition=...
# Maybe we could create a template from a kaggle competition?

Additional context
Messing around with dbt and they have this

Add ResultMixin implementations for Dask native types

Issue by skrawcz
Friday Feb 11, 2022 at 21:50 GMT
Originally opened as stitchfix/hamilton#75

Is your feature request related to a problem? Please describe.

We should implement useful implementations of:

class ResultMixin(object):
    """Base class housing the static function.

    Why a static function? That's because certain frameworks can only pickle a static function, not an entire
    object.
    """
    @staticmethod
    @abc.abstractmethod
    def build_result(**outputs: typing.Dict[str, typing.Any]) -> typing.Any:
        """This function builds the result given the computed values."""
        pass

for use with Dask. E.g. returning a Dask native array, dataframe, bag, etc. Currently the default is to return a pandas dataframe.

See the build_result function in DaskGraphAdapter for a reference point on how it could be used.

Describe the solution you'd like
These should probably be placed in the h_dask.py module for now. Otherwise open to naming.

Alternatively, we could include more options in DaskGraphAdapter. Open to thinking what way is the most user friendly solution going forward.

Additional context
The addition of these ResultMixins should enable a user who is using Dask, to not have to implement their own version,
instead they can use the ones that come with Hamilton.

Show pyspark dataframe support

Issue by skrawcz
Thursday Mar 10, 2022 at 07:53 GMT
Originally opened as stitchfix/hamilton#84

Is your feature request related to a problem? Please describe.
A common question we get, is does Hamilton support spark dataframes? The answer is yes, but it's not ideal at the moment, and we don't have a vanilla example to point to.

It's not ideal because joins are a bit of a pain -- you need to know the index to join on. In the pandas world, we got away with
this because everything had an index associated with it. In spark, you need to provide it, and know when to provide it.

Describe the solution you'd like
(1) Provide a vanilla pyspark example.
(2) Provide a pattern to show how to handle multiple spark data sources. Perhaps implement a graph adapter to do so.

Describe alternatives you've considered
N/A

Ability to data profile node outputs for creating data quality checks

Issue by skrawcz
Tuesday Aug 02, 2022 at 17:00 GMT
Originally opened as stitchfix/hamilton#165

Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.

Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.

Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.

Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.

stitchfix/hamilton#149 does a little to prototype in this area too.

Help bootstrap `check_output()` decorator.

Issue by skrawcz
Tuesday Aug 02, 2022 at 17:08 GMT
Originally opened as stitchfix/hamilton#166

Is your feature request related to a problem? Please describe.
Can we help users bootstrap the check_output() decorator?

Describe the solution you'd like
Setting up data quality is possible to do manually, but with some knowledge of the data, could be automated, or at least partially automated.

Idea:

give a data profile, can we generate the check_output() decorator to add?
could we also automatically update the code with it, rather than having the user cut and paste it?

Describe alternatives you've considered
Not doing this. But I think having a way to speed up adding it to a code base is a good idea.

Additional context
Related to #164 and #165.

functools.lru_cache() makes hamilton think a function is not a node

Issue by elijahbenizzy
Thursday Aug 11, 2022 at 21:34 GMT
Originally opened as stitchfix/hamilton#178

IMO this is actually just sloppiness on the implementation of lru_cache, not layerable. Need to verify 100% that this is the cause, but we should fix.

Current behavior

Stack Traces

(If applicable)

Screenshots

(If applicable)

Steps to replicate behavior

@functools.lru_cache(maxsize=None)
def config() -> Dict[str, Any]:
    return _load_config()

def foo(config: Dict[str, Any]):
    return config['foo']

  File "/Users/elijahbenizzy/dev/hamilton-os/hamilton/hamilton/driver.py", line 203, in visualize_execution
    self.validate_inputs(user_nodes, inputs)
  File "/Users/elijahbenizzy/dev/hamilton-os/hamilton/hamilton/driver.py", line 99, in validate_inputs
    raise ValueError(error_str)
ValueError: 2 errors encountered:
  Error: Required input config not provided for nodes: ['foo'].

Library & System Information

E.g. python version, hamilton library version, linux, etc.

Expected behavior

Additional context

Add any other context about the problem here.

Add documentation showing %%writefile in a notebook

Issue by skrawcz
Thursday Jul 14, 2022 at 17:01 GMT
Originally opened as stitchfix/hamilton#153

This is another way someone could develop hamilton in a notebook -
https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-writefile

We should add some documentation/example around this.

[idea] Node fusing for speeding up execution on systems like Ray, Dask

Issue by skrawcz
Wednesday Aug 24, 2022 at 20:45 GMT
Originally opened as stitchfix/hamilton#188

Is your feature request related to a problem? Please describe.
For delegating to systems like Ray, it could make sense to "fuse" nodes together to reduce serialization costs.

Describe the solution you'd like
ideas:

We need some concept to augment the DAG.
We then need some pluggable way to change this logic. E.g. heuristics, vs some multi-pass logic.

Describe alternatives you've considered
Ideas:

Make people write larger functions. But we wouldn't do this, since it goes against Hamilton's ideals.
Have people tag functions that could be grouped -- seems like a good backdoor capability to have -- could work well with some more automated solution to override whatever it tries to do.

Additional context
I thought of this idea because people were complaining about Hamilton on Ray being slow.

Prototype integration with an LLVM or similar tech.

Issue by skrawcz
Tuesday Feb 01, 2022 at 00:05 GMT
Originally opened as stitchfix/hamilton#40

Here the following assumes "numba", but really we could replace "numba" with "jax", or any other framework that could optimize python code to execute faster.

Is your feature request related to a problem? Please describe.
Numba is a way to accelerate python functions. To use it, you annotate your python functions to be "compiled" with the jit. It then creates faster code from it.

Currently the speed up only materializes on the second invocation of a function -- the first time it compiles it. So to work with Hamilton, we'd have to compile ahead of time (AOT) if people only run a DAG once. Otherwise we could use the jit for DAGs that people execute over and over again.

Describe the solution you'd like
Two solutions:

Prototype the ability to compile a hamilton graph ahead of time with Numba. You could use how we get Hamilton to run on Dask as a starting point (TODO: link to code). See these numba docs for ahead of time compilation.
Prototype the ability to use the jit compiler with Numba. That way the first time someone runs execute things are compiled (no speed up), but the second time, things are lightning quick! See these docs.

Things to think about with prototype (1):

Since compiling a head of time requires types -- we might need some better way to specify them? Or perhaps we can have numba infer it?
The output of compilation is another set of python module(s) -- this is what we'd then want to use for computation.
What is therefore the correct order of operations? Build the function graph, compile it, then somehow build the graph again with the new functions (?), and use that for execution?
What are the limitations of this approach in terms of use cases, etc. We could limit to numpy and python primitive code only for instance.

Things to think about with prototype (2):

What use cases does this make sense for?
What are the limitations of this approach?

Describe alternatives you've considered
Haven't.

Additional context

[documentation] show how to get hamilton running on snowpark

Issue by skrawcz
Friday Dec 09, 2022 at 20:45 GMT
Originally opened as stitchfix/hamilton#242

We need to help people get up and running with Hamilton on snowpark.

Two artifacts to produce:

hamilton + dbt + snowpark
hamilton + snowpark

Enable a DAG to define a recommendation call

Issue by skrawcz
Tuesday Oct 11, 2022 at 01:05 GMT
Originally opened as stitchfix/hamilton#207

Is your feature request related to a problem? Please describe.
Hamilton is general purpose - could we use it to model a recommendation stack call?

If you'd like to discuss ideas please try them here - stitchfix/hamilton#206.

Requirements

User should be able to define a recommendation stack dataflow.
We should be able to optimize the dataflow/resolve placement of logic.
The driver should then know how to walk the DAG and resolve calling the right services, etc.

Things to think about
What is framework first, and what is an optional add-on, versus custom code a user must write.

Additional context
Twitter thread with @jakemannix that started this train of thought.

Better querying for available nodes

Issue by elijahbenizzy
Tuesday Jul 19, 2022 at 20:19 GMT
Originally opened as stitchfix/hamilton#159

We make it easy to attach metadata to nodes, but don't yet have a natural API to make it easy to query these. This could be useful if:

(1) You want to look up a set of nodes with specific tags for reporting purposes
(2) You want to look up nodes used in data quality operators (see the motivating use-case below)
(3) You want to run some subset of the DAG that relates to the way nodes are tagged.

Is your feature request related to a problem? Please describe.
When using DQ we have to query by tags, this is really ugly. E.G.

all_validator_variables = [
    var.name for var in dr.list_available_variables() if
    var.tags.get('hamilton.data_quality.contains_dq_results')]

We should be able to have some utility functions here.

Describe the solution you'd like
Some combo of the following:

dr.query(tag_match={...}, name_match=r"...", module_match=r"...")
hamilton_utils.get_dq_validators(...)

or something like that. Note this would be valuable for more than just data quality -- E.G. tagging by nodes in general.

Describe alternatives you've considered
See above

Additional context
Writing out gitbook docs...

Do we want to support iterators for data loading?

Issue by skrawcz
Monday Feb 07, 2022 at 17:25 GMT
Originally opened as stitchfix/hamilton#68

What?

If we want to chunk over data, a natural way to do that is via an iterator.

Example: Enable "input" functions to be iterators

def my_loading_funct(...) -> Iterator[pd.Series]:
     ...
     yield some_chunk

This is fraught with some edge cases. But could be a more natural way to chunk over large data sets? This perhaps requires a new driver -- as we'd want some next() type semantic logic on the output of execute...

Originally posted by @skrawcz in stitchfix/hamilton#43 (comment)

Things to think through whether this is something useful to have:

Where would we allow this? Only on "input" nodes?
How would we exercise them in a deterministic fashion? i.e. does execute() care? and we iterate over them until they're exhausted? Or does execute() only do one iteration?
How do we coordinate multiple inputs that are iterators? What if they're of different lengths?
How would we ensure people don't create a mess that's hard to debug?
Would this work for the distributed graph adapters?

Modifications to enable decorating functions from another module

Issue by skrawcz
Thursday Oct 27, 2022 at 17:06 GMT
Originally opened as stitchfix/hamilton#217

If the user wants to reuse hamilton functions without hamilton, this commit shows one way to do it -- and what would be required on Hamilton's side to support it.

Basically we'd need a convention. Not sure it's a good idea. Probably easier to instead have boilerplate code people use to try/except the import if hamilton does not exist and have the decorators be identity functions...

skrawcz included the following code: https://github.com/stitchfix/hamilton/pull/217/commits

Move away from using setup.py for packaging hamilton

Issue by skrawcz
Wednesday Oct 13, 2021 at 00:28 GMT
Originally opened as stitchfix/hamilton#12

what?

Python is moving away from setup.py and to setup.cfg and pyproject.toml.
See https://packaging.python.org/tutorials/packaging-projects/.

impact of not doing this

Maintenance burden/becomes harder to maintain.

impact of doing this change

We're ready for future python changes more easily -- and can build a more complex package if necessary(?).

Openlineage Adapter

Issue by skrawcz
Tuesday Jun 21, 2022 at 21:57 GMT
Originally opened as stitchfix/hamilton#137

Is your feature request related to a problem? Please describe.
Hamilton encodes a lot of metadata that lives in code. It also creates some at execution time. There are projects such as https://datahubproject.io/, https://openlineage.io/ that capture this metadata across a wide array of tooling to create a central view in a heterogenous environment. Hamilton should be able to emit metadata/executions information to them.

Describe the solution you'd like
A user should be able to specify whether their Hamilton DAG should emit metadata.
This should play nicely with graph adapters, e.g. spark, ray, dask.

This should use the post graph execution hook to emit open lineage information.

use case:

creating a data set and writing it somewhere. The adapter can then emit openlineage information about what was executed.

Prototype Lineage Analysis Tooling

Issue by skrawcz
Tuesday Feb 01, 2022 at 00:41 GMT
Originally opened as stitchfix/hamilton#42

Is your feature request related to a problem? Please describe.
Currently, when given a Hamilton DAG, we don't expose ways to ask questions about it.

E.g. For GDPR, Data providence, etc.

E.g.

What if I remove this input, what function(s) will I impact?
What uses some PII data and what is the surface area?
If someone requests to be forgotten, what data do I need to delete?
Who should I talk to when I want to make this change that impacts these functions ? (e.g. use git blame to surface function owner?)
What has changed about the DAG since these two commits?
Are there any cycles?
Are there clusters of disjoint nodes? If so, what are they, maybe I can delete them?
etc

Describe the solution you'd like
This could be a specific "driver class", or something added to the base driver.

Without an end user workflow in mind, it's a bit hard to specify the API.

Also, perhaps this would work well with #4 -- e.g. tagging what is PII, and what isn't?

Describe alternatives you've considered
N/A

Additional context
There are a lot of start ups and organizations trying to get a handle on their data and where it is used. Hamilton can help provide a way to get at this easily...

Provide a loose type checking adapter

Issue by skrawcz
Monday Aug 15, 2022 at 20:45 GMT
Originally opened as stitchfix/hamilton#181

Is your feature request related to a problem? Please describe.
Following on from conversation in stitchfix/hamilton#170 a user should be able to augment Hamilton's type checking to allow:

def bar_union(x: t.Union[int, pd.Series]) -> t.Union[int, pd.Series]:
    return x
def foo_bar(bar_union: int) -> int:
    return bar + 1

as well as

def foo_int(x: int) -> int:
    """foo int"""
    return x + 1


def foo_union(foo_int: AnyType) -> AnyType:
    """foo union"""
    return foo_int + 1


def foo_int2(foo_union: int) -> int:
    """foo int, taking foo_union input"""
    return foo_union + 5

Describe the solution you'd like
A user should use a framework supported graph adapter that enables this sort of looser type checking and provide that to the driver at DAG construction time.

Describe alternatives you've considered
User writes custom graph adapter that is not framework supported.

E.g. something like:

class LooseDAGTypeCheckPythonDataFrameGraphAdapter(base.SimplePythonDataFrameGraphAdapter):

    @staticmethod
    def check_node_type_equivalence(node_type: typing.Type, input_type: typing.Type) -> bool:
        """Essentially allows super set of inputs to go through to node."""
        if input_type == typing.Any:
            # assume it will work
            return True
        # if input is superset of what node expects, that's okay
        elif type_utils.custom_subclass_check(input_type, node_type):
            return True
        return False

Additional context
See stitchfix/hamilton#170

Prototype Compile Hamilton on to an Orchestration Framework

Issue by skrawcz
Wednesday Feb 02, 2022 at 01:35 GMT
Originally opened as stitchfix/hamilton#44

Is your feature request related to a problem? Please describe.
Another way to scale a Hamilton DAG is to break it up into stages and have some other orchestrator handle execution. Hamilton need not implement these functions itself -- it could just compile and delegate execution to these frameworks.

E.g. I have a Hamilton DAG, but I want to use my in house Metaflow system -- the user should be able to generate code to run on Metaflow.

Describe the solution you'd like
A prototype to show how you could go from a Hamilton DAG to a DAG/Pipeline of some orchestration framework.

E.g.:

You'd have to think through the flow to do this:
e.g. define Hamilton DAG -> Compile to X Framework -> Commit code -> Run code on Framework X

We should prototype at least two implementations and see how we'd need to structure the code to make it manageable to maintain.

Describe alternatives you've considered
Hamilton could implement something like these other orchestration frameworks do, but that seems like a heavy lift. Better to try compiling to an existing framework.

Additional context
N/A

Running Hamilton on Flyte

Issue by ramannanda9
Thursday Nov 17, 2022 at 00:25 GMT
Originally opened as stitchfix/hamilton#233

[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

PR has an informative and human-readable title (this will be pulled into the release notes)
Changes are limited to a single goal (no scope creep)
Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
Any change in functionality is tested
New functions are documented (with a description, list of inputs, and expected output)
Placeholder code is flagged / future TODOs are captured in comments
Project documentation has been updated if adding/changing functionality.

ramannanda9 included the following code: https://github.com/stitchfix/hamilton/pull/233/commits

Better documentation for available default validators

Issue by elijahbenizzy
Thursday Jul 14, 2022 at 23:18 GMT
Originally opened as stitchfix/hamilton#156

Is your feature request related to a problem? Please describe.
Currently you have to look at the code to figure out the arguments, and they won't be auto-completed by an IDE.

Describe the solution you'd like
Some sort of auto-generated docs, or at least a configurable list.

Describe alternatives you've considered
Could do it manually.

Some scaffolding

Issue by elijahbenizzy
Friday Jun 17, 2022 at 22:48 GMT
Originally opened as stitchfix/hamilton#135

[Short description explaining the high-level reason for the pull request]

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

python 3.6
python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/135/commits

Add caching for hamilton

Issue by elijahbenizzy
Monday Oct 18, 2021 at 17:54 GMT
Originally opened as stitchfix/hamilton#17

The problem

We want to enable caching of functions and their downstream results.

Say we want to alter a function and rerun the entire DAG. The function that we want to alter runs late enough that we'd be redoing a significant amount of computation. While iterating could often be solved by executing individual nodes, its completely reasonable to iterate on the entire DAG.

In our internal use of Hamilton, we actually have a decorator called @cache that runs entirely separate from Hamilton -- this allows us to cache the results of individual functions. This decorator uses (a) the code a function runs and (b) the hash of the parameters. That said, its not foolproof -- changes in external libraries referenced within functions can get ignored, its not DAG-aware (it doesn't care about downstream functions you might also not want to rerun), and it depends on hashability of parameters.

I envision this as useful for:

Rerunning/iterating on a DAG locally
Running expensive DAGs in production ETLs that all use the same cache but change minimal parts

Some options

Automatically cache functions, have a clear_cache or use_cache in the execute function
Use the @cache decorator we have internally
Manage the cache externally -- pass in as an override to the driver's execute function. Then have a method on the driver to manipulate the cache as needed.

I'm partial to (1) although we need to make it visible to the user and easy to override. E.G. to mark things as changed. We could also have decorators that say dont_cache if needed.

Probably a few other things we can do -- welcome feedback! Might want to think about making it pluggable -- saving to disk is nice, but saving it to a backing store could be even nicer. Shouldn't get locked in.

Slightly leaky abstraction -- figure out the best way to extend this

Issue by elijahbenizzy
Wednesday May 12, 2021 at 18:29 GMT
Originally opened as stitchfix/hamilton#7

This is for an extension of hamilton called hamiltime. Currently it's perfectly abstracted away except for this enum. Let's figure out how to make it a completely separate product.

https://github.com/stitchfix/hamilton/blob/c86ddd2adcc6d5812c8d1c769e76e72c1b06a580/hamilton/node.py#L20

Some ideas:

Release hamilton with hamiltime
Have this be an extensible class that hamiltime can overwrite
Have the function_graph, not the node type know about the node sources
...

[prototype] NVMTabular support

Issue by skrawcz
Friday Jul 08, 2022 at 05:07 GMT
Originally opened as stitchfix/hamilton#150

Is your feature request related to a problem? Please describe.
NVMTabular is a way to write ETL feature code for nvidia GPUs. Is there a Hamilton way to help organize the workflow required to run NVMTabular?

Describe the solution you'd like
The goal of the prototype is to figure out how Hamilton could be best used.

Describe alternatives you've considered
N/A

Additional context
E.g. what would the Hamilton version of https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb be?

Reusable subDAG components

Issue by elijahbenizzy
Saturday Mar 12, 2022 at 00:24 GMT
Originally opened as stitchfix/hamilton#86

Is your feature request related to a problem? Please describe.
Reusing functions, helpers, etc... are all nice and good. However, sometimes you want to be able to reuse large subcomponents in the DAG.

Describe the solution you'd like

Two idea:

Use the Driver to stich bigger DAGs together
Two decorator (still need to fully think this through)

This uses prefixes, but some actual namespace notion could be nice here.

@hamilton.subdag_split(
    inputs={
        'subdag_1' : {'source' : 'source_for_dag_1'}, # subdag 1 gets its source from a different place than subdag 1
        'subdag_2 : {'source' : 'source_for_dag_2'}}} 
def data(source: str) -> pd.DataFrame:
    return _load_data(source)

def foo(data) -> pd.Series:
    return _do_something(data)

@hamilton.subdag_join(subdag='subdag_1')
def process_subdag_1(foo) -> pd.Series:
    return _process_some_way(foo)

The framework would then compile this in a fancy way -- ensuring that every node between the splits and the joins is turned into one for each subdag, under a separate namespace. TBD on how to access it.

Describe alternatives you've considered
Not allowing this -- I don't have a concrete use-case blocking for anyone but we have observed one at Stitch Fix.

Additional context
Thinking ahead here.

Add testing for thread safety of driver

Issue by elijahbenizzy
Friday Nov 18, 2022 at 21:56 GMT
Originally opened as stitchfix/hamilton#234

Is your feature request related to a problem? Please describe.
The driver should be thread-safe, but we want a test to ensure

Describe the solution you'd like
A unit/integration test that runs a bunch in parallel with the same drier.

Additional context
Add any other context or screenshots about the feature request here.
https://hamilton-opensource.slack.com/archives/C03M33QB4M8/p1668807849626519

Lazy config evaluation

Issue by elijahbenizzy
Monday Feb 07, 2022 at 03:51 GMT
Originally opened as stitchfix/hamilton#64

[Short description explaining the high-level reason for the pull request]

Additions

Removals

Changes

Testing

Screenshots

If applicable

Notes

Todos

Checklist

Testing checklist

Python

python 3.6
python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/64/commits

Modin integration

Issue by skrawcz
Thursday Mar 10, 2022 at 19:09 GMT
Originally opened as stitchfix/hamilton#85

Is your feature request related to a problem? Please describe.
Modin - https://github.com/modin-project/modin - also enables scaling pandas computation. Since we have ray, dask, and koalas, why not add Modin?

Describe the solution you'd like
Modin requires a replacement of the pandas import in user code to work.
We would need to think how to do this:

Do we get people to import "pandas" from hamilton, and we can then control which pandas is actually imported?
Do we require users then to assume modin, by changing the pandas import themselves when defining their hamilton python functions?
Or is there some other way to integrate? E.g. a graph adapter

Additional context
N/A

[Internal] Decouple node function from inputs

Issue by elijahbenizzy
Wednesday Sep 28, 2022 at 00:22 GMT
Originally opened as stitchfix/hamilton#201

Currently a node takes in a set of inputs -- these correspond both to (a) the names of the dependencies and (b) the parameters in the function.

To illustrate:

In [1]: def test(a: int, b: int) -> int:
   ...:     return a + b
   ...:

In [2]: from hamilton import node

In [3]: node.Node.from_fn(test).input_types
Out[3]:
{'a': (int, <DependencyType.REQUIRED: 1>),
 'b': (int, <DependencyType.REQUIRED: 1>)}

This works well in the standard case. However, with parameterizing sources it kinda breaks down:

In [4]: from hamilton.function_modifiers import source, value
function_modifiers.parameterize(foo={'a' : source('c'), 'b': source('c')})(test)

In [5]: from hamilton.function_modifiers.base import resolve_nodes
In [6]: resolve_nodes(test, {})[0].input_types
Out[6]: {'c': (int, <DependencyType.REQUIRED: 1>)}

In [18]: resolve_nodes(test, {})[0](c=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 resolve_nodes(test, {})[0](c=1)

File ~/dev/dagworks/hamilton/hamilton/node.py:174, in Node.__call__(self, *args, **kwargs)
    172 def __call__(self, *args, **kwargs):
    173     """Call just delegates to the callable, purely for clean syntactic sugar"""
--> 174     return self.callable(*args, **kwargs)

File ~/dev/dagworks/hamilton/hamilton/function_modifiers/expanders.py:94, in parameterize.expand_node.<locals>.replacement_function(upstream_dependencies, literal_dependencies, *args, **kwargs)
     92 kwargs = kwargs.copy()
     93 for dependency, replacement in upstream_dependencies.items():
---> 94     kwargs[dependency] = kwargs.pop(replacement.source)
     95 for dependency, replacement in literal_dependencies.items():
     96     kwargs[dependency] = replacement.value

KeyError: 'c'

The obvious solution (that I'm taking in reuse_subdag is to create an identity node. But, let's face it, that's fairly clunky. Instead I propose we specify a mapping inside node. So, we have a field param_mapping in Node that gives the mapping of internal parameters to external ones. This defaults to an identity, but we can then use it to simplify a lot of more complex decorators.

Data quality next plans POC

Issue by elijahbenizzy
Monday Jul 04, 2022 at 22:38 GMT
Originally opened as stitchfix/hamilton#149

OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
That we can use config to enable/disable items at run/compile time.
That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

python 3.6
python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/149/commits

Configurable data quality checks

Issue by skrawcz
Tuesday Aug 02, 2022 at 16:54 GMT
Originally opened as stitchfix/hamilton#164

Is your feature request related to a problem? Please describe.
We should be able to configure whether we want data quality to run or not at DAG build/run time.

Describe the solution you'd like
Allowing for fine-grained control over the behavior of specific checks. This will enable the user to turn on and off certain checks at runtime.

i.e. some configuration that a user can adjust/modify via JSON or YAML or both...

Describe alternatives you've considered
N/A

Additional context
stitchfix/hamilton#149 prototypes some approaches here.

Clarify behavior of decorator ordering

Issue by skrawcz
Sunday Dec 18, 2022 at 01:28 GMT
Originally opened as stitchfix/hamilton#249

We need to make clear our philosophy and resolution method for functions such as:

@extract_fields({'out_value1': int, 'out_value2': str})
@tag(test_key="test-value")
@check_output(data_type=dict, importance="fail")
@does(_dummy)
def uber_decorated_function(in_value1: int, in_value2: str) -> dict:
    pass

Right now it is not clear, nor obvious.

Current behavior

This is what the graph looks like:

So it would be unexpected to see check_output over the output of extract_fields.

Steps to replicate behavior

Function code:

def _dummy(**values) -> dict:
    return {f"out_{k.split('_')[1]}": v for k, v in values.items()}


@extract_fields({'out_value1': int, 'out_value2': str})
@tag(test_key="test-value")
@check_output(data_type=dict, importance="fail")
@does(_dummy)
def uber_decorated_function(in_value1: int, in_value2: str) -> dict:
    pass

Expected behavior

check_output should probably operate over what's directly underneath that.
tag similarly should apply to all? or just what's underneath?
does should apply to uber_decorated_function
extract_fields is the last thing that's applied?

Additional context

Thoughts: can we create a linter that reorders decorators?

Runtime-based DAG structure

Issue by elijahbenizzy
Tuesday Oct 11, 2022 at 16:54 GMT
Originally opened as stitchfix/hamilton#208

Is your feature request related to a problem? Please describe.
Based on a conversation with @bmritz, more for non-bulk computations.

The problem is this: say you have a computation that depends on the result of an upstream node. E.G. whether that value was supplied/exists, or some property on top of that value (if odd do x, if even do y). Currently under Hamilton this might look like:

def foo(bar: int=None, baz: int=None) -> int:
    if bar is None:
        return _x(baz)
    if baz is None:
        return _y(bar)
    raise ValueError("must supply either bar *or* baz")

and in the next case:

def foo(bar: int):
    if bar % 2 == 0:
        return _x(bar)
    return _y(bar)

This, however, is fairly ugly -- as the dependencies get more complex you have a bunch of chains. and you end up dealing with nodes that may or may not exist -- Hamilton has no notion of this, so we just chain through Nones. While this works in simpler cases just fine, it is far from ideal.

Describe the solution you'd like
Two ideas:

Having a compile-time or-gate
Wrapping the approaches above in syntactic sugar

Either way, the API could be somewhat similar:

@dynamic.when_exists('bar')
def foo(bar: int) -> int:
    return _x(bar)

@dynamic.when_exists('baz')
def foo(baz: int) -> int:
    return _y(baz)

Describe alternatives you've considered
The above two are reasonable options. Some other ideas:

Treat the subdag as a node and dynamically execute/configure it on a per-run basis
Run it in a separate driver and not have it supported
Continue with the available

See datajet for inspiration.

Additional context
From a slack conversation with @bmritz.

Dask map_partitions node

Issue by elshize
Monday Jun 27, 2022 at 16:19 GMT
Originally opened as stitchfix/hamilton#143

ℹ️ This is in response to a discussion with @elijahbenizzy on discord.

This is an example of a use case that could be supported with some additional decorators.

The usecase is that a node takes one Dask data frame and possibly some other arguments (either Pandas data frames or scalars). Then, the node simply executes map_partitions on the dask input, broadcasting the remaining arguments.

For example:

def node(a: dd.DataFrame, b: pd.DataFrame, c: int):
    return a.map_partitions(_node, b, c, align_dataframes=False)

def _node(a: pd.DataFrame, b: pd.DataFrame, c:int):
    # actual logic for each partition `a`

could be:

# this probably should be designed better, I just want to give you an idea
@dask_partition("a")
def node(a: pd.DataFrame, b: pd.DataFrame, c: int):
    # logic

So that it could run that map_partitions automatically. It might need (at least optional) parameter to add meta to the function call.

Not entirely sure if there are some disadvantages of that, and what the value of it would be in general, but I wanted to document what we discussed.

Capturing wide to long transformations of entire dataframes using Hamilton

Issue by latlan1
Thursday Feb 03, 2022 at 01:17 GMT
Originally opened as stitchfix/hamilton#46

Is your feature request related to a problem? Please describe.
Is there a way to transform a dataframe from wide to long (or vice versa) using Hamilton to track this transformation? I concluded no since all of the input/output columns would need to be specified, which could be a lot of typing.

Describe the solution you'd like
It would nice if I could define a function that accepts df_wide and outputs df_long with pd.melt.

Describe alternatives you've considered
I performed the melt operation outside of Hamilton so this operation is not directly captured through the DAG.

Create Hamilton converter for pandas code

Issue by skrawcz
Friday May 13, 2022 at 22:54 GMT
Originally opened as stitchfix/hamilton#132

Is your feature request related to a problem? Please describe.
With Hamilton you need need to restructure your code. This can be too much of a friction point for someone. Wouldn't it be nice if we had a way to help automate this step?

Describe the solution you'd like
We should be able to write some python code that parses the AST to covert code like:

df['a'] = df['b'] + df['c']

into

def a(b: pd.Series, c: pd.Series) -> pd.Series:
      return b + c

Core to this problem, is building code to parse python code and output/print hamilton functions. Once we have that, we can think about the places we could provide this, e.g. CLI, a website, some other means...

Describe alternatives you've considered
Not doing this.

Additional context
It would enable people to get up and running with Hamilton faster. E.g. if they provided a script, and we "walked" the script and guessed what should be output...

Support frozenset[...] generic

Issue by elijahbenizzy
Wednesday Aug 10, 2022 at 20:26 GMT
Originally opened as stitchfix/hamilton#175

Short description explaining the high-level reason for the new issue.

Current behavior

This breaks:

def foo() -> frozenset[str]:
    ...

def bar(foo: Set[str]) -> ...:
    ...

But it should work. That said, the reverse isn't true -- we can't pass a set when expecting a frozenset :/

Stack Traces

I've gotten a few:

    raise e
../hamilton/hamilton/driver.py:57: in __init__
    self.graph = graph.FunctionGraph(*modules, config=config, adapter=adapter)
../hamilton/hamilton/graph.py:194: in __init__
    self.nodes = create_function_graph(*modules, config=self._config, adapter=adapter)
../hamilton/hamilton/graph.py:124: in create_function_graph
    add_dependency(n, node_name, nodes, param_name, param_type, adapter)
../hamilton/hamilton/graph.py:89: in add_dependency
    if not types_match(adapter, param_type, required_node.type):
../hamilton/hamilton/graph.py:50: in types_match
    elif custom_subclass_check(required_node_type, param_type):
../hamilton/hamilton/type_utils.py:45: in custom_subclass_check
    return issubclass(requested_type, param_type)```

## Screenshots
(If applicable)


## Steps to replicate behavior
1.

## Library & System Information
E.g. python version, hamilton library version, linux, etc.


# Expected behavior


# Additional context
Add any other context about the problem here.

Tag-based dependencies

Issue by elijahbenizzy
Wednesday Nov 02, 2022 at 17:24 GMT
Originally opened as stitchfix/hamilton#226

From slack:

James Marvin
2:31 AM
Hi folks - wanted to discuss with you the merits of enabling a method of referring to nodes by their tags as opposed to by node name.
There are some instances in which we may want to process all nodes of a certain 'type' - for example, all metadata columns, or all engineered 'features'/derived columns. In the latter case in particular, it could be that there are dozens of columns of this 'type'.
My understanding is that to create a new node which accepts all nodes of a given type as input, we have to provide each input node name as a parameter to the new function:

@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
    return helpers._get_some_metadata(input)

@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
    return pd.Series(current_time)

def get_metadata_table(create_some_metadata:pd.Series, create_some_more_metadata:pd.Series) -> pd.DataFrame:
    return pd.DataFrame([create_some_metadata, create_some_more_metadata])

It could be useful - especially where we are creating a new function accepting a high number of nodes of the same type as input - to have some feature enabling us to refer to nodes by type, as opposed to by name.
Hopefully this example shows in principle what I mean:

@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
    return helpers._get_some_metadata(input)

@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
    return pd.Series(current_time)

@nodes_by_tag(type='metadata')
def get_metadata_table(**kwargs) -> pd.DataFrame:
    return pd.DataFrame(**kwargs)

In this example:
All nodes have been assigned the same 'type' using the @tag feature
Some method is supplied (in this case, a new decorator @nodes_by_tag) by which we're able to use to refer to all nodes of a given type in definition of a new node
The new node is able to perform some action based on the assumption that all nodes of a given type have been provided as an input - without having to refer to each node by name
What do you think? (edited)

Add pandas result builder that converts to long format

Issue by skrawcz
Monday Apr 25, 2022 at 21:53 GMT
Originally opened as stitchfix/hamilton#121

Is your feature request related to a problem? Please describe.
Hamilton works on "wide" columns -- not "long ones". However the "tidy" data ethos thinks data should be in a long format -- it does make some things easier to do.

Describe the solution you'd like
Add a ResultBuilder variant that takes in how you'd want to collapse the resulting pandas dataframe.

Describe alternatives you've considered
People do this manually -- but perhaps in the result builder makes more sense.

Additional context
Prerequisites for someone picking this up:

know Pandas.
know python.
can write the pandas code to go from wide to long.
can read the Hamilton code base to figure out where to add it.

Adds first pass example using movie example from metaflow

Issue by skrawcz
Thursday Dec 23, 2021 at 02:10 GMT
Originally opened as stitchfix/hamilton#32

Goal of this PR is to provide more extensive examples.

Inspiration is from:
https://github.com/Netflix/metaflow/blob/master/metaflow/tutorials/01-playlist/playlist.py

Two interesting things to note here:

We could have the driver do the loading of the CSV, or we could have a function to do it.
Filter operations are easiest to happen in the driver I think. Not to say you can't do it,
we'd just need better logic around creating the dataframe, and not trying to combine series
with various index lengths. Hmm.

Additions

Examples

Removals

Changes

Testing

The example works. Tested locally.

Todos

[] add more examples and see if this makes sense.

Checklist

Testing checklist

Python

python 3.6
python 3.7

skrawcz included the following code: https://github.com/stitchfix/hamilton/pull/32/commits

Create HOW TOs for integration with popular ETL frameworks/methodology

Issue by skrawcz
Thursday Nov 04, 2021 at 23:19 GMT
Originally opened as stitchfix/hamilton#22

Is your feature request related to a problem? Please describe.
Hamilton has a small footprint. It can be run inside existing ETLs very easily. We should have documentation to reflect
that fact to help users understand how simple it is to get it working.

Describe the solution you'd like
We should have examples/documentation to cover:

Metaflow
Airflow
Dagster
Flyte
Your custom scheduler

Describe alternatives you've considered
N/A

Additional context
We want people to be able to cut and paste code easily. Also having examples/documentation would help people size what it would look like to get Hamilton into their ETL.

Slightly leaky abstraction -- figure out the best way to extend this

Issue by elijahbenizzy
Wednesday May 12, 2021 at 18:29 GMT
Originally opened as stitchfix/hamilton#7

This is for an extension of hamilton called hamiltime. Currently it's perfectly abstracted away except for this enum. Let's figure out how to make it a completely separate product.

https://github.com/stitchfix/hamilton/blob/c86ddd2adcc6d5812c8d1c769e76e72c1b06a580/hamilton/node.py#L20

Some ideas:

Release hamilton with hamiltime
Have this be an extensible class that hamiltime can overwrite
Have the function_graph, not the node type know about the node sources
...

Ability to Profile nodes

Issue by chrisaddy
Monday May 02, 2022 at 18:36 GMT
Originally opened as stitchfix/hamilton#128

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Have found it useful to be able to profile code execution for purposes of debugging and profiling code for enhancing performance.

Describe the solution you'd like
A clear and concise description of what you want to happen.

A solution that I have implemented in the past has been to create a stateful decorator so that you can capture execution context. This would capture a list of calls with their function names and simple stats; while this was done with dataframes in mind, it can also be used on arbitrary functions and class methods, so extending to any sort of hamilton (callable) node should be straightforward. The original would look something like:

my_data.csv

profile = Tracer(**tracer_options)

@profile
def load_data(input_csv: str) -> pd.DataFrame:
    return pd.read_csv(input_csv)
    
@profile
def half_everything(load_data: pd.DataFrame) -> pd.DataFrame:
    return load_data / 2
    
if __name__ == "__main__":
    data = load_data("my_data.csv")
    transformed_data = half_everything(data)

print(profile)

[
    Trace(
         func=<function load_data at 0x7f9fc7ff8d30>,
         args=('my_data.csv', ),
         kwargs={},
         profile=Profile(
             cpu=Resource(start=3.9, end=4.4),
             memory=Resource(start=75.7, end=75.7)
          )
    ),
    Trace(
        func=<function transform at 0x7f9fd4782dd0>,
        args=(   a  b
             0  1  3
             1  2  4
             2  3  5, ),
        kwargs={},
        profile=Profile(
            cpu=Resource(start=0.0, end=3.8), 
            memory=Resource(start=75.7, end=75.7)
        )
    )
]

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

chatted with @elijahbenizzy , think it could be done either as a node extender similar to data quality, or as an adapter that wraps current adapter types

Utility for generating temporary, unique node names

Issue by elijahbenizzy
Tuesday Nov 15, 2022 at 15:41 GMT
Originally opened as stitchfix/hamilton#230

We have a lot of cases (coming up) in which we generate unique/temporary nodes in decorator/DAG construction.

E.G.

generating a node in the new parameterized and extract_columns combo decorator
generating static/pass-through nodes for the new reuse_functions decorator

And a few more that we already do but I honestly can't remember right now... Currently these have the potential of clashing with each other, but I think we can do this in a much cleaner way. Properties we want:

(1) unique
(2) readable
(3) stable between runs
(4) stable between DAG changes

TBD on implementation -- but I think a stable(ish) hash with a prefix and a low-digit number for collisions. If we toss readability a hash/uuid is fine.

dagworks-inc / hamilton Goto Github PK

hamilton's People

Contributors

Stargazers

Watchers

Forkers

hamilton's Issues

Current behavior

Stack Traces

Screenshots

Steps to replicate behavior

Library & System Information

Expected behavior

Additional context

What?

Example: Enable "input" functions to be iterators

Things to think through whether this is something useful to have:

what?

impact of not doing this

impact of doing this change

use case:

Changes

How I tested this

Notes

Checklist

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

Additions

Removals

Changes

Testing

Screenshots

Notes

Todos

Checklist

Testing checklist

Python

Changes

Testing

Notes

Checklist

Testing checklist

Python - local testing

Current behavior

Steps to replicate behavior

Expected behavior

Additional context

Current behavior

Stack Traces

Additions

Removals

Changes

Testing

Todos

Checklist

Testing checklist

Python

Recommend Projects

Recommend Topics

Recommend Org