Code Monkey home page Code Monkey logo

dagworks-inc / hamilton Goto Github PK

View Code? Open in Web Editor NEW
1.3K 12.0 83.0 55.45 MB

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

Home Page: https://hamilton.dagworks.io/en/latest/

License: Other

Shell 0.07% Python 27.71% HTML 0.04% JavaScript 0.17% CSS 0.03% TypeScript 6.73% Jupyter Notebook 65.21% Dockerfile 0.01% Jinja 0.04%
data-science python dag data-engineering dataframe etl etl-framework etl-pipeline feature-engineering featurization

hamilton's Introduction

Welcome to the official Hamilton Github Repository

Hamilton CircleCI Documentation Status Hamilton Slack Twitter
Python supported PyPi Version Total Downloads

Hamilton

The general purpose micro-orchestration framework for building dataflows from python functions. Express data, ML, LLM pipelines/workflows, and web requests in a simple declarative manner.

Hamilton is a novel paradigm for specifying a flow of delayed execution in python. It works on python objects of any type and dataflows of any complexity. Core to the design of Hamilton is a clear mapping of function name to artifact, allowing you to quickly grok the relationship between the code you write and the data you produce.

This paradigm makes modifications easy to build and track, ensures code is self-documenting, and makes it natural to unit test your data transformations. When connected together, these functions form a Directed Acyclic Graph (DAG), which the Hamilton framework can execute, optimize, and report on.

Note: Hamilton describes DAGs. If you're looking for something to handle loops or conditional edges (say, for a human-in-the-loop application like a chatbot or agent), you might appreciate Burr -- it integrates well with any python library (including Hamilton!).

Problems Hamilton Solves

✅ Model a dataflow -- If you can model your problem as a DAG in python, Hamilton is the cleanest way to build it.
✅ Unmaintainable spaghetti code -- Hamilton dataflows are unit testable, self-documenting, and provide lineage.
✅ Long iteration/experimentation cycles -- Hamilton provides a clear, quick, and methodical path to debugging/modifying/extending your code.
✅ Reusing code across contexts -- Hamilton encourages code that is independent of infrastructure and can run regardless of execution setting.

Problems Hamilton Does not Solve

❌ Provisioning infrastructure -- you want a macro-orchestration system (see airflow, kubeflow, sagemaker, etc...).
❌ Doing your ML for you -- we organize your code, BYOL (bring your own libraries).
❌ Tracking execution + associated artifacts -- Hamilton is lightweight, but if this is important to you see the DAGWorks product.

See the table below for more specifics/how it compares to other common tooling.

Full Feature Comparison

Here are common things that Hamilton is compared to, and how Hamilton compares to them.

Feature Hamilton Macro orchestration systems (e.g. Airflow) Feast dbt Dask
Python 3.8+
Helps you structure your code base
Code is always unit testable
Documentation friendly
Can visualize lineage easily
Is just a library
Runs anywhere python runs
Built for managing python transformations
Can model GenerativeAI/LLM based workflows
Replaces macro orchestration systems
Is a feature store
Can model transforms at row/column/object/dataset level

Getting Started

If you don't want to install anything to try Hamilton, we recommend trying www.tryhamilton.dev. Otherwise, here's a quick getting started guide to get you up and running in less than 15 minutes. If you need help join our slack community to chat/ask Qs/etc. For the latest updates, follow us on twitter!

Installation

Requirements:

  • Python 3.8+

To get started, first you need to install hamilton. It is published to pypi under sf-hamilton:

pip install sf-hamilton

Note: to use the DAG visualization functionality, you should instead do:

pip install "sf-hamilton[visualization]"

While it is installing we encourage you to start on the next section.

Note: the content (i.e. names, function bodies) of our example code snippets are for illustrative purposes only, and don't reflect what we actually do internally.

Hamilton in <15 minutes

Hamilton is a new paradigm when it comes to creating, um, dataframes (let's use dataframes as an example, otherwise you can create ANY python object). Rather than thinking about manipulating a central dataframe, as is normal in some data engineering/data science work, you instead think about the column(s) you want to create, and what inputs are required. There is no need for you to think about maintaining this dataframe, meaning you do not need to think about any "glue" code; this is all taken care of by the Hamilton framework.

For example rather than writing the following to manipulate a central dataframe object df:

df['col_c'] = df['col_a'] + df['col_b']

you write

def col_c(col_a: pd.Series, col_b: pd.Series) -> pd.Series:
    """Creating column c from summing column a and column b."""
    return col_a + col_b

In diagram form: example The Hamilton framework will then be able to build a DAG from this function definition.

So let's create a "Hello World" and start using Hamilton!

Your first hello world.

By now, you should have installed Hamilton, so let's write some code.

  1. Create a file my_functions.py and add the following functions:
import pandas as pd

def avg_3wk_spend(spend: pd.Series) -> pd.Series:
    """Rolling 3 week average spend."""
    return spend.rolling(3).mean()

def spend_per_signup(spend: pd.Series, signups: pd.Series) -> pd.Series:
    """The cost per signup in relation to spend."""
    return spend / signups

The astute observer will notice we have not defined spend or signups as functions. That is okay, this just means these need to be provided as input when we come to actually wanting to create a dataframe.

Note: functions can take or create scalar values, in addition to any python object type.

  1. Create a my_script.py which is where code will live to tell Hamilton what to do:
import sys
import logging
import importlib

import pandas as pd
from hamilton import driver

logging.basicConfig(stream=sys.stdout)
initial_columns = {  # load from actuals or wherever -- this is our initial data we use as input.
    # Note: these do not have to be all series, they could be scalar inputs.
    'signups': pd.Series([1, 10, 50, 100, 200, 400]),
    'spend': pd.Series([10, 10, 20, 40, 40, 50]),
}
# we need to tell hamilton where to load function definitions from
module_name = 'my_functions'
module = importlib.import_module(module_name) # or we could just do `import my_functions`
dr = driver.Driver(initial_columns, module)  # can pass in multiple modules
# we need to specify what we want in the final dataframe.
output_columns = [
    'spend',  # or module.spend
    'signups',  # or module.signups
    'avg_3wk_spend',  # or module.avg_3wk_spend
    'spend_per_signup',  # or module.spend_per_signup
]
# let's create the dataframe!
# if you only did `pip install sf-hamilton` earlier:
df = dr.execute(output_columns)
# else if you did `pip install "sf-hamilton[visualization]"` earlier:
# dr.visualize_execution(output_columns, './my-dag.dot', {})
print(df)
  1. Run my_script.py

python my_script.py

You should see the following output:

   spend  signups  avg_3wk_spend  spend_per_signup
0     10        1            NaN            10.000
1     10       10            NaN             1.000
2     20       50      13.333333             0.400
3     40      100      23.333333             0.400
4     40      200      33.333333             0.200
5     50      400      43.333333             0.125

You should see the following image if you ran dr.visualize_execution(output_columns, './my-dag.dot', {"format": "png"}, orient="TB"):

hello_world_image Note: we treat displaying Inputs in a special manner for readability in our visualizations. So you'll likely see input nodes repeated.

Congratulations - you just created your Hamilton dataflow that created a dataframe!

Example Hamilton Dataflows

We have a growing list of examples showcasing how one might use Hamilton. You currently have two places to find them:

  1. The Hamilton Dataflow Hub -- which makes it easy to pull and then modify code.
  2. The examples/ folder in this repository.

For the Hub, this will contain user contributed dataflows, e.g. text_summarization, forecasting, data processing, that will be continually added to.

For the examples/ directory, you'll have to copy/fork the repository to run them. E.g.

We also have a docker container that contains some of these examples so you can pull that and run them locally. See the examples folder README for details.

We forked and lost some stars

This repository is maintained by the original creators of Hamilton, who have since founded DAGWorks inc., a company largely dedicated to building and maintaining the Hamilton library. We decided to fork the original because Stitch Fix did not want to transfer ownership to us; we had grown the star count in the original repository to 893: Screen Shot 2023-02-23 at 12 58 43 PM before forking.

For the backstory on how Hamilton came about, see the original Stitch Fix blog post!.

Slack Community

We have a small but active community on slack. Come join us!

License

Hamilton is released under the BSD 3-Clause Clear License.

Used internally by:

To add your company, make a pull request to add it here.

Contributing

We take contributions, large and small. We operate via a Code of Conduct and expect anyone contributing to do the same.

To see how you can contribute, please read our contributing guidelines and then our developer setup guide.

Blog Posts

Videos of talks

Watch the video

Citing Hamilton

We'd appreciate citing Hamilton by referencing one of the following:

@inproceedings{DBLP:conf/vldb/KrawczykI22,
  author    = {Stefan Krawczyk and Elijah ben Izzy},
  editor    = {Satyanarayana R. Valluri and Mohamed Za{\"{\i}}t},
  title     = {Hamilton: a modular open source declarative paradigm for high level
               modeling of dataflows},
  booktitle = {1st International Workshop on Composable Data Management Systems,
               CDMS@VLDB 2022, Sydney, Australia, September 9, 2022},
  year      = {2022},
  url       = {https://cdmsworkshop.github.io/2022/Proceedings/ShortPapers/Paper6\_StefanKrawczyk.pdf},
  timestamp = {Wed, 19 Oct 2022 16:20:48 +0200},
  biburl    = {https://dblp.org/rec/conf/vldb/KrawczykI22.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

@inproceedings{CEURWS:conf/vldb/KrawczykIQ22,
  author    = {Stefan Krawczyk and Elijah ben Izzy and Danielle Quinn},
  editor    = {Cinzia Cappiello and Sandra Geisler and Maria-Esther Vidal},
  title     = {Hamilton: enabling software engineering best practices for data transformations via generalized dataflow graphs},
  booktitle = {1st International Workshop on Data Ecosystems co-located with 48th International Conference on Very Large Databases (VLDB 2022)},
  pages     = {41--50},
  url       = {https://ceur-ws.org/Vol-3306/paper5.pdf},
  year      = {2022}
}

🛣🗺 Roadmap / Things you can do with Hamilton

Hamilton is an ambitious project to provide a unified way to describe any dataflow, independent of where it runs. You can find currently support integrations and high-level roadmap below. Please reach out via slack or email (stefan / elijah at dagworks.io) to contribute or share feedback!

Object types:

  • Any python object type! E.g. Pandas, Spark dataframes, Dask dataframes, Ray datasets, Polars, dicts, lists, primitives, your custom objects, etc.

Workflows:

  • data processing
  • feature engineering
  • model training
  • LLM application workflows
  • all of them together

Data Quality

See the data quality docs.

  • Ability to define data quality check on an object.
  • Pandera schema integration.
  • Custom object type validators.
  • Integration with other data quality libraries (e.g. Great Expectations, Deequ, whylogs, etc.)

Online Monitoring

  • Open telemetry/tracing plugin.

Caching:

  • Checkpoint caching (e.g. save a function's result to disk, independent of input) - WIP.
  • Finergrained caching (e.g. save a function's result to disk, dependent on input).

Execution:

  • Runs anywhere python runs. E.g. airflow, prefect, dagster, kubeflow, sagemaker, jupyter, fastAPI, snowpark, etc.

Backend integrations:

Specific integrations with other systems where we help you write code that runs on those systems.

Ray

  • Delegate function execution to Ray.
  • Function grouping (e.g. fuse multiple functions into a single Ray task)

Dask

  • Delegate function execution to Dask.
  • Function grouping (e.g. fuse multiple functions into a single Dask task)

Spark

  • Pandas on spark integration (via GraphAdapter)
  • PySpark native UDF map function integration (via GraphAdapter)
  • PySpark native aggregation function integration
  • PySpark join, filter, groupby, etc. integration

Snowpark

  • Packaging functions for Snowpark

LLVMs & related

  • Numba integration

Custom Backends

  • Generate code to execute on a custom topology, e.g. microservices, etc.

Integrations with other systems/tools:

  • Generating Airflow | Prefect | Metaflow | Dagster | Kubeflow Pipelines | Sagemaker Pipelines | etc from Hamilton.
  • Plugins for common MLOps/DataOps tools: MLFlow, DBT, etc.

Dataflow/DAG Walking:

  • Depth first search traversal
  • Async function support via AsyncDriver
  • Parallel walk over a generator
  • Python multiprocessing execution (still in beta)
  • Python threading support
  • Grouping of nodes into tasks for efficient parallel computation
  • Breadth first search traversal
  • Sequential walk over a generator

DAG/Dataflow resolution:

  • At Driver instantiation time, using configuration/modules and @config.when.
  • With @resolve during Driver instantiation time.

Prescribed Development Workflow

In general we prescribe the following:

  1. Ensure you understand Hamilton Basics.
  2. Familiarize yourself with some of the Hamilton decorators. They will help keep your code DRY.
  3. Start creating Hamilton Functions that represent your work. We suggest grouping them in modules where it makes sense.
  4. Write a simple script so that you can easily run things end to end.
  5. Join our Slack community to chat/ask Qs/etc.

For the backstory on Hamilton we invite you to watch a roughly-9 minute lightning talk on it that we gave at the apply conference: video, slides.

PyCharm Tips

If you're using Hamilton, it's likely that you'll need to migrate some code. Here are some useful tricks we found to speed up that process.

Live templates

Live templates are a cool feature and allow you to type in a name which expands into some code.

E.g. For example, we wrote one to make it quick to stub out Hamilton functions: typing graphfunc would turn into ->

def _(_: pd.Series) -> pd.Series:
   """"""
   return _

Where the blanks are where you can tab with the cursor and fill things in. See your pycharm preferences for setting this up.

Multiple Cursors

If you are doing a lot of repetitive work, one might consider multiple cursors. Multiple cursors allow you to do things on multiple lines at once.

To use it hit option + mouse click to create multiple cursors. Esc to revert back to a normal mode.

Usage analytics & data privacy

By default, when using Hamilton, it collects anonymous usage data to help improve Hamilton and know where to apply development efforts.

We capture three types of events: one when the Driver object is instantiated, one when the execute() call on the Driver object completes, and one for most Driver object function invocations. No user data or potentially sensitive information is or ever will be collected. The captured data is limited to:

  • Operating System and Python version
  • A persistent UUID to indentify the session, stored in ~/.hamilton.conf.
  • Error stack trace limited to Hamilton code, if one occurs.
  • Information on what features you're using from Hamilton: decorators, adapters, result builders.
  • How Hamilton is being used: number of final nodes in DAG, number of modules, size of objects passed to execute(), the name of the Driver function being invoked.

If you're worried, see telemetry.py for details.

If you do not wish to participate, one can opt-out with one of the following methods:

  1. Set it to false programmatically in your code before creating a Hamilton driver:
    from hamilton import telemetry
    telemetry.disable_telemetry()
  2. Set the key telemetry_enabled to false in ~/.hamilton.conf under the DEFAULT section:
    [DEFAULT]
    telemetry_enabled = False
    
  3. Set HAMILTON_TELEMETRY_ENABLED=false as an environment variable. Either setting it for your shell session:
    export HAMILTON_TELEMETRY_ENABLED=false
    or passing it as part of the run command:
    HAMILTON_TELEMETRY_ENABLED=false python NAME_OF_MY_DRIVER.py

For the hamilton UI you jmust use the environment variable method prior to running docker compose.

Contributors

Code Contributors

  • Stefan Krawczyk (@skrawcz)
  • Elijah ben Izzy (@elijahbenizzy)
  • Danielle Quinn (@danfisher-sf)
  • Rachel Insoft (@rinsoft-sf)
  • Shelly Jang (@shellyjang)
  • Vincent Chu (@vslchusf)
  • Christopher Prohm (@chmp)
  • James Lamb (@jameslamb)
  • Avnish Pal (@bovem)
  • Sarah Haskins (@frenchfrywpepper)
  • Thierry Jean (@zilto)
  • Michał Siedlaczek (@elshize)
  • Benjamin Hack (@benhhack)
  • Bryan Galindo (@bryangalindo)
  • Jordan Smith (@JoJo10Smith)
  • Roel Bertens (@roelbertens)
  • Swapnil Delwalkar (@swapdewalkar)
  • Fran Boon (@flavour)
  • Tom Barber (@buggtb)
  • Konstantin Tyapochkin (@tyapochkin)

Bug Hunters/Special Mentions

  • Nils Olsson (@nilsso)
  • Michał Siedlaczek (@elshize)
  • Alaa Abedrabbo (@AAbedrabbo)
  • Shreya Datar (@datarshreya)
  • Baldo Faieta (@baldofaieta)
  • Anwar Brini (@AnwarBrini)
  • Gourav Kumar (@gms101)
  • Amos Aikman (@amosaikman)
  • Ankush Kundaliya (@akundaliya)
  • David Weselowski (@j7zAhU)
  • Peter Robinson (@Peter4137)
  • Seth Stokes (@sT0v
  • Louis Maddox (@lmmx)
  • Stephen Bias (@s-ducks)
  • Anup Joseph (@AnupJoseph)
  • Jan Hurst (@janhurst)
  • Flavia Santos (@flaviassantos)
  • Nicolas Huray (@nhuray)
  • Manabu Niseki (@ninoseki)

hamilton's People

Contributors

alti-tude avatar anupjoseph avatar benhhack avatar bovem avatar bryangalindo avatar buggtb avatar bustosalex1 avatar chmp avatar datarshreya avatar elijahbenizzy avatar ellipsis-dev[bot] avatar flaviassantos avatar frenchfrywpepper avatar ivirshup avatar jameslamb avatar jojo10smith avatar rinsoft-sf avatar roelbertens avatar shellyjang avatar skrawcz avatar st0v avatar subhamc1 avatar swapdewalkar avatar tyapochkin avatar vslchusf avatar weaviate-git-bot avatar wentao-lu avatar wmoreiraa avatar wwzeng1 avatar zilto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hamilton's Issues

[Internal] Decouple node function from inputs

Issue by elijahbenizzy
Wednesday Sep 28, 2022 at 00:22 GMT
Originally opened as stitchfix/hamilton#201


Currently a node takes in a set of inputs -- these correspond both to (a) the names of the dependencies and (b) the parameters in the function.

To illustrate:

In [1]: def test(a: int, b: int) -> int:
   ...:     return a + b
   ...:

In [2]: from hamilton import node

In [3]: node.Node.from_fn(test).input_types
Out[3]:
{'a': (int, <DependencyType.REQUIRED: 1>),
 'b': (int, <DependencyType.REQUIRED: 1>)}

This works well in the standard case. However, with parameterizing sources it kinda breaks down:

In [4]: from hamilton.function_modifiers import source, value
function_modifiers.parameterize(foo={'a' : source('c'), 'b': source('c')})(test)

In [5]: from hamilton.function_modifiers.base import resolve_nodes
In [6]: resolve_nodes(test, {})[0].input_types
Out[6]: {'c': (int, <DependencyType.REQUIRED: 1>)}

In [18]: resolve_nodes(test, {})[0](c=1)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [18], in <cell line: 1>()
----> 1 resolve_nodes(test, {})[0](c=1)

File ~/dev/dagworks/hamilton/hamilton/node.py:174, in Node.__call__(self, *args, **kwargs)
    172 def __call__(self, *args, **kwargs):
    173     """Call just delegates to the callable, purely for clean syntactic sugar"""
--> 174     return self.callable(*args, **kwargs)

File ~/dev/dagworks/hamilton/hamilton/function_modifiers/expanders.py:94, in parameterize.expand_node.<locals>.replacement_function(upstream_dependencies, literal_dependencies, *args, **kwargs)
     92 kwargs = kwargs.copy()
     93 for dependency, replacement in upstream_dependencies.items():
---> 94     kwargs[dependency] = kwargs.pop(replacement.source)
     95 for dependency, replacement in literal_dependencies.items():
     96     kwargs[dependency] = replacement.value

KeyError: 'c'

The obvious solution (that I'm taking in reuse_subdag is to create an identity node. But, let's face it, that's fairly clunky. Instead I propose we specify a mapping inside node. So, we have a field param_mapping in Node that gives the mapping of internal parameters to external ones. This defaults to an identity, but we can then use it to simplify a lot of more complex decorators.

Prototype integration with an LLVM or similar tech.

Issue by skrawcz
Tuesday Feb 01, 2022 at 00:05 GMT
Originally opened as stitchfix/hamilton#40


Here the following assumes "numba", but really we could replace "numba" with "jax", or any other framework that could optimize python code to execute faster.

Is your feature request related to a problem? Please describe.
Numba is a way to accelerate python functions. To use it, you annotate your python functions to be "compiled" with the jit. It then creates faster code from it.

Currently the speed up only materializes on the second invocation of a function -- the first time it compiles it. So to work with Hamilton, we'd have to compile ahead of time (AOT) if people only run a DAG once. Otherwise we could use the jit for DAGs that people execute over and over again.

Describe the solution you'd like
Two solutions:

  1. Prototype the ability to compile a hamilton graph ahead of time with Numba. You could use how we get Hamilton to run on Dask as a starting point (TODO: link to code). See these numba docs for ahead of time compilation.
  2. Prototype the ability to use the jit compiler with Numba. That way the first time someone runs execute things are compiled (no speed up), but the second time, things are lightning quick! See these docs.

Things to think about with prototype (1):

  1. Since compiling a head of time requires types -- we might need some better way to specify them? Or perhaps we can have numba infer it?
  2. The output of compilation is another set of python module(s) -- this is what we'd then want to use for computation.
  3. What is therefore the correct order of operations? Build the function graph, compile it, then somehow build the graph again with the new functions (?), and use that for execution?
  4. What are the limitations of this approach in terms of use cases, etc. We could limit to numpy and python primitive code only for instance.

Things to think about with prototype (2):

  1. What use cases does this make sense for?
  2. What are the limitations of this approach?

Describe alternatives you've considered
Haven't.

Additional context

Some scaffolding

Issue by elijahbenizzy
Friday Jun 17, 2022 at 22:48 GMT
Originally opened as stitchfix/hamilton#135


[Short description explaining the high-level reason for the pull request]

Changes

Testing

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python - local testing

  • python 3.6
  • python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/135/commits

Modin integration

Issue by skrawcz
Thursday Mar 10, 2022 at 19:09 GMT
Originally opened as stitchfix/hamilton#85


Is your feature request related to a problem? Please describe.
Modin - https://github.com/modin-project/modin - also enables scaling pandas computation. Since we have ray, dask, and koalas, why not add Modin?

Describe the solution you'd like
Modin requires a replacement of the pandas import in user code to work.
We would need to think how to do this:

  1. Do we get people to import "pandas" from hamilton, and we can then control which pandas is actually imported?
  2. Do we require users then to assume modin, by changing the pandas import themselves when defining their hamilton python functions?
  3. Or is there some other way to integrate? E.g. a graph adapter

Additional context
N/A

Add testing for thread safety of driver

Issue by elijahbenizzy
Friday Nov 18, 2022 at 21:56 GMT
Originally opened as stitchfix/hamilton#234


Is your feature request related to a problem? Please describe.
The driver should be thread-safe, but we want a test to ensure

Describe the solution you'd like
A unit/integration test that runs a bunch in parallel with the same drier.

Additional context
Add any other context or screenshots about the feature request here.
https://hamilton-opensource.slack.com/archives/C03M33QB4M8/p1668807849626519

Support frozenset[...] generic

Issue by elijahbenizzy
Wednesday Aug 10, 2022 at 20:26 GMT
Originally opened as stitchfix/hamilton#175


Short description explaining the high-level reason for the new issue.

Current behavior

This breaks:

def foo() -> frozenset[str]:
    ...

def bar(foo: Set[str]) -> ...:
    ...

But it should work. That said, the reverse isn't true -- we can't pass a set when expecting a frozenset :/

Stack Traces

I've gotten a few:

    raise e
../hamilton/hamilton/driver.py:57: in __init__
    self.graph = graph.FunctionGraph(*modules, config=config, adapter=adapter)
../hamilton/hamilton/graph.py:194: in __init__
    self.nodes = create_function_graph(*modules, config=self._config, adapter=adapter)
../hamilton/hamilton/graph.py:124: in create_function_graph
    add_dependency(n, node_name, nodes, param_name, param_type, adapter)
../hamilton/hamilton/graph.py:89: in add_dependency
    if not types_match(adapter, param_type, required_node.type):
../hamilton/hamilton/graph.py:50: in types_match
    elif custom_subclass_check(required_node_type, param_type):
../hamilton/hamilton/type_utils.py:45: in custom_subclass_check
    return issubclass(requested_type, param_type)```

## Screenshots
(If applicable)


## Steps to replicate behavior
1.

## Library & System Information
E.g. python version, hamilton library version, linux, etc.


# Expected behavior


# Additional context
Add any other context about the problem here.

Ability to data profile node outputs for creating data quality checks

Issue by skrawcz
Tuesday Aug 02, 2022 at 17:00 GMT
Originally opened as stitchfix/hamilton#165


Is your feature request related to a problem? Please describe.
Data profiling is a way to help bootstrap creating data quality checks.
Data profiling is also a way to facilitate data exploration, by providing summary statistics over data.

Describe the solution you'd like
A user should be able to profile their DAG, or a set of nodes, and get out some summary statistics.
Those statistics could then be used to bootstrap data quality, i.e. check_output(), decorators, but the output should be standalone.

Describe alternatives you've considered
Haven't considered many options. But there are a few libraries that do data profiling already.

Additional context
Systems like whylogs, great expectations, use profiling to help with the user experience.
Standalone libraries like https://github.com/capitalone/DataProfiler also exist.

stitchfix/hamilton#149 does a little to prototype in this area too.

Metadata emission

Issue by skrawcz
Tuesday Jun 21, 2022 at 21:57 GMT
Originally opened as stitchfix/hamilton#137


Is your feature request related to a problem? Please describe.
Hamilton encodes a lot of metadata that lives in code. It also creates some at execution time. There are projects such as https://datahubproject.io/, https://openlineage.io/ that capture this metadata across a wide array of tooling to create a central view in a heterogenous environment. Hamilton should be able to emit metadata/executions information to them.

Describe the solution you'd like
A user should be able to specify whether their Hamilton DAG should emit metadata.
This should play nicely with graph adapters, e.g. spark, ray, dask.

UX questions:

  1. Should this be something in the graph adapter universe? E.g. a mixin?
  2. Or should this be on the driver side, so you change drivers for functionality, but change graph adapters for scale...

TODO:

  • find motivating use case to develop for

Reusable subDAG components

Issue by elijahbenizzy
Saturday Mar 12, 2022 at 00:24 GMT
Originally opened as stitchfix/hamilton#86


Is your feature request related to a problem? Please describe.
Reusing functions, helpers, etc... are all nice and good. However, sometimes you want to be able to reuse large subcomponents in the DAG.

Describe the solution you'd like

Two idea:

  1. Use the Driver to stich bigger DAGs together
  2. Two decorator (still need to fully think this through)

This uses prefixes, but some actual namespace notion could be nice here.

@hamilton.subdag_split(
    inputs={
        'subdag_1' : {'source' : 'source_for_dag_1'}, # subdag 1 gets its source from a different place than subdag 1
        'subdag_2 : {'source' : 'source_for_dag_2'}}} 
def data(source: str) -> pd.DataFrame:
    return _load_data(source)

def foo(data) -> pd.Series:
    return _do_something(data)

@hamilton.subdag_join(subdag='subdag_1')
def process_subdag_1(foo) -> pd.Series:
    return _process_some_way(foo)

The framework would then compile this in a fancy way -- ensuring that every node between the splits and the joins is turned into one for each subdag, under a separate namespace. TBD on how to access it.

Describe alternatives you've considered
Not allowing this -- I don't have a concrete use-case blocking for anyone but we have observed one at Stitch Fix.

Additional context
Thinking ahead here.

Expose tags and function metadata to decorators

Issue by skrawcz
Wednesday Jun 22, 2022 at 21:42 GMT
Originally opened as stitchfix/hamilton#138


Is your feature request related to a problem? Please describe.
With tagging, we can annotate functions with metadata. It would be useful to allow decorators access to this, and other metadata accumulated.

E.g. tags could be used as a means to help inform what should happen if a check_output decorator runs a test and it fails. That is, if we standardize on tag keys, then decorators could assume them and make use of them.

Describe the solution you'd like
Enable decorators access to a context or some variable that would allow them to get at this information.

Describe alternatives you've considered
N/A

Additional context
Taken from the discussion with whylabs folks on what would be useful.

Ability to Profile nodes

Issue by chrisaddy
Monday May 02, 2022 at 18:36 GMT
Originally opened as stitchfix/hamilton#128


Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Have found it useful to be able to profile code execution for purposes of debugging and profiling code for enhancing performance.

Describe the solution you'd like
A clear and concise description of what you want to happen.

A solution that I have implemented in the past has been to create a stateful decorator so that you can capture execution context. This would capture a list of calls with their function names and simple stats; while this was done with dataframes in mind, it can also be used on arbitrary functions and class methods, so extending to any sort of hamilton (callable) node should be straightforward. The original would look something like:

my_data.csv

a      b
1	3
2	4
3	5
profile = Tracer(**tracer_options)

@profile
def load_data(input_csv: str) -> pd.DataFrame:
    return pd.read_csv(input_csv)
    
@profile
def half_everything(load_data: pd.DataFrame) -> pd.DataFrame:
    return load_data / 2
    
if __name__ == "__main__":
    data = load_data("my_data.csv")
    transformed_data = half_everything(data)

print(profile)
[
    Trace(
         func=<function load_data at 0x7f9fc7ff8d30>,
         args=('my_data.csv', ),
         kwargs={},
         profile=Profile(
             cpu=Resource(start=3.9, end=4.4),
             memory=Resource(start=75.7, end=75.7)
          )
    ),
    Trace(
        func=<function transform at 0x7f9fd4782dd0>,
        args=(   a  b
             0  1  3
             1  2  4
             2  3  5, ),
        kwargs={},
        profile=Profile(
            cpu=Resource(start=0.0, end=3.8), 
            memory=Resource(start=75.7, end=75.7)
        )
    )
]

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

chatted with @elijahbenizzy , think it could be done either as a node extender similar to data quality, or as an adapter that wraps current adapter types

Tag-based dependencies

Issue by elijahbenizzy
Wednesday Nov 02, 2022 at 17:24 GMT
Originally opened as stitchfix/hamilton#226


From slack:

James Marvin
2:31 AM
Hi folks - wanted to discuss with you the merits of enabling a method of referring to nodes by their tags as opposed to by node name.
There are some instances in which we may want to process all nodes of a certain 'type' - for example, all metadata columns, or all engineered 'features'/derived columns. In the latter case in particular, it could be that there are dozens of columns of this 'type'.
My understanding is that to create a new node which accepts all nodes of a given type as input, we have to provide each input node name as a parameter to the new function:

@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
    return helpers._get_some_metadata(input)

@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
    return pd.Series(current_time)

def get_metadata_table(create_some_metadata:pd.Series, create_some_more_metadata:pd.Series) -> pd.DataFrame:
    return pd.DataFrame([create_some_metadata, create_some_more_metadata])

It could be useful - especially where we are creating a new function accepting a high number of nodes of the same type as input - to have some feature enabling us to refer to nodes by type, as opposed to by name.
Hopefully this example shows in principle what I mean:

@tag(type='metadata')
def create_some_metadata(input:pd.Series) -> pd.Series:
    return helpers._get_some_metadata(input)

@tag(type='metadata')
def create_some_more_metadata(current_time:datetime) -> pd.Series:
    return pd.Series(current_time)

@nodes_by_tag(type='metadata')
def get_metadata_table(**kwargs) -> pd.DataFrame:
    return pd.DataFrame(**kwargs)

In this example:
All nodes have been assigned the same 'type' using the @tag feature
Some method is supplied (in this case, a new decorator @nodes_by_tag) by which we're able to use to refer to all nodes of a given type in definition of a new node
The new node is able to perform some action based on the assumption that all nodes of a given type have been provided as an input - without having to refer to each node by name
What do you think? (edited)

Help bootstrap `check_output()` decorator.

Issue by skrawcz
Tuesday Aug 02, 2022 at 17:08 GMT
Originally opened as stitchfix/hamilton#166


Is your feature request related to a problem? Please describe.
Can we help users bootstrap the check_output() decorator?

Describe the solution you'd like
Setting up data quality is possible to do manually, but with some knowledge of the data, could be automated, or at least partially automated.

Idea:

  1. give a data profile, can we generate the check_output() decorator to add?
  2. could we also automatically update the code with it, rather than having the user cut and paste it?

Describe alternatives you've considered
Not doing this. But I think having a way to speed up adding it to a code base is a good idea.

Additional context
Related to #164 and #165.

Better documentation for available default validators

Issue by elijahbenizzy
Thursday Jul 14, 2022 at 23:18 GMT
Originally opened as stitchfix/hamilton#156


Is your feature request related to a problem? Please describe.
Currently you have to look at the code to figure out the arguments, and they won't be auto-completed by an IDE.

Describe the solution you'd like
Some sort of auto-generated docs, or at least a configurable list.

Describe alternatives you've considered
Could do it manually.

Better querying for available nodes

Issue by elijahbenizzy
Tuesday Jul 19, 2022 at 20:19 GMT
Originally opened as stitchfix/hamilton#159


We make it easy to attach metadata to nodes, but don't yet have a natural API to make it easy to query these. This could be useful if:

(1) You want to look up a set of nodes with specific tags for reporting purposes
(2) You want to look up nodes used in data quality operators (see the motivating use-case below)
(3) You want to run some subset of the DAG that relates to the way nodes are tagged.

Is your feature request related to a problem? Please describe.
When using DQ we have to query by tags, this is really ugly. E.G.

all_validator_variables = [
    var.name for var in dr.list_available_variables() if
    var.tags.get('hamilton.data_quality.contains_dq_results')]

We should be able to have some utility functions here.

Describe the solution you'd like
Some combo of the following:

dr.query(tag_match={...}, name_match=r"...", module_match=r"...")
hamilton_utils.get_dq_validators(...)

or something like that. Note this would be valuable for more than just data quality -- E.G. tagging by nodes in general.

Describe alternatives you've considered
See above

Additional context
Writing out gitbook docs...

Do we want to support iterators for data loading?

Issue by skrawcz
Monday Feb 07, 2022 at 17:25 GMT
Originally opened as stitchfix/hamilton#68


What?

If we want to chunk over data, a natural way to do that is via an iterator.

Example: Enable "input" functions to be iterators

def my_loading_funct(...) -> Iterator[pd.Series]:
     ...
     yield some_chunk 

This is fraught with some edge cases. But could be a more natural way to chunk over large data sets? This perhaps requires a new driver -- as we'd want some next() type semantic logic on the output of execute...

Originally posted by @skrawcz in stitchfix/hamilton#43 (comment)

Things to think through whether this is something useful to have:

  1. Where would we allow this? Only on "input" nodes?
  2. How would we exercise them in a deterministic fashion? i.e. does execute() care? and we iterate over them until they're exhausted? Or does execute() only do one iteration?
  3. How do we coordinate multiple inputs that are iterators? What if they're of different lengths?
  4. How would we ensure people don't create a mess that's hard to debug?
  5. Would this work for the distributed graph adapters?

Configurable data quality checks

Issue by skrawcz
Tuesday Aug 02, 2022 at 16:54 GMT
Originally opened as stitchfix/hamilton#164


Is your feature request related to a problem? Please describe.
We should be able to configure whether we want data quality to run or not at DAG build/run time.

Describe the solution you'd like
Allowing for fine-grained control over the behavior of specific checks. This will enable the user to turn on and off certain checks at runtime.

i.e. some configuration that a user can adjust/modify via JSON or YAML or both...

Describe alternatives you've considered
N/A

Additional context
stitchfix/hamilton#149 prototypes some approaches here.

Dask map_partitions node

Issue by elshize
Monday Jun 27, 2022 at 16:19 GMT
Originally opened as stitchfix/hamilton#143


ℹ️ This is in response to a discussion with @elijahbenizzy on discord.

This is an example of a use case that could be supported with some additional decorators.

The usecase is that a node takes one Dask data frame and possibly some other arguments (either Pandas data frames or scalars). Then, the node simply executes map_partitions on the dask input, broadcasting the remaining arguments.

For example:

def node(a: dd.DataFrame, b: pd.DataFrame, c: int):
    return a.map_partitions(_node, b, c, align_dataframes=False)

def _node(a: pd.DataFrame, b: pd.DataFrame, c:int):
    # actual logic for each partition `a`

could be:

# this probably should be designed better, I just want to give you an idea
@dask_partition("a")
def node(a: pd.DataFrame, b: pd.DataFrame, c: int):
    # logic

So that it could run that map_partitions automatically. It might need (at least optional) parameter to add meta to the function call.

Not entirely sure if there are some disadvantages of that, and what the value of it would be in general, but I wanted to document what we discussed.

Auto generate pytest unit test stubs for hamilton functions

Issue by skrawcz
Wednesday Jul 20, 2022 at 21:19 GMT
Originally opened as stitchfix/hamilton#160


Is your feature request related to a problem? Please describe.
We should be able to bootstrap a unit test suite given a hamilton function module.

Most DS probably write functions first, and then think about tests. We should make bootstrapping unit tests easy.

Describe the solution you'd like

  1. User writes Hamilton functions.
  2. User uses a command line utility to generate pytest unit test stubs for a module.
    feature_logic.py --> test_feature_logic.py
  3. Within that module, then we should be able to determine which functions have tests and which do not, creating the test for them appropriately
    def my_feature() --> def test_my_feature().
  4. To start we probably don't want too many options -- perhaps a --dry-run argument that would list what would be created.

Describe alternatives you've considered
Not doing this.

Additional context
This would help the software engineering best practices story.
We could similarly use this approach to bulk add check_output decorators.

Create HOW TOs for integration with popular ETL frameworks/methodology

Issue by skrawcz
Thursday Nov 04, 2021 at 23:19 GMT
Originally opened as stitchfix/hamilton#22


Is your feature request related to a problem? Please describe.
Hamilton has a small footprint. It can be run inside existing ETLs very easily. We should have documentation to reflect
that fact to help users understand how simple it is to get it working.

Describe the solution you'd like
We should have examples/documentation to cover:

  1. Metaflow
  2. Airflow
  3. Dagster
  4. Flyte
  5. Your custom scheduler

Describe alternatives you've considered
N/A

Additional context
We want people to be able to cut and paste code easily. Also having examples/documentation would help people size what it would look like to get Hamilton into their ETL.

Add caching for hamilton

Issue by elijahbenizzy
Monday Oct 18, 2021 at 17:54 GMT
Originally opened as stitchfix/hamilton#17


The problem

We want to enable caching of functions and their downstream results.

Say we want to alter a function and rerun the entire DAG. The function that we want to alter runs late enough that we'd be redoing a significant amount of computation. While iterating could often be solved by executing individual nodes, its completely reasonable to iterate on the entire DAG.

In our internal use of Hamilton, we actually have a decorator called @cache that runs entirely separate from Hamilton -- this allows us to cache the results of individual functions. This decorator uses (a) the code a function runs and (b) the hash of the parameters. That said, its not foolproof -- changes in external libraries referenced within functions can get ignored, its not DAG-aware (it doesn't care about downstream functions you might also not want to rerun), and it depends on hashability of parameters.

I envision this as useful for:

  1. Rerunning/iterating on a DAG locally
  2. Running expensive DAGs in production ETLs that all use the same cache but change minimal parts

Some options

  1. Automatically cache functions, have a clear_cache or use_cache in the execute function
  2. Use the @cache decorator we have internally
  3. Manage the cache externally -- pass in as an override to the driver's execute function. Then have a method on the driver to manipulate the cache as needed.

I'm partial to (1) although we need to make it visible to the user and easy to override. E.G. to mark things as changed. We could also have decorators that say dont_cache if needed.

Probably a few other things we can do -- welcome feedback! Might want to think about making it pluggable -- saving to disk is nice, but saving it to a backing store could be even nicer. Shouldn't get locked in.

Slightly leaky abstraction -- figure out the best way to extend this

Issue by elijahbenizzy
Wednesday May 12, 2021 at 18:29 GMT
Originally opened as stitchfix/hamilton#7


This is for an extension of hamilton called hamiltime. Currently it's perfectly abstracted away except for this enum. Let's figure out how to make it a completely separate product.

https://github.com/stitchfix/hamilton/blob/c86ddd2adcc6d5812c8d1c769e76e72c1b06a580/hamilton/node.py#L20

Some ideas:

  1. Release hamilton with hamiltime
  2. Have this be an extensible class that hamiltime can overwrite
  3. Have the function_graph, not the node type know about the node sources
  4. ...

Running Hamilton on Flyte

Issue by ramannanda9
Thursday Nov 17, 2022 at 00:25 GMT
Originally opened as stitchfix/hamilton#233


[Short description explaining the high-level reason for the pull request]

Changes

How I tested this

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.

ramannanda9 included the following code: https://github.com/stitchfix/hamilton/pull/233/commits

Runtime-based DAG structure

Issue by elijahbenizzy
Tuesday Oct 11, 2022 at 16:54 GMT
Originally opened as stitchfix/hamilton#208


Is your feature request related to a problem? Please describe.
Based on a conversation with @bmritz, more for non-bulk computations.

The problem is this: say you have a computation that depends on the result of an upstream node. E.G. whether that value was supplied/exists, or some property on top of that value (if odd do x, if even do y). Currently under Hamilton this might look like:

def foo(bar: int=None, baz: int=None) -> int:
    if bar is None:
        return _x(baz)
    if baz is None:
        return _y(bar)
    raise ValueError("must supply either bar *or* baz")

and in the next case:

def foo(bar: int):
    if bar % 2 == 0:
        return _x(bar)
    return _y(bar)

This, however, is fairly ugly -- as the dependencies get more complex you have a bunch of chains. and you end up dealing with nodes that may or may not exist -- Hamilton has no notion of this, so we just chain through Nones. While this works in simpler cases just fine, it is far from ideal.

Describe the solution you'd like
Two ideas:

  1. Having a compile-time or-gate
  2. Wrapping the approaches above in syntactic sugar

Either way, the API could be somewhat similar:

@dynamic.when_exists('bar')
def foo(bar: int) -> int:
    return _x(bar)
@dynamic.when_exists('baz')
def foo(baz: int) -> int:
    return _y(baz)

Describe alternatives you've considered
The above two are reasonable options. Some other ideas:

  • Treat the subdag as a node and dynamically execute/configure it on a per-run basis
  • Run it in a separate driver and not have it supported
  • Continue with the available

See datajet for inspiration.

Additional context
From a slack conversation with @bmritz.

hamilton --init to get started

Issue by elijahbenizzy
Thursday Nov 24, 2022 at 00:11 GMT
Originally opened as stitchfix/hamilton#235


Is your feature request related to a problem? Please describe.
New folks might want to get started in an existing repo. New DS/college students could use hamilton to get started on a simple modeling project...

Describe the solution you'd like

hamilton init
# Creates a basic project structure with some functions + hamilton files

hamilton init --project=hello_world 
# Creates the hello_world example

hamilton init --project=recomendations_stack
# Creates the scaffolding for a rec-stack example

hamilton init --project=web-service
# Create[s the scaffolding for a flask app

hamilton init kaggle --kaggle-competition=...
# Maybe we could create a template from a kaggle competition?

Additional context
Messing around with dbt and they have this

Provide a loose type checking adapter

Issue by skrawcz
Monday Aug 15, 2022 at 20:45 GMT
Originally opened as stitchfix/hamilton#181


Is your feature request related to a problem? Please describe.
Following on from conversation in stitchfix/hamilton#170 a user should be able to augment Hamilton's type checking to allow:

def bar_union(x: t.Union[int, pd.Series]) -> t.Union[int, pd.Series]:
    return x
def foo_bar(bar_union: int) -> int:
    return bar + 1

as well as

def foo_int(x: int) -> int:
    """foo int"""
    return x + 1


def foo_union(foo_int: AnyType) -> AnyType:
    """foo union"""
    return foo_int + 1


def foo_int2(foo_union: int) -> int:
    """foo int, taking foo_union input"""
    return foo_union + 5

Describe the solution you'd like
A user should use a framework supported graph adapter that enables this sort of looser type checking and provide that to the driver at DAG construction time.

Describe alternatives you've considered
User writes custom graph adapter that is not framework supported.

E.g. something like:

class LooseDAGTypeCheckPythonDataFrameGraphAdapter(base.SimplePythonDataFrameGraphAdapter):

    @staticmethod
    def check_node_type_equivalence(node_type: typing.Type, input_type: typing.Type) -> bool:
        """Essentially allows super set of inputs to go through to node."""
        if input_type == typing.Any:
            # assume it will work
            return True
        # if input is superset of what node expects, that's okay
        elif type_utils.custom_subclass_check(input_type, node_type):
            return True
        return False

Additional context
See stitchfix/hamilton#170

[prototype] NVMTabular support

Issue by skrawcz
Friday Jul 08, 2022 at 05:07 GMT
Originally opened as stitchfix/hamilton#150


Is your feature request related to a problem? Please describe.
NVMTabular is a way to write ETL feature code for nvidia GPUs. Is there a Hamilton way to help organize the workflow required to run NVMTabular?

Describe the solution you'd like
The goal of the prototype is to figure out how Hamilton could be best used.

Describe alternatives you've considered
N/A

Additional context
E.g. what would the Hamilton version of https://github.com/NVIDIA-Merlin/NVTabular/blob/main/examples/getting-started-movielens/02-ETL-with-NVTabular.ipynb be?

Data quality next plans POC

Issue by elijahbenizzy
Monday Jul 04, 2022 at 22:38 GMT
Originally opened as stitchfix/hamilton#149


OK so this is a pure proof of concept. Not necessarily the right way to do things, and not tested. That said, I wanted to prove the following:

  1. That we could build a two-step data quality pass (E.G. with a profiler and a validator). This will quickly be a whylogs blocker.
  2. That we can use config to enable/disable items at run/compile time.
  3. That we can add an applies_to keyword to narrow focus of data quality.

(1) is useful for integrations with complex stuff -- E.G. an expensive profiling step with lots of validations.
(2) is useful for disabling -- this will probably be the first we release.
(3) is useful for extract_columns -- it now makes it clear what it applies to.

While some of this code still has placeholders and isn't tested, it demonstrates feasible solutions, and de-risks the release of data quality enough to make me comfortable.

Look through commits for more explanations.

Changes

Testing

Notes

Checklist

  • PR has an informative and human-readable title (this will be pulled into the release notes)
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code passed the pre-commit check & code is left cleaner/nicer than when first encountered.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future TODOs are captured in comments
  • Project documentation has been updated if adding/changing functionality.
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python - local testing

  • python 3.6
  • python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/149/commits

Prototype Lineage Analysis Tooling

Issue by skrawcz
Tuesday Feb 01, 2022 at 00:41 GMT
Originally opened as stitchfix/hamilton#42


Is your feature request related to a problem? Please describe.
Currently, when given a Hamilton DAG, we don't expose ways to ask questions about it.

E.g. For GDPR, Data providence, etc.

E.g.

  1. What if I remove this input, what function(s) will I impact?
  2. What uses some PII data and what is the surface area?
  3. If someone requests to be forgotten, what data do I need to delete?
  4. Who should I talk to when I want to make this change that impacts these functions ? (e.g. use git blame to surface function owner?)
  5. What has changed about the DAG since these two commits?
  6. Are there any cycles?
  7. Are there clusters of disjoint nodes? If so, what are they, maybe I can delete them?
  8. etc

Describe the solution you'd like
This could be a specific "driver class", or something added to the base driver.

Without an end user workflow in mind, it's a bit hard to specify the API.

Also, perhaps this would work well with #4 -- e.g. tagging what is PII, and what isn't?

Describe alternatives you've considered
N/A

Additional context
There are a lot of start ups and organizations trying to get a handle on their data and where it is used. Hamilton can help provide a way to get at this easily...

Adds first pass example using movie example from metaflow

Issue by skrawcz
Thursday Dec 23, 2021 at 02:10 GMT
Originally opened as stitchfix/hamilton#32


Goal of this PR is to provide more extensive examples.

Inspiration is from:
https://github.com/Netflix/metaflow/blob/master/metaflow/tutorials/01-playlist/playlist.py

Two interesting things to note here:

  1. We could have the driver do the loading of the CSV, or we could have a function to do it.
  2. Filter operations are easiest to happen in the driver I think. Not to say you can't do it,
    we'd just need better logic around creating the dataframe, and not trying to combine series
    with various index lengths. Hmm.

Additions

  • Examples

Removals

  • N/A

Changes

  • N/A

Testing

  1. The example works. Tested locally.

Todos

  • [] add more examples and see if this makes sense.

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows the standards laid out in the dev standards.
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python

  • python 3.6
  • python 3.7

skrawcz included the following code: https://github.com/stitchfix/hamilton/pull/32/commits

Prototype Compile Hamilton on to an Orchestration Framework

Issue by skrawcz
Wednesday Feb 02, 2022 at 01:35 GMT
Originally opened as stitchfix/hamilton#44


Is your feature request related to a problem? Please describe.
Another way to scale a Hamilton DAG is to break it up into stages and have some other orchestrator handle execution. Hamilton need not implement these functions itself -- it could just compile and delegate execution to these frameworks.

E.g. I have a Hamilton DAG, but I want to use my in house Metaflow system -- the user should be able to generate code to run on Metaflow.

Describe the solution you'd like
A prototype to show how you could go from a Hamilton DAG to a DAG/Pipeline of some orchestration framework.

E.g.:

You'd have to think through the flow to do this:
e.g. define Hamilton DAG -> Compile to X Framework -> Commit code -> Run code on Framework X

We should prototype at least two implementations and see how we'd need to structure the code to make it manageable to maintain.

Describe alternatives you've considered
Hamilton could implement something like these other orchestration frameworks do, but that seems like a heavy lift. Better to try compiling to an existing framework.

Additional context
N/A

enable connecting model node to metric node

Issue by shellyjang
Wednesday Sep 01, 2021 at 00:30 GMT
Originally opened as stitchfix/hamilton#11


slack conversation

Current issue:

  1. model_p_something (“model” node) and prob_something (“metric” node) being a complement pair,
  2. create_database driver (and therefore its crawler) contain both nodes
  3. simulate driver (and therefore its crawler) only contains the “metric” node and not the “model” node.
  4. the simulate driver’s crawler can find the feature dependency of the metric node (presumably through the model coefficient configs)
  5. the crawlers (and therefore the drivers) are unable to find the complementary connection between the model and metric nodes;

item 5 specifically means that a person needs to know the complementary pairs (= domain knowledge; or hard-coded somewhere?) instead of DAG containing this info. A complementary pair is identified via @model decorator.

@model(GLM, 'model_p_demand_manual_by_formerautoship')
def prob_demand_manual_existing_former_autoship() -> pd.Series:
    pass

We would like there to be a systematic mapping between the complementary pairs.

Clarify behavior of decorator ordering

Issue by skrawcz
Sunday Dec 18, 2022 at 01:28 GMT
Originally opened as stitchfix/hamilton#249


We need to make clear our philosophy and resolution method for functions such as:

@extract_fields({'out_value1': int, 'out_value2': str})
@tag(test_key="test-value")
@check_output(data_type=dict, importance="fail")
@does(_dummy)
def uber_decorated_function(in_value1: int, in_value2: str) -> dict:
    pass

Right now it is not clear, nor obvious.

Current behavior

This is what the graph looks like:

Screen Shot 2022-12-17 at 5 24 42 PM

So it would be unexpected to see check_output over the output of extract_fields.

Steps to replicate behavior

Function code:

def _dummy(**values) -> dict:
    return {f"out_{k.split('_')[1]}": v for k, v in values.items()}


@extract_fields({'out_value1': int, 'out_value2': str})
@tag(test_key="test-value")
@check_output(data_type=dict, importance="fail")
@does(_dummy)
def uber_decorated_function(in_value1: int, in_value2: str) -> dict:
    pass

Expected behavior

check_output should probably operate over what's directly underneath that.
tag similarly should apply to all? or just what's underneath?
does should apply to uber_decorated_function
extract_fields is the last thing that's applied?

Additional context

Thoughts: can we create a linter that reorders decorators?

Modifications to enable decorating functions from another module

Issue by skrawcz
Thursday Oct 27, 2022 at 17:06 GMT
Originally opened as stitchfix/hamilton#217


If the user wants to reuse hamilton functions without hamilton, this commit shows one way to do it -- and what would be required on Hamilton's side to support it.

Basically we'd need a convention. Not sure it's a good idea. Probably easier to instead have boilerplate code people use to try/except the import if hamilton does not exist and have the decorators be identity functions...


skrawcz included the following code: https://github.com/stitchfix/hamilton/pull/217/commits

Capturing wide to long transformations of entire dataframes using Hamilton

Issue by latlan1
Thursday Feb 03, 2022 at 01:17 GMT
Originally opened as stitchfix/hamilton#46


Is your feature request related to a problem? Please describe.
Is there a way to transform a dataframe from wide to long (or vice versa) using Hamilton to track this transformation? I concluded no since all of the input/output columns would need to be specified, which could be a lot of typing.

Describe the solution you'd like
It would nice if I could define a function that accepts df_wide and outputs df_long with pd.melt.

Describe alternatives you've considered
I performed the melt operation outside of Hamilton so this operation is not directly captured through the DAG.

Enable a DAG to define a recommendation call

Issue by skrawcz
Tuesday Oct 11, 2022 at 01:05 GMT
Originally opened as stitchfix/hamilton#207


Is your feature request related to a problem? Please describe.
Hamilton is general purpose - could we use it to model a recommendation stack call?

If you'd like to discuss ideas please try them here - stitchfix/hamilton#206.

Requirements

  1. User should be able to define a recommendation stack dataflow.
  2. We should be able to optimize the dataflow/resolve placement of logic.
  3. The driver should then know how to walk the DAG and resolve calling the right services, etc.

Things to think about
What is framework first, and what is an optional add-on, versus custom code a user must write.

Additional context
Twitter thread with @jakemannix that started this train of thought.

Show pyspark dataframe support

Issue by skrawcz
Thursday Mar 10, 2022 at 07:53 GMT
Originally opened as stitchfix/hamilton#84


Is your feature request related to a problem? Please describe.
A common question we get, is does Hamilton support spark dataframes? The answer is yes, but it's not ideal at the moment, and we don't have a vanilla example to point to.

It's not ideal because joins are a bit of a pain -- you need to know the index to join on. In the pandas world, we got away with
this because everything had an index associated with it. In spark, you need to provide it, and know when to provide it.

Describe the solution you'd like
(1) Provide a vanilla pyspark example.
(2) Provide a pattern to show how to handle multiple spark data sources. Perhaps implement a graph adapter to do so.

Describe alternatives you've considered
N/A

Slightly leaky abstraction -- figure out the best way to extend this

Issue by elijahbenizzy
Wednesday May 12, 2021 at 18:29 GMT
Originally opened as stitchfix/hamilton#7


This is for an extension of hamilton called hamiltime. Currently it's perfectly abstracted away except for this enum. Let's figure out how to make it a completely separate product.

https://github.com/stitchfix/hamilton/blob/c86ddd2adcc6d5812c8d1c769e76e72c1b06a580/hamilton/node.py#L20

Some ideas:

  1. Release hamilton with hamiltime
  2. Have this be an extensible class that hamiltime can overwrite
  3. Have the function_graph, not the node type know about the node sources
  4. ...

Add pandas result builder that converts to long format

Issue by skrawcz
Monday Apr 25, 2022 at 21:53 GMT
Originally opened as stitchfix/hamilton#121


Is your feature request related to a problem? Please describe.
Hamilton works on "wide" columns -- not "long ones". However the "tidy" data ethos thinks data should be in a long format -- it does make some things easier to do.

Describe the solution you'd like
Add a ResultBuilder variant that takes in how you'd want to collapse the resulting pandas dataframe.

Describe alternatives you've considered
People do this manually -- but perhaps in the result builder makes more sense.

Additional context
Prerequisites for someone picking this up:

  • know Pandas.
  • know python.
  • can write the pandas code to go from wide to long.
  • can read the Hamilton code base to figure out where to add it.

Add ResultMixin implementations for Dask native types

Issue by skrawcz
Friday Feb 11, 2022 at 21:50 GMT
Originally opened as stitchfix/hamilton#75


Is your feature request related to a problem? Please describe.

We should implement useful implementations of:

class ResultMixin(object):
    """Base class housing the static function.

    Why a static function? That's because certain frameworks can only pickle a static function, not an entire
    object.
    """
    @staticmethod
    @abc.abstractmethod
    def build_result(**outputs: typing.Dict[str, typing.Any]) -> typing.Any:
        """This function builds the result given the computed values."""
        pass

for use with Dask. E.g. returning a Dask native array, dataframe, bag, etc. Currently the default is to return a pandas dataframe.

See the build_result function in DaskGraphAdapter for a reference point on how it could be used.

Describe the solution you'd like
These should probably be placed in the h_dask.py module for now. Otherwise open to naming.

Alternatively, we could include more options in DaskGraphAdapter. Open to thinking what way is the most user friendly solution going forward.

Additional context
The addition of these ResultMixins should enable a user who is using Dask, to not have to implement their own version,
instead they can use the ones that come with Hamilton.

[idea] Node fusing for speeding up execution on systems like Ray, Dask

Issue by skrawcz
Wednesday Aug 24, 2022 at 20:45 GMT
Originally opened as stitchfix/hamilton#188


Is your feature request related to a problem? Please describe.
For delegating to systems like Ray, it could make sense to "fuse" nodes together to reduce serialization costs.

Describe the solution you'd like
ideas:

  1. We need some concept to augment the DAG.
  2. We then need some pluggable way to change this logic. E.g. heuristics, vs some multi-pass logic.

Describe alternatives you've considered
Ideas:

  1. Make people write larger functions. But we wouldn't do this, since it goes against Hamilton's ideals.
  2. Have people tag functions that could be grouped -- seems like a good backdoor capability to have -- could work well with some more automated solution to override whatever it tries to do.

Additional context
I thought of this idea because people were complaining about Hamilton on Ray being slow.

Explore Ibis Integration

Issue by elijahbenizzy
Friday Mar 18, 2022 at 21:49 GMT
Originally opened as stitchfix/hamilton#88


Is your feature request related to a problem? Please describe.
Ibis could happily replace pandas, and its more flexible (scalable, etc...)

Describe the solution you'd like
Ibis dataframes instead of pandas dataframes. Perhaps a plugin framework.

functools.lru_cache() makes hamilton think a function is not a node

Issue by elijahbenizzy
Thursday Aug 11, 2022 at 21:34 GMT
Originally opened as stitchfix/hamilton#178


IMO this is actually just sloppiness on the implementation of lru_cache, not layerable. Need to verify 100% that this is the cause, but we should fix.

Current behavior

Stack Traces

(If applicable)

Screenshots

(If applicable)

Steps to replicate behavior

@functools.lru_cache(maxsize=None)
def config() -> Dict[str, Any]:
    return _load_config()

def foo(config: Dict[str, Any]):
    return config['foo']
  File "/Users/elijahbenizzy/dev/hamilton-os/hamilton/hamilton/driver.py", line 203, in visualize_execution
    self.validate_inputs(user_nodes, inputs)
  File "/Users/elijahbenizzy/dev/hamilton-os/hamilton/hamilton/driver.py", line 99, in validate_inputs
    raise ValueError(error_str)
ValueError: 2 errors encountered:
  Error: Required input config not provided for nodes: ['foo'].

Library & System Information

E.g. python version, hamilton library version, linux, etc.

Expected behavior

Additional context

Add any other context about the problem here.

Utility for generating temporary, unique node names

Issue by elijahbenizzy
Tuesday Nov 15, 2022 at 15:41 GMT
Originally opened as stitchfix/hamilton#230


We have a lot of cases (coming up) in which we generate unique/temporary nodes in decorator/DAG construction.

E.G.

  • generating a node in the new parameterized and extract_columns combo decorator
  • generating static/pass-through nodes for the new reuse_functions decorator

And a few more that we already do but I honestly can't remember right now... Currently these have the potential of clashing with each other, but I think we can do this in a much cleaner way. Properties we want:

(1) unique
(2) readable
(3) stable between runs
(4) stable between DAG changes

TBD on implementation -- but I think a stable(ish) hash with a prefix and a low-digit number for collisions. If we toss readability a hash/uuid is fine.

Lazy config evaluation

Issue by elijahbenizzy
Monday Feb 07, 2022 at 03:51 GMT
Originally opened as stitchfix/hamilton#64


[Short description explaining the high-level reason for the pull request]

Additions

Removals

Changes

Testing

Screenshots

If applicable

Notes

Todos

Checklist

  • PR has an informative and human-readable title
  • Changes are limited to a single goal (no scope creep)
  • Code can be automatically merged (no conflicts)
  • Code follows the standards laid out in the TODO link to standards
  • Passes all existing automated tests
  • Any change in functionality is tested
  • New functions are documented (with a description, list of inputs, and expected output)
  • Placeholder code is flagged / future todos are captured in comments
  • Project documentation has been updated (including the "Unreleased" section of the CHANGELOG)
  • Reviewers requested with the Reviewers tool ➡️

Testing checklist

Python

  • python 3.6
  • python 3.7

elijahbenizzy included the following code: https://github.com/stitchfix/hamilton/pull/64/commits

Create Hamilton converter for pandas code

Issue by skrawcz
Friday May 13, 2022 at 22:54 GMT
Originally opened as stitchfix/hamilton#132


Is your feature request related to a problem? Please describe.
With Hamilton you need need to restructure your code. This can be too much of a friction point for someone. Wouldn't it be nice if we had a way to help automate this step?

Describe the solution you'd like
We should be able to write some python code that parses the AST to covert code like:

df['a'] = df['b'] + df['c']

into

def a(b: pd.Series, c: pd.Series) -> pd.Series:
      return b + c

Core to this problem, is building code to parse python code and output/print hamilton functions. Once we have that, we can think about the places we could provide this, e.g. CLI, a website, some other means...

Describe alternatives you've considered
Not doing this.

Additional context
It would enable people to get up and running with Hamilton faster. E.g. if they provided a script, and we "walked" the script and guessed what should be output...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.