cedricleroy / pyungo Goto Github PK

View Code? Open in Web Editor NEW

71.0 71.0 13.0 587 KB

Function dependencies resolution and execution

Home Page: https://cedricleroy.github.io/pyungo/

License: MIT License

Python 100.00%

dag python workflow

pyungo's People

Contributors

Stargazers

Watchers

Forkers

ioancw nelsontodd chanpatrick veronicaguo apogiatzis tosa95 suimong bwkim71 vishalbelsare campanelli-sunpower amirbtb jeffamaxey actuarial-tools

pyungo's Issues

Function names instead of node indices when printing

It would be nice to be able to display the function names instead of the node indices when printing the graph dependencies.

RFE: auto-populate graph from function name & args

I propose to auto-populate Graph.add_node() calls based on function & argument-names (using python's inspect standard-library's module), and allow some form of string-filtering on the function/args names.
Knowingly this would work for (singular) outputs only.

The proposal is easier to explain with sample client code:

def funcname_chopper(funcname):
    for prefix in ['calc_', 'compute_', 'make_']:
        if funcname.startswith(prefix):
            return  prefix[len(prefix):]

graph = pyungo.Graph(
    outname_converter=funcname_chopper)

# equivalent to: register(inputs=['a', 'b'], outputs=['c']
@graph.register
def make_c(a, b):
    return a+b


# equivalent to: register(inputs=['a', 'b', 'c'], outputs=['make_d']
@graph.register(inpname_converter=lambda n: n[5:], outname_converter=None)
def calc_d(some_a, look_b, stop_c):
    return a+b

Issue with single output being an array

Reference: https://github.com/cedricleroy/pyungo/blob/master/pyungo/core.py#L146-L152

I ran into an issue for a specific type of node. The node returns 1 output that is a list of multiple datetime objects (like a timestamps vector). Because of the lines referenced above, the graph only saves the 1st item of that returned list, because it thinks that multiple outputs will be returned (since iter(res) doesn't fail), but there is only one output_name in the node (like "timestamps" for instance).
Essentially, the for loop goes through the timestamps list, and returns the first element of that list as data to be saved...

Is schema enforced on outputs and internal data-nodes?

Adapting the quickstart example:

schema = {
    "type": "object",
    "properties": {
        "a": {"type": "number"},
        "b": {"type": "number"}
    }
}

graph = Graph(schema=schema)

@graph.register(inputs=['a'], outputs=['b'])
def f1(a):
    return "Hey!"

@graph.register(inputs=['b'], outputs=['c'])
def f2(b):
    return 2 * b

graph.calculate(data={'a': 1})

I was expecting an error, but got the pipeline went through and got the result Hey@!Hey!.
Am i doing something wrong?
This feature is particularly important for data on the internal nodes, because it is not as easy to test them as inputs/outputs.

Memory usage

Hi,
I'm not sure if this is a bug or a feature request.

I have a workflow that is very memory intensive but also very well suited to decomposition to a DAG.
My problem is that it if I keep any intermediate outputs in memory I will quickly exceed the capacity of one computer to hold the data in RAM.
I had hoped pyungo would be clever enough to allow intermediate states to be garbage collected, but it doesn't seem so.

See a sample program:
`
from pyungo import Graph
import numpy as np
import gc

@Profile
def main():
graph = Graph()

@graph.register()
def calc_a():
    a = np.random.rand(8192,8192)
    return a

@graph.register()
def calc_b():
    b = np.random.rand(8192,8192)
    return b

@graph.register()
def calc_c(a,b):
    gc.collect()
    c = a * b 
    print("c")
    return c

@graph.register()
def calc_d():
    gc.collect()
    d = np.random.rand(8192,8192)
    print("d")
    return d

@graph.register()
def calc_pfd(c,d):
    gc.collect()
    e = c * d
    return e

gc.collect()
res = graph.calculate(data={})
gc.collect()
print(res)
del res
gc.collect()
del graph
gc.collect()

main()
`

Output:
`
(venv) zenbook% python -m memory_profiler memtest.py
INFO:root:Starting calculation...
INFO:root:Ran Node(08f958eb-84ff-49ad-a2fb-a2ada5788705, <calc_a>, [], ['a']) in 0:00:02.127759
d
INFO:root:Ran Node(9cd9ce4e-16d7-4a43-83b7-a8e01e8bd8ba, <calc_d>, [], ['d']) in 0:00:01.618884
INFO:root:Ran Node(ea8b8c8f-d8fc-4967-a7c5-c3ba0dbcd550, <calc_b>, [], ['b']) in 0:00:01.519026
c
INFO:root:Ran Node(ac6d7004-7cb1-41ff-8476-a2a1ce9e64d6, <calc_c>, ['a', 'b'], ['c']) in 0:00:01.029356
INFO:root:Ran Node(7786d30e-7796-4d19-a692-07f2904ea6c8, <calc_pfd>, ['c', 'd'], ['e']) in 0:00:00.853072
INFO:root:Calculation finished in 0:00:07.152394
[[0.32979496 0.00617538 0.01675385 ... 0.08284045 0.03303956 0.09351132]
[0.00268712 0.20226707 0.06033366 ... 0.07918911 0.01333745 0.15655172]
[0.0007408 0.01337496 0.17597583 ... 0.19520472 0.0274126 0.07911974]
...
[0.00958562 0.00919059 0.10846052 ... 0.01235475 0.02207799 0.26674223]
[0.06822633 0.03539608 0.08139489 ... 0.08097827 0.10901089 0.02113664]
[0.01915152 0.00518849 0.34347554 ... 0.04939359 0.48837681 0.11771939]]
Filename: memtest.py

Line # Mem usage Increment Line Contents

 5   29.688 MiB   29.688 MiB   @profile
 6                             def main():
 7   29.688 MiB    0.000 MiB       graph = Graph()
 8                             
 9   29.691 MiB    0.000 MiB       @graph.register()
10   29.691 MiB    0.004 MiB       def calc_a():
11  541.562 MiB  511.871 MiB           a = np.random.rand(8192,8192)
12  541.562 MiB    0.000 MiB           return a
13                             
14 1053.578 MiB    0.000 MiB       @graph.register()
15   29.691 MiB    0.000 MiB       def calc_b():
16 1565.590 MiB  512.012 MiB           b = np.random.rand(8192,8192)
17 1565.590 MiB    0.000 MiB           return b
18                             
19 1565.590 MiB    0.000 MiB       @graph.register()
20   29.691 MiB    0.000 MiB       def calc_c(a,b):
21 1565.590 MiB    0.000 MiB           gc.collect()
22 2077.605 MiB  512.016 MiB           c = a * b 
23 2077.605 MiB    0.000 MiB           print("c")
24 2077.605 MiB    0.000 MiB           return c
25                             
26  541.562 MiB    0.000 MiB       @graph.register()
27   29.691 MiB    0.000 MiB       def calc_d():
28  541.562 MiB    0.000 MiB           gc.collect()
29 1053.578 MiB  512.016 MiB           d = np.random.rand(8192,8192)
30 1053.578 MiB    0.000 MiB           print("d")
31 1053.578 MiB    0.000 MiB           return d
32                             
33 2077.605 MiB    0.000 MiB       @graph.register()
34   29.691 MiB    0.000 MiB       def calc_pfd(c,d):
35 2077.605 MiB    0.000 MiB           gc.collect()
36 2589.621 MiB  512.016 MiB           e = c * d
37 2589.621 MiB    0.000 MiB           return e
38                             
39   29.691 MiB    0.000 MiB       gc.collect()
40 2589.621 MiB    0.000 MiB       res = graph.calculate(data={})
41 2589.621 MiB    0.000 MiB       gc.collect()
42 2589.621 MiB    0.000 MiB       print(res)
43 2589.621 MiB    0.000 MiB       del res
44 2589.621 MiB    0.000 MiB       gc.collect()
45   29.730 MiB    0.000 MiB       del graph
46   29.730 MiB    0.000 MiB       gc.collect()

After calc_c has run, a and b should be able to be garbage collected, but it seems a reference is held by graph to every output.

Create a node without decorator

Add a method in Graph to register a new node without using a decorator:

graph.add_node(inputs=['a', 'b'], outputs=['c'], function=f_my_function)

Why does init.py not import anything?

Hi. I understand this is a style choice, but why use an empty __init__.py instead of filling it with from .core import *?
If you do it the second way, you can import with from pyungo import Graph instead of from pyungo.core import Graph, which is nice because I had no idea core.py existed inside of this package until I looked. Thank you!

RFE: support sub-graphs

It would be nice to add e method like:

bigger_graph = Graph.add_subgraph(some_graph)

and port all nodes from some_graph into bigger_graph.

Comparison to alternatives in README

To help a prospective user, perhaps a short comparison to similar packages could be made in the README, such as to Dask and GraphKit.

RFE: allow for optional kwargs fro input data

Python's kwargs are optional (since defaults are given in the function declaration).
This library's kwargs feature is not default - it has to exist in the input-data or an error is raised.

These two facts cause a mismatch when converting traditional code into a graph-pipeline.

If it is feasible, it would really help to add another optional keword in the Graph.add_node().

Steps to reproduce

The following code:

graph = pyungo.Graph()

@graph.register(inputs=['a'], kwargs=['b'], outputs=['c'])
def f(a, b=2):
    return a + b

graph.calculate({'a': 1})

... raises PyungoError: The following inputs are needed: ['b']
while the function is fully capable of working without b.

Proposal

This should work:

@graph.register(inputs=['a'], optional=['b'], outputs=['c'])
def f(a, b=2):
    return a + b

graph.calculate({'a': 1})

and produce 3.

Need to be able to pass static values at node definition

Identify function kwargs from other arguments

Need to identify optional keyword arguments when using imported functions.

RFE: allow to have nodes producing the same output

There is a use-case for having multiple nodes producing the same output.
And only decide on calculation-time which path to use.

Example: convert units, and have multiple input-units convert to the same output.

It would be even useful to have a flag set on calculation-time whether to raise if dupe outputs detected, or just issue a warning and chose an arbitrary node, in cases where there is duplication in the inputs, and all paths produce the same result. Further doen the road, the flag could become a tri-state, to calculate all paths and compare results and raise if different only.