cedricleroy / pyungo Goto Github PK
View Code? Open in Web Editor NEWFunction dependencies resolution and execution
Home Page: https://cedricleroy.github.io/pyungo/
License: MIT License
Function dependencies resolution and execution
Home Page: https://cedricleroy.github.io/pyungo/
License: MIT License
It would be nice to be able to display the function names instead of the node indices when printing the graph dependencies.
I propose to auto-populate Graph.add_node()
calls based on function & argument-names (using python's inspect
standard-library's module), and allow some form of string-filtering on the function/args names.
Knowingly this would work for (singular) outputs only.
The proposal is easier to explain with sample client code:
def funcname_chopper(funcname):
for prefix in ['calc_', 'compute_', 'make_']:
if funcname.startswith(prefix):
return prefix[len(prefix):]
graph = pyungo.Graph(
outname_converter=funcname_chopper)
# equivalent to: register(inputs=['a', 'b'], outputs=['c']
@graph.register
def make_c(a, b):
return a+b
# equivalent to: register(inputs=['a', 'b', 'c'], outputs=['make_d']
@graph.register(inpname_converter=lambda n: n[5:], outname_converter=None)
def calc_d(some_a, look_b, stop_c):
return a+b
Reference: https://github.com/cedricleroy/pyungo/blob/master/pyungo/core.py#L146-L152
I ran into an issue for a specific type of node. The node returns 1 output that is a list of multiple datetime objects (like a timestamps vector). Because of the lines referenced above, the graph only saves the 1st item of that returned list, because it thinks that multiple outputs will be returned (since iter(res)
doesn't fail), but there is only one output_name
in the node (like "timestamps" for instance).
Essentially, the for
loop goes through the timestamps list, and returns the first element of that list as data to be saved...
Adapting the quickstart example:
schema = {
"type": "object",
"properties": {
"a": {"type": "number"},
"b": {"type": "number"}
}
}
graph = Graph(schema=schema)
@graph.register(inputs=['a'], outputs=['b'])
def f1(a):
return "Hey!"
@graph.register(inputs=['b'], outputs=['c'])
def f2(b):
return 2 * b
graph.calculate(data={'a': 1})
I was expecting an error, but got the pipeline went through and got the result Hey@!Hey!
.
Am i doing something wrong?
This feature is particularly important for data on the internal nodes, because it is not as easy to test them as inputs/outputs.
Hi,
I'm not sure if this is a bug or a feature request.
I have a workflow that is very memory intensive but also very well suited to decomposition to a DAG.
My problem is that it if I keep any intermediate outputs in memory I will quickly exceed the capacity of one computer to hold the data in RAM.
I had hoped pyungo would be clever enough to allow intermediate states to be garbage collected, but it doesn't seem so.
See a sample program:
`
from pyungo import Graph
import numpy as np
import gc
@Profile
def main():
graph = Graph()
@graph.register()
def calc_a():
a = np.random.rand(8192,8192)
return a
@graph.register()
def calc_b():
b = np.random.rand(8192,8192)
return b
@graph.register()
def calc_c(a,b):
gc.collect()
c = a * b
print("c")
return c
@graph.register()
def calc_d():
gc.collect()
d = np.random.rand(8192,8192)
print("d")
return d
@graph.register()
def calc_pfd(c,d):
gc.collect()
e = c * d
return e
gc.collect()
res = graph.calculate(data={})
gc.collect()
print(res)
del res
gc.collect()
del graph
gc.collect()
main()
`
Output:
`
(venv) zenbook% python -m memory_profiler memtest.py
INFO:root:Starting calculation...
INFO:root:Ran Node(08f958eb-84ff-49ad-a2fb-a2ada5788705, <calc_a>, [], ['a']) in 0:00:02.127759
d
INFO:root:Ran Node(9cd9ce4e-16d7-4a43-83b7-a8e01e8bd8ba, <calc_d>, [], ['d']) in 0:00:01.618884
INFO:root:Ran Node(ea8b8c8f-d8fc-4967-a7c5-c3ba0dbcd550, <calc_b>, [], ['b']) in 0:00:01.519026
c
INFO:root:Ran Node(ac6d7004-7cb1-41ff-8476-a2a1ce9e64d6, <calc_c>, ['a', 'b'], ['c']) in 0:00:01.029356
INFO:root:Ran Node(7786d30e-7796-4d19-a692-07f2904ea6c8, <calc_pfd>, ['c', 'd'], ['e']) in 0:00:00.853072
INFO:root:Calculation finished in 0:00:07.152394
[[0.32979496 0.00617538 0.01675385 ... 0.08284045 0.03303956 0.09351132]
[0.00268712 0.20226707 0.06033366 ... 0.07918911 0.01333745 0.15655172]
[0.0007408 0.01337496 0.17597583 ... 0.19520472 0.0274126 0.07911974]
...
[0.00958562 0.00919059 0.10846052 ... 0.01235475 0.02207799 0.26674223]
[0.06822633 0.03539608 0.08139489 ... 0.08097827 0.10901089 0.02113664]
[0.01915152 0.00518849 0.34347554 ... 0.04939359 0.48837681 0.11771939]]
Filename: memtest.py
5 29.688 MiB 29.688 MiB @profile
6 def main():
7 29.688 MiB 0.000 MiB graph = Graph()
8
9 29.691 MiB 0.000 MiB @graph.register()
10 29.691 MiB 0.004 MiB def calc_a():
11 541.562 MiB 511.871 MiB a = np.random.rand(8192,8192)
12 541.562 MiB 0.000 MiB return a
13
14 1053.578 MiB 0.000 MiB @graph.register()
15 29.691 MiB 0.000 MiB def calc_b():
16 1565.590 MiB 512.012 MiB b = np.random.rand(8192,8192)
17 1565.590 MiB 0.000 MiB return b
18
19 1565.590 MiB 0.000 MiB @graph.register()
20 29.691 MiB 0.000 MiB def calc_c(a,b):
21 1565.590 MiB 0.000 MiB gc.collect()
22 2077.605 MiB 512.016 MiB c = a * b
23 2077.605 MiB 0.000 MiB print("c")
24 2077.605 MiB 0.000 MiB return c
25
26 541.562 MiB 0.000 MiB @graph.register()
27 29.691 MiB 0.000 MiB def calc_d():
28 541.562 MiB 0.000 MiB gc.collect()
29 1053.578 MiB 512.016 MiB d = np.random.rand(8192,8192)
30 1053.578 MiB 0.000 MiB print("d")
31 1053.578 MiB 0.000 MiB return d
32
33 2077.605 MiB 0.000 MiB @graph.register()
34 29.691 MiB 0.000 MiB def calc_pfd(c,d):
35 2077.605 MiB 0.000 MiB gc.collect()
36 2589.621 MiB 512.016 MiB e = c * d
37 2589.621 MiB 0.000 MiB return e
38
39 29.691 MiB 0.000 MiB gc.collect()
40 2589.621 MiB 0.000 MiB res = graph.calculate(data={})
41 2589.621 MiB 0.000 MiB gc.collect()
42 2589.621 MiB 0.000 MiB print(res)
43 2589.621 MiB 0.000 MiB del res
44 2589.621 MiB 0.000 MiB gc.collect()
45 29.730 MiB 0.000 MiB del graph
46 29.730 MiB 0.000 MiB gc.collect()
`
After calc_c has run, a and b should be able to be garbage collected, but it seems a reference is held by graph to every output.
Add a method in Graph
to register a new node without using a decorator:
graph.add_node(inputs=['a', 'b'], outputs=['c'], function=f_my_function)
Hi. I understand this is a style choice, but why use an empty __init__.py
instead of filling it with from .core import *
?
If you do it the second way, you can import with from pyungo import Graph
instead of from pyungo.core import Graph
, which is nice because I had no idea core.py
existed inside of this package until I looked. Thank you!
It would be nice to add e method like:
bigger_graph = Graph.add_subgraph(some_graph)
and port all nodes from some_graph
into bigger_graph
.
These two facts cause a mismatch when converting traditional code into a graph-pipeline.
If it is feasible, it would really help to add another optional
keword in the Graph.add_node()
.
The following code:
graph = pyungo.Graph()
@graph.register(inputs=['a'], kwargs=['b'], outputs=['c'])
def f(a, b=2):
return a + b
graph.calculate({'a': 1})
... raises PyungoError: The following inputs are needed: ['b']
while the function is fully capable of working without b
.
This should work:
@graph.register(inputs=['a'], optional=['b'], outputs=['c'])
def f(a, b=2):
return a + b
graph.calculate({'a': 1})
and produce 3
.
Need to identify optional keyword arguments when using imported functions.
There is a use-case for having multiple nodes producing the same output.
And only decide on calculation-time which path to use.
Example: convert units, and have multiple input-units convert to the same output.
It would be even useful to have a flag set on calculation-time whether to raise if dupe outputs detected, or just issue a warning and chose an arbitrary node, in cases where there is duplication in the inputs, and all paths produce the same result. Further doen the road, the flag could become a tri-state, to calculate all paths and compare results and raise if different only.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.