Code Monkey home page Code Monkey logo

pythonflow's Introduction

Pythonflow: Dataflow programming for python. Build Status Documentation Status

Pythonflow is a simple implementation of dataflow programming for python. Users of Tensorflow will immediately be familiar with the syntax.

At Spotify, we use Pythonflow in data preprocessing pipelines for machine learning models because

  • it automatically caches computationally expensive operations,
  • any part of the computational graph can be easily evaluated for debugging purposes,
  • it allows us to distribute data preprocessing across multiple machines.

See the documentation for details.

pythonflow's People

Contributors

kant avatar novitk avatar tillahoffmann avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pythonflow's Issues

Can you use Pythonflow in OOP?

Hi there, thanks for open-sourcing this library. I am wondering if PythonFlow can support object based, stateful DAG relationship?

I have seen a similar library before, and had similar question posted here:
man-group/mdf#23

I really think it will be cool for Python to have these kinds of Dataflow toolkit. Would be great to hear the case for PythonFlow.

Project inactive - any reason?

Hi - looks like this project isn't active any longer (even though it accumulated 215 stars in short order ;-) ).

Any reason why? The ideas behind it look quite powerful.

Best,
Michael

Nondeterministic string hashing in Python(>3.3)

I was running into some weird issues with incorrect caching to file a function applied to a string.

This is because python(>3.3) salts its hashing function. (for strings at least)
Specifically:

> python -c "print(hash('asdf'))"
-8690208562067163084
> python -c "print(hash('asdf'))"
-4220296486527231708

The fix for this is to pass in PYTHONHASHSEED=1.
The 'proper' fix would be to substitute the internal hash function for something more suitable, however I couldn't immediately see the right place to inject that.

PYTHONHASHSEED=1 python -c "print(hash('asdf'))"
-5132432945605986887
PYTHONHASHSEED=1 python -c "print(hash('asdf'))"
-5132432945605986887

Handling tuple of operations

This gives the following error: TypeError: not all arguments converted during string formatting

def f(z):
    return z, z

with pf.Graph() as g:
    x = pf.placeholder('x')
    y = f(x)
    
g([y], x=1)

However this works:

def f(z):
    return z, z

def tuple_op(x, y):
    return x, y

with pf.Graph() as g:
    x = pf.placeholder('x')
    y = pf.func_op(tuple_op, *f(x))
    
g([y], x=1)

and it outputs ((1, 1),)

It seems that pythonflow does not know how to handle properly tuples of operations. It could be useful to compute easily a tuple of operations, for example when one wants to define parts of the graph in an auxiliary function as above.
Adding a @pf.opmethod() decorator does not work if there are pythonflow operations in the function. It raises the error ValueError: 'graph' must be given explicitly or a default graph must be set since it encapsulates an operation inside another operation.

Evaluating operations outside the `with` block in Jupyter Notebook grows the `graph.operations` dict

This seem like a bug, and probably related to fact that I am running this in a jupyter notebook. Given this example:

import pythonflow as pf

with pf.Graph() as graph:
    a = pf.constant(4, name='a')
    b = pf.constant(38, name='b')
    x = (a + b).set_name('x')

looking at the graph.operations looks fine initially:

{'a': <pf.func_op 'a' target=<function identity at 0x106500310> args=<1 items> kwargs=<0 items>>,
 'b': <pf.func_op 'b' target=<function identity at 0x106500310> args=<1 items> kwargs=<0 items>>,
 'x': <pf.func_op 'x' target=<built-in function add> args=<2 items> kwargs=<0 items>>}

However, if I evaluate any of the Operation instances in the notebook cell (e.g. a or b at the end of the cell) the dictionary with graph.operations grows with additional items (and continues growing by a dozen new getattr operations every time I evaluate any of the operations)

Here a screenshot that is self-explaining:

image

Is there a way to prevent this behavior?

graph function call error

hi, I use the example code as follow and got error.

import pythonflow as pf

with pf.Graph() as graph:
... a = pf.constant(4)
... b = pf.constant(38)
... x = a + b
...
Traceback (most recent call last):
File "", line 4, in
File "/Users/leepand/Downloads/BigRLab_APIs/flask_lab/pythonflow/pythonflow/core.py", line 329, in add
return add(self, other, graph=self.graph)
File "/Users/leepand/Downloads/BigRLab_APIs/flask_lab/pythonflow/pythonflow/core.py", line 480, in _wrapper
return func_op(target, *args, **kwargs_inner, **kwargs)
File "/Users/leepand/Downloads/BigRLab_APIs/flask_lab/pythonflow/pythonflow/core.py", line 458, in init
super(func_op, self).init(*args, **kwargs)
TypeError: init() got multiple values for argument 'graph'

re-compute parts of graph that are invalidated by changed placeholders?

I'm trying to do the following sequence:

  • (a) define some context C (by setting values of some placeholders)
  • (b) compute some op P in a graph G with context C
  • (c) now of course context C contains the state of the graph G, including all intermediate ops
  • (d) update some placeholder(s) in C to new values
  • (e) compute some op Q (could be same as P above, but needn't be) in graph G, in updated context C

Now when computing Q,

I want the graph to re-compute any ops that were invalidated by the updated placeholders in step (d).

But this doesn't happen because the context C is fully respected, i.e. all ops whose values are in C are taken as-is, even if they are invalid given the new values of the updated placeholders. Here's a toy example to show this:

import pythonflow as pf
import random
random.seed(1)

gr = pf.Graph()
with gr as graph:
  b = pf.placeholder(name='b')
  uniform = pf.func_op(random.uniform, 0, b, name = 'uniform')  # depends on b
  scaled_uniform = uniform*10

context = dict(b=1.0)
graph(scaled_uniform, context)

context[b] = 2.0     # update b to new value

# below I'd like 'uniform' to be re-calculated since the 'uniform' op  
# depends on "b", which has changed

graph(scaled_uniform, context)

# but it does NOT recompute it since "uniform" is 
# already in the context, and it uses its value

Is there some way to get the behavior I want?

Pip install: Syntax Error

Reproducible

  1. Install pythonflow with pip.
  2. Run first example in docs:
import pythonflow as pf

with pf.Graph() as graph:
    a = pf.constant(4)
    b = pf.constant(38)
    x = a + b
Traceback (most recent call last):
  File "C:/Users/admin/Desktop/pythonflow_testing.py", line 1, in <module>
    import pythonflow as pf
  File "C:\Users\admin\miniconda\envs\pythonflow\lib\site-packages\pythonflow\__init__.py", line 18, in <module>
    from .core import *
  File "C:\Users\admin\miniconda\envs\pythonflow\lib\site-packages\pythonflow\core.py", line 121
    def __call__(self, fetches, context=None, *, callback=None, **kwargs):
                                               ^
SyntaxError: invalid syntax

Unexpected(?) behaviour in `Graph.apply`

I ran into an issue today while attempting repeated evaluations of a graph. My rough usage pattern looks like this:

context = {'a': 1, 'b': 2}
for df in frames:
	out = graph(outputs, context=context, frame=df)

Which worked fine for the first call but fails with the following error on the second call:

ValueError: duplicate value for operation '<pf.placeholder 'first_input'>'

After some digging, I noticed that the Graph.normalize_context method modifies the context in-place. It is not mentioned in the documentation, and given the semantics of a Graph.apply this does not seem like an expected (or desired?) side-effect.

Is this something that should be updated in the documentation? Should Graph.normalize_context be modified to copy the context and not modify it in-place? Or is there a better pattern I should be using if I need to repeatedly call the graph object with some constant context and a handful of varying placeholders? (It probably wouldn't hurt to update the documentation with that pattern if there is one).

ZeroMQ dependency

I was wondering whether it would be possible to have ZeroMQ as an optional dependency for distributed processing?

In the industry (vfx and animation) I'm interesting in using pythonflow, ZeroMQ is hard to distribute because of the tools we use.
Further we would probably look at using other processing distribution systems, such as using render manager like Thinkbox.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.