entilzha / pyfunctional Goto Github PK
View Code? Open in Web Editor NEWPython library for creating data pipelines with chain functional programming
Home Page: http://pyfunctional.pedro.ai
License: MIT License
Python library for creating data pipelines with chain functional programming
Home Page: http://pyfunctional.pedro.ai
License: MIT License
While working to add aggregate
I noticed a major bug which affects fold_left
and fold_right
. Referencing the scala documentation for foldLeft
shows that given a sequence of type A
, the passed function should have the type func(x: B, y: A) => B
. This means that x
should be the current folded value and y
should be the next value to fold.
Currently, fold_left
and fold_right` behave with this arguments reversed which is inconsistent with both scala as well as the similar aggregate function defined in the LINQ documentation.
To confirm this behavior:
Scala REPL
List("a", "b", "c").foldLeft("")((current, next) => current + ":" + next)
res3: String = :a:b:c
Python Terminal
In [1]: seq('a', 'b', 'c').fold_left("", lambda current, next: current + ":" + next)
Out[1]: 'c:b:a:'
Correcting this bug introduces a breaking change with all previous versions of ScalaFunctional
which contain fold_left
namely 0.2.0
, 0.3.0
, 0.3.1
. Since this fix is a breaking change, it will not be backported to versions as patches (third number in version), but will be introduced in 0.4.0
.
As part of #38, add an alias method for select
Add documentation, change summary, and pypi keywords to improve discoverability for users looking for LINQ-like features.
Child to #38
So far the only way to ingest data into ScalaFunctional
is to read through using python defined data structures. It would be helpful to be able to read directly from data formats such as json/sql/csv.
Target milestone for everything completed will be 0.4.0
.
This issue will serve as a parent issue for implementing each specific function.
Child issues:
#34 seq.open
#35 seq.range
#36 seq.csv
#37 seq.jsonl
#29 seq.json
#30 to_json
#31 to_csv
#32 to_file
#33 to_jsonl
Looking into this. I suspect its that the wheel is built using python2, which means that unlike the source distribution the code in setup.py
to handle the correct version is not functioning correctly. This didn't appear to be a problem before because part of this release fixed using the correct requirements list.
Doing my best to get a good fix out tonight and bump to 0.4.1
to avoid breaking things on pip.
The general bug or undesirable behavior comes from using a generator from ScalaFunctional twice. This will of course, make it keep returning nothing since the generator is exhausted. While in general this is expected behavior, there are some functions where this could be prevented.
Specifically, for to_list
, list
, to_set
, set
, to_dict
, and dict
, since the sequence is getting expanded, it should also get stored for future calls. This increases memory use, but only by a constant factor.
In general, need to look at the library and consider where it makes sense to do this (sparingly).
From python set manipulation, implement intersection, difference, and union.
Because Python 3 no longer has a cmp
argument in sorting / ordering functions.
Already implemented, created for book keeping
Child of #19
Child to #54 for implementing file open/writing for compressed file formats. This issue is specific to implementing this for xz
files using https://docs.python.org/3/library/lzma.html
Via Twitter got suggested to checkout toolz. I think its worth looking into as powering part of the backend. It may help clean up code, make certain things easier to extend, and improve performance.
I've posted to their mailing list expressing interest in collaboration.
Child to #54 for implementing file open/writing for compressed file formats. This issue is specific to implementing this for bz2
files using https://docs.python.org/3.5/library/bz2.html
In this ticket, implement the seq.json
function. This should be styled similar to other functions in functional.streams
. The primary decision points are:
seq.json
will make each element in the json list an element in the sequenceseq.json
will return a list of (Key, Value)
pairs where the keys are the root dictionary keys and values the corresponding values.The behavior of the second is consistent that Sequence
is storing a list, not other collection types and that in the context of functional
, it is best represented as a list of (Key, Value)
pairs.
Child of #19
Adding a sliding window transformation with API matching http://www.scala-lang.org/api/current/index.html#scala.collection.Seq@sliding(size:Int,step:Int):Iterator[Repr]
Implement aggregate as defined in scala docs
Issue for book keeping, already implemented. seq
has been modified to support this behavior:
>>> # Already supported
>>> seq([1, 2, 3])
[1, 2, 3]
>>> # Newly added
>>> seq(1, 2, 3)
[1, 2, 3]
>>> seq(1)
[1]
>>> # Behavior changed, used to expand string
>>> seq("abc")
["abc"]
With this code, I traced all function calls for a simple operation.
import sys
from functional import seq
def tracefunc(frame, event, arg, indent=[0]):
if event == "call":
indent[0] += 2
print (" " * indent[0] + "|", frame.f_code.co_name)
elif event == "return":
indent[0] -= 2
return tracefunc
sys.settrace(tracefunc)
def dummyPredicate(line):
return True
list(seq([1, 2, 3, 4, 5]).filter(dummyPredicate))
Here are the results for master and lineage-rewrite branches: https://gist.github.com/adrian17/5daa0db38fb4340c9f6e
As you can see, dummyPredicate
is called twice as much as it should be - it looks like the base collection is iterated twice.
Add a function to_sqlite3
that can write to sqlite3
databases.
Potential API
Writing to sqlite won't be complex if we can supply insertion SQL like below.
the_seq.to_sqlite3("db_path", "insert into test_table (id, name) values (?, ?);")However, API without query like pandas'
to_sql
needs some work.the_seq.to_sqlite3("db_path", "table_name")
Potential API Description
For inserting, the first example seems fine to me. The reason Pandas can do the second one is that it works with structured data for which it knows the types/names. The second query would get translated to something like
insert into table_name (col1,col2....) values (?.....);
where the columns come from the DataFrame's columns.The second call could be fairly useful and not too difficult to write. Since we don't keep track of columns, in order to do something like this we would have to enforce that the sequence is a sequence of Tuple/namedtuple/List/dict of the same length/form (for dict this would require a scan to determine all dict fields since that is more friendly than every dict having every column, for list it would require getting max length list). Following that, we could do our best to infer the names of the columns for
insert into test_table (id, name) values (?, ?)
(fromnamedtuple._fields
or scanning for dict fields) or give up and useinsert into test_table values (?, ?)
.The second requirement would be to check the input string against a table name regex to determine what should be done.
Already implemented, created for book keeping
Child of #19
As part of #38, add alias method for order_by
I have long thought that the name ScalaFunctional
is not that great, but so far haven't done anything about it. I think that this might be a good time to come up with a better name that suits what it does and direction of the project better. To be clear, the name change is for the distribution name (repository, website, PyPI), the import name will not change, because too many things would break.
I would like to detail where the name came from, why its not that good, and what would be desired in a new name by explaining the overall goals for the project. I plan on posting some name ideas later this week after more thought, but would like to get ideas from others as well.
At the end of the issue, I will explain logistically what the plan is to TLDR not break anything.
functional
functional
on PyPI had/has been dead for a long time, but cannot be reclaimed. Since it is dead, a name conflict with it from import functional
is unlikely, but not possible to claim the dead project's distribution nameCurrently, the project is doing very well in supporting the first two. The streams/actions API is very complete, and more or less all common data formats/sources are supported (file compression coming in next release). The next possible targets would be SQL DBs with SQLAlchemy or similar to to_pandas
, provide a way to make an SKLearn node (auto generate class that satisfies the node API).
I am not quite happy with progress on the third goal, namely making lambda
calls more succinct. This is my motivation to at some point natively support something like _
from fn.py
. This is paired with the exploratory work I have been doing on a SQL Parser/Compiler. With the code/understanding I have right now, something like below is looking pretty easy:
User = namedtuple('User', 'id name city')
seq([User(1, 'pedro', 'boulder'), User(2, 'fritz', 'seattle')]).filter('id < 1').select('name,city')
I am fairly confidant that as time goes on, the fourth goal will be better and better met.
The last goal has a few things wrapped in:
seq
forces an expansion of its input, I would like to provide a family of seq
operations that behave slightly differently for particular use cases. seq
will stay default, sseq
(stream sequence) will not force expand its input, pseq
(parallel sequence) will provide a parallel execution engine. sseq
is fairly low hanging fruit of theseScalaFunctional
as 0.6.0
1.0
, whenever that might be. The import name will not change, only the distribution/repository name. Open to comment on this or any part of the planHopefully I didn't forget anything, open to comments on anything at all (including that name change is not a good idea)
Implemented in: 0ec00cf
zip_with_index
behavior is inconsistent with how it is defined in spark/scala, and redundant with enumerate
. Specifically, it zips with the index on the left hand side of the tuple, instead of the right hand side of the tuple.
There are quite a few of common operators which are passed into functions such as map/filter/reduce. It might be a good idea to compile a library of common operators in functional.ops
.
In Spark docs count
returns the number of elements from all partitions without using a predicate. In scala count
returns the number of elements which satisfy some predicate. In general I think its better to go with Scala definitions over Spark (although things like group_by_key
are inspired from there). Additionally, len
and size
already do what count
does.
As part of #38, add an aggregate
method.
As described in Pull Request #55, add seq.sqlite3(arg, sql_statement)
to input streams API. arg
can be any of
The input stream comes from the sqlite3
execute(sql_statement)
function which returns an iterable of tuple rows
Thanks for @adrian17 to finding this.
Using the file here: http://norvig.com/ngrams/enable1.txt, and output/images below, its easy to see that LazyFile
is creating a 2x overhead. Currently, the culprit seems to be a combination of additional call overhead to next
and that next
in builtins.open
seems to be implemented in C.
$ python3 -m timeit -s "from functional import seq;" "lines = list(open('/Users/pedro/Desktop/enable1.txt')); seq(lines).select(str.rstrip).count(lambda line: len(line) > 20)"
10 loops, best of 3: 101 msec per loop
$ python3 -m timeit -s "from functional import seq" "seq.open('/Users/pedro/Desktop/enable1.txt').select(str.rstrip).count(lambda line: len(line) > 20)"
10 loops, best of 3: 195 msec per loop
Putting these in files and us pygraphviz
have these call graphs (look at the far right, the rest is not relevant):
In Scala, .product() on an empty list returns 1, 1.0 etc... depending on the type of List's values.
Here, currently seq([]).product()
throws.
I think product
should take an optional initializer
parameter (in case someone used classes with overloaded multiplication) with default value 1 or 1.0... I don't know which though.
Creating to discuss possibly implementing and better integrating something similar to _
in fn.py
. From the 0.5.0
milestones:
Another idea is to implement the
_
operator fromfn.py
. It is quite useful, but its overkill to require the library as a dependency and gimmicky to check if it exists just to import. This might open doors to integrate it more deeply as well.
As part of #38, add alias method for where
Child to #54 for implementing file open/writing for compressed file formats. This issue is specific to implementing this for gzip
files using https://docs.python.org/3.5/library/gzip.html
The functionality from https://github.com/jagill/python-chainz#errors would be useful for certain use cases. Wrapping this into Lineage
seems like a fairly clean way to accomplish this.
On a side note, this might be a good time to look at making the exceptions raised from evaluating an incorrect user function in a PyFunctional
pipeline cleaner. Currently, there is quite a bit of noise when its is very unlikely the core issue is coming from PyFunctional
.
Re-implement/change all the functions in the library to be compatible with generators. Currently, sequential calls to transformations produces a new list between each transformation even if it is only used for the next transformation, not in the result. This is wasteful and could be eliminated using generators.
Targeting this to be the large (potentially breaking, hopefully not though) feature of 0.2.0
while 0.1.7
will be used to add more utility functions.
Creating issue to discuss potential of implementing a parallel execution engine. From the 0.5.0
milestone this might include:
The first possibility is to abstract the execution engine away so that
ScalaFunctional
can use either a sequential or parallel execution engine. This would need to be done through a combination ofmultiprocessing
and determining where it could be used without creating massive code duplication. Additionally, this would require writing completely new tests and infrastructure since order is not guaranteed, but expected in the current sequential tests.
Implement to_json
This should give the option to write values as an array at the json root, or if the sequence is a list of (Key, Value)
pairs to write it as a dictionary at the json root.
Child of #19
First reported here: #24
The core issue is when running code like this in pypy
>>> l = seq([1, 2, 3]).union([4, 5])
>>> set(l)
set([])
The result in standard python is different:
>>> l = seq([1, 2, 3]).union([4, 5])
>>> set(l)
set([1, 2, 3, 4, 5])
Looking into this further, I had a suspicion that at heart of the issue is that on master
, union
and many other operators return iterators. If they are iterated over once, then will return nothing. So it seemed like something was iterating over them before set()
got to them. A common culprit for this type of problem is len
, so I stuck some debugging statements and confirmed this is the problem.
For demonstration purposes, below is minimalistic code to replicate the same behavior, followed by the terminal session for that and scalafunctional
with the print statements on iter
, getitem
and len
from collections import Iterable
class A(object):
def __init__(self, seq):
self.l = seq
def __getitem__(self, item):
print "DEBUG:getitem called"
return self.l[item]
def __iter__(self):
print "DEBUG:iter called"
return iter(self.l)
def __len__(self):
print "DEBUG:len called"
if isinstance(self.l, Iterable):
self.l = list(self.l)
return len(self.l)
class B(object):
def __init__(self, seq):
self.l = seq
def __iter__(self):
print "DEBUG:iter called"
return iter(self.l)
print "Calling set(A([1, 2]))"
a = A([1, 2])
print set(a)
print "Calling set(B([1, 2]))"
b = B([1, 2])
print set(b)
print "Calling union"
s = set([1, 2, 3]).union([4, 5])
c = A(iter(s))
print set(c)
Output
$ pypy iterable.py
Calling set(A([1, 2]))
DEBUG:iter called
DEBUG:len called
set([1, 2])
Calling set(B([1, 2]))
DEBUG:iter called
set([1, 2])
Calling union
DEBUG:iter called
DEBUG:len called
set([])
$ python iterable.py
Calling set(A([1, 2]))
DEBUG:iter called
set([1, 2])
Calling set(B([1, 2]))
DEBUG:iter called
set([1, 2])
Calling union
DEBUG:iter called
set([1, 2, 3, 4, 5])
Terminal sessions for scalafunctional
$ python
Python 2.7.9 (default, Jan 7 2015, 11:49:12)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.56)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from functional import seq
>>> l = seq([1, 1, 2, 3, 3]).union([1, 4, 5])
>>> set(l)
DEBUG:iter
set([1, 2, 3, 4, 5])
$ pypy
Python 2.7.9 (9c4588d731b7fe0b08669bd732c2b676cb0a8233, Mar 31 2015, 07:55:22)
[PyPy 2.5.1 with GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>> from functional import seq
>>>> l = seq([1, 1, 2, 3, 3]).union([1, 4, 5])
>>>> set(l)
DEBUG:iter
DEBUG:len
DEBUG:iterable expanded in len via list()
set([])
Basically what is happening is that in both examples:
set()
results in a call to iter
. Since the return type of union
is an iterator, the final return value of that call is like iter(iter(resultOfUnion))
.iter(resultOfUnion)
. When len
gets called, it evaluates iter(resultOfUnion)
and saves it to l
. scalafunctional
does this in order to reduce many evaluations of a generator which can cause problems.iter
is finally called though because there are no elements to call it on.I am unsure of the best way to fix this, but some considerations
pypy
correct to be calling len
when standard python
doesn't? Is there a good reason for this (probably)? Moreover, it seems even if i disagree, unlikely that this would be changed.lineage-rewrite
, #20, and #17 will fix this I think without any specific attention to it.Given that, I am inclined to followup with pypy
devs to see if this is expected "correct" behavior or something needing fixing. I will also finish up the work on the lineage rewrite, then include the tests using set()
and dict()
. I am presuming the problems are due to similar issues with iterators. If it is still a problem, then I will have to think more about what to do.
Based on a reddit thread, this package would be helpful for users looking for features of LINQ (from .NET) in Python. This is a parent issue for editing documentation to talk about this use case and add new/alias methods (like where
and select
) common in LINQ.
In using seq.to_file
I have found a common case is to write a collection to a file as a string. I think the right way to expose is through a delimiter option in to_file
. If it is None, the default is to str(self.to_list())
, if it is defined then it will do self.make_string(separator)
and write that.
Creating to discuss potential better LINQ integration for 0.6.0
. From the milestone summery:
Another possible focus is on LINQ. This could take the form of implementing a limited SQL parser and optimizer using
pyparsing
. This might also be givingselect
,where
, and related methods more definition. For example, if the LINQ functions are used using calls likeselect("atr").filter("atr == 1")
be smarter about how they are executed. This is a wide open door, looking for thoughts and suggestions on what is of value. The basic concept is to start working on smarter ways of reading data, although this might tread into the territory of much more mature libraries likepandas
its dataframes.
Already implemented, created for book keeping
Child of #19
From documentation: "Selects all elements except the first."
Your version: "get last element".
Any good reason behind it? I can change it to the former, and also implement stuff like inits
and tails
.
I could be usefull if stream functions like seq.open
, seq.csv
etc can read compressed files like Spark sc.textFile
.
Also writing a compressed file by to_file
, to_csv
etc is great.
Implement to_csv
with similar interface to python module csv.writer
Child of #19
Implement to_file
with similar options to builtins.open
in write mode
Child of #19
Issue to match with implementing a join
function. The implementation should take two sequences with tuples (K, V)
and (K, W)
. The return value is the sequence joined on K
to return a sequence of (K, (V, W))
tuples.
Additionally, should implement join_on
which creates the keys via the result of a passed function.
Already implemented, created for book keeping
Child of #19
I'm guessing one of your design goals is beauty and readability. Would it create a nicer first impression to see:
result = (seq(1, 2, 3, 4)
.map(lambda x: x * 2)
.filter(lambda x: x > 4)
.reduce(lambda x, y: x + y)
)
I understand that this is a personal preference so feel free to disagree :)
Implement to_jsonl
which matches the implementation of functional.streams.jsonl
Child of #19
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.