mp_pandas's Introduction

MP_Pandas

Pandas' group-by/apply with multiprocessing

Pandas is a Python library used for data manipulation. The main Pandas data structure is the data frame which is very similar to the R data frame. One of the main processing paradigms on data frames is the group-by/apply which is essentially map/reduce. The syntax for group-by/apply is:

                data_frame.groupby(column_list).apply(apply_func, *args, **kwargs)

When the computation to be performed by the apply function is CPU intensive and/or the number of data frame groups is large, a group-by/apply computation can easily become a processing bottleneck.

The Pandas library is very fast and in general one is better off leveraging its capabilities. The first and best bottleneck removal step is to make sure that your apply function is optimized. Similarly, often structuring the data differently may lead to a reduction in the number of groups generated by the group-by step. If after having implemented these two steps your group-by/apply is still a processing bottleneck, one may want to consider using multiprocessing.

The group-by/apply operation is, in general but not always, embarrassingly parallel. Here we assume that the group-by/apply is embarrassingly parallel. The basic idea in a multiprocessor group-by/apply is to assign groups resulting from the group-by step to different CPUs. Enabling multiprocessing in a Pandas group-by/apply is interesting if it is generic in the sense that we preserve the syntax and we do not need to rewrite a special apply function code. Our multiprocessing syntax is as follows:

            mp_groupby(data_frame, column_list, apply_func, *args, **kwargs, **mp_args)

The arguments to mp_groupby() are the same as in the Pandas groupby/apply except for the additional mp_arg argument, which contains multiprocessing information such as the number of CPUs to use and load balancing information.

mp_pandas's People

Contributors

Stargazers

Watchers

mp_pandas's Issues

Reshuffling dataframes

The output dataframe is completely reshuffled [On very large DF's]

Is there any way your'e handling it?

Incompatibility of df_in and df_grouper

Thanks for this, it should help my processes a lot.

However I did find a bug that was causing me problems (I was running it under Python3 with the required minor syntax changes) the df_grouper() function sorts the DataFrame by the desired columns to create the dictionary, but func_by_groups() then tries to apply the indexes on the unsorted df_in. This means the subsets it uses are incorrect.

Once I tracked down this it was easy enough to fix, by moving the sort from df_grouper() into mp_groupby() prior to the former being called.

Most of the time I'd have sorted the data before passing it to groupby(), so I assume that's probably why no-one had spotted it yet.

Thanks again.

Recommend Projects

josepm / mp_pandas Goto Github PK

mp_pandas's Introduction

MP_Pandas

Pandas' group-by/apply with multiprocessing

mp_pandas's People

Contributors

Stargazers

Watchers

Forkers

mp_pandas's Issues

Reshuffling dataframes

Incompatibility of df_in and df_grouper

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent