Code Monkey home page Code Monkey logo

mp_pandas's Introduction

MP_Pandas

Pandas' group-by/apply with multiprocessing

Pandas is a Python library used for data manipulation. The main Pandas data structure is the data frame which is very similar to the R data frame. One of the main processing paradigms on data frames is the group-by/apply which is essentially map/reduce. The syntax for group-by/apply is:

                data_frame.groupby(column_list).apply(apply_func, *args, **kwargs)

When the computation to be performed by the apply function is CPU intensive and/or the number of data frame groups is large, a group-by/apply computation can easily become a processing bottleneck.

The Pandas library is very fast and in general one is better off leveraging its capabilities. The first and best bottleneck removal step is to make sure that your apply function is optimized. Similarly, often structuring the data differently may lead to a reduction in the number of groups generated by the group-by step. If after having implemented these two steps your group-by/apply is still a processing bottleneck, one may want to consider using multiprocessing.

The group-by/apply operation is, in general but not always, embarrassingly parallel. Here we assume that the group-by/apply is embarrassingly parallel. The basic idea in a multiprocessor group-by/apply is to assign groups resulting from the group-by step to different CPUs. Enabling multiprocessing in a Pandas group-by/apply is interesting if it is generic in the sense that we preserve the syntax and we do not need to rewrite a special apply function code. Our multiprocessing syntax is as follows:

            mp_groupby(data_frame, column_list, apply_func, *args, **kwargs, **mp_args)

The arguments to mp_groupby() are the same as in the Pandas groupby/apply except for the additional mp_arg argument, which contains multiprocessing information such as the number of CPUs to use and load balancing information.

mp_pandas's People

Contributors

josepm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mp_pandas's Issues

Reshuffling dataframes

The output dataframe is completely reshuffled [On very large DF's]

Is there any way your'e handling it?

Incompatibility of df_in and df_grouper

Hi

Thanks for this, it should help my processes a lot.

However I did find a bug that was causing me problems (I was running it under Python3 with the required minor syntax changes) the df_grouper() function sorts the DataFrame by the desired columns to create the dictionary, but func_by_groups() then tries to apply the indexes on the unsorted df_in. This means the subsets it uses are incorrect.

Once I tracked down this it was easy enough to fix, by moving the sort from df_grouper() into mp_groupby() prior to the former being called.

Most of the time I'd have sorted the data before passing it to groupby(), so I assume that's probably why no-one had spotted it yet.

Thanks again.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.