Code Monkey home page Code Monkey logo

pandarallel's Introduction

Pandaral·lel

PyPI version fury.io PyPI license PyPI download month

Without parallelization Without Pandarallel
With parallelization With Pandarallel

Pandaral.lel provides a simple way to parallelize your pandas operations on all your CPUs by changing only one line of code. It also displays progress bars.

⚠️ Pandaral·lel is looking for a maintainer! ⚠️

If you are interested, please contact me or open a GitHub issue.

Maintainers

Former maintainers (thanks a lot for your work!)

Installation

pip install pandarallel [--upgrade] [--user]

Quickstart

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

# df.apply(func)
df.parallel_apply(func)

Usage

Be sure to check out the documentation.

Examples

An example of each available pandas API is available:

pandarallel's People

Contributors

alvail avatar apahl avatar bartbroere avatar bjvanderweij avatar bra-fsn avatar chris-boson avatar cschan1828 avatar dice89 avatar gcamargo1 avatar gield avatar guyrosin avatar haim0n avatar itssimon avatar jorenham avatar masguit42 avatar mithil467 avatar nalepae avatar quancore avatar rbeilvert avatar sagarkar10 avatar synapticarbors avatar till-m avatar tobiasedwards avatar tom-andersson avatar vaaaaanquish avatar yashsinha1996 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pandarallel's Issues

Incompatible with python 3.5

When I install this package into my python 3.5 project, I find that this piece of code:
msg = f"The pandarallel shared memory is too small to allow \ ...
throws a
SyntaxError: invalid syntax while importing

I suppose the package uses f-strings for its error messaging, that first started in python 3.6. I am wondering if that is completely necessary, and if it can be removed to make the package compatible with 3.5.

Thanks.

Compatibility with Cell Magic

I was trying this out with jupyter notebook and I noticed that doing the following:

%%time
result = series.parallel_apply(...)

inside of a jupyter notebook cell, blocks the series from assigning the computation to the result variable.

`parallel_apply` does not work on second run

When I try to run `parallel_apply' twice, it does not work the second time and just hangs. Perhaps add a reinitialize option or terminate option to clear the current instance.

Python 2.7 Compatibility

As is indirectly mentioned here, 2.7 compatibility would be useful. Although admittedly, this is only relevant for another 5½ months. until it's officially deprecated.

got different result when use groupby().parallel_apply() compared to pandas

Here is the code, pandarallel went error when it start to get the final result:

import pandas as pd

df = pd.DataFrame([['r', 1],['r', -1], ['r', 3], ['e', 10], ['r', 5], ['e', 4], ['t', 6], ['y', 8],
                   ['re', 11],['re', -1], ['rv', 3], ['ev', 10], ['rg', 5], ['eb', 4], ['tt', 3], ['tt', 7]], columns=['a','b'])
print(df)


def process_data(b_df):
    print(b_df.shape[0])
    df_list = []

    max_b = -1
    i = 0
    for index, row in b_df.iterrows():
        max_b = max(max_b, row['b'])
        if i==0:
            df_list.append(row['b'])
        i+=1

    df_list.append(max_b)

    df = pd.DataFrame([df_list], columns=['bb', 'max_b'])
    print(df)
    print(df_list)
    return df


train_dataframe = df.groupby(df['a']).apply(process_data)
print(' pandas process result ....')
print(train_dataframe)

from pandarallel import pandarallel
pandarallel.initialize(shm_size_mb=int(1e3), nb_workers=4)
train_dataframe0 = df.groupby('a').parallel_apply(process_data)
print(' pandarallel process result ....')
print(train_dataframe0)

issue with more groupie levels

Adding more group by level shows
TypeError: Series.name must be a hashable type

    import math
    df_size = int(3e3)
    df = pd.DataFrame(dict(a=np.random.randint(1, 1000, df_size),
                        b=np.random.randint(1, 50, df_size),
                        c=np.random.rand(df_size)))
    def func(df):
        dum = 0
        for item in df.c:
            dum += math.log10(math.sqrt(math.exp(item**2)))
            
        return dum / len(df.b)
    res_parallel = df.groupby(["b"], as_index=False).parallel_apply(func)

TypeError Traceback (most recent call last)
in
10
11 return dum / len(df.b)
---> 12 res_parallel = df.groupby(["b"], as_index=False).parallel_apply(func)

~/Programs/anaconda3/envs/nand-reports-p3/lib/python3.7/site-packages/pandarallel/_pandarallel.py in wrapper(*args, **kwargs)
72 """Please see the docstring of this method without parallel"""
73 try:
---> 74 return func(*args, **kwargs)
75
76 except _PlasmaStoreFull:

~/Programs/anaconda3/envs/nand-reports-p3/lib/python3.7/site-packages/pandarallel/_pandarallel.py in closure(df_grouped, func, *args, **kwargs)
217 ])),
218 index=_pd.Series(list(df_grouped.grouper),
--> 219 name=df_grouped.keys)
220 ).squeeze()
221 return result

~/Programs/anaconda3/envs/nand-reports-p3/lib/python3.7/site-packages/pandas/core/series.py in init(self, data, index, dtype, name, copy, fastpath)
266 generic.NDFrame.init(self, data, fastpath=True)
267
--> 268 self.name = name
269 self._set_axis(0, index, fastpath=True)
270

~/Programs/anaconda3/envs/nand-reports-p3/lib/python3.7/site-packages/pandas/core/generic.py in setattr(self, name, value)
5087 object.setattr(self, name, value)
5088 elif name in self._metadata:
-> 5089 object.setattr(self, name, value)
5090 else:
5091 try:

~/Programs/anaconda3/envs/nand-reports-p3/lib/python3.7/site-packages/pandas/core/series.py in name(self, value)
400 def name(self, value):
401 if value is not None and not is_hashable(value):
--> 402 raise TypeError('Series.name must be a hashable type')
403 object.setattr(self, '_name', value)
404

TypeError: Series.name must be a hashable type

Permission denied: '/usr/local/lib/python3.6/dist-packages/pyarrow-0.15.0-py3.6-linux-x86_64.egg/pyarrow/plasma-store-server'

As a root (ubuntu 18.04, python3.6), trying to run some code:

from pandarallel import pandarallel

pandarallel.initialize()

I am getting "Permission denied" in the very beginning:

New pandarallel memory created - Size: 2000 MB
Pandarallel will run on 56 workers
Traceback (most recent call last):
  File "pokusy_only_linux.py", line 3, in <module>
    pandarallel.initialize()
  File "/usr/local/lib/python3.6/dist-packages/pandarallel-1.3.3-py3.6.egg/pandarallel/pandarallel.py", line 75, in initialize
    verbose=verbose_store)
  File "/usr/local/lib/python3.6/dist-packages/pandarallel-1.3.3-py3.6.egg/pandarallel/plasma_store.py", line 29, in start_plasma_store
    proc = subprocess.Popen(command, stdout=stdout, stderr=stderr)
  File "/usr/lib/python3.6/subprocess.py", line 729, in __init__
    restore_signals, start_new_session)
  File "/usr/lib/python3.6/subprocess.py", line 1364, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: '/usr/local/lib/python3.6/dist-packages/pyarrow-0.15.0-py3.6-linux-x86_64.egg/pyarrow/plasma-store-server'

While `python_requires` is specified in setup.py, it is still possible to install pandarallel with a non supported Python version through pip.

For example, with Python 2.7, the command:
$ pip install pandarallel won't throw any error, while it should because python_requires='>=3.5' is specified in setup.py.

Additional information:
If you go into the pandarallel repository with Python 2.7 and if you write
$ pip install ., then the following error appears (which is normal):

pandarallel requires Python '>=3.5' but the running Python is 2.7.16

ModuleNotFoundError: No module named 'ipywidgets'

Seemed to be installation dependency problem.
Replicated on two machines with fresh conda environment. pandarallel is installed via pip.
Solved by manually install ipywidgets.

During import pandarallel:
from .pandarallel import pandarallel
from .dataframe import DataFrame
from .utils import (parallel, chunk, ProgressBarsConsole, ProgressBarsNotebookLab)
from ipywidgets import HBox, VBox, IntProgress, Label
ModuleNotFoundError: No module named 'ipywidgets'

try dill to support lambda function?

Greetings.

I know pickle doesn't support lambda function serialization, but another serialization library dill does. And there is also a multiprocessing library, multiprocess, which uses dill to replace pickle.

I'm new here. There may be some reasons that we don't support lambda functions, because of upstream dependent package or else.Just want to mention these, if we didn't notice them before.

Regards

Maximum size exceeded

hi
I set pandarallel.initialize(shm_size_mb=10000) and after apply parallel_apply to my column i get the net error Maximum size exceeded (2GB)

why i get this message when i set more than 2gb?

Connection to IPC socket failed for pathname

Hello, ask, pandarallel.initialize () appears warning how is it? Thank you: WARNING: Logging before InitGoogleLogging() is written to STDERR
E0812 19:11:57.484051 2409853824 http://io.cc:168] Connection to IPC socket failed for pathname /var/folders/sp/vz74h1tx3jlb3jqrq__bjwh00000gp/T/pandarallel-32ts0h6r/plasma_sock, retrying 20 more times
please help me,thank

Can not kill the program on pycharm

Hi,

I use one single line to call df.parallel_apply() as the following:
"label['content_segmented'] = label.parallel_apply(content_segementation.segmentation_content, axis=1)"

It is supposed to show 4 progress bars because I have a 4-core cpu in my laptop but it only showed one. It is fine because I can see that the program was running one 4 cores ( 8 threads). But when I tried to kill the program I found I could not. Even though I close the tap on PyCharm it was still running in all cores. Why is that? How can I kill the program running in halfway?

image

image

Comparison?

Are there known differences/advantages/disadvantages of pandarallel vs for example Ray Modin? https://github.com/modin-project/modin

(Other things like Dask and Arrow are more clearly different scope, intentions, approaches... But pandarallel and modin do seem to have similar intents and approaches)

Clear Pool/Add Option to Clear Pool

With current pool use (have only tested with series.parallel_apply, but I'd assume this happens anywhere you have a pool.<x>map call), the child processes are kept in memory after results have been returned. I believe this is intended behavior from Pathos, as it allows for the reuse of pools if called again; however, this can cause memory issues if a script continues to execute after any pandarallel functions are called (ie the script requires additional memory usage later on that is now occupied by the zombie child processes).

You can get around this by calling pool.clear() once you have your results, and child processes will be cleared and release the memory hold. I'm not sure how much of an impact this has on performance for any later pool usage (I didn't notice any difference in testing, but I'd assume there is at least some), so this could be added as optional input if hard-coding doesn't seem like the route to go here.

Open to other thoughts/ideas on how to better handle these as well.

Disabling console messages?

Is there a way to disable console messages?

WARNING: Logging before InitGoogleLogging() is written to STDERR
I0529 15:34:22.434839  4958 store.cc:1100] Allowing the Plasma store to use up to 2GB of memory.
I0529 15:34:22.434947  4958 store.cc:1127] Starting object store with directory /dev/shm and huge page support disabled

quiet mode execution

Currently there are a ton of messages are printed. Is there a way to mute all or part of the messages?

AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'

I can't seem to get pandarallel to work with one of my functions.

The function uses a lot of other modules like Spacy, nltk, and numpy and is quite a complex function, hence why I want to run it in parallel.

This is the line that is causing the error in the title:
data.body = data.body.parallel_apply(super().remove_noise, progress_bar=False)

and this is the full traceback:

Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'
Traceback (most recent call last):
  File "/usr/lib64/python3.6/multiprocessing/queues.py", line 234, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/usr/lib64/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'FeatureExtracter.<locals>.feature_extracter_fwd'

And then the program just hangs.
Any help would be much appreciated.

AttributeError: 'ProgressBarsConsole' object has no attribute 'set_error'

Hi.

The progress bar may not be displayed.

$ pip list
tqdm                     4.36.1
pandas                   0.25.3
pandarallel              1.4.1
   0.00%                                          |        0 /     6164 |
   0.00%                                          |        0 /     6164 |
   0.00%                                          |        0 /     6163 |
   0.00%                                          |        0 /     6163 |                                                                                                                                                                    
  File "/home/ubuntu/test.py", line 60, in _run
    df['result'] = df.parallel_apply(_func, axis=1)
  File "/home/ubuntu/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 384, in closure
    map_result,
  File "/home/ubuntu/.pyenv/versions/3.6.8/lib/python3.6/site-packages/pandarallel/pandarallel.py", line 327, in get_workers_result
    progress_bars.set_error(worker_index)
AttributeError: 'ProgressBarsConsole' object has no attribute 'set_error'

Because set_error seems to be only ProgressBarsNotebookLab.
https://github.com/nalepae/pandarallel/blob/master/pandarallel/utils/progress_bars.py#L91

But It seems to be called either ProgressBarsNotebookLab or ProgressBarsConsole.
https://github.com/nalepae/pandarallel/blob/master/pandarallel/pandarallel.py#L322

It seems that it was from the time of refactoring > f297b75

issues of df.groupby(args).parallel_apply(func)

Looks like this error is raised when the program combines the indices of dataframes.

InvalidIndexError: Reindexing only valid with uniquely valued Index objects.

File "/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2740, in get_indexer
raise InvalidIndexError('Reindexing only valid with uniquely'

InvalidIndexError: Reindexing only valid with uniquely valued Index objects

When an error occurs, the user has to wait for the end of all tasks

In the following case, an error occurs at 16% of the first task.

error

Observed:
The user has to wait for the end of all tasks before being able to see the failure reason.

Expected:
As soon as an error occurs on one task, all other tasks should be stopped, and the error cause should be displayed.

is there any deinitialize?

when run the code pandarallel.initialize(), my system build up a process:
eversec 79659 1 0 11:02 pts/1 00:00:00 /home/everdc/anaconda3/lib/python3.6/site-packages/pyarrow/plasma_store_server -s /tmp/test_plasma-2680t0g9/plasma.sock -m 2000000000

the more the code run, the more the process build.
so, is there any deinitialize func to kill the process?

cannot import name PlasmaStoreFull from pyarrow.lib

I am running python3.7 in a virtual environment, and hit '### cannot import name PlasmaStoreFull from pyarrow.lib'

  File "/home/jeremyr/p37/lib/python3.7/site-packages/pandarallel/utils.py", line 10, in <module>
    from pyarrow.lib import PlasmaStoreFull as _PlasmaStoreFull
ImportError: cannot import name 'PlasmaStoreFull' from 'pyarrow.lib' (/home/jeremyr/p37/lib/python3.7/site-packages/pyarrow/lib.cpython-37m-x86_64-linux-gnu.so)
(p37) jeremyr@bolt88:~/sdt-address$ 

import of pyarrow and pyarrow.lib work fine, but pyarrow.lib doesnt seem to have PlasmaStoreFUll. This is with pyarrow 0.15.0

Use accessor

I think it would make things easier and more standard if you use a pandas accessor instead.

import pandas

@pandas.api.extensions.register_dataframe_accessor('parallel')
class ClassWithYourMethods:
    pass

Then, the user can use:

import pandas_parallel

df = DataFrame(some_data)

df.parallel.apply(some_function)

I think we haven't done a great job at letting users/developers know about accessors, but I think it's good for pandas users to have a unified way of using pandas "plugins". We've got in the pandas roadmap to improve the contributing documentation, hopefully that makes things easier for extension developers and users.

Doc here: http://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.api.extensions.register_dataframe_accessor.html

Also, feel free to send a PR to add this package to the pandas ecosystem page if you haven't: http://pandas.pydata.org/pandas-docs/stable/ecosystem.html

ArrowIOError: Broken pipe

I have encountered ArrowIOError: Broken pipe
I was applying parallel_apply to first 50 values within pd.Series

def make_request(uid):
    uid = int(uid)
    url = some_endpoint
    response = requests.post(url).json()
    return response['total']
---------------------------------------------------------------------------
ArrowIOError                              Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     74             try:
---> 75                 return func(*args, **kwargs)
     76 

/usr/local/lib/python3.7/site-packages/pandarallel/series.py in closure(series, func, *args, **kwargs)
    148 
--> 149             object_id = plasma_client.put(series)
    150 

/usr/local/lib/python3.7/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.put()

/usr/local/lib/python3.7/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.create()

/usr/local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Broken pipe

During handling of the above exception, another exception occurred:

ArrowIOError                              Traceback (most recent call last)
<ipython-input-17-95842d7aa72c> in <module>
      1 start = datetime.now()
      2 print(start)
----> 3 data[1].iloc[:50].parallel_apply(make_request, axis=1)
      4 print('time elapsed', datetime.now() - start)

/usr/local/lib/python3.7/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     84 
     85             finally:
---> 86                 client.delete(client.list().keys())
     87 
     88         return wrapper

/usr/local/lib/python3.7/site-packages/pyarrow/_plasma.pyx in pyarrow._plasma.PlasmaClient.list()

/usr/local/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowIOError: Broken pipe

Logging of python error

First of, great package! The fact that I just had to add two lines of code to get a speedup of factor 4 is amazing.
I was wondering where I have to look for potential errors thrown by my python code. E.g. I had failed to declare a variable in a subroutine and the progress bar just froze at the step instead of raising the syntax error. Is there a logfile where errors are written to?

One process does all jobs by itself

First off, I love the ease of use of this project!

I am trying to multiprocess the reading of an image (using rasterio.
Each row in my dataframe will read a window from the image source.
The image path is distributed to each process, which will then open it.

When I run parallel_apply(), all four processes seem to start, but only the first one continues. The three others stops at job 1, and the first one performs all jobs (152/38). The result is a dataframe concatenated from all processes, having four times as many rows as the input, with NaN-values for 3/4 of the rows. See screen cap below.

Do you have any input on why this is happening?

Skjermbilde 2019-10-23 kl  15 18 45

groupby and apply does not work

def cumulate_asset_scores(dataset):
    dataset['counts'] = list(range(len(dataset)))
    dataset['corrects'] = dataset['score'].cumsum()
    return dataset
    

from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, shm_size_mb=30000)
df = dataframe.groupby('user_id').parallel_apply(cumulate_asset_scores)

yields

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.7/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/usr/local/lib/python3.7/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/usr/local/lib/python3.7/site-packages/pandarallel/dataframe_groupby.py", line 15, in worker
    df = client.get(object_id)
  File "pyarrow/_plasma.pyx", line 579, in pyarrow._plasma.PlasmaClient.get
  File "pyarrow/_plasma.pyx", line 572, in pyarrow._plasma.PlasmaClient.get
  File "pyarrow/serialization.pxi", line 470, in pyarrow.lib.deserialize
  File "pyarrow/serialization.pxi", line 433, in pyarrow.lib.deserialize_from
  File "pyarrow/serialization.pxi", line 275, in pyarrow.lib.SerializedPyObject.deserialize
  File "pyarrow/serialization.pxi", line 183, in pyarrow.lib.SerializationContext._deserialize_callback
  File "/usr/local/lib/python3.7/site-packages/pyarrow/serialization.py", line 175, in _deserialize_pandas_dataframe
    return pdcompat.serialized_dict_to_dataframe(data)
  File "/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 640, in serialized_dict_to_dataframe
    for block in data['blocks']]
  File "/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 640, in <listcomp>
    for block in data['blocks']]
  File "/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 661, in _reconstruct_block
    dtype = make_datetimetz(item['timezone'])
  File "/usr/local/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 676, in make_datetimetz
    return _pandas_api.datetimetz_type('ns', tz=tz)
TypeError: 'NoneType' object is not callable
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-12-6a762afbe0c2> in <module>
      2 
      3 pandarallel.initialize(shm_size_mb=30000)
----> 4 df = dataframe.groupby('user_id').parallel_apply(cumulate_asset_scores)
      5 #res = []
      6 #for _, g in tqdm.tqdm(list(dataframe.groupby('user_id'))):

/usr/local/lib/python3.7/site-packages/pandarallel/utils.py in wrapper(*args, **kwargs)
     78             """Please see the docstring of this method without `parallel`"""
     79             try:
---> 80                 return func(*args, **kwargs)
     81 
     82             except _PlasmaStoreFull:

/usr/local/lib/python3.7/site-packages/pandarallel/dataframe_groupby.py in closure(df_grouped, func, *args, **kwargs)
     36             with ProcessingPool(nb_workers) as pool:
     37                 result_workers = pool.map(
---> 38                     DataFrameGroupBy.worker, workers_args)
     39 
     40             if len(df_grouped.grouper.shape) == 1:

/usr/local/lib/python3.7/site-packages/pathos/multiprocessing.py in map(self, f, *args, **kwds)
    135         AbstractWorkerPool._AbstractWorkerPool__map(self, f, *args, **kwds)
    136         _pool = self._serve()
--> 137         return _pool.map(star(f), zip(*args)) # chunksize
    138     map.__doc__ = AbstractWorkerPool.map.__doc__
    139     def imap(self, f, *args, **kwds):

/usr/local/lib/python3.7/site-packages/multiprocess/pool.py in map(self, func, iterable, chunksize)
    266         in a list that is returned.
    267         '''
--> 268         return self._map_async(func, iterable, mapstar, chunksize).get()
    269 
    270     def starmap(self, func, iterable, chunksize=None):

/usr/local/lib/python3.7/site-packages/multiprocess/pool.py in get(self, timeout)
    655             return self._value
    656         else:
--> 657             raise self._value
    658 
    659     def _set(self, i, obj):

TypeError: 'NoneType' object is not callable

In a jupyter notebook

Indefinite run

Hello,

I have found your library on Medium and wanted to try it out.
I did base preparation such as

from pandarallel import pandarallel
# Initialization
pandarallel.initialize(progress_bar=True)

and I use it for a single column dataframe:

def make_request(uid):
    uid = int(uid)
    url = some_endpoint
    response = requests.post(url).json()
    return response['total']
    
start = datetime.now()
print(start)
data[1].iloc[:50].parallel_apply(make_request, axis=1)
print('time elapsed', datetime.now() - start)

I see the following string under the Jupyter Lab cell: VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=5750), Label(value='0 / 5750'))), … but the cell is running indefinitely.
The regular pandas's apply works for about 25 secs

What did I wrong?

E0731 Error reading from socket.

Hi,
I tried pandarallel today and it helped speed-up my processing script quite a bit.
However, after a few minutes of processing, I get the following error every few milliseconds spammed on the screen.
No further exception is thrown and the process seems to have become stuck on this error

E0731 19:23:55.184354 7351 io.cc:215] Error reading from socket.

It is quite cryptic and I could not find anything else on this particular error. That's why I am opening this issue. I am running this on the Google Cloud Compute instance: n1-highcpu-32 (32 vCPUs, 28.8 GB memory). I still believe that it is related to the pandarallel package since I end up with this exception when I CTRL+C out of the Error loop:

Traceback (most recent call last):
  File "shapesplit.py", line 290, in <module>
    main()
  File "shapesplit.py", line 58, in main
    distribute_raster_series(region, year, collection, parcels, overwrite, stop_early=stop_early)
  File "shapesplit.py", line 270, in distribute_raster_series
    distribute_raster(raster, parcels, collection, overwrite=overwrite)
  File "shapesplit.py", line 235, in distribute_raster
    result = parcels.parallel_apply(mapping_function, axis=1, args=(offset, rastername, local_filename, collection))
  File "/home/marc/miniconda3/lib/python3.7/site-packages/pandarallel/utils.py", line 74, in wrapper
    client.delete(client.list().keys())
  File "pyarrow/_plasma.pyx", line 741, in pyarrow._plasma.PlasmaClient.list
  File "pyarrow/error.pxi", line 87, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: Connection reset by peer

Waiting up to 5 seconds.
Sent all pending logs.
Process ForkPoolWorker-15:
Traceback (most recent call last):
  File "/home/marc/miniconda3/lib/python3.7/site-packages/multiprocess/process.py", line 297, in _bootstrap
    self.run()
  File "/home/marc/miniconda3/lib/python3.7/site-packages/multiprocess/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/home/marc/miniconda3/lib/python3.7/site-packages/multiprocess/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/marc/miniconda3/lib/python3.7/site-packages/multiprocess/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/home/marc/miniconda3/lib/python3.7/site-packages/pathos/helpers/mp_helper.py", line 15, in <lambda>
    func = lambda args: f(*args)
  File "/home/marc/miniconda3/lib/python3.7/site-packages/pandarallel/dataframe.py", line 14, in worker_apply
    client = plasma.connect(plasma_store_name)
  File "pyarrow/_plasma.pyx", line 805, in pyarrow._plasma.connect
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
  File "/home/marc/miniconda3/lib/python3.7/site-packages/pyarrow/compat.py", line 115, in frombytes
    def frombytes(o):
KeyboardInterrupt

Can you help me with that issue?

PS: here is a screenshot of the instance details
Screenshot from 2019-07-31 21-52-24

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.