pwwang / datar Goto Github PK

View Code? Open in Web Editor NEW

264.0 11.0 17.0 21.92 MB

A Grammar of Data Manipulation in python

Home Page: https://pwwang.github.io/datar/

License: MIT License

Python 99.79% Dockerfile 0.08% Shell 0.12%

dplyr data-manipulation tidyverse tidyr tibble tribble forcats pandas groupby

datar's Introduction

datar

A Grammar of Data Manipulation in python

Documentation | Reference Maps | Notebook Examples | API

datar is a re-imagining of APIs for data manipulation in python with multiple backends supported. Those APIs are aligned with tidyverse packages in R as much as possible.

Installation

pip install -U datar

# install with a backend
pip install -U datar[pandas]

# More backends support coming soon

Backends

Repo	Badges
datar-numpy
datar-pandas
datar-arrow

Example usage

# with pandas backend
from datar import f
from datar.dplyr import mutate, filter_, if_else
from datar.tibble import tibble
# or
# from datar.all import f, mutate, filter_, if_else, tibble

df = tibble(
    x=range(4),  # or c[:4]  (from datar.base import c)
    y=['zero', 'one', 'two', 'three']
)
df >> mutate(z=f.x)
"""# output
        x        y       z
  <int64> <object> <int64>
0       0     zero       0
1       1      one       1
2       2      two       2
3       3    three       3
"""

df >> mutate(z=if_else(f.x>1, 1, 0))
"""# output:
        x        y       z
  <int64> <object> <int64>
0       0     zero       0
1       1      one       0
2       2      two       1
3       3    three       1
"""

df >> filter_(f.x>1)
"""# output:
        x        y
  <int64> <object>
0       2      two
1       3    three
"""

df >> mutate(z=if_else(f.x>1, 1, 0)) >> filter_(f.z==1)
"""# output:
        x        y       z
  <int64> <object> <int64>
0       2      two       1
1       3    three       1
"""

# works with plotnine
# example grabbed from https://github.com/has2k1/plydata
import numpy
from datar import f
from datar.base import sin, pi
from datar.tibble import tibble
from datar.dplyr import mutate, if_else
from plotnine import ggplot, aes, geom_line, theme_classic

df = tibble(x=numpy.linspace(0, 2 * pi, 500))
(
    df
    >> mutate(y=sin(f.x), sign=if_else(f.y >= 0, "positive", "negative"))
    >> ggplot(aes(x="x", y="y"))
    + theme_classic()
    + geom_line(aes(color="sign"), size=1.2)
)

# very easy to integrate with other libraries
# for example: klib
import klib
from pipda import register_verb
from datar import f
from datar.data import iris
from datar.dplyr import pull

dist_plot = register_verb(func=klib.dist_plot)
iris >> pull(f.Sepal_Length) >> dist_plot()

Testimonials

@coforfe:

Thanks for your excellent package to port R (dplyr) flow of processing to Python. I have been using other alternatives, and yours is the one that offers the most extensive and equivalent to what is possible now with dplyr.

datar's People

Contributors

Stargazers

Watchers

Forkers

lego97 osdaf kevinwang09 vishalbelsare rleyvasal lazycrazyowl sthagen stjordanis gedevan-aleksizde vinithasiva-hexaware jnhyeon pdwaggoner python-repository-hub kmustzjq longqingli josiahparry berkcanerken

datar's Issues

`anti_join()` with columns not working `KeyError: {'mpg': 'mpg', 'cyl': 'cyl', 'disp': 'disp', 'hp': 'hp'}`

Issue

When using anti_join() on multiple columns a KeyError appears

Steps to reproduce

from datar.datasets import mtcars
mtcars2 = mtcars.reset_index() >> filter(f.mpg >18)
mtcars.reset_index() >> anti_join(mtcars2, by = {'mpg': 'mpg', 'cyl':'cyl','disp':'disp', 'hp': 'hp'})

Expected result

anti_join should return the observations on mtcars that did not match observations on mtcars2 based on multiple variables mpg': 'mpg', 'cyl':'cyl','disp':'disp', 'hp': 'hp'

Note: left_join() with multiple columns works as expected.

`dplyr.filter()` restructures `group_data` incorrectly

P1 . It seems `group_by >> mutate() >>filter()>> mutate()` has some confilicts.

I wanna add a columns with cumsum function, and filter the added column by some conditions ,then rank the column.

from datar.all import * 
from datar.datasets import mtcars 

>>>df = (mtcars 
         >> arrange(f.gear)
        >> group_by(f.cyl)
        >> mutate(cum=f.drat.cumsum())  
        >> filter(f.cum >=5)
        >> mutate(ranking=f.cum.rank())
         )  
>>>df
ValueError: Length mismatch: Expected axis has 13 elements, new values have 6 elements

But if I select only some rows, the code above worked.

>>>df = (mtcars 
        >>head(20)  # Not working when the number is above 25.
        >> arrange(f.gear)
        >> group_by(f.cyl)
        >> mutate(cum=f.drat.cumsum())  
        >> filter(f.cum >=5)
        >> mutate(ranking=f.cum.rank())  
        )
>>>df
# No erros occour.

If I noted the arrange, error occoured as the first code.

>>>df = (mtcars 
        >>head(20)  
        #>> arrange(f.gear)
        >> group_by(f.cyl)
        >> mutate(cum=f.drat.cumsum())  
        >> filter(f.cum >=5)
        >> mutate(ranking=f.cum.rank())  
        )
>>>df
#ValueError: Length mismatch: Expected axis has 13 elements, new values have 6 elements

P2 Filter after group_by then mutate columns ,like `group_by() >> filter ()>>mutate()` , also has some erros.

>>>df = (mtcars
         >> arrange(f.gear)
        >> group_by(f.cyl) 
        >> filter(f.drat >=4)
        >> mutate(ranking=f.drat.rank())  
      )
>>>df
KeyError: 2

Same code, I changed the column to filter, no erros occour.

df = (mtcars
         >> arrange(f.gear)
        >> group_by(f.cyl) 
        >> filter(f.qsec >=18)
        >> mutate(ranking=f.drat.rank())  
      )
df
# No erros

`register_verb()` for cufflinks

Is there a way to register_verb() for cufflinks package and pipe data to it in order to do the following with datar?

from datar.datasets import mtcars
from datar.all import *
import cufflinks as cf
mtcars.reset_index().iplot(kind='bar', x= 'index', y ='mpg')

`mutate()` error `AttributeError: REGULAR`

Steps to reproduce

from datar.all import *
from datar.datasets import mtcars
mtcars >> mutate(mpl = f.mpg/4)

Expected output

Should get a column mpl with every value calculated from column mpg/4

ImportError: cannot import name 'VarnameException'

Issue

I just upgraded datar from 0.4.3 to 0.4.4 with pip install -U datar and got error ImportError: cannot import name 'VarnameException' when importing datar with code from datar.all import *

below is the error message:

ImportError: cannot import name 'VarnameException' from 'varname' (C:\Anaconda3\lib\site-packages\varname\__init__.py)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-2-268be173473d> in <module>
----> 1 from datar.all import *
      2 from datar.datasets import mtcars
      3 mtcars  >> mutate(mpl = f.mpg/4)

C:\Anaconda3\lib\site-packages\datar\all.py in <module>
      7 from .base import *
      8 from .base import _warn as _
----> 9 from .datar import *
     10 from .dplyr import _no_warn as _
     11 from .dplyr import _builtin_names as _dplyr_builtin_names

C:\Anaconda3\lib\site-packages\datar\datar\__init__.py in <module>
      1 """Specific verbs/funcs from this package"""
      2 
----> 3 from .verbs import get, flatten, drop_index
      4 from .funcs import itemgetter

C:\Anaconda3\lib\site-packages\datar\datar\verbs.py in <module>
      9 from ..core.contexts import Context
     10 from ..core.grouped import DataFrameGroupBy
---> 11 from ..dplyr import select, slice_
     12 
     13 

C:\Anaconda3\lib\site-packages\datar\dplyr\__init__.py in <module>
      4 from .across import across, c_across, if_all, if_any
      5 from .arrange import arrange
----> 6 from .bind import bind_cols, bind_rows
      7 from .context import (
      8     cur_column,

C:\Anaconda3\lib\site-packages\datar\dplyr\bind.py in <module>
     15 from ..core.names import repair_names
     16 from ..core.grouped import DataFrameGroupBy
---> 17 from ..tibble import tibble
     18 
     19 

C:\Anaconda3\lib\site-packages\datar\tibble\__init__.py in <module>
      1 """APIs for R-tibble"""
----> 2 from .tibble import tibble, tibble_row, tribble, zibble
      3 from .verbs import (
      4     enframe,
      5     deframe,

C:\Anaconda3\lib\site-packages\datar\tibble\tibble.py in <module>
      4 
      5 from pandas import DataFrame
----> 6 from varname import argname, varname, VarnameException
      7 
      8 import pipda

ImportError: cannot import name 'VarnameException' from 'varname' (C:\Anaconda3\lib\site-packages\varname\__init__.py)

Expected

Expect datar to work without issue after upgrading to new versions.

Mention `~` instead of `-` for record removal/exclusion in indices

Especially here:

https://github.com/pwwang/datar/blob/master/docs/indexing.md#negative-indexes

`[2022-03-11 20:14:31][datar][ INFO] Adding missing grouping variables: ['efficiency']` warning message

what the following warning mesage about?

[2022-03-11 20:14:31][datar][ INFO] Adding missing grouping variables: ['efficiency']

I am using a for loop to iterate over thousands of columns and the warning message keeps showing in output. I created some code below to illustrate the warning message.

Code to replicate

mtcars_vars = mtcars.columns.to_list()
 mtcars = mtcars >> mutate(efficiency = case_when(
                    f.mpg.between(0,16, inclusive ='left'), "bad",
                    f.mpg.between(16,20, inclusive ='left'), "ok",
                    f.mpg.between(20,22, inclusive ='left'), "better",
                    f.mpg>=22, "best",
                    True, "other" )      

col_name = mtcars_vars[1]
all_cars = mtcars >> select(f.efficiency, f[col_name]) >> filter(f[col_name] > 0) >> group_by(f.efficiency) >> count() >> rename(**{col_name: f.n})

[讨论] 管道操作符是否能够和其他包配合使用呢

比如siuba和dfply包，都采用了 >> 这种管道操作符的逻辑
如果同时调用datar和siuba，会导致冲突的吧
这里面的管道操作可以配合使用么

[Proposal] Apply copy-on-write (COW) rule wherever possible

Only copy the data when possible. For example:

df = mutate(mtcars)

# currently, df is a copy of mtcars
# but if there are no changes made to mtcars, it should be:
# df is mtcars

Show "str" type if data casted to `str` instead of `object`

Thanks for the quick response @pwwang .

Changing the data type works for me.

BTW, is there a way to display variable type str under the variable name? - after changing data type to str still shows <object> under the variable name Description

Originally posted by @rleyvasal in #35 (comment)

Pipe into plotnine `ggplot` not working - TypeError: init() missing 1 required positional argument: 'data'

Issue

TypeError appears when piping data into plotnine ggplot()

Steps to replicate

from datar.all import *
from plotnine import ggplot, aes, theme_classic, geom_bar, aes
mtcars >> group_by(f.cyl) >> count() >> ggplot( aes(x ='cyl', y = 'n')) + geom_bar()

Expected

Graph should show after piping data into ggplot

semi_join returns duplicated rows

First of all, I would like to thank you for the amazing effort you have put into this work.

I would like to point an issue with the current implementation of the semi_join function; if a row in the left tibble corresponds to multiple rows in the right tibble, it will return duplicated rows because of the use of a left join.

Here is a minimal example.

import pandas as pd
from datar.all import *

df1 = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'x': [1, 2, 2, 3]})

semi_join(df1, df2, by='x')

This gives the output

Just in case that this might help, I have used the following approach for semi_join and anti_join before (don't remember if I got this from StackOverflow or not!).

import pandas as pd

def anti_join(left_df, right_df, on):
    return left_df.loc[~left_df.loc[:, on].apply(tuple, axis=1).isin(right_df.loc[:, on].apply(tuple, axis=1))]

def semi_join(left_df, right_df, on):
    return left_df.loc[left_df.loc[:, on].apply(tuple, axis=1).isin(right_df.loc[:, on].apply(tuple, axis=1))]

df1 = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'x': [1, 2, 2, 3]})

semi_join(df1, df2, on=['x'])
anti_join(df1, df2, on=['x'])

Again, thank you very much for this work.

Remove doc about "piping syntax vs regular calling"

This doc section should be removed. As regular calling is now perfectly supported by pipda v0.4.0.

Lubridate commands

Hi @pwwang , do you have plans to add lubridate commands to datar?

I am trying to convert the Date column on stock time series data to date time with datar mutate.

Data from yahoo finance

import pandas as pd
from datar.all import *

aapl = pd.read_csv("AAPL.csv")

aapl.Date = pd.to_datetime(aapl.Date.astype('str')) # with pandas this works to change the data type to datetime

aapl = aapl >> mutate(Date = as_datetime(f.Date))  # this does not work and shows error message

aapl = aapl >> mutate(Date = as_date(f.Date))  #this does not work and does not show error message

pipe operator doesn't work in plain python prompt

seems that the pipe operator doesn't work when using datar in virtual anaconda environments.
here's a snippet of running the example code ran in anaconda prompt:

from datar.all import f, mutate, filter, if_else, tibble

[2021-08-03 19:57:12][datar][WARNING] Builtin name "filter" has been overriden by datar.

df = tibble(
    x=range(4),
    y=['zero', 'one', 'two', 'three']
)

C:\ProgramData\Anaconda3\envs\conda_start\lib\site-packages\pipda\utils.py:159: UserWarning: Failed to fetch the node calling the function, call it with the original function.
warnings.warn(

df >> mutate(z=f.x)

C:\ProgramData\Anaconda3\envs\conda_start\lib\site-packages\pipda\utils.py:159: UserWarning: Failed to fetch the node calling the function, call it with the original function.
warnings.warn(
Traceback (most recent call last):
File "", line 1, in
File "C:\ProgramData\Anaconda3\envs\conda_start\lib\site-packages\pipda\register.py", line 396, in wrapper
return calling_rule(generic, args, kwargs, envdata)
File "C:\ProgramData\Anaconda3\envs\conda_start\lib\site-packages\pipda_calling.py", line 93, in verb_calling_rule3
return generic(*args, **kwargs)
File "C:\ProgramData\Anaconda3\envs\conda_start\lib\functools.py", line 872, in wrapper
raise TypeError(f'{funcname} requires at least '
TypeError: _not_implemented requires at least 1 positional argument

mutate(df, z=f.x)

>         x        y       z
>   <int64> <object> <int64>
> 0       0     zero       0
> 1       1      one       1
> 2       2      two       2
> 3       3    three       3

pandas 1.2.3
python 3.8.1

ps thanks a lot for the package, hopefully the issue can be closed soon :)

Update license in `core._frame_format_patch`

As part of the code is grabbed from pandas, and pandas is using BSD 3-Clause License.

`f.a.mean()` not applied to grouped data

>>> from datar.all import *
>>> df = tibble(g=[1,1,2,2], a=[3,4,5,6])
>>> df >> group_by(f.g) >> mutate(b=f.a.mean())
        g       a         b
  <int64> <int64> <float64>
0       1       3       3.5
1       1       4       3.5
2       2       5       3.5
3       2       6       3.5

[Groups: g (n=2)]
>>> # expected
>>> df >> group_by(f.g) >> mutate(b=mean(f.a))
        g       a         b
  <int64> <int64> <float64>
0       1       3       3.5
1       1       4       3.5
2       2       5       5.5
3       2       6       5.5

[Groups: g (n=2)]

from datar.all import * doesn't import trimws

Hello pwwang

I tried to call:
from datar.all import *
and it worked fine, but when I needed trimws I had to import using
from datar.base.string import trimws

If it is a intended functionality please just close the issue

`Collection` object as indexers failed in pandas 1.3.0

>>> from datar.all import c
>>> from datar.datasets import mtcars
>>> 
>>> mtcars[c("cyl", "disp", "am", "drat")]
Traceback (most recent call last):
  File "<input>", line 1, in <module>
    mtcars[c("cyl", "disp", "am", "drat")]
  File "/.../python3.7/site-packages/pandas/core/frame.py", line 3445, in __getitem__
    if com.is_bool_indexer(key):
  File "/.../python3.7/site-packages/pandas/core/common.py", line 146, in is_bool_indexer
    return len(key) > 0 and lib.is_bool_list(key)
TypeError: Argument 'obj' has incorrect type (expected list, got Collection)

This works in pandas v1.2.*

This is a pandas bug in v1.3.0: pandas-dev/pandas#42433

Nested dataframe not behaving like dplyr nested dataframe

Code -

import pandas as pd
from datar.all import *

df1 = pd.DataFrame(
    {"x": [1, 2, 3, 4], "y": [11, 12, 13, 14], "z": [21, 22, 23, 24]}
)

df1 = df1 >> nest(data1=~f.x)


df2 = pd.DataFrame(
    {"x": [1, 2, 2, 6], "y": [11, 12, 10, 14], "l": [21, 22, 23, 24]}
)

df2 = df2 >> nest(data2=~f.x)


df = (
    df1
    >> nest_join(df2)
    >> rename(data2=f._y_joined)
    >> group_by(f.x)
    >> mutate(ct=f.data2.size)
    >> ungroup()
)


df

Result -

x	data1	data2	ct

1	<DF 1x2>	<DF 1x1>	<bound method GroupBy.size of <pandas.core.gro...
2	<DF 1x2>	<DF 1x1>	<bound method GroupBy.size of <pandas.core.gro...
3	<DF 1x2>	<DF 0x1>	<bound method GroupBy.size of <pandas.core.gro...
4	<DF 1x2>	<DF 0x1>	<bound method GroupBy.size of <pandas.core.gro...

Expected -

x	data1	data2	ct

1	<DF 1x2>	<DF 1x1>	1
2	<DF 1x2>	<DF 1x1>	1
3	<DF 1x2>	<DF 0x1>	0
4	<DF 1x2>	<DF 0x1>	0

f.duplicated() not working in filter

Sometimes I wanna keep all the duplicated rows. While in pandas, done like this
mtcarss[mtcars.duplicated(keep=False)]
In datar, it does not work.

from datar.all import * 
from datar.datasets import mtcars

mtcars >> select('cyl','hp','gear','disp')>> filter(f.duplicated(keep=False))

But in the follow two ways,it works.

# 1  f.series 

mtcars >> select('cyl','hp','gear','disp')>> filter(f.cyl.duplicated(keep=False))

# 2 select all the columns 

mtcars >> select('cyl','hp','gear','disp')>> filter(f['cyl'].duplicated(keep=False))

It seems that only series can be passed to the filter

dataframe to tibble

Hello

I'm trying to read a csv file with pandas and pass it to a tibble to work with it. I couldn't find any documentation for this.

What I want to do is:

Read csv file (currently using pandas for this) and converting it to a dataframe
Take that dataframe do

df >> group_by(f.col1, f.col2) >> mutate(newCol1 = min(f.col-value), newCol2 = max(f.col-value))

When i try to do it with a pandas dataframe I get this error:

/python/lib/python3.8/site-packages/pipda/utils.py:161: UserWarning: Failed to fetch the node calling the function, call it with the original function.
  warnings.warn(
NotImplementedError: 'group_by' is not registered for type: <class 'pipda.symbolic.DirectRefAttr'>.

and then just a traceback of the most recent calls.

How should i properly load my csv file to use datar?

[Proposal] Allow a "plugin" system for other packages to be ported independently and but imported from `datar`

Example:

An independent python package is porting, say, multidplyr. The package can be named datar_multidplyr. Users can do:

import datar_multidplyr

Or if the developer wants to bundle it with datar. He could register it:

datar.register('multidplyr') # could be an alias

So that users can do:

from datar.multidplyr import ...
# and all verbs/functions can be imported by
from datar.all import *

Piping syntax not running in raw python REPL

@GitHunter0

I'm just having an issue with multi-line execution of datar code in VScode.

If a run this line by line, it works smoothly.

from datar.all import (f, mutate, tibble, fct_infreq, fct_inorder, pull)
df = tibble(var=['b','b','b','c','a','a'])
df = df >> mutate(fct_var = f['var'].astype("category"))

However, if I select all the lines and execute them, it returns:

C:\Users\user_name\miniconda3\envs\py38\lib\site-packages\pipda\utils.py:161: UserWarning: Failed to fetch the node calling the function, call it with the original function.

>>> df = df >> mutate(fct_var = f['var'].astype("category"))

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\user_name\miniconda3\envs\py38\lib\site-packages\pipda\register.py", line 396, in wrapper
    return calling_rule(generic, args, kwargs, envdata)
  File "C:\Users\user_name\miniconda3\envs\py38\lib\site-packages\pipda\_calling.py", line 93, in verb_calling_rule3
    return generic(*args, **kwargs)
  File "C:\Users\user_name\miniconda3\envs\py38\lib\functools.py", line 872, in wrapper
    raise TypeError(f'{funcname} requires at least '
TypeError: _not_implemented requires at least 1 positional argument

Originally posted by @GitHunter0 in #48 (reply in thread)

和pandas 1.2.0不兼容，pandas1.2.5可以

pandas 1.2.0版本中没有 pandas.io.formats.format._trim_zeros_single_float 这个函数，只有 pandas.io.formats.format._trim_zeros_float ，在导入datar时候就会报错

试了下panda 1.2.5 ，不报错，要求版本可能需要提高

Function to write dataframe to CSV like readr `write_csv()`

Hi @pwwang,

Happy New Year!, I hope you had a great holiday break.

I want to save the output/dataframe of a pipe into a csv file the way write_csv() does in readr dplyr but I did not find a function in the documentation.

Is there a write_csv() function in datar?

This is the way it works in R notebook

```{r}
library(tidyverse)
```

```{r}
mtcars |> group_by(mpg) |> count(cyl) |> write_csv('mpg.csv')
```

This is what I tried with datar and did not work

from datar import *
from datar.datasets import mtcars
mtcars >> group_by(f.mpg) >> count(f.cyl) >> write_csv('mpg.csv')

Porting more related `tidyverse` packages

I believe your great initiative could be stronger if you join forces with siuba, which has the same philosophy, instead of building a new project from the ground up. What do you think about that?

Getting attributes not working for operator-connected expressions

from datar.all import *

df = tibble(x=[1,-2,3], y=[-4,5,-6])
df >> mutate(z=(f.x + f.y).abs()) # AttributeError

Any way to stop the re package being overwritten?

re is needed to do regular expressions.
Then re has to be imported after the datar.

from datar.all import *
import re

I always import re first.Then id doesn't work after being overwritten.
Sometimes I use re in function like this :

def test(x,y) 
   re.sub(..)
   re.replace(..)
   return ...

`KeyError: 0` when using `filter()` on dataframe with one column

I am trying to iterate over all columns of a dataframe and filter only those values that meet a condition, but I get a KeyError: 0

Code to replicate

mpg_location = 0
mtcars >> select(f[mpg_location]) >> filter(f[mpg_location]>21)

Cannot `rename()` column to name with spaces

I want to change column names to name with spaces as following; however, the code below does not work.

is there a way to do this with datar?

mtcars >> rename(f['miles per gallon'] = f.mpg)

forcats

Hi!

Are you planning to add forcats function to datar?

Have a good day!

mean() with option `na_rm=False` does not work

Please, consider the MWE below:

from datar.all import *
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'id': ['A']*2 + ['B']*2,
    'date':['2020-01-01','2020-02-01']*2,
    'value': [2,np.nan,3,3]
}) 
df

df_mean = (df 
    >> group_by(f.id)
    >> summarize(
        # value_np_nanmean = np.nanmean(f.value),
        value_np_mean = np.mean(f.value),
        value_datar_mean = mean(f.value, na_rm=False)
    )
)
df_mean

In df_mean, the first observation of value_np_mean and value_datar_mean should be NAN instead of 2.
This is the same issue found in Pandas, which discards NAN / None observations automatically during calculations.
The only workaround I found is this: https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans/54106520

`datar` not working on RStudio notebooks

I am trying to run code on RStudio R notebooks with datar but the code does not run. I would like to use R notebooks to highlight parts of the datar code and run separately.

@pwwang what is the issue with running datar on RStudio.

`TibbleGrouped` object is not expandable in VSCode jupyter data viewer

When I create grouped data with datar's group_by(), I get an undesirable DataFrameGroupBy element instead of a DataFrame. It is not desirable to have a DataFrameGroupBy in VSCode because the dataframe cannot be clicked on the Variables Window of VSCode to see the entire dataframe, whereas the mtcars can be click to exposed the full dataset because it is a DataFrame.

The code below creates grouped data in datar and grouped data in pandas; However, datar creates a DataFrameGroupBy instead of a dataframe.

from datar.all import *
from datar.datasets import mtcars
datar_group = mtcars >> group_by(f.hp) >> count()
pandas_group = mtcars.groupby('hp').size().reset_index().rename(columns = {0:"n"})

left_join with by.x and by.y

Hello

So I have some R code that looks like this:

new_df = df1 %>% merge(df2, by.x = "col", by.y = "col2", all.x = TRUE)
I'm trying to merge left with two columns. Both have by.y value (col2) but only df1 have col1 value.
When I try to do it the datar way like this:
new_df = df1 >> left_join(df2, by =["col1", "col2"])

I get the error:
KeyError: 'col1'

Am I doing something wrong? Or is it not possible to do by.x, by.y like in R?

When doing it the pandas way like:

new_df = pd.merge(df1, df2, left_on = "col1", right_on = "col2", how="left")

It returns col2_y and col2_x, which I'm not interested in. This is not a problem in the R code

`group_by >> mutate` return error with multiple grouping variables and different sizes

I found that mutate function returns error after group_by if the grouping variables are more than 1 variable and thieir sizes are different.

from datar import f
from datar.dplyr import mutate, group_by
from datar.tibble import tibble

d = tibble(
    g1=['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c'],
    g2=['a', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'b']
) >> mutate(x=range(9))

print(d.groupby(['g1', 'g2']).size())

The size of each groups are the following:

g1  g2
a   a     3
b   a     1
    b     2
c   b     1
    c     2

Then I try to using mutate after group_by, the following error occured.

print(d >> group_by(f.g1, f.g2) >> mutate(x=f.x))

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

The full log


ValueError                                Traceback (most recent call last)
/mnt/g/User/Games/Blade-and-Sorcery/translation-jp/test.py in 
     10 print(d.groupby(['g1', 'g2']).size())
     11
---> 12 print(d >> group_by(f.g1, f.g2) >> mutate(x=f.x))

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pipda/function.py in _pipda_eval(self, data, context)
     94             # leave args/kwargs for the child
     95             # verb/function/operator to evaluate
---> 96             return func(*bondargs.args, **bondargs.kwargs)  # type: ignore
     97
     98         args = evaluate_expr(

~/.pyenv/versions/3.8.10/lib/python3.8/functools.py in wrapper(*args, **kw)
    873                             '1 positional argument')
    874
--> 875         return dispatch(args[0].__class__)(*args, **kw)
    876
    877     funcname = getattr(func, '__name__', 'singledispatch function')

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/datar/dplyr/mutate.py in _(_data, _keep, _before, _after, base0_, *args, **kwargs)
    196         return ret
    197
--> 198     out = _data._datar_apply(apply_func, _drop_index=False).sort_index()
    199     if out.shape[0] > 0:
    200         # keep the original row order

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/datar/core/grouped.py in _datar_apply(self, _func, _mappings, _method, _groupdata, _drop_index, *args, **kwargs)
    214
    215             # keep the order
--> 216             out = self._grouped_df.apply(_applied).sort_index(level=-1)
    217
    218         if not _groupdata:

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
   1270         with option_context("mode.chained_assignment", None):
   1271             try:
-> 1272                 result = self._python_apply_general(f, self._selected_obj)
   1273             except TypeError:
   1274                 # gh-20949

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f, data)
   1304             data after applying f
   1305         """
-> 1306         keys, values, mutated = self.grouper.apply(f, data, self.axis)
   1307
   1308         return self._wrap_applied_output(

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
    781             try:
    782                 sdata = splitter.sorted_data
--> 783                 result_values, mutated = splitter.fast_apply(f, sdata, group_keys)
    784
    785             except IndexError:

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/groupby/ops.py in fast_apply(self, f, sdata, names)
   1326         # must return keys::list, values::list, mutated::bool
   1327         starts, ends = lib.generate_slices(self.slabels, self.ngroups)
-> 1328         return libreduction.apply_frame_axis0(sdata, f, names, starts, ends)
   1329
   1330     def _chop(self, sdata: DataFrame, slice_obj: slice) -> DataFrame:

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/_libs/reduction.pyx in pandas._libs.reduction.apply_frame_axis0()

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/datar/core/grouped.py in _applied(subdf)
    210                 subdf.attrs["_group_index"] = group_index
    211                 subdf.attrs["_group_data"] = self._group_data
--> 212                 ret = _func(subdf, *args, **kwargs)
    213                 return None if ret is None else ret
    214

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/datar/dplyr/mutate.py in apply_func(df)
    193             **kwargs,
    194         )
--> 195         ret.index = rows
    196         return ret
    197

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
   5498         try:
   5499             object.__getattribute__(self, name)
-> 5500             return object.__setattr__(self, name, value)
   5501         except AttributeError:
   5502             pass

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/_libs/properties.pyx in pandas._libs.properties.AxisProperty.__set__()

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/generic.py in _set_axis(self, axis, labels)
    764     def _set_axis(self, axis: int, labels: Index) -> None:
    765         labels = ensure_index(labels)
--> 766         self._mgr.set_axis(axis, labels)
    767         self._clear_item_cache()
    768

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/internals/managers.py in set_axis(self, axis, new_labels)
    214     def set_axis(self, axis: int, new_labels: Index) -> None:
    215         # Caller is responsible for ensuring we have an Index object.
--> 216         self._validate_set_axis(axis, new_labels)
    217         self.axes[axis] = new_labels
    218

~/.pyenv/versions/3.8.10/lib/python3.8/site-packages/pandas/core/internals/base.py in _validate_set_axis(self, axis, new_labels)
     55
     56         elif new_len != old_len:
---> 57             raise ValueError(
     58                 f"Length mismatch: Expected axis has {old_len} elements, new "
     59                 f"values have {new_len} elements"

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

Note:

I confirmed this problem in Python 3.9.5 or 3.8.10 + iPython 7.28.0 with the latest datar package (73c58da)
This error didn't occur with same sizes of groups.
This error didn't occur with a single grouping variable (e.g.; d >> group_by(f.g2) >> mutate(x=f.x)) )

p.s. Thank you for the attension to my post.

Optimize `DataFrameGroupBy.apply()`

Refer to:

https://github.com/pandas-dev/pandas/blob/57bb1657a3530a2576a5d05541510840ebe9fd91/pandas/core/apply.py#L1072-L1104

    def apply_standard(self) -> DataFrame | Series:
        f = self.f
        obj = self.obj

        with np.errstate(all="ignore"):
            if isinstance(f, np.ufunc):
                return f(obj)

            # row-wise access
            if is_extension_array_dtype(obj.dtype) and hasattr(obj._values, "map"):
                # GH#23179 some EAs do not have `map`
                mapped = obj._values.map(f)
            else:
                values = obj.astype(object)._values
                # error: Argument 2 to "map_infer" has incompatible type
                # "Union[Callable[..., Any], str, List[Union[Callable[..., Any], str]],
                # Dict[Hashable, Union[Union[Callable[..., Any], str],
                # List[Union[Callable[..., Any], str]]]]]"; expected
                # "Callable[[Any], Any]"
                mapped = lib.map_infer(
                    values,
                    f,  # type: ignore[arg-type]
                    convert=self.convert_dtype,
                )

        if len(mapped) and isinstance(mapped[0], ABCSeries):
            # GH 25959 use pd.array instead of tolist
            # so extension arrays can be used
            return obj._constructor_expanddim(pd_array(mapped), index=obj.index)
        else:
            return obj._constructor(mapped, index=obj.index).__finalize__(
                obj, method="apply"
            )

`rowwise() >> mutate()` takes a long time to execute

rowwise() >> mutate() takes a very long time to execute.
My original test data was 4000 columns and 2000 rows(mostly zeroes) and took about 24minutes to execute but with sum(axis=1) took a few seconds. I try to replicate the issue with the mtcars dataset below.

Also I would like to keep the index so I can know which values the sum belong to but rowwise() >> mutate() takes it away - sum(axis=1) keeps the index

Lines of code to replicate:

mt2 = mtcars.append(mtcars).append(mtcars).append(mtcars).append(mtcars)

mt2 >> rowwise() >> mutate(total = sum(c_across(everything())))

#compare to 
mt2.sum(axis=1)

In the example below rowwise() >> mutate() takes 844ms and sum(axis=1) 26ms

Filter multiple objects using `in` with `filter` not working

Issue

Filter multiple objects with the filter not working - not in the documentation.

Expected results

Filtering multiple objects should work with filter verb.

Steps to replicate

from datar.datasets import mtcars
mtcars = mtcars.reset_index()
mercedes = ["Merc 230", "Merc 240D", "Merc 280"]
mtcars >> filter(f.index in mercedes)

System info

Windows 10
Jupyterlab version 3.0.14

datar 0.3.1
pipda 0.4.0

`group_by()` very slow

Issue

group_by() very slow when working with large datasets.

Expected result

group_by() should be close to pandas groupby in processing time.

Example

datar group_by() takes 2.16s

pandas groupby() takes 23ms

Cannot import datar on VScode

Hey @pwwang , I reinstalled VScode and a new strange issue appeared.

In the Interactive Window, import pandas for example is running smoothly, however import datar throws this error:

SyntaxError: unexpected EOF while parsing (datar.py, line 17)
Traceback (most recent call last):

  File "C:\Users\flavio\miniconda3\envs\py38\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "C:\Users\flavio\AppData\Local\Temp/ipykernel_8104/1528111572.py", line 1, in <module>
    import datar

  File "d:\temp_download\datar.py", line 17
    
    ^
SyntaxError: unexpected EOF while parsing

The same error also appears when running the line from a .py file. Nonetheless, running directly in Terminal works as expected.

VScode 1.60.1
Windows 10 Pro 64 bits
Miniconda 3

asttokens                 2.0.5                    pypi_0    pypi
backcall                  0.2.0              pyhd3eb1b0_0
blas                      1.0                         mkl
bottleneck                1.3.2            py38h2a96729_1
ca-certificates           2021.7.5             haa95532_1
certifi                   2021.5.30        py38haa95532_0
colorama                  0.4.4              pyhd3eb1b0_0
datar                     0.5.0                    pypi_0    pypi
debugpy                   1.4.1            py38hd77b12b_0
decorator                 5.0.9              pyhd3eb1b0_0
diot                      0.1.1                    pypi_0    pypi
entrypoints               0.3                      py38_0
executing                 0.8.1                    pypi_0    pypi
inflection                0.5.1                    pypi_0    pypi
intel-openmp              2021.3.0          haa95532_3372
ipykernel                 6.2.0            py38haa95532_1
ipython                   7.26.0           py38hd4e2768_0
ipython_genutils          0.2.0              pyhd3eb1b0_1
jedi                      0.18.0           py38haa95532_1
jupyter_client            7.0.1              pyhd3eb1b0_0
jupyter_core              4.7.1            py38haa95532_0
matplotlib-inline         0.1.2              pyhd3eb1b0_2
mkl                       2021.3.0           haa95532_524
mkl-service               2.4.0            py38h2bbff1b_0
mkl_fft                   1.3.0            py38h277e83a_2
mkl_random                1.2.2            py38hf11a4ad_0
nest-asyncio              1.5.1              pyhd3eb1b0_0
numexpr                   2.7.3            py38hb80d3ca_1
numpy                     1.20.3           py38ha4e8547_0
numpy-base                1.20.3           py38hc2deb75_0
openssl                   1.1.1l               h2bbff1b_0
pandas                    1.3.2            py38h6214cd6_0
parso                     0.8.2              pyhd3eb1b0_0
pickleshare               0.7.5           pyhd3eb1b0_1003
pip                       21.2.2           py38haa95532_0
pipda                     0.4.5                    pypi_0    pypi
plotly                    5.3.1                    pypi_0    pypi
prompt-toolkit            3.0.17             pyhca03da5_0
pure-eval                 0.2.1                    pypi_0    pypi
pygments                  2.10.0             pyhd3eb1b0_0
python                    3.8.10               hdbf39b2_7
python-dateutil           2.8.2              pyhd3eb1b0_0
python-slugify            5.0.2                    pypi_0    pypi
pytz                      2021.1             pyhd3eb1b0_0
pywin32                   228              py38hbaba5e8_1
pyzmq                     22.2.1           py38hd77b12b_1
scipy                     1.7.1                    pypi_0    pypi
setuptools                52.0.0           py38haa95532_0
six                       1.16.0             pyhd3eb1b0_0
sqlite                    3.36.0               h2bbff1b_0
tenacity                  8.0.1                    pypi_0    pypi
text-unidecode            1.3                      pypi_0    pypi
tornado                   6.1              py38h2bbff1b_0
traitlets                 5.0.5              pyhd3eb1b0_0
varname                   0.8.1                    pypi_0    pypi
vc                        14.2                 h21ff451_1
vs2015_runtime            14.27.29016          h5e58377_2
wcwidth                   0.2.5              pyhd3eb1b0_0
wheel                     0.37.0             pyhd3eb1b0_1
wincertstore              0.2                      py38_0

ImportError

Successfully installed datar-0.5.6 python-slugify-5.0.2

[autoreload of datar.base.arithmetic failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.bessel failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.na failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.casting failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.testing failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.complex failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.constants failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.cum failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.date failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.factor failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.funs failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.logical failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.null failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.random failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.seq failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.special failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.string failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.table failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules
]
[autoreload of datar.base.trig_hb failed: Traceback (most recent call last):
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 245, in check
superreload(m, reload, self.old_objects)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/site-packages/IPython/extensions/autoreload.py", line 394, in superreload
module = reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/imp.py", line 314, in reload
return importlib.reload(module)
File "/usr/local/Caskroom/miniconda/base/lib/python3.9/importlib/init.py", line 159, in reload
raise ImportError(msg.format(parent_name),
ImportError: parent 'datar.base' not in sys.modules

`group_by` TypeError: '<' not supported between instances of 'str' and 'int'

Issue

group_by TypeError: '<' not supported between instances of 'str' and 'int'

Steps to replicate

from datar.all import *
df_test = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx")
df_test >> group_by(f.Description)

Snapshot of error message

Any plans to add new functions like fast_mutate in siuba

In big data ,it always takes long time in mutate after group by .
I found the fast_mutate, fast_summarize in siuba does short the time a lot.
So do you have any plans to add some functions like that to optimize the grouped data.

cannot import name 'argname2' from 'varname' error

Issue

When importing datar, only the two lines of code below run without error, all others give the same error message cannot import name 'argname2' from 'varname' error.

from datar import f
from datar.datasets import mtcars
from datar.dplyr import mutate, filter, if_else
from datar.tibble import tibble
from datar.all import *

Expected result

import any of datar packages without errors

System information:

Windows 10
Jupyterlab version 3.0.14

pip list shows packages installed
datar 0.3.1
pipda 0.4.0

Steps taken to resolve issue

pip uninstall datar
pip uninstall pipda

restarted kernel
pip install -U datar

(Optim) Conda-Forge Recipe

Any plans to sumbit datar to Conda-Forge?

Thank you very much.

Add CLI interface to process data on the command line

The proposed feature would be (on command line):

> datar import iris | \
    datar group_by f.Species | \
    datar summarise --sum_Sepal_Length="sum(f.Sepal_Length)" | \
    datar to_csv sum_iris.csv
# load data from file
> datar read_csv --sep="\t" --index_col=0 | datar mutate --double_Sepal_Length="f.Sepal_Length*2"

This module will be optional and installed only when:

pip install -U datar-cli

`filter` slows down on large datasets

Issue

When using filter in small datasets there is minimal difference in processing time between datar's filter and pandas query; however, when working with large datasets, datar's filter slows dramatically

Steps to replicate

df_test = pd.read_excel("https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx", dtype={"Description": str})
df_test >>= mutate(country2 = f.Country, country3 = f.Country, country4 = f.Country, custid2 = f.CustomerID, custid3 = f.CustomerID, custid4= f.CustomerID, invoice2 =f.InvoiceDate, invoice3 =f.InvoiceDate, invoice4 =f.InvoiceDate, stock2 = f.StockCode, stock3 = f.StockCode)
filtered_df = df_test >> filter((f.Description =="WHITE HANGING HEART T-LIGHT HOLDER" )| (f.Description =="REGENCY CAKESTAND 3 TIER") )

Note: I artificially created more columns to show the issue.

Snapshot comparing small dataset vs large dataset using datar's `filter` and pandas `query`

Operator `&` losing index

when the case_when is used the output is not as expected.

Code to replicate

mtcars >> mutate(gas_milage = case_when(
                            f.mpg >21  and f.mpg <= 22, "ok",
                            f.mpg >22, "best",
                            True, "other"

))

Issue: The last line in the output does not meet the f.mpg >21 and f.mpg <= 22, "ok" but it is still applied the "ok" label

Expected result

Only rows meeting the f.mpg >21 and f.mpg <= 22, "ok" condition labeled "ok", all other rows not meeting any condition should be labeled "other"

pwwang / datar Goto Github PK

datar's Introduction

datar

Installation

Backends

Example usage

Testimonials

datar's People

Contributors

Stargazers

Watchers

Forkers

datar's Issues

Issue

Steps to reproduce

Expected result

P1 . It seems group_by >> mutate() >>filter()>> mutate() has some confilicts.

P2 Filter after group_by then mutate columns ,like group_by() >> filter ()>>mutate() , also has some erros.

Steps to reproduce

Expected output

Issue

Expected

Code to replicate

Issue

Steps to replicate

Expected

Code to replicate

Issue

Expected results

Steps to replicate

System info

Issue

Expected result

Example

Issue

Steps to replicate

Snapshot of error message

Issue

Expected result

System information:

Steps taken to resolve issue

Issue

Steps to replicate

Snapshot comparing small dataset vs large dataset using datar's filter and pandas query

Code to replicate

Issue: The last line in the output does not meet the f.mpg >21 and f.mpg <= 22, "ok" but it is still applied the "ok" label

Expected result

Recommend Projects

Recommend Topics

Recommend Org

P1 . It seems `group_by >> mutate() >>filter()>> mutate()` has some confilicts.

P2 Filter after group_by then mutate columns ,like `group_by() >> filter ()>>mutate()` , also has some erros.

Snapshot comparing small dataset vs large dataset using datar's `filter` and pandas `query`