Code Monkey home page Code Monkey logo

Comments (6)

orlp avatar orlp commented on June 25, 2024 1

As mentioned by others, this is not a fair comparison as the input/output formats are different - we don't do in-place manipulation but generate a copy. Also, Polars actually has proper nulls (which means it has to look in a different memory location that contains the nulls), whereas Pandas only has to look at the values themselves since it uses NaNs.

Finally the original test of 280,000 rows is way too small - at that point you're almost benchmarking the Polars DSL parsing/optimizer more than the data manipulation itself.

Repeating the above experiment with 100M rows I get the following results on my Apple M2 machine:

378 ms ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
742 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I'm currently finishing a PR that would reduce the gap to this:

379 ms ± 5.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
544 ms ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

More improvement with branchless filling is possible still but low priority at the moment, as it's rather labour-intensive to write.

from polars.

d-reynol avatar d-reynol commented on June 25, 2024

I'm not seeing the same behavior:

242 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
239 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

from polars.

Chuck321123 avatar Chuck321123 commented on June 25, 2024

Reopening case as I still get faster results for pandas counterpart.

2.21 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.86 ms ± 114 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Let me know if anyone gets something else

from polars.

deanm0000 avatar deanm0000 commented on June 25, 2024

I don't think your test is apples to apples

doing df["random_value"].bfill() doesn't return a DataFrame. It returns a Series

A more apples to apples test would be compare two function calls that return a dataframe so something like

%%timeit
df2.with_columns(pl.col("random_value").backward_fill())
%%timeit
df.assign(a=df['a'].bfill())

When I do that comparison with 100M rows, 20% null. I get polars takes 795ms and pandas takes 1.44s

from polars.

Chuck321123 avatar Chuck321123 commented on June 25, 2024

@deanm0000 I see. The whole idea is to create a new column in a dataframe where i do backwardfilling. By using with_columns instead of select i get the following results where polars is line number 2:

2.33 ms ± 709 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.25 ms ± 603 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

The pandas way of adding a new column/manipulating existing column is usually df["New_Col"] = ..., so would kind of be wrong to compare to assign in which "nobody" uses

from polars.

deanm0000 avatar deanm0000 commented on June 25, 2024

I see your point but it's not a bug that pandas is faster for this operation.

Someone should correct me if I have this wrong but I think the difference is that numpy arrays are mutable whereas arrow arrays are immutable. That means when you just want to change a subset of values, pandas/numpy can do that inplace whereas when you want to perform the same operation with arrow arrays it has to rewrite all the values.

from polars.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.