Code Monkey home page Code Monkey logo

Comments (2)

Gmousse avatar Gmousse commented on May 11, 2024

Hi @dragoljub,

Your question is really interesting.
As you said, the DataFrame object is not really column-based data structure but it's actually row-based.

To write this codebase I had two options:

  • Row-based: Define a DataFrame as a collection of rows having a schema. It's used by Spark (or other functional, immutable, big data structures). The API should be functional, based on maps and reductions.
  • Column-based: Define a DataFrame as a collection of columns having a type. It's used by R, pandas (in python). The API should be based on vectorial computations and usually mutable.

I have chosen Row-based because:

  • I think the functional paradigm (and the Spark implementation and API) is more in accordance with the Javascript evolution (and future?), and also with the way I work.
  • The row is easier to maintain the dataset structure. Indeed, we don't have to work with index in Column vectors to manipulate row data (which can be a pain in a group by for example).
  • It's better (in terms of computation and usage) to make reductions taking multiple columns in account to filter rows, to sort them or to modify them.
  • It's easier to make rows immutable.
  • (as you said) It's perfect in front-end applications (vizualizations, list of data, filtering...).
  • Rows is the best format for parallelism (It will be a reality in 1 or 2 years in Javascript, http://www.2ality.com/2017/01/shared-array-buffer.html). I plane to test parallelism with DataFrame-js in backend with https://github.com/turbo/js. Column-based can't be used for this purpose.
  • Rows have many other advantages that I haven't in my mind right now...

However Row-based has some disadvantages:

  • It's slower in some simple column manipulation cases where column-based can be better: modifiy one column, add one column, sort columns, cast one column... (but this disadvantage is canceled when you need to work on multiple columns at the same time)
  • The API can be painful for simple column manipulations (mike df["mycolumn"] = df["mycolumn"] ** 2 in pandas).
  • Declare a large number of Row objects is heavy and lead to slow computations...
  • And a lot of other issues...

To conclude, both Column and Row have advantages (and disadvantages). I have chosen the row-based solution but it could be interesting to improve column manipulations, or to add new features. Why not create a MutableDataFrame (as scala does for some data structures) which could use similar API and column-based operations than R or pandas ? It could be interesting, but it's not in my short-time aims.

Indeed, DataFrame can be slower in some column manipulations, but it's also faster in map and reduction taks (that I use in 70% of times).
I work (slowly) on a new DataFrame version including important performance (speed and memory consumption) optimizations (I hope to make it 10x faster). I will try to improve column operations and maybe to create some bridges between rows and columns (like a better .transpose() method as you said).

I hope I have answered to your question. If you have any ideas of improvment of column-based (which doesn't break the code base and the API), make a PR.

from dataframe-js.

dragoljub avatar dragoljub commented on May 11, 2024

Thanks for the detailed response! I'll play with the code some more to better understand the usage patterns.

from dataframe-js.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.