Comments (2)
Hi @dragoljub,
Your question is really interesting.
As you said, the DataFrame object is not really column-based data structure but it's actually row-based.
To write this codebase I had two options:
- Row-based: Define a DataFrame as a collection of rows having a schema. It's used by Spark (or other functional, immutable, big data structures). The API should be functional, based on maps and reductions.
- Column-based: Define a DataFrame as a collection of columns having a type. It's used by R, pandas (in python). The API should be based on vectorial computations and usually mutable.
I have chosen Row-based because:
- I think the functional paradigm (and the Spark implementation and API) is more in accordance with the Javascript evolution (and future?), and also with the way I work.
- The row is easier to maintain the dataset structure. Indeed, we don't have to work with index in Column vectors to manipulate row data (which can be a pain in a group by for example).
- It's better (in terms of computation and usage) to make reductions taking multiple columns in account to filter rows, to sort them or to modify them.
- It's easier to make rows immutable.
- (as you said) It's perfect in front-end applications (vizualizations, list of data, filtering...).
- Rows is the best format for parallelism (It will be a reality in 1 or 2 years in Javascript, http://www.2ality.com/2017/01/shared-array-buffer.html). I plane to test parallelism with DataFrame-js in backend with https://github.com/turbo/js. Column-based can't be used for this purpose.
- Rows have many other advantages that I haven't in my mind right now...
However Row-based has some disadvantages:
- It's slower in some simple column manipulation cases where column-based can be better: modifiy one column, add one column, sort columns, cast one column... (but this disadvantage is canceled when you need to work on multiple columns at the same time)
- The API can be painful for simple column manipulations (mike df["mycolumn"] = df["mycolumn"] ** 2 in pandas).
- Declare a large number of Row objects is heavy and lead to slow computations...
- And a lot of other issues...
To conclude, both Column and Row have advantages (and disadvantages). I have chosen the row-based solution but it could be interesting to improve column manipulations, or to add new features. Why not create a MutableDataFrame (as scala does for some data structures) which could use similar API and column-based operations than R or pandas ? It could be interesting, but it's not in my short-time aims.
Indeed, DataFrame can be slower in some column manipulations, but it's also faster in map and reduction taks (that I use in 70% of times).
I work (slowly) on a new DataFrame version including important performance (speed and memory consumption) optimizations (I hope to make it 10x faster). I will try to improve column operations and maybe to create some bridges between rows and columns (like a better .transpose() method as you said).
I hope I have answered to your question. If you have any ideas of improvment of column-based (which doesn't break the code base and the API), make a PR.
from dataframe-js.
Thanks for the detailed response! I'll play with the code some more to better understand the usage patterns.
from dataframe-js.
Related Issues (20)
- [BUG] fromCSV fails when there is a space in the path HOT 1
- [BUG] TypeError: Cannot read property 'toStringTag' of undefined HOT 3
- has the csv file been loaded successfully? HOT 1
- .push does not seem to work HOT 1
- add index to the DataFrame
- keep date as a type of data in DataFrame-js
- [BUG] Inconsistent left join results
- [QUESTION] Why does the aggregate function drops all columns ? HOT 2
- [BUG] Dataframe loaded from file doesn't identify missing values
- [BUG] Creating new DataFrame from array with duplicate column names
- [BUG] Typo in the documentation
- [FEATURE] - To be able to filter a DataFrame from a MongoDB JSON export by Object ID
- [BUG] Filtering is not working as expected HOT 1
- Documentation not available[BUG]
- [BUG]
- [BUG] DataFrame is not a constructor HOT 3
- [BUG] Definitions for filter and where functions missing boolean option
- [FEATURE] Improve error messages
- [BUG] Documentation page returning 401 unauthorized
- [BUG] Documentation showing wrong syntax for SQL
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe-js.