gmousse / dataframe-js Goto Github PK
View Code? Open in Web Editor NEWA javascript library providing a new data structure for datascientists and developpers
Home Page: https://gmousse.gitbooks.io/dataframe-js/
License: MIT License
A javascript library providing a new data structure for datascientists and developpers
Home Page: https://gmousse.gitbooks.io/dataframe-js/
License: MIT License
> dataFrame = new DataFrame.DataFrame([[-1], [-2], [-3]], ['value']);
> dataFrame.stat.max('value');
0
> dataFrame.stat.min('value');
-3
It appears that max()
is initialized to 0, so it will never return a negative value: https://github.com/Gmousse/dataframe-js/blob/master/src/modules/stat.js#L36-L40.
Adding an optional parameter columnName allowing to return only one column as Array.
When dataframe.count()===1 shuffle returns empty dataframe.
When a user chains methods on a DataFrame, it creates a new DataFrame on each method.
const newDF = df.map(___) // New DataFrame
.filter(____) // New DataFrame
.withColumn(____) // New DataFrame
.map(____) // New DataFrame
It 's slow.
It could be interesting to create a computation stack (with some optimizations) in order to only compute things when necessary. It should be computed when the user REALLY wants a new DataFrame or when he wants to collect its results in a different form (toCollection or other methodes).
const newDF = df.map(___) // Add map in the computation stack
.filter(____) // Add filter in the computation stack
.withColumn(____) // Add withColumn in the computation stack
.map(____) // Add map in the computation stack
.toCollection() // Really compute map, filter, ....
Of course we have to memoize the DataFrame when the computation stack is consumed (in order to avoid useless re-computations).
Moreover, it should not break the immutability (each method should return a new DataFrame with a growing computation stack without mutations).
const newDF = df.map(_____);
const newDF2 = newDF.filter(____);
// newDF computation stack: [map]
// newDF2 computation stack: [map, filter]
const newDF2Results = newDF2.toCollection(); // Trigger computation. Doesn't trigger newDF.
newDF2.show(); // Instant results. newDF2 was already computed;
newDF.show(): // Compute its own stack.
A new module to request DataFrames with SQL syntax.
Is it possible to filter out duplicates only on a subset of columns, keeping for instance the first encountered value for other columns (as .dropDuplicates()
does not take any argument I guess it is not)?
Alternatively, would it be easier to add a mult="first"
argument like in R's data.table in left joins in order to join only with the first matching row and discard other matching rows?
Thanks,
Martin
Related to one bug of the issue #21
Indeed, with the minified build, checktypes on Class is just broken and block some methods.
keep-fnames will apply to all classes. If a developer is working in a large code-base, this could add a ton of bloat to their minified output.
I haven't been able to look more into this issue yet, I hope to when I get time. One strategy I wonder about is simply adding a string property to each class identifying it rather than relying on the constructor name.
As always, I'm open to other thoughts.
For reference, the reason this is important for me is that when I apply nested groups to a dataframe, I need to recusrively traverse it, meaning I must distinguish a DataFrame from a GroupedDataFrame. While I'm handling that via on: string[]
, it's inelegant, and something within the lib itself is failing.
linked with #3
I'm familiar with data frames through R and python so I was looking for something similar in JS. I ran into this module. The problem I'm having is that when I try and read in a local JS file it's returning a promise not an actual DF. If I run the following a DF doesn't really seem to be created.
var DataFrame = dfjs.Dataframe;
const df = new DataFrame.fromJSON('http://localhost:8000/data/maps3.json').then(df => df);
df.show() // returns df.show is not a function
If I run this the DF is displayed so I know it's not a read error but something about how the DF is being created.
var DataFrame = dfjs.Dataframe;
// Actually shows the DF
const df = new DataFrame.fromJSON('http://localhost:8000/data/maps3.json').then( df => {df.show()} );
As best I can tell this has to do with a promise being created not the actual DF. If I use the following it functions as I would expect.
// From a collection (easier)
const df = new DataFrame([
{c1: 1, c2: 6}, // <------- A row
{c4: 1, c3: 2}
], ['c1', 'c2', 'c3', 'c4']);
df.show()
Request 1: When reading in the json file like above, how do I return the DF object? Can you update the basic usage documentation help illustrate how to handle reading the file in and returning a DF? Maybe I'm missing something simple?
Request/Question 2: When I use df.select('column') it returns an object when I would have expected it to return a list or vector of the values. Is there simple method for doing that? If so, can you provide an example in your documentation?
FYI, I'm looking to take JSONS, convert to DFs, and then manipulate and subset the data and then feed it into D3 to make some interactive dashboards. Between some basic JS on the page, DFs and what they can do, and D3 I think there are some great opporunities, I just need to get the DF portion ironed out.
Hello,
It could be useful if .listColumns()
were updated to give the actual columns names, including those added with .map(row => row.set("test",1))
for instance (I would like to use it with dynamic columns names not known in advance).
Instead, it gives only the columns names given to the constructor.
A quick and dirty fix is to convert the DF to collection and back to DF, but it is not really satisfactory.
I am not sure of what would be the most efficient way to do it but I can try and fill a PR if I find something.
This is a very useful package btw, thank you @Gmousse!
Martin
Hi,
I needed some more performant handling of large CSV files for a project I'm working on. So I'm giving dataframe-js a spin, but I can't seem to even load my csv files. I'm just trying to load a local CSV file, but keep getting an error that seems to suggest I need to be running a server to serve the files. Which seems kind of ridiculous, and also not what the docs say. Here's some output:
> fs.existsSync('./raw_data/pro_record_db.csv')
true
> var df; DataFrame.fromCSV('./raw_data/pro_record_db.csv').then(function(theDf) { df = theDf });
Promise { <pending> }
> (node:54415) UnhandledPromiseRejectionWarning: Unhandled promise rejection (rejection id: 6): FileNotFoundError: ./raw_data/pro_record_db.csv not found. You maybe use a wrong path or url. Be sure you use absolute path, relative one being not supported.
Then looking at the source of addFileProtocol I changed it to not have a ./
at the beginning. So I changed my path a bit, but no luck.
> var df; DataFrame.fromCSV('~/code/bettor/js/raw_data/pro_record_db.csv').then(function(theDf) { df = theDf });
Promise { <pending> }
> df.show()
| Error:... |
------------
| at... |
| at... |
| at... |
'| Error:... |\n------------\n| at... |\n| at... |\n| at... |'
> df.listColumns()
[ 'Error: connect ECONNREFUSED 127.0.0.1:80' ]
The same thing happens when I use a fully absolute path of like 'Users/blakewest/code/project1/raw_data/pro_record_db.csv'
.
Clearly, it appears it's trying to load the file from a server. But It's a local file. How do I get it to load the local file correctly? Thanks! - Blake
This methods returns the DataFrame as collection of dictionnaries (Object):
[
{'column1':1, 'column2': 2 },
{'column1':2, 'column2': 4 },
]
Thanks!
Rob
Hi,
When I try to do a group by this error shows
TypeError: Cannot read property 'apply' of undefined
at \node_modules\dataframe-js\lib\modules\sql\sqlEngine.js:224:30
at sqlParser (\node_modules\dataframe-js\lib\modules\sql\sqlEngine.js:244:12)
at Function.request (\node_modules\dataframe-js\lib\modules\sql\index.js:34:47)
at app.js:263:39
at Layer.handle [as handle_request] (\node_modules\express\lib\router\layer.js:95:5)
at next (\node_modules\express\lib\router\route.js:137:13)
at Route.dispatch (\node_modules\express\lib\router\route.js:112:3)
at Layer.handle [as handle_request] (\node_modules\express\lib\router\layer.js:95:5)
at \node_modules\express\lib\router\index.js:281:22
at Function.process_params (\node_modules\express\lib\router\index.js:335:12)
I'm sure I have data to do the group by and this is the query im tyring to run
select longitude,latitude,sum(amount),idfraudpoint from tmp0 where fraud=1 GROUP BY idfraudpoint
How can I achieve the following structure? (Nested group-by).
Color | State | Name | Qty | Value |
---|---|---|---|---|
Red | ||||
VA | ||||
Jeff | 2 | $10 | ||
John | 1 | $5 | ||
PA | ||||
Rachel | 4 | $20 | ||
Blue | ||||
VA | ||||
Robert | 5 | $100 | ||
PA | ||||
Rebecca | 3 | $40 |
I have now something like the following:
private recurseDataFrame(dataFrame: DataFrame | IGroupedDataFrame, group: IGroupByField, agg: IAggregateField): DataFrame {
if (dataFrame instanceof DataFrame) {
if (agg) {
return dataFrame
.groupBy(...<string[]> group.name)
.aggregate((group) => dataFrame.stat[mathMap[agg.type]](agg.name));
}
return dataFrame.groupBy(...<string[]> group.name);
} else {
for (const dfGroup of dataFrame) {
dfGroup.group = this.recurseDataFrame(dfGroup.group, group, agg);
}
return dataFrame;
}
}
One issue I have is that for (const dfGroup of dataFrame)
is not normally supported by TypeScript when compiling to es5. By default, TypeScript only supports Array types for a for let
or for const
.
Enabling down-compilation of iterators causes other issues with the Angular framework, so it isn't a good solution for me,
Edit:
Is it as simple as:
for (const dfGroup of dataFrame.toCollection()) {
dfGroup.group = this.recurseDataFrame(dfGroup.group, group, agg);
}
Is there a way to establish indexes on columns to speed up access queries based on where conditions?
Im trying to use join and i want to specify the columns that i want to join between dfs.
df.join(df2, 'column1', 'full')
I asume if i do the following join that column1 name exist in both dfs rigth?
Is it posible to join specifing the columns foreach dfs?
For example I want to join df with df2 on df.column1 = df.column3
Thanks!
SortBy should take in multiple columns, ordered left to right by decreasing "priority". When a column to the left has the same value, the column to the right should tie-break it.
I hope to have time to investigate and create a PR for this soon. If you already have an implementation in mind, so much the better!
Example: myDf.sortBy('Test A Score', 'Test B Score', 'Test C Score');
Student | Test A Score | Test B Score | Test C Score |
---|---|---|---|
Henry | 95 | 90 | 76 |
Jess | 95 | 90 | 75 |
William | 95 | 89 | 76 |
Clair | 95 | 89 | 76 |
Barbara | 94 | 99 | 99 |
John | 94 | 98 | 77 |
Trying to use it in my AngularCLI project with TypeScript.
I tried importing with yarn and even adding the CDN version but it's either asking for the types or not working at all. Maybe a types file can be created for https://github.com/DefinitelyTyped/DefinitelyTyped
Perhaps there is another way to import it?
Would love it if someone could point me in the right direction :)
Currently when we use a DataFrame method, it creates a deep copy of the previous instance.
But it results on slow computations.
We have to see how a lazier immutable system could improve speed of the library.
Adding the sum() method into the Stat module
For now (1.2.7), to infer DataFrame schema (when columns are not given by the user) is pretty slow.
We should infer schema from some rows with a fast table scan.
hello Gmousse , maybe i found a bug about import data frame.js .
when i import dataframe.js , i join two data frame , performance is too slower , caught browser core dump,
when ii import dataframe-min.js , program give a error "TypeError: t expected as one of [DataFrame], not as a object | t " .
i think , maybe it is a bug .
Gmousse , thank you for a powerful js package to support DataFrame .
Hello,
I wan't to use the file dataframe.js in the dist folder but I have the following error:
TypeError: DataFrame is not a constructor
My code is the following
DataFrame = require('./dataframe-js/dist/dataframe.js').DataFrame;
test = new DataFrame([]);
test.show();
(I have no problem if I replace the first line by this one:
DataFrame = require('./dataframe-js/lib').DataFrame;
but I need to install babel-runtime with npm.)
Do you know what is the problem ?
Thanks
While having multiple values returned from reduce function is very straight forward. Like:
// df being Titanic.csv
df.reduce((p, n) => (
{ sum: +n.get('Freq') + p.sum,
max: +n.get('Freq') > p.max ? +n.get('Freq') : p.max
}),
{sum: 0, max: -999}
).show();
Output:
{sum: 2201, max: 670}
If I try to use similar reduce function on GroupedDataFrame like below:
df.groupBy('Class').aggregate(group => group.reduce((p, n) =>
({ sum: +n.get('Freq') + p.sum,
max: +n.get('Freq') > p.max ? +n.get('Freq') : p.max
}),
{sum: 0, max: -999} )
, 'new col').show();
As expected the output is a JSON object stored in a column, which can further be mapped into separate columns.
| Class | new col |
------------------------
| 1st | [objec... |
| 2nd | [objec... |
| 3rd | [objec... |
| Crew | [objec... |
However it would be better if each of the variables in JSON can be expanded as a separate columns like below:
| Class | sum | max |
-----------------------------------
| 1st | 325 | 140 |
| 2nd | 285 | 154 |
| 3rd | 706 | 387 |
| Crew | 885 | 670 |
Is there a way to achieve similar result in dataframe-js? Such feature is available in R's data.table package, great to have similar in dataframe-JS too.
Thanks!
Kunal
We can't import dataframe-js when babel is not required.
Hi, very nice project!
I'm using this API in a web application (vue.js) where I need to filter some rows, select some columns and iterate over the results. When I call toDict() on the dataframe and iterate over the results, it returns the columns in order. I expected it to return rows, one by one.
Edit: I noticed that df.toArray() does return rows as array of values. I expected toDict() to also iterate over rows, but where each row is a named object and the names come from column names. This way, I can store data effeciently in a dataframe, without having names attached to each row, but when I want to iterate, names are added dynamically at run-time.
Edit: df.toCollection() does what I want. Should have read the docs better!
Hi, I tried to query using sql but the result is always undefined.
var DataFrame = require('dataframe-js').DataFrame; const df = new DataFrame(pls, columns); df.sql.register('tmp2', true); const test = DataFrame.sql.request("SELECT * FROM tmp2");
DataFrame.rename() will now allow to rename only one column.
DataFrame.renameAll() will assume old .rename() functionality.
const DataFrame = require('dataframe-js').DataFrame
const df = new DataFrame([
{ c1: 1, c2: 6 },
{ c1: 1, c3: 2 }
], ['c1', 'c2', 'c3', 'c4'])
df.filter(row => row.get('c1') > 1).show()
now it returns a empty df with no columns, will it be better to return an df with original columns but no rows ?
it creates problem when the df later do joining with other dataframe if no column df is returned
Thanks for such an useful library. I am using it to process DataFrames over 150 thousand rows and 10 thousand groups, and I am having performance problems with the current implementation of groupBy.
I am sending a Pull request that changes the GroupedDataFrame._groupBy function to loop over the source DataFrame just once, instead of once per group. This change makes my jobs complete in seconds instead of minutes. The PR passes the tests and lint, I hope it is worth including upstream.
A new api for groupBy will coming soon.
.groupBy() now returns a GroupedDataFrame object.
GroupedDataFrame provides a groupBy on multiple columns ( #4 ) and a method .aggregate() used to apply a function on each group.
example: df.groupBy('column1', 'column2').aggregate(group => group.count())
The aggregation returns a DataFrame with columns: ['column1', 'column2', 'aggregation']
How would you get a row by its index, or perform an action on an individual row? Only way i can see is give a row an ID and filter by it then map over it.
Hi. This code looks very interesting and exciting. ๐
Are there any plans to implement column-based data store as the underlying data container? ie Col.js vs Row.js I'm curious what motivated you to store row-based data structures rather than column-based which are generally fixed types and could offer more compression if needed via categorical columns etc.
Probably for UI tasks you are interested in full fast row-based context access of data for tool-tips etc.
Maybe all we need is a transpose function.
First of all, thanks for open sourcing this library. I was just building one of my own and saw that this one existed and thought I'd give it a try.
inner
.See below for more details and samples from Python Pandas
I see that dataframe-js
has 5 join types: inner
, outer
, full
, left
, and right
. Usually full
and outer
mean the same thing... so I'm confused why there are two different implementations?
I see from your Twitter bio that you also like Python. So here is an example of Pandas which only has 4 join types...
# Created: 4/5/17
__author__ = 'Paul Mestemaker <[email protected]>'
import pandas as pd
def show_join_types():
df1 = pd.DataFrame({
'key': [1, 2, 3, 4],
'value1': [1.01, 2.01, 3.01, 4.01],
})
df2 = pd.DataFrame({
'key': [1, 2, 3, 5],
'value2': [1.02, 2.02, 3.02, 5.02],
})
print('df1')
print(df1)
print('\ndf2')
print(df2)
join_types = ['inner', 'outer', 'right', 'left']
for join_type in join_types:
df3 = pd.merge(left=df1, right=df2, how=join_type)
print('\n')
print(join_type)
print(df3)
if __name__ == '__main__':
show_join_types()
df1
key value1
0 1 1.01
1 2 2.01
2 3 3.01
3 4 4.01
df2
key value2
0 1 1.02
1 2 2.02
2 3 3.02
3 5 5.02
inner
key value1 value2
0 1 1.01 1.02
1 2 2.01 2.02
2 3 3.01 3.02
outer
key value1 value2
0 1 1.01 1.02
1 2 2.01 2.02
2 3 3.01 3.02
3 4 4.01 NaN
4 5 NaN 5.02
right
key value1 value2
0 1 1.01 1.02
1 2 2.01 2.02
2 3 3.01 3.02
3 5 NaN 5.02
left
key value1 value2
0 1 1.01 1.02
1 2 2.01 2.02
2 3 3.01 3.02
3 4 4.01 NaN
const df1 = new DataFrame({
key: [1, 2, 3, 4], // <------ A column
value1: [1.01, 2.01, 3.01, 4.01],
}, ['key', 'value1']);
df1.show();
const df2 = new DataFrame({
key: [1, 2, 3, 5], // <------ A column
value2: [1.02, 2.02, 3.02, 5.02],
}, ['key', 'value2']);
df2.show();
const joinTypes = ['inner', 'outer', 'full', 'left', 'right'];
for (const joinType of joinTypes) {
console.log(`\n${joinType}`);
df1.join(df2, 'key', joinType).show();
}
| key | value1 |
------------------------
| 1 | 1.01 |
| 2 | 2.01 |
| 3 | 3.01 |
| 4 | 4.01 |
| key | value2 |
------------------------
| 1 | 1.02 |
| 2 | 2.02 |
| 3 | 3.02 |
| 5 | 5.02 |
inner
| key | value1 | value2 |
------------------------------------
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
outer
| key | value1 | value2 |
------------------------------------
| 4 | 4.01 | undefined |
| 5 | undefined | 5.02 |
full
| key | value1 | value2 |
------------------------------------
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
| 4 | 4.01 | undefined |
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
| 5 | undefined | 5.02 |
left
| key | value1 | value2 |
------------------------------------
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
| 4 | 4.01 | undefined |
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
right
| key | value1 | value2 |
------------------------------------
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
| 1 | 1.01 | 1.02 |
| 2 | 2.01 | 2.02 |
| 3 | 3.01 | 3.02 |
| 5 | undefined | 5.02 |
When a.union(b) is called and a, b have distinct column names, I'd expect the concat to still work, while currently, it triggers an exception.
In the interim, I'm wrapping with:
function unionDFs(a, b) {
const aNeeds = b.listColumns().filter((v) => aCols.indexOf(v) === -1);
const bNeeds = a.listColumns().filter((v) => bCols.indexOf(v) === -1);
const a2 = aNeeds.reduce((df, name) => df.withColumn(name, () => 'n/a'), a);
const b2 = bNeeds.reduce((df, name) => df.withColumn(name, () => 'n/a'), b);
return a2.union(b2);
}
Thanks for this excellent repo,
I am trying to add a new module, and want to keep the existing default modules.
I know we can do this by reimporting the defaults:
DataFrame.setDefaultModules(SQL, Stat, FakeModule)
but seems the Stat ,SQL and matrix module is not exported, how can I get them and reimport it when setting the default module ?
Currently, the first whitespace in a column name is removed. I think this is a minor bug in that it was intended to be a global replace. However, there are two deeper design issues here:
Can whitespace removal be dropped? This makes reuse with externally defined column names error-prone. AFAICT, only the constructor needs that modification.
If column renaming is not dropped, can we expose the renaming in some way so that external code can track how to map + reverse map names?
Happy to contribute code here, lmk.
When joining with an empty dataframe the result is an array.
I'd expect it to be an empty dataframe.
Example:
let df1 = new dfjs.DataFrame([]);
let df2 = new dfjs.DataFrame([]);
df1.join(df2, 'col1'); // empty array
The empty array is quite unhandy in situations where filter()
is followed by join()
.
What are your thoughts on this?
In the API documentation, the example you give for groupBy is this:
groupedDF.aggregate(group => group.sql.sum('column1'));
When I try to do this it throws the error: "Uncaught (in promise) TypeError: group.sql.sum is not a function(...)".
I experimented with using "group.stat.sum" instead, and that seems to work.
It could be that your documentation is wrong, and you need to change group.sql.sum to group.stat.sum. Or is it the case that I should actually use group.sql.sum and I'm lacking something that will bring the group.sql.sum function into scope?
DataFrame JS is massive. I'm interested in getting a smaller bundle by excluding features I don't use. In my project, I only need groups, sorts, filters, and aggregates - an in-memory grid, in other words.
Your library is one of the better grids, but I dislike getting all the module baggage with it, such as SQL, and Matrix.
Can you look into using inheritance/mixins/whatever to product different classes for different sets of DataFrame features? That should help keep things slim.
If you have other suggestions, I'm happy to hear them.
Hi team,
I'm using dataframe-js v1.3.2 and facing an error when trying to sort dataframe by column that contains null. Please see the following example:
let result = new DataFrame([
{Name: 'Peter', Age: 16},
{Name: 'Denis', Age: null}
], ['Name', 'Age']);
result.sortBy('Age')
Please let me know if you are going to fix this in the nearest release. Or please share some thoughts on a workaround.
PS. in the project value really needs to stay null so please do not propose change null to zero :)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.