Code Monkey home page Code Monkey logo

arquero's Introduction

Arquero

Arquero is a JavaScript library for query processing and transformation of array-backed data tables. Following the relational algebra and inspired by the design of dplyr, Arquero provides a fluent API for manipulating column-oriented data frames. Arquero supports a range of data transformation tasks, including filter, sample, aggregation, window, join, and reshaping operations.

  • Fast: process data tables with million+ rows.
  • Flexible: query over arrays, typed arrays, array-like objects, or Apache Arrow columns.
  • Full-Featured: perform a variety of wrangling and analysis tasks.
  • Extensible: add new column types or functions, including aggregate & window operations.
  • Lightweight: small size, minimal dependencies.

To get up and running, start with the Introducing Arquero tutorial, part of the Arquero notebook collection.

Have a question or need help? Post to the Arquero GitHub Discussions board.

Arquero is Spanish for "archer": if datasets are arrows, Arquero helps their aim stay true. ๐Ÿน Arquero also refers to a goalkeeper: safeguard your data from analytic "own goals"! ๐Ÿฅ… โœ‹ โšฝ

API Documentation

  • Top-Level API - All methods in the top-level Arquero namespace.
  • Table - Table access and output methods.
  • Verbs - Table transformation verbs.
  • Op Functions - All functions, including aggregate and window functions.
  • Expressions - Parsing and generation of table expressions.
  • Extensibility - Extend Arquero with new expression functions or table verbs.

Example

The core abstractions in Arquero are data tables, which model each column as an array of values, and verbs that transform data and return new tables. Verbs are table methods, allowing method chaining for multi-step transformations. Though each table is unique, many verbs reuse the underlying columns to limit duplication.

import { all, desc, op, table } from 'arquero';

// Average hours of sunshine per month, from https://usclimatedata.com/.
const dt = table({
  'Seattle': [69,108,178,207,253,268,312,281,221,142,72,52],
  'Chicago': [135,136,187,215,281,311,318,283,226,193,113,106],
  'San Francisco': [165,182,251,281,314,330,300,272,267,243,189,156]
});

// Sorted differences between Seattle and Chicago.
// Table expressions use arrow function syntax.
dt.derive({
    month: d => op.row_number(),
    diff:  d => d.Seattle - d.Chicago
  })
  .select('month', 'diff')
  .orderby(desc('diff'))
  .print();

// Is Seattle more correlated with San Francisco or Chicago?
// Operations accept column name strings outside a function context.
dt.rollup({
    corr_sf:  op.corr('Seattle', 'San Francisco'),
    corr_chi: op.corr('Seattle', 'Chicago')
  })
  .print();

// Aggregate statistics per city, as output objects.
// Reshape (fold) the data to a two column layout: city, sun.
dt.fold(all(), { as: ['city', 'sun'] })
  .groupby('city')
  .rollup({
    min:  d => op.min(d.sun), // functional form of op.min('sun')
    max:  d => op.max(d.sun),
    avg:  d => op.average(d.sun),
    med:  d => op.median(d.sun),
    // functional forms permit flexible table expressions
    skew: ({sun: s}) => (op.mean(s) - op.median(s)) / op.stdev(s) || 0
  })
  .objects()

Usage

In Browser

To use in the browser, you can load Arquero from a content delivery network:

<script src="https://cdn.jsdelivr.net/npm/arquero@latest"></script>

Arquero will be imported into the aq global object. The default browser bundle does not include the Apache Arrow library. To perform Arrow encoding using toArrow() or binary file loading using loadArrow(), import Apache Arrow first:

<script src="https://cdn.jsdelivr.net/npm/apache-arrow@latest"></script>
<script src="https://cdn.jsdelivr.net/npm/arquero@latest"></script>

Alternatively, you can build and import arquero.min.js from the dist directory, or build your own application bundle. When building custom application bundles for the browser, the module bundler should draw from the browser property of Arquero's package.json file. For example, if using rollup, pass the browser: true option to the node-resolve plugin.

Arquero uses modern JavaScript features, and so will not work with some outdated browsers. To use Arquero with older browsers including Internet Explorer, set up your project with a transpiler such as Babel.

In Node.js or Application Bundles

First install arquero as a dependency, for example via npm install arquero --save. Arquero assumes Node version 12 or higher.

Import using CommonJS module syntax:

const aq = require('arquero');

Import using ES module syntax, import all exports into a single object:

import * as aq from 'arquero';

Import using ES module syntax, with targeted imports:

import { op, table } from 'arquero';

Build Instructions

To build and develop Arquero locally:

  • Clone https://github.com/uwdata/arquero.
  • Run npm i to install dependencies.
  • Run npm test to run test cases, npm run perf to run performance benchmarks, and npm run build to build output files.

arquero's People

Contributors

bmesuere avatar calvinfo avatar domoritz avatar dworthen avatar ee2dev avatar gordonje avatar jannisch avatar jheer avatar john-guerra avatar lsh avatar mbostock avatar natlownes avatar natoverse avatar pgte avatar stuartlynn avatar sujeetpillai avatar t829702 avatar wpercy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arquero's Issues

Add an op.recode function.

Maybe it could be interesting to add an operation to recode the values of a categorical variable, similar to the fct_recode function of the forcats R package. I don't think this already exists in Arquero, but I may have missed something.

Such a function could be something like this :

recode = function(value, recoding, other) {
  const keys = [...recoding.keys()];
  if (keys.includes(value)) return recoding.get(value);
  if (other !== undefined) return other;
  return value;
}

The idea would be to pass a value, a Map as recoding argument, and an optional other argument which, if defined, would be returned instead of value if value is not found in the Map. So for example :

let recoding = new Map([["foo", "bar"],[1,2]])
recode("foo", recoding)  // returns "bar"
recode(1, recoding)  // returns 2
recode("hello", recoding)  // returns "hello"
recode("hello", recoding, "other")  // returns "other"

I'm not a good JavaScript developer so there may be a much better implementation, but I think such an operation could be very useful for dealing with categorical variabels with Arquero.

Thanks !

undefined values in joined table after a filter and a derive

Hi,

This is quite a corner case, but if I try to join the following two tables :

dt1 = aq
  .from([
    { key: 1, value1: 1 },
    { key: 2, value1: 2 },
    { key: 3, value1: 3 },
    { key: 4, value1: 4 }
  ])
  .filter(d => d.key < 3)
  .derive({ key: d => d.key })

dt2 = aq.from([
  { key: 1, value2: 1 },
  { key: 2, value2: 2 },
  { key: 5, value2: 5 }
])

The result has 4 rows, with undefined values in the key column for the rows filtered from dt1 :

dt1.join_left(dt2).view()

Here is a small Observable notebook reproducing the issue : https://observablehq.com/d/d85031b66c92460b

The problem is solved if I put the derive before the filter, or if I call reify on dt1 before the join.

Maybe the good practice is just to systematically call reify between a filter and a derive ?

Thanks !

Support Arquero to SQL query translation.

It would be awesome if one could convert a query to SQL so that it can run in a database. Would that be feasible or are there some obvious operations that are not easily translated to SQL?

Column-level metadata/functional arguments to aq.matches

This is an alternate solution related to the question at #33 about selecting certain subsets of columns based on types.

Arquero could have a metadata slot on each column that borrows from Apache Arrow's support of table and column-level metadata. If my arrow column is of type arrow.utf8 with a metadata field saying "language": "English", if would be useful to have an arquero table derived from it at some point declare table.metadata = {"language": "English", "arrow_type": "utf8"}. The pyarrow feather export functions do something similar with pandas frames: the feather metadata includes a description of the pandas dtypes.

If a function as argument to aq.matches worked with reference to the full column (not just the name as for strings and regexes) @ericemc3's case in #33 could be expressed something like this:

table.select(aq.matches(col => col.metadata.arrow_type.match(/int|float/)))

col.metadata would also conceivably be a useful place to expose information about what autotype inference in fromCSV and fromJSON decided to do.

Could try to find time for a pull request, but obviously this hinges on questions about how/whether you want to use slots on the column object other than data.

Add sample_frac verb

pandas has df.sample(frac=1) and df.sample(n=1); dplyr has t %>% sample_frac(1) and t %>% sample_n(1).

Currently the recommended way to do this is either

  • t.sample(t.numRows()) (only on ungrouped tables).
  • t.sample(() => op.count())

The first only works on ungrouped tables, and the second is a little wonky--I only found it in the docs by searching if there was an example of bootstrap sampling.

Build failures with Ember.js

Hi,
I was trying to use the library in the context of the Ember.js application, however, the module won't load with the default setup. Here is the error message. Do you have any advice on how to avoid it?

ERROR in ./node_modules/arquero/src/arrow/encode/profiler.js 99:40
Module parse failed: Identifier directly after number (99:40)
You may need an appropriate loader to handle this file type.
|     return Type.Float64;
|   } else if (p.bigints === valid) {
>     var _v = -p.min > p.max ? -p.min - 1n : p.max;
| 
|     return p.min < 0 ? _v < Math.pow(2, 63) ? Type.Int64 : error("BigInt exceeds 64 bits: ".concat(_v)) : p.max < Math.pow(2, 64) ? Type.Uint64 : error("BigInt exceeds 64 bits: ".concat(p.max));
 @ ./node_modules/arquero/src/arrow/encode/data-from-objects.js 2:0-37 9:12-19
 @ ./node_modules/arquero/src/arrow/encode/index.js
 @ ./node_modules/arquero/src/index.js

result of table.antijoin is incoherent

I think I found a problem with table.antijoin(), which is evidenced if you try to place this file on the arquero's test directory and run it with node:

const aq = require('../')
const assert = require('assert')

const tb1 = aq.table({
  id: [1, 2, 3]
})

const tb2 = aq.table({
  id: [1, 2]
})

const tb3 = tb1.antijoin(tb2)

tb3.print()

assert.equal(tb3.numRows(), 1)
assert.equal(tb3.get('id', 0), 3)

This fails with the following error:

node:assert:119
  throw new AssertionError(obj);
  ^

AssertionError [ERR_ASSERTION]: 1 == 3
    at Object.<anonymous> (/Users/pedroteixeira/projects/deci/arquero/test/antijoin.js:17:8)
    at Module._compile (node:internal/modules/cjs/loader:1108:14)
    at Object.Module._extensions..js (node:internal/modules/cjs/loader:1137:10)
    at Module.load (node:internal/modules/cjs/loader:973:32)
    at Function.Module._load (node:internal/modules/cjs/loader:813:14)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:76:12)
    at node:internal/main/run_main_module:17:47 {
  generatedMessage: true,
  code: 'ERR_ASSERTION',
  actual: 1,
  expected: 3,
  operator: '=='
}

The output table is:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ (index) โ”‚ id โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚    0    โ”‚ 3  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜

But the assertion fails.

The problem goes away if I reify the result of the antijoin call:

const tb3 = tb1.antijoin(tb2).reify()

If you point me to where the problem lays in the code I can try to fix it.

More problems related to typings

Here are some assorted problems I encountered when using arquero:

  • ColumnTable (see dist/types/table/column-table.d.ts)

    • In methods toCSV, toHTML, toJSON, toMarkdown
      • The parameter options has been generated as any and not as specified in the jsdoc.
        /**
        * Format this table as a JavaScript Object Notation (JSON) string.
        * @param {JSONFormatOptions} options The formatting options.
        * @return {string} A JSON string.
        */
        toJSON(options) {
        return toJSON(this, options);
        }
      • the options parameter should be optional, now it is required
  • Table.create: the definition of config parameter should be refactored as a separate type or interface

    * @param {object} [config] Configuration settings for the new table:
    * - data: The data payload to use.
    * - names: An ordered list of column names.
    * - filter: An additional filter bitset to apply.
    * - groups: The groupby specification to use (null for no groups).
    * - order: The orderby comparator to use (null for no order).
    * - params: Table expression parameters.

  • Table.comparator: should be more specific than just a Function, e.g. type Comparator<T> = (a:T,b:T) => number or something...

    arquero/src/table/table.js

    Lines 113 to 119 in 55c7c13

    /**
    * Returns the row order comparator function, if specified.
    * @return {Function} The row order comparator function.
    */
    comparator() {
    return this._order;
    }

New statistical transforms

How about adding those (sorted by descending interest for me):

  • sample (table.sample(n))
  • density (op.density(field, bandwidth)),
  • regression,
  • loess...

I have actually spotted them in vega-lite transforms collection.

Specify insertion point for derived columns.

With R, it is very convenient to create a column and specify where to insert it:
mutate( newvar = ..., .after = var1 )
hence avoiding to respecify the desired column order with a full select.

Could derive encompass a new parameter similar to .after or .before?

params in expr containing op.lag raises error

While trying to figure out how to set a lag programatically, I hit some unexpected behavior.

I wanted to do:

aq.table({"n": [1,2,3,4]})
  .params({lag: 1})
  .derive({prod: (d, $) => d.n - op.lag(d.n, $.lag)})
  .view()

But it raises: TypeError: t.param is not a function

Probably I'm just not finding the right place to set params to an op. (Or need to write a new one?).

But even the following raises an error, although the op.lag function does not ask for access to .params,
which seems less than ideal.

aq.table({"n": [1,2,3,4]})
  .params({lag: 1})
  .derive({prod: (d, $) => d.n * $.lag - op.lag(d.n, 1)})
  .view()

Fix ordering of integer-named columns.

I love pivot and fold fonctions in Arquero, they are so powerful !

I just ran into few small difficulties.
Here is a use case:

t = aq.table({x:['01','01','01','02','02','02'], y:['1','2','3','1','2','3'], z:[123,256,854, 652,734,222]})

x y z
01 1 123
01 2 256
01 3 854
02 1 652
02 2 734
02 3 222

I pivot that table:

t.groupby(0).pivot(1,2)

The problem is that x is now located on the right, i'd like it to stay in the first position.

1 2 3 x
123 256 854 01
652 734 222 02

I am trying all these variants, but in vain, x doesn't want to move ;)

t.groupby(0).pivot(1,2).relocate('x', {before:0}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:1}).view() ;
t.groupby(0).pivot(1,2).relocate('x', {before:"1"}).view() ;

Maybe i am missing something there. Of course i can write a select.
The fact that new column names are not by default prefixed probably does not help.

My humble suggestions would be:

  1. after a pivot, make the groupby columns appear at first
  2. add a prefix option in pivot (and fold as well)

For instance:
p = t.groupby(0).pivot(1,2, {prefix:'age_'})

could produce:

x age_1 age_2 age_3
01 123 256 854
02 652 734 222

and conversely:
p.fold(aq.range(1,3),{as:['y','z'], prefix:'age_'})

would take me back to the original table.

For the moment i use for pivot something like:
p = t.groupby(0).pivot({ d: d => 'age_' + d.y },2, {prefix:'age_'})

But after a fold i have to write something like this this, which i find not very elegant:
.derive({y: d => op.substring(d.y,4)})

cf. for inspiration (and old habits):
https://tidyr.tidyverse.org/reference/pivot_longer.html (aka fold) and names_prefix option
https://tidyr.tidyverse.org/reference/pivot_wider.html (aka pivot) and names_prefix option

Update table getters to respect filter and order criteria?

I am troubled with the fact that get() ignores current filter, making it (a little bit) difficult to access data in a filtered/ordered table.

And i am also having that strange result here (undefined when i expect 4 or ideally 5):

const dt = aq.table({ a: [1, 2, 3], b: ["4", "5", "6"] })

dt.filter(d => d.a > 1)
  .derive({b: d => +d.b})
  .column('b').get(0) 

Easier moving average in windowed operations?

Recently I'm doing some COVID-19 stats trying to utilize Arquero lib.
Most of my operations are perfectly ported to Arquero APIs and worked well.
One of my key complaint is the lack of easy moving average operations.

My current version is like:

.derive({ moving_death_7: d => (op.lag(d.death, 1) + op.lag(d.death, 2) + op.lag(d.death, 3) + op.lag(d.death, 4) + op.lag(d.death, 5) + op.lag(d.death, 6) + op.lag(d.death, 0)) / 7 })

It's currently unable to use for loops, and it looks awkward a bit. If I want to change to a 14d moving average, it will look even worse. Am I doing it wrong?

Get indices do not seem to respect table order

First of all: Thanks for this great project. Itโ€™s so great to finally see a good data wrangling framework in JavaScript land!

I have only worked with Arquero a couple of hours and stumbled upon an unexpected behavior that you might explain somewhere in the docs. Or it might be a bug.

Accessing individual cell values using .get(..., index) does not seem to respect re-ordered tables. Hereโ€™s an example:

const dt = aq.table({a: [1, 2, 3], b: [4, 5, 6]});

a	b
1	4
2	5
3	6

dt.get("a", 0); // 1

const dto = dt.orderby(aq.desc("a"));

a	b
3	6
2	5
1	4

dto.get("a", 0); // still 1 but expected 3

Any idea why that may be the case?

op.is_nan() returns false for 'ND', etc.

I want to check what should be a numeric column in my table (loaded from a CSV file), but it happens that it contains several non-numeric values : 'NA', 'ND', 'XXX'...

So i am trying a test such as:

isNaN(x)

With Arquero, i will use:

dt.filter(d => op.is_nan(d.mycol))

but it actually does not retain these 'NA', 'ND', 'XXX'...

beacause op.is_nan() follows Number.isNaN(), which behaves differently than isNaN().

Is this by design?

As an alternative, i am currently using:
dt.filter(d => +d.mycol != d.mycol)

Typings should be fixed

Problem

I tried this:

import {table} from "arquero";

const myTable = table({
  x: [1610109901000, 1610109902000, 1610109903000, 1610109904000],
  y: [600, 610, 590, 602]
});

I got a warning from IntelliJ (both in javascript and in typescript): Invalid number of arguments, expected 2

This is because the names parameter of the table function is not marked as optional.

/**
* Create a new table for a set of named columns.
* @param {object} columns
* The set of named column arrays.
* Object keys are the column names.
* The enumeration order of the keys determines the column indices.
* Object values must be arrays (or array-like values) of identical length.
* @return {ColumnTable} the instantiated table
* @example table({ colA: ['a', 'b', 'c'], colB: [3, 4, 5] })
*/
export function table(columns, names) {
return new ColumnTable(mapObject(columns, x => x), names);
}

How to fix

Add the following code to your jsdoc: @param {string[]?} names

I would also suggest to review the typings of all functions in your code ๐Ÿ˜„

Add join_natural method

This issue captures a missing feature that's
bothered me a little too; a quick and dirty way to join tables on shared keys as in tidyverse's inner_join
or MySQL's NATURAL JOIN. Especially without named arguments in JS, it's hard to
remember how to do this with existing joins.

In observable I've started adding a join_natural function that mimics their behavior:
I wonder if you'd consider adding something similar as a method.

  aq.internal.Table.prototype.join_natural = function(right) {
    const names = this.columnNames()
    return this.join(right, undefined, [names,right.columnNames().filter(d => names.indexOf(d) == -1)])
  }

Would probably be best if it threw an error if there are no shared keys. (?)

Thanks again for your work--still greatly enjoying this library's contribution.

How to left pad a string for a new column?

For example, in the case when dealing with U.S. Census FIPS values that are stored as numbers in a dataset, how would I convert the numeric values back to string values that are left padded with the correct number of zeros?

When I try doing something like:

table.derive({
  "fips_state_str": d => `${d.fips_state}`.padStart(2, "0")
})

... Arquero throws an error. Looking at the API documentation and docs on Observable I don't see an example of how to accomplish this, but apologies if I missed or overlooked something. Thanks!

Duplicated keys column after join

Hi,

First of all, many thanks for this awesome library that makes working with tabular data much more natural. As an Observable newbie coming from R and dplyr, I really appreciate it.

A small question : when joining two tables, the key column(s) are duplicated in the resulting joined table. For example, if both tables are merged on the same key column, the resulting table will have both a key_1 and a key_2 columns. In other languages or libraries I'm accustomed to the fact that the result only has a key column, as the original one.

Please excuse me if it is the intended behavior or if I am misunderstanding something.

Thanks.

Name collisions for op min and max functions.

It's only OK to use 'op.max' for aggregations, like in d=>op.max(d.foo),

But if I try to use it for common math calculations, like d=>op.max(1,2,3) or d=>op.max([1,2,3]), none of them will output expected values. It's still common usage for processing data like correcting negative values using like op.max(d.foo, 0).

I think Arquero should better separate the 'max' function with different keywords, like 'op.math_max' or 'op.agg_max'

Add Arrow output serialization.

@jheer and others, this is already looking like a great library for us at OmniSci, where we just added the ability to return SQL query results over the wire as Arrow buffers ๐Ÿ‘ ๐Ÿ‘

Wondering if it is possible to serialize the results of an Arquero query to Arrow - it would make for a composable pipeline e.g. for use cases where you retrieve a dataset from OmniSci, wrangle it client-side with Arquero, and then either visualize it with Vega-based charts or things like deck.gl for 3d charts, for example.

We're happy to help in any way we can!

Support explicit column types.

First congratulations for that impressive work, which i consider, being a R user and a D3 fan, as a huge step forward for live and sexy datavisualisation and dataflows!

It looks like Arquero, from CSV for instance, is able to infer column types (Date, Numeric, String...), as we can see with the view() display (columns right or left-alignement), or by testing values type with typeof. Could that information be exposed in the table object, allowing for instance to test for numeric columns only?

A cool feature, starting from this, could be for instance:
select(1, aq.isNumeric())
or groupby(v1).rollup(*...here sum all numeric variables keeping same name...*)

I am used to this convenient R/dplyr syntax : summarise_if(is_numeric, sum) or summarise( across(where(is_numeric), sum) )

Aggregates should return null when all values are null

Right now, if I do op.sum(colName) and all the values of colName are null within a group, the aggregated value for that group is 0.0. IMO it would be more sensible to return null in this case.

Example:

dt = aq.from([{ foo: 1, bar: 1 }, { foo: 1, bar: 3 }, { foo: 2, bar: null }])
dt.groupby('foo').derive({bar: op.sum('bar')})

filtered arrow tables and conversion to arquero table: filter is lost

Arquero does not seem to take into account a filter applied to an arrow table.

Here is my use case:

I have quite a big arrow table:

arrow_tb.count()

=> 6159296

I filter it with arrow functions, which is very fast, especially with dictionary-encoded columns:

arrow_filtered_tb = arrow_tb.filter(arrow.predicate.col('NIVGEO').eq('COM') 
                            .and(arrow.predicate.col('CS1_8').eq('2'))
                            .and(arrow.predicate.col('AGEQ65').eq('60'))
                            .and(arrow.predicate.col('SEXE').eq('2') )
               )

arrow_filtered_tb.count()

=> 34951

I load it as an Arquero table and compute its number of rows:
aq.fromArrow(arrow_filtered_tb).numRows()

=> 6159296, which is not what i expect

Apparently, the arrow method column.toArray() exports the whole unfiltered column.

The arrow scan method looks more effective, i'd like to avoid here the conversion to objects before converting to arquero table :

nbRowsFilteredAfterScan = {
  let codgeo, nb, result = [] ;

  arrow_filtered_tb.scan((idx) => {
        result.push({'codgeo': codgeo(idx), 'nb': nb(idx)});
    }, (batch) => {
        codgeo = arrow.predicate.col('CODGEO').bind(batch);
        nb     = arrow.predicate.col('NB').bind(batch);
    });
  
  return aq.from(result).numRows() ;
}

=> 34951

Support file loading from URL directly?

The API doc exposes as an example:

// create table from an input CSV loaded from 'url'
aq.fromCSV(await fetch(url).text())

But it does not seem to work.
Here are few variants:

  let url ='https://raw.githubusercontent.com/open-numbers/ddf--gapminder--systema_globalis/master/countries-etc-datapoints/ddf--datapoints--children_per_woman_total_fertility--by--geo--time.csv' ;
  
  return aq.fromCSV( await fetch(url).text() )  // KO
  return (await fetch(url)).text() // OK
  return aq.fromCSV( (await fetch(url)).text() )  // KO
  return aq.fromCSV( await ( (await fetch(url)).text() ) ) // OK
  return fetch(url).then(d => d.text()).then(d => aq.fromCSV(d)) // OK

Actually i wish i could use simply:
aq.fromCSV(url)

like
d3.csv(url)

Add head, tail, and/or slice verbs

It is useful to be able to select individual rows by index, including respect for sorting and grouping.

dplyr uses head(n), tail(n), and slice(indices).

R does not respect negative indices in slice, but it would be nice to also be to say
table.slice([-1, 0]) to get the last and first elements, in that order, for each group.

The pandas function nth has a better name than slice, but only selects a single row per group.

Bind tables by columns?

Is there a way to bind tables by columns, in the same way as concat bind them by rows ?

For the moment the only way I found to do it is with something like :

let tmp1 = dt1.derive({ tmp: op.row_number() });
let tmp2 = dt2.derive({ tmp: op.row_number() });
tmp1
    .join(tmp2)
    .select(aq.not("tmp"));

But maybe there is a simpler way to do it ?

Thanks !

Best (simple) way to get distinct values from a column ?

I have a table t1 with a risk nominal column
and i want all distinct values in an array this is quite a common need i am having.

I really appreciate the new columnArray() Table method.

I am actually hesitating between different paths, having difficulties to select and remind which would be the best:

[...new Set(t1.columnArray('risk'))]          // short but not ideal for JS debutants...
t1.groupby('risk').count().columnArray('risk')  // tradeoff, but i don't really need the count
t1.rollup({risk: op.array_agg_distinct('risk')}).get('risk', 0)  // complex... could we avoid get here or use a get(0,0) ?

Maybe i am missing here something simplier.

Wish i could use something very simple such as t1.distinct('risk')

Date typing information from arrow lost after sample.

Thanks again for this great library.

If you load a table using aq.fromArrow, date type information is preserved;
but after some operations, the dates revert to being represented as some kind of integers, not as datetime.

Apologies for not providing an example inline, because I'm not confident about generating
an Arrow date field in Javascript. In the notebook at https://observablehq.com/@bmschmidt/a-files,
you can see that date information is preserved after a filter

table.filter(d => d.country=="Mexico").select("birth date").view()

gives

birth date
--
1915-06-21T00:00:00.000Z
1900-04-14T00:00:00.000Z
1899-04-15T00:00:00.000Z
1886-04-12T00:00:00.000Z

But after a sample or slice, you get ints:

table.sample(10).select("birth date").view()

gives

birth date
--
-1841614848
1234978816
-275112960
689860608


Alternative to a fluent API

Hi,

I'm trying to derive many keys with different params:

table
  .params({date: 'Jan'})
  .derive({Jan: (e,$) => concat($.date, e) })
  .params({date: 'Feb'})
  .derive({Feb: (e,$) => concat($.date, e) })

However the months can be 3, 6, or 12. So I need to calculate the derive function using a object that looks like

params = [{date: 'Jan'}];
derive = { 
  Jan : f()...,
  Feb: f()...,
};

Without a fluent api I can imagine doing something like

table.
  notfluent([
    {params: params},
    {derive: derive}
  ])

Is there a way to achieve this in Arquero? Alternatively I'll have to append the array to the existing Arquero table, do a spread, and rename the columns? Is that the solution that is advisable? This means I would do the calculations outside arquero, which might defeat the point?

Reliable way to test if an object is an Arquero table

I hope that my question is not too silly, but I wondered what would be a reliable way to test if a given object is an Arquero Table (for example in the case of a function that would apply both to a table and to a d3 array).

I tried to use obj.constructor.name but I believe this is not reliable because the result can change from a version to another due to code minimization.

Is there a "good" way to test for this ?

Many thanks !

Optimize for dictionary-encoded Arrow columns.

Arrow JS's compute library has the ability to optimize filter expressions that operate on a dictionary-encoded column. For example if foo is a dictionary encoded string column and you're evaluating foo == "bar" we first lookup the dictionary index for "bar" and then search for instances of that index. This can make a huge performance improvement, particularly for string columns where UTF-8 decoding is very expensive.

In Arrow I believe we only do it for equality expressions but you can imagine there are many other expressions that could benefit from such an optimization (foo in ("bar", "baz"), lt, gt, ...). I never got a chance to build that out in Arrow, but now that arquero is here it seems like a much better place for it :)

fromCSV autotype: keep leading 0

aq.fromCSV, with option autotype = true, removes leading 0 (as does d3.autotype, unfortunately).

"06075" => 6075
"01" => 1

Could a test on leading 0 be added to avoid string to number conversion here?

I currently fix this with a specific parse option:
parse: {zipcode: String}
but it would be simplier not having to care about this, it would allow for writing more generic function to parse CSV files.

Access a column programmatically on a table expression

Apologies if this shouldn't be an issue, but I have been searching everywhere and haven't been able to understand what can I do with table expressions. Is this documented?

I want to derive columns programmatically. I have a dataset with a bunch of dates per each row, and I want to compute their relative position compared to a reference one. I know I can do:

dt.derive({
  date1_rel: d => d.date1 - d.date_ref,
  date2_rel: d => d.date2 - d.date_ref,
 ...
})

But I don't want to write that by hand. I tried then these cells on Observable:

columnFechas = [
  "fis",
  "fecha_de_muerte",
  "fecha_reporte_web"
  "fecha_recuperado",
  "fecha_de_notificaci_n",
  "fecha_diagnostico"
]
deriveFechas = Object.fromEntries(
  columnFechas.map(f => [
    `${f}_res`,
    d =>  d[f] - d.fecha_reporte_web 
  ])
)
dt.derive(deriveFechas)

But It throws an error saying Invalid column "f": "d[f]". I tried debugging the code and noticed that the function is a table expression, so for a reason I don't understand I cannot even console.log the attribute d. How do I get the column that is in the variable f in a table expression. Documenting this could be important whenever you have column names with spaces

Here is a notebook replicating the issue

https://observablehq.com/d/2783d075d8079564#fechas

Seems related to #5, is there a way of doing op.get(column, d)?

rollup() no longer allows empty inputs since 1.3.0

In previous versions, by using someTable.groupby('colA','colB').rollup(), it is possible to return a result table with all unique values with valid groups without actually rollup a value. However in 1.3.0, this is broken since the 'Object.entries are not allowed for undefined or null' error(depends on the browser). I have to manually add anything like op.count() to get it work, which I actually don't need.

I don't know if it is by purpose or by accident. But I do think the previous usage is no big deal for rollup() function.

Thematic mapping utilities

An extension to op.ntile() could prove useful to encode numeric values to categories from manual breaks.
Something similar to the R cut function:
dens_code = cut( pop_density, breaks = c(0, 1000, 5000,20000,100000, Inf)...)
or d3.scaleThreshold()

jenks() and kmeans() are also useful clustering methods, we can borrow them from the simple statistics library, but of course if they were in Arquero it would be convenient.

Optimize groupby to use dictionary indices

Thanks so much for releasing this project--I've been hoping for precisely this kind of engine for a while. Look forward to using it.

There's some relatively low-hanging fruit for a performance bump on dictionary-encoded columns. (Probably string columns, too, although it's a little trickier.) Currently, groupby on a dictionary(utf-8 entries with int32 keys) column is quite slow because--if I'm tracing the code correctly--each individual row is being decoded from utf-8.

Here's a benchmark counting 2.5 million rows; in this case, it's about 4.7 seconds to count 1,000 distinct fields on character keys and 0.2 seconds on their integer equivalent.

Similar optimizations should be possible for 'filter' and 'join' on dictionary columns, although there's probably more overhead involved.

Update expression parser to support indirect column name references?

The following code (within an observable notebook cell) runs just fine:

{
  let df = aq.from([{ a: 1 }, { a: 2 }]);
  return df.derive({ b: row => row['a'] + 1 }).view(10);
}

But if I try and make the column name a variable, it fails:

{
  let df = aq.from([{ a: 1 }, { a: 2 }]);
  let col = 'a';
  return df.derive({ b: row => row[col] + 1 }).view(10);
}

I suppose I'm just not understanding the scope in which the lambda gets evaluated, or something. What is the right way to use variable names within the lambdas that get passed to derive?

Notebook: https://observablehq.com/@pwills/arquero-confusions

DataFrame Structure

First, this looks amazing and thank you for creating it.

I am could not find a built in method in the api docs to understand the dataframe/table structure that is created. In R:dplyr the underlying dataframe/tibble provides a str function to see type of each column vector. In Python:Pandas, there is dtypes method. In JS:arquero, the underlying array is 'read as-is' and there is no-requirement/enforcement of a uniform type for each column vector. So there is no built in-method to see what is the current column type being worked on?

I can extend it using the built-in extension. Just wanted to confirm that that is the current case.

Add additional selection helpers.

These are suggestions from a R practice, with handy functionalities i frequently use:

  1. select
    current possibilities such as selecting by name, indices, not, range are excellent. I'd like also: starts_with, contains, and the possibility to test on column types (for instance select all numeric columns)

  2. rename
    of course select in Arquero allows to rename, i have noticed that.
    Within R, janitor::clean_names is very useful for tidying and fixing complex column_names that you can find, sometimes, in CSV files (with blanks, accented characters, caps) => DB compliant column names (one single world, lowercase, without accented or special chars...)

Support for flatMap or arrayAgg

Really loving Arquero so far! One thing I constantly run into is the need for a flatMap or equivalently SQL array_agg operation - Often grouping by a key, I want to collect the values into an array-valued output. I can use Javascript flatMap, but it would be great to have Arquero support this in its native ops

Working with date columns

FIrst off, thanks for building this - it's making my life a lot easier when I work with data in Observable ๐Ÿ˜„

I'm having some trouble working with dates. I build a data table via

df = aq.from([
  { date: '2020-01-01 00:00:00.000' },
  { date: '2020-01-02 00:00:00.000' }
])

I then add a custom function to shift the date forward 30 days, using moment:

let shiftDateFunc = d => moment(d).add(30, 'days').format()
aq.addFunction("shiftDate", shiftDateFunc)

However, when I try to do

df.derive({shiftedDate: op.shiftDate('date')})

I get an error:

Expression parse error: Invalid date

Which is surprising, since I can map this function over the objects just fine:

df.objects().map(d => shiftDateFunc(d.date)) // succeeds, returns an array of strings

Why can I map this function over the object arrays successfully, but it fails when I register it via addFunction? In my mental model the success of the former should imply that the latter will also work, but obviously I'm confused about something.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.