vega / compassql Goto Github PK

CompassQL Query Language for visualization recommendation.

License: Other

Shell 0.24% TypeScript 99.03% HTML 0.34% JavaScript 0.39%

compassql's Introduction

CompassQL

CompassQL is a visualization query language that powers chart specifications and recommendations in Voyager 2.

As described in our vision paper and Voyager 2 paper, a CompassQL query is a JSON object that contains the following components:

Specification (spec) for describing a collection of queried visualizations. This spec's syntax follows a structure similar to Vega-Lite's single view specification. However, spec in CompassQL can have enumeration specifiers (or wildcards) describing properties that can be enumerated.¹
Grouping/Nesting method names (groupBy and nest) for grouping queried visualizations into groups or hierarchical groups.
Ranking method names (orderBy and chooseBy) for ordering queried visualizations or choose a top visualization from the collection.
Config (config) for customizing query parameters.

Internally, CompassQL engine contains a collection of constraints for enumerating a set of candidate visualizations based on the input specification, and methods for grouping and ranking visualization.

For example, the following CompassQL query has one wildcard for the mark property. The system will automatically generate different marks and choose the top visual encodings based on the effectiveness score.

{
  "spec": {
    "data": {"url": "data/cars.json"},
    "mark": "?",
    "encodings": [
      {
        "channel": "x",
        "aggregate": "mean",
        "field": "Horsepower",
        "type": "quantitative"
      },{
        "channel": "y",
        "field": "Cylinders",
        "type": "ordinal"
      }
    ]
  },
  "chooseBy": "effectiveness"
}

The examples/specs directory contains a number of example CompassQL queries.

To understand more about the structure of a CompassQL Query, look at the Query interface declaration.

A query's spec property implements SpecQuery interface, which follows the same structure as Vega-Lite's UnitSpec (single view specification) but most of SpecQuery's properties have -Query suffixes to hint that its instance is a query that can contain wildcards to describe a collection of specifications.
Since multiple encoding channels can be a wildcard, the encoding object in Vega-Lite is flatten as encodings which is an array of Encoding in CompassQL's spec.

Usage

Given a row-based array of data object, here are the steps to use CompassQL:

Specify a query config (or use an empty object to use the default configs)

var opt = {}; // Use all default query configs

For all query configuration properties, see src/config.ts.

Build a data schema.

var schema = cql.schema.build(data);

The data property is a row-based array of data objects where each object represents a row in the data table (e.g., [{"a": 1, "b":2}, {"a": 2, "b": 3}]).

You can reuse the same schema for querying the same dataset multiple times.

Specify a query. For example, this is a query for automatically selecting a mark:

var query = {
  spec: {
    data: { url: "node_modules/vega-datasets/data/cars.json" },
    mark: "?",
    encodings: [
      {
        channel: "x",
        aggregate: "mean",
        field: "Horsepower",
        type: "quantitative",
      },
      {
        channel: "y",
        field: "Cylinders",
        type: "ordinal",
      },
    ],
  },
  chooseBy: "effectiveness",
};

Execute a CompassQL query.

var output = cql.recommend(query, schema);
var result = output.result; // recommendation result

The result object is an instance of SpecQueryModelGroup (ResultGroup<SpecQueryModel>), which is a root of the output ordered tree. Its items property can be either an array of SpecQueryModel or an array of SpecQueryModelGroup (for hierarchical groupings).

The SpecQueryModel is an class instance of a SpecQuery with helper methods. Note that, in the result, all of spec query models are completely enumerated and there would be no wildcard left.

Convert instances of SpecQueryModel in the tree, using SpecQueryModel's toSpec() class method and the mapLeaves method.

var vlTree = cql.result.mapLeaves(result, function (item) {
  return item.toSpec();
});

Now you can use the result. In this case, the tree has only 2 levels (the root and leaves). We can just get the top visualization by accessing the 0-th item.

For a full source code, please see index.html.

var topVlSpec = vlTree.items[0];

Note for Developers

The root file of our project is src/cql.ts, which defines the top-level namespace cql for the compiled files. Other files under src/ reflect namespace structure. All methods for cql.xxx will be in either src/xxx.ts or src/xxx/xxx.ts. For example, cql.util.* methods are in src/util.ts, cql.query is in src/query/query.ts.
TODO: constraints
- List in Vy2 paper supplement..

Development Instructions

You can install dependencies with:

yarn install

You can use the following npm commands such as

npm run build
npm run lint
npm run test
npm run cover       // see test coverage  (see coverage/lcov-report/index.html)
npm run watch       // watcher that build, lint, and test
npm run test-debug  // useful for debugging unit-test with vscode
npm run clean       // useful for wiping out js files that's created from other branch

(See package.json for Full list of commands.)

To play with latest CompassQL in the vega-editor, use branch cql-vl3 in kanitw's fork, which has been updated to use Vega-Lite 3, Vega 5, and CompassQL ^0.21.1. (For CompassQL 0.7 or older, use branch compassql, which uses Vega-Lite 1.x).

Make sure to link CompassQL to the editor

cd COMPASSQL_DIR
npm link

cd VEGA_EDITOR_DIR
npm run vendor -- -l compassql

(You might want to link your local version of Vega-Lite as well.)

Main API

The main method is cql.recommend, which is in src/recommend.ts.

Directory Structure

examples - Example CompassQL queries
- examples/specs – All JSON files for CompassQL queries
- examples/cql-examples.json - A json files listing all CompasssQL examples that should be shown in Vega-editor.
src/ - Main source code directory.
- src/cql.ts is the root file for CompassQL codebase that exports the global cql object. Other files under src/ reflect namespace structure.
- All interface for CompassQL syntax should be declared at the top-level of the src/ folder.
test/ - Code for unit testing. test's structure reflects src's' directory structure. For example, test/constraint/ test files inside src/constraint/.
typings/ - TypeScript typing declaration for dependencies. Some of them are downloaded from the TypeStrong community.

Pro-Tip

When you add a new source file to the project, don't forget to add the file to files in tsconfig.json.

compassql's People

Contributors

Stargazers

Watchers

Forkers

fstfwd akshatsh anukat2015 codeaudit robinjia benkalegin jstcki www3838438 indera lynchpin4 cindygregory jyfmidi doytsujin donghaoren eddings forkkit derekwtian timeless15 oigewan solversa arenaswan fagan2888 gongzhaohui devopstoday11 junoapp admariner leibatt stutiredboy cuulee therustmonk alexkreidler shaunstoltz peter-gy collabsoft terrasolstice super-rain lgtm-migrator pinkdiamond1 varaisys davoodqorbani parthomayo shauryashaurya chriss-0x01

compassql's Issues

MVP for Enumerate

enumerate answers based on input CompassQL query
- check if the constraint is enabled (in the option)
- generate fields -- read from schema
support two types of constraints
- encoding constraint (constraint for one encoding mappings)
- spec constraint (constraint that involves multiple encoding mappings or involves relationship between mark and encoding)
determine order in a way that automatically adding count still works
- noRepeatedField --> '*'
Remember which field we assign for later reference

Missing Constraints

channelsSupportRoles
omitShapeWithBin (channel supports role?)
omitShapeWithTimeDimension (channel supports role?)
omitBarWithSize
omitRawBar/Area

Transform: Filter

Adjust interface to be similar to vega/vega-lite#1461
Find //TODO: transform and implement each relevant part.

Revise old compass constraints

Not sure if we should add the following

~~maxCardinalityForAutoAddOrdinal~~ #70
alwaysAddHistogram
~~consistentAutoQ -- if aggregate for all Q are "*" -- give all of them same level of aggregation.~~ (already have omitRawContinuousFieldForAggregatePlot)

Add missing core tests

enumerator.test.ts

For each of these properties:

aggregate
timeUnit
field
type

Write a test that enumerate all valid values

aggregate
timeUnit
field
type

hint: turn config.verbose = true

Write a test that enumerate both valid and invalid values (and test that the output contains only valid values)

aggregate
- To see relevant constraints, look at constraints/{spec|encoding}.ts
  - look at properties of each constraint
  - look at a few ones that contain Property.AGGREGATE

(LATER)

Write a test that enumerate all valid values

bin -- bin is the most complicated -- ping me to explain about it

Write a test that enumerate both valid and invalid values (and test that the output contains only valid values)

To see relevant constraints, look at constraints/{spec|encoding}.ts
timeUnit
field
type

Other Files

Run npm run cover and see coverage report -- add more tests for uncovered constraints

constraint/encoding.test.ts @FelixCodes
constraint/spec.test.ts – @FelixCodes
nest.ts – @RileyChang
query.ts – @RileyChang

Criteria for enumerating timeUnit

We need good criteria for determine timeUnits that we should determine by default for a particular dataset

Blocked #95 -- need stats first

Constraint: AggregateOnly / RawOnly

AggregateOnly
RawOnly

Make debugger stop in mocha

Sometimes adding debugger breakpoint in mocha doesn't work.

http://stackoverflow.com/questions/30023736/mocha-breakpoints-using-visual-studio-code

Expanded the top-level only the top 5

Refactor Bin to Support Bin Parameter

Currently in EncodingQuery, it's

bin?: boolean | EnumSpec<boolean> | ShortEnumSpec;

However, bin can have parameter too and I don't want mixing up between boolean and object here.

So I'm thinking

bin?: BinQuery

with the following interface

interface BinQuery {
  enable: boolean | EnumSpec<boolean> | ShortEnumSpec;
  maxbins: number | EnumSpec<number> | ShortEnumSpec;
  ... // other params
}

Any thoughts? @domoritz

Reduce redundancy for checking type of properties

Spec constraint's satisfy can become more effective when requireAllProperties is true.
Currently we repeatedly checking if we have all the requires properties.

Data-driven occlusion test

Right now we just say aggregate has no occlusion, while raw has occlusion -- that's not always correct.

Enumerate Stack

Stack
Stack constraint (don't enumerate non-summing aggregate for stack)

Improve OxQxQ ranking

x:Q, y:Q, color: O should be > y:O, x:Q, size: Q

Make the item title in editor displays score

Make EnumSpec support `exclude` in addition to `enum`

modify model build logic to merge exclude with `values.

Headless Mode

Supports top N query / or at least make vega-editor shows top N

MVP Ranking

Allow outputing pruned output that do not satisfy non-strict constraints

The rationale is that sometimes users might expect to see some visualizations but do not see it because they are pruned.

Outputting them with a label that they do not satisfy constraint could be useful both for debugging and for helping users understand.

Revise how we group config for preferred axis based on field type

preferredTemporalAxis?: Channel;
preferredOrdinalAxis?: Channel;
preferredNominalAxis?: Channel;

Should ordinal and nominal be grouped together?
Should we group this by the output scale type? (If so, time field with timeUnit would be ordinal scale.)

Add JSON schema

Generate JSON Schema for CompassQL schema

Look at this line in Vega-Lite
https://github.com/vega/vega-lite/blob/master/package.json#L35

Do the same for Query.

Add Tests to validate all examples

In Vega-Lite, we have a test that validates all example specs so that both its input and output validates JSON schema.

Validates input CompassQL query (each example json files)
For each example query, run the query method in query.ts and check the output. For each SpecQueryModel in the output convert them into Vega-Lite specs (call .toSpec()) and validates Vega-Lite output.)

Make sure that the example test is excluded from test coverage.
(See Vega-Lite's package.json)

Better name for enumJob -- maybe enumSpecIndex?

This spec generates duplicated output

{
  "mark": {
    "mode:": "pick/enum"
    "values": [""]
  },
  "encodings": [
    {
      "channel": "x",
      "field": "Cylinders",
      "type": "quantitative"
    },{
      "autoCount": true
    }
  ]
}

Rank Encoding Based on transform

like bin bin count should be better than avg avg?

Support varying number of fields mapped

Deal with text table.

In older Compass, we add a few hacks for recommending text table.

With the new label and tile, we need to revise how we deal with this.

Add statistical profiling

1D
2D
Need to think what to add

Refactor constraints

Specs

hasAppropriateGraphicTypeForMark
omitRawBarLineArea
omitRawTable

Distinguish high-cardinality strings from nominal fields

Fields with too high cardinality takes up a lot of space and can be slow to render.

add a flag isKeyLike (or some better name) to schema

We might want to consider a few options:

distinguish between categories (low cardinality) and text (high cardinality) as they serve different purpose in data analysis anyway.
- Check if the cardinality is above X% (50%?) of the overall data count and above minimum threshold (e.g., 40)

Maybe check if "if the cardinality is above ~80% of the overall data count" or some similar criteria

Add a constraint that excludes fields with too high cardanality from being added automatically.

Split generate.ts into two files

Right now enumerator stuff are in generate.ts.
However, this makes generate.test.ts unduly long.

Therefore, we should extract enumerator.ts from generate.ts

Normalize flat version to nested version

(for demo)

Turning line overlay on only for ordinal scale

(For temporal line, there might be too many points?)

Set max size for each cell in vega-editor

MVP Grouping

Constraint propertyPrecedence

Prevent duplicate output if autoCount comes after channel in propertyPrecedence

Basically, whenever, autoCount is false, we shouldn't even assign it to a channel.

We have to either add Logic to prevent autoCount to come after channel in the propertyPrecedence
or make answerSet in generate really a set to prevent duplication

Prevent nested property output from coming before its parent

Systematically Test Constraints / Ranking

Constraints

Distinguish between grouping by "fields" and by "fields and transforms"

Syntax for nested grouping

Nested grouping is very important for understanding structure / debugging output results.
(I'm currently flooded by transposes of the visualizations.)

Therefore we need a good syntax for nested grouping.

Suppose I want to hierarchical grouping that first group by dataQueryKey then by encodingKey.

For each subgroup (by encodingKey), I want to order the subgroup's items by rankFn1.
For each group (by dataQueryKey), I want to order the group's items (which are subgroups based on encodingKey) by rankFn2.
Finally, I want to order groups by rankFn3.

For example, rankingFn1 = rankingFn2 = "effectiveness". rankFn3 can be some data enumeration order. The ranking function will rank groups by calculating score for the top-item in each list.

Suppose

spec = {
    "data": {"url": "data/cars.json"},
    "mark": "?",
    "encodings": [
      {
        "channel": "?",
        "field": "Cylinders",
        "type": "ordinal"
      },{
        "channel": "?",
        "bin": "?",
        "aggregate": "?",
        "field": "Horsepower",
        "type": "quantitative"
      }
    ]
  }

Here are a few alternative queries:

a) Nested version

{
  spec: spec, 
  group/groupings: { 
    // This case, definitely start with top-level grouping key. 
    by: 'dataKey',
    // if we want one output for each group, we can replace this orderItemBy with chooseBy
    orderItemBy: 'rankingFn2' 
    subgroup/subgroupings: {
      by: 'encodingKey',
      orderItemBy: 'rankingFn1'     
    }
  }],
  orderBy: 'rankingFn3'   
}

b) Array-based

{
  spec: spec, 
  // should the first one be the top-level one or the subgroup one -- current it's the subgroup one
  group/groupings: [{ 

     groupBy: 'encodingKey',
     // if we want one output for each group, we can replace this orderItemBy with chooseBy
     orderItemBy: 'rankingFn1'  
  },{
     groupBy: 'dataKey',
     orderItemBy: 'rankingFn2'  
  }],
  orderBy: 'rankingFn3'   // or orderGroupBy?
}

@jheer @domoritz any preference for a. or b. (or other options) / minor wordings?

I am not married to of these yet. Other ideas are welcomed.
I'm leaning toward the nested version because it's seems clearer which one is the top-level grouping.

Improve Ranking

Channel, Cardinality
Penalize over encoding

Test

TxT
TxQ
QxT > Q

Scale

Background

Look at description and changes of #27 to see the infrastructure for adding nested property (bin.maxbins) -- note that I might miss something in the description, but if that's the case, you'll notice problem as you debug.

1st step Scale.type

Scale.*

Repeat the process for other scale properties (one PR for each)

add ones that are required by other tasks
- type
  - clamp: Q, T
  - exponent: pow
  - round: Q, T
    - accept types of values depending on scale type
- zero --> zero doesn't play well with [ ScaleType.ORDINAL, LOG, TIME, UTC]. I don't think I'm missing anything else...
  - #105
- bandSize
  - #93
  - ~~bandSize must be at least 0~~
- range
  - #101
  - ~~values must contain two or more values.~~
- domain
- round
- clamp
  - must have continuous domain / continuous domain (quantitative and time types only)
- nice
  - similar to clamp.. quantitatiev and time.
- exponent
- useRawDomain

--- LATER ---

padding
- works with channel.x, channel.y --> uses pixels
- ??? padding (0, 1) for rangeBands ??? -- LATER

Score based on scale type: Tick Score should be much worse for year(T) than month(T)

http://localhost:1234/?mode=compassql&spec=1d-T

Refactor / Additional Test

Extract and test hasRequiredPropertyAsEnumSpec in satisfy of EncodingConstraintModel and SpecConstraintModel

Replicating Compass

Gen

aggregate.test.ts
encodings.test.ts

Run npm run cover and see coverage report -- add more tests for uncovered constraints

constraint/encoding.test.ts @FelixCodes
constraint/spec.test.ts – @FelixCodes

Don't bin Q-field add autoCount if there are already dimension in the spec

For example,

{
  "spec": {
    "data": {"url": "data/cars.json"},
    "mark": "?",
    "encodings": [
      {
        "channel": "?",
        "field": "Cylinders",
        "type": "nominal"
      },{
        "channel": "?",
        "field": "Origin",
        "type": "ordinal"
      },{
        "channel": "?",
        "bin": "?",
        "aggregate": "?",
        "field": "Acceleration",
        "type": "quantitative"
      }
    ]
  },
  "groupBy": "data",
  "config": {
    "autoAddCount": true
  }
}

has this group group: Cylinders,n|Origin,o|bin(Acceleration,q)|count(*,q) that contains a visualization like this one:

{
  "data": {
    "url": "data/cars.json"
  },
  "mark": "point",
  "encoding": {
    "y": {
      "field": "Cylinders",
      "type": "nominal"
    },
    "x": {
      "field": "Origin",
      "type": "ordinal"
    },
    "row": {
      "bin": true,
      "field": "Acceleration",
      "type": "quantitative"
    },
    "size": {
      "aggregate": "count",
      "field": "*",
      "type": "quantitative"
    }
  }
}

Support automatically adding count

Additional Constraints

(for completeness)

For raw plots, don't put a field on detail (originally vega/compass#98)

Cardinality Based Constraints

determine input format for cardinality in the schema
maxCardinalityForFacets
maxCardinalityForColor
maxCardinalityForShape
minCardinalityForBin

X/Y, Row/Column Preference

Refactor

Consistent Variable Name
- encodingQ => encQ
- property => prop
EnumSpecIndex.timeunit => timeUnit

cc: @ZeningQu

Aggregate Plot with Facet the only group-by should be rated worse

e.g.,

{
  "data": {
    "url": "data/cars.json"
  },
  "mark": "point",
  "encoding": {
    "row": {
      "field": "Cylinders",
      "type": "nominal"
    },
    "x": {
      "aggregate": "mean",
      "bin": false,
      "field": "Horsepower",
      "type": "quantitative"
    },
    "y": {
      "aggregate": "mean",
      "bin": false,
      "field": "Acceleration",
      "type": "quantitative"
    }
  }
}