Code Monkey home page Code Monkey logo

soda-carto's Introduction

soda-carto Build Status

Query Carto datasets with a Socrata-style SODA2 API. Depends heavily on node-soda2-parser, which is a light wrapper around the node-sqlparser AST parser.

How it works

soda-carto is a reverse proxy meant to simulate a Socrata-style SODA2 API in front of a Carto API. To do that, it has to (1) translate the inbound SODA request into a Carto request, and (2) translate the Carto response into a response format you'd expect from a SODA API. 90% of the work necessary was on translating the inbound request.

Carto expects a pure PostgreSQL/PostGIS query, and SODA requests are almost SQL to begin with: the SODA API consists of querystring parameters like $select, $where, $group, $limit, etc. By combining those values, you can easily construct a SQL query. The SODA API also allows "simple filters," such as ?district=northinstead of$where=district = 'north'. It also supports a $query` parameter where you can write pure SQL, but that isn't supported by soda-carto under the assumption that it's not commonly used.

In order to translate these combinations of querystrings into a Carto request (basically a PostgreSQL/PostGIS query), soda-carto first delegates to node-soda2-parser, which constructs a standard SQL query from the querystring parts, then parses it to an abstract syntax tree (AST) using node-sqlparser. This logic resides in its own package so that other layers can sit on top of it (such as soda-geoservices), but that may be more trouble than it's worth.

soda-carto applies a few changes to the parsed AST:

  • Insert the table name as the FROM clause
  • Add a default limit of 1000 if none is specified
  • Select the geometry as GeoJSON if the output format is json, and as Well-known Text if the output format is csv
  • Rename the geometry field, since Socrata datasets support arbitrary geometry field names and in carto, they're always called the_geom
  • Convert Socrata's convex_hull, within_box, within_circle, and within_polygon functions to standard PostGIS

soda-carto then uses node-soda2-parser's stringify function to convert the touched-up AST back to SQL. This logic is all wrapped by handler.js, which handles an inbound HTTP request, translates it to a Carto/SQL query (as described above), then executes that query on a Carto API and passes on the response. It also does a few other minor things:

  • If the desired output format is json, it returns only the rows property of the Carto response, to match the response format you'd expect from a SODA API
  • It uses a lookup file (datasets.yml) to match Socrata dataset IDs (ie. sd09-aox1) to the corresponding Carto table (this file must be populated manually)

The last layer is serverless, which uses a simple YAML configuration file to deploy AWS Lambda functions. serverless.yml defines the routes and delegates the handling to handler.js.

Usage

To run this yourself, git clone the repository and npm install the dependencies. You'll also need to install serverless using npm install --global serverless.

You can change the CARTO_DOMAIN inside serverless.yml, then use serverless deploy to deploy the function to AWS Lambda. To view the logs, check CloudWatch or use serverless logs -f soda.

There is also a local web server provided by server.js, but there isn't full parity between that and what AWS Lambda would show; we need to rewire server.js to simply delegate to handler.js.

Feature support

The below table outlines the specific features identified in the SODA2 docs. Features designated auto are already PostgreSQL-compliant, so their support can be taken for granted. Features marked supported have custom "translation" code written to make them PostgreSQL/Carto-compliant. Features marked "todo" should be possible, but require translation code to be written. Those marked "not supported" are likely due to a shortcoming of the node-sqlparser library. Each feature should have a test written.

category feature example support
$select all fields $select=* auto
$select comma-separated fields $select=foo, bar auto
$select aliases $select=foo AS bar auto
$select operators $select=foo * 2 AS double_foo auto
$select count $select=count(*) AS count auto
$select sum $select=sum(foo) AS total auto
$select average $select=avg(foo) auto
$select minimum $select=min(foo) auto
$select maximum $select=max(foo) auto
$select date truncation (y/ym/ymd) $select=date_trunc_ym(datetime) AS month todo
$select convex_hull $select=convex_hull(location) supported
$select case $select=case(type = 'A', 'Full', type = 'B', 'Partial') supported
$select extent $select=extent(location) todo
$select simplify $select=simplify(location, 0.001) todo
$select number of vertices $select=num_points(location) todo
$select distance in meters $select=distance_in_meters(location, 'POINT(-122.334540 47.59815)') AS range todo
$select concatenate strings $select=foo
$where equality expression $where=foo = 'bar' auto
$where AND operator $where=foo = 'bar' AND baz = 2 auto
$where parentheses $where=foo = 'bar' AND (baz = 2 OR baz = 3) auto
$where operators $where=end - start < 3 auto
$where simple filters foo=bar&baz=1 supported
$where simple filters with $where clause foo=bar&$where=baz = 1` supported
$where quotes within strings $where=foo = 'bob''s burgers' not supported
$where quotes within simple filters $foo=bob's burgers supported
$where boolean fields $where=foo = true auto
$where boolean fields short-hand $where=foo auto
$where in function $where=foo in ('bar', 'baz') auto
$where not in function $where=foo not in ('bar', 'baz') auto
$where between $where=foo between '100' and '200' auto
$where not between $where=foo not between '100' and '200' not supported
$where intersects $where=intersects(location, 'POINT (-12.3, 45.6)') supported
$where starts with $where=starts_with(title, 'chief') todo
$where within box $where=within_box(location, 47.5, -122.3, 47.5, -122.3) supported
$where within circle $where=within_circle(location, 47.59815, -122.33454, 500) supported
$where within polygon $where=within_polygon(location, 'MULTIPOLYGON (((-87.637714 41.887275, -87.613681 41.886892, -87.625526 41.871555, -87.637714 41.887275)))') supported
$where record updated $where=:updated_at > '2017-02-19' not supported
$order order by $order=foo auto
$order order by with direction $order=foo DESC auto
$group group by $group=foo auto
$group group by truncated date $group=date_trunc_ym(datetime) todo
$having having $select=count(*) AS count&$group=bar&$having=count > 20 todo
$limit limit $limit=10 auto
$limit default limit of 1000 $select=foo supported
$offset offset $offset=5 auto
$q full-text search $q=foo todo
$q stemming $q=users not supported
$q multiple words are ANDed $q=test user todo
$query soql query $query=SELECT * WHERE foo = 'bar' todo
$query sub-query $query=SELECT city_feature, COUNT(*) AS count GROUP BY city_feature | > SELECT COUNT(city_feature) AS num_types, SUM(count) AS total_features not supported
$jsonp jsonp callback $jsonp=callback todo

See also

soda-carto's People

Contributors

andrewbt avatar awm33 avatar nwebz avatar timwis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

andrewbt sabman

soda-carto's Issues

Research & document how to deploy to production

Once we're finished, we'll need to tell serverless to deploy to production (it defaults to dev). Pretty sure this is just a command line flag; just want to leave a task for it.

  • Enable CloudWatch Logs (INFO level)

Geometry column name

Socrata allows geometry fields to be named anything. Carto requires geometry columns be named the_geom. Perhaps the dataset lookup table should optionally let you specify the name of the geometry column?

Geometry field not renamed in SELECT or WHERE

#7 allowed us to rename the geometry field from the_geom to whatever the socrata dataset called it. This affects the names of the fields that are returned, but SELECT clauses like SELECT convex_hull(shape) and WHERE clauses like WHERE within_circle(shape, 39, -75, 50) will fail because shape is an unknown field.

Roadmap: merge node-soda2-parser into this library

This is definitely not an MVP need. But I think node-soda2-parser may be overkill. It makes more sense when trying to convert Socrata/SQL to something like Esri's API, but considering how similar Socrata's API is to vanilla postgres, I think we could probably do the job in like..6 regexes...and combining it into this library simply makes maintenance easier (especially by folks who aren't familiar with the 2 libraries and where one leaves off and the other begins).

Use a Socrata API key

Socrata will rate limit you at a certain point unless you include an API key, so it may be worth registering one.

Process response

Carto returns { rows: [] } but socrata users expect [ ] (not nested). Unfortunately this requires loading the entire response object into memory, parsing it, then serializing it :-/ But I suppose there are larger problems in this world.

Account for all Socrata API endpoints

and ensure we have Lambda functions for handling them.

  • /resource/xxxx-xxxx.json (primary API endpoint)
  • /resource/xxxx-xxxx.geojson
  • /api/views/xxxx-xxxx/rows.csv?accessType=DOWNLOAD (all CSV endpoints are like
    this)
  • /views/xxxx-xxxx/rows.json (only 6 endpoints advertised this way)
  • /views/xxxx-xxxx/rows.geojson
  • /view/yyyy-yyyy (DataLens page - note 4x4 is different from dataset id)

5 different endpoints for Socrata URLS

  • 2 for a CSV
  • some other ways for SoQL
  • need to make sure that we are accomodating these API endpoints in the proxy

Make test suite compliant with PostGIS

Verify CORS is enabled

The response headers look like CORS is enabled, but serverless docs say you need to explicitly enable it. Would be nice to test it in a simple JS app. Although I suppose it can't hurt to explicitly enable it anyway.

Log inbound requests

Cloudwatch should log the inbound request as well as what it was converted to for debugging and future regression testing.

Support generic field renaming

Currently you can rename the geometry field in the datasets.yml file. We should make this logic more generic, letting you rename any number of fields that way. This will resolve the issue of the crime data's hour field vs hour_ field in GeoDB2/carto.

Verify CSV encoding

I downloaded a CSV and got an error about its encoding when trying to open it in csvlook:

'ascii' codec can't encode character u'\u2026' in position 247: ordinal not in range(128)

Need to try the same thing directly against the carto data to find out if the proxy is changing the character encoding or something

EDIT: verified that same thing happens when downloading from carto directly (example query)

I'm lost on this one. Not sure it's carto's fault, maybe it's csvkit's fault?? Looks like a common python error. @andrewbt maybe you're more familiar with it? I'm using it via

csvlook incidents.csv | less -S

API Gateway has maximum 30 second timeout

Per: http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

Integration timeout 30 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.

For very large "SELECT *" queries on large CARTO files, we could hit this timeout. Also, for extremely complex SQL.

@timwis what are your largest datasets? For the crimes one and others, I'm thinking links through the proxy like data.phila.gov/api/v2/sql?q=select * from crimes as the "official export links" on opendataphilly.org or elsewhere might not actually work because of this timeout.

GeoJSON doesn't need to convert geometry format

We don't need to do ST_AsGeoJSON(the_geom)::json AS the_geom_geojson when the format being requested is geojson, because carto does that for us under the hood. Technically the geojson response includes the geometry as geojson twice at the moment. This is not that bad (and not an MVP issue); more of an optimization.

Consider renaming "socrata-carto"

I called it SODA because that's the name of Socrata's API (Socrata Open Data API), but I think "socrata-carto" would be a little more discoverable and clear. Any thoughts?

Dataset lookup table

A way to map Socrata dataset IDs (ie s98w-oiup) to carto table names (ie. phl.crime_incidents) so that incoming queries can reference the socrata ID but pull from a carto table.

I'm thinking a YAML file may be the best option for this.

GeoJSON download

Socrata supports exporting a full dataset in GeoJSON format by specifying .geojson in the URL.

ex. https://data.phila.gov/views/sspu-uyfa/rows.geojson

CSV download

Socrata supports downloading a full dataset in CSV form by specifying .csv in the url.

ex. https://data.phila.gov/api/views/sspu-uyfa/rows.csv?accessType=DOWNLOAD

The geometry columns are downloaded in WKT.

Discussion: What format should carto be queried in?

In the standard API JSON response, we want geometry represented as GeoJSON. By default it's returned as WKB. To get it in GeoJSON from Postgres, you have to use ST_AsGeoJSON(the_geom) in your SELECT clause. This requires explicitly naming all the fields instead of using SELECT *, or using SELECT *, ST_AsGeoJSON(the_geom) and getting the geometry value twice. It also requires knowing that there is a geometry field, and what the field name is (at least carto makes this guaranteed).

The way this project currently does the job is by requesting the carto response in GeoJSON format entirely via (&format=geojson). This does the job much easier, and maybe that's the easiest solution. But in the CSV download, we want geometry represented as WKT.

Options seem to be:

A. Explicitly specify format in the SELECT clause (may result in duplicate geometry columns or a second http request to get the list of fields so they can be selected explicitly)
B. Rely on the format parameter, and maybe add a field to the SELECT clause for the CSV download (ST_AsWKT(the_geom))
C. Convert formats client-side (between WKB and GeoJSON or WKT)

/cc @andrewbt

full text search $q

Socrata supports full text search via $q=foo bar. Postgres and carto support it in the following format:

SELECT * FROM crimes_2015_to_oct_2016 WHERE (crimes_2015_to_oct_2016::text ILIKE '%foo%')

Interesting that this one requires the table name in the where clause.

Skipped test added.

Endpoint env var should only be the domain

It shouldn't include /api/v2/, as that's closely tied to the query being generated by this code and will always be the same anyway. Need to change this in handler.js and server.js.

`is not null` fails

The underlying parser fails on where foo is not null. It works on is null but the not throws it off. We knew about this for is not between and deemed it not important enough. Overlooked is not null.

Ex: https://data.phila.gov/resource/4t9v-rppq.json?$where=requested_datetime is not null

Limit syntax

I think node-soda2-parser is stringifying the LIMIT clause as LIMIT 0, 10 instead of LIMIT 10 OFFSET 0 because carto's throwing an error about that.

EDIT: found the source

Adding skipped test.

$query parameter

Socrata supports writing full SoQL queries via the $query parameter, similar to carto's ?q parameter but without the FROM.

https://dev.socrata.com/docs/queries/query.html

We might just call this one 'not supported', though I imagine we could add support for it if needed.

EDIT: Although, interestingly, this feature supports "sub queries" (see docs)

within_circle function not working as expected against carto

https://data.phila.gov/resource/sspu-uyfa.json?$limit=5&$where=within_circle(shape,39.9642259,-75.1476617,100)

Maps to

https://phl.carto.com/api/v2/sql?q=SELECT *, ST_AsGeoJSON(the_geom)::json AS shape FROM "incidents_part1_part2" WHERE ST_Point_Inside_Circle(the_geom, 39.9642259, -75.1476617, 0.001) LIMIT 5

But the carto version doesn't return any results. It is also much slower for some reason. @andrewbt any idea what's wrong here? Maybe I'm doing a radius search incorrectly?

Improve docs

  • Add more detail generally (goal of project, how it works)
  • Add serverless info

Spatial radius query not working

Add tests for handler.js

There's some logic and decision-making in handler.js that we'd have a few tests for, ideally. It's more challenging because we have to mock request using something like proxyquire though.

This definitely isn't MVP though.

Basic filters may misinterpret numeric types, which will fail in postgres

Example: incidents_part1_part2?dc_dist=24

The table's dc_dist column is a string, but node-soda2-parser treats this basic filter's value as a number (code). The result is a postgres query WHERE dc_dist = 24, which throws an error.

@andrewbt do you think it would be wise to always treat these values as strings? For instance, if dc_dist where actually an integer field in the table, and we queried as if it were a string (WHERE dc_dist = '24'), it seems like that would work fine. The only other types I can think of are booleans, floats, and date/times. Can we query those as strings too?

?limit=10 throws error

I queried ?limit=10 by accident (it should be ?$limit=10 and noticed it threw a lambda/api gateway error. In theory, it should generate WHERE limit = 10, which should just throw a sql error. Indicates something failed and the error's being lost.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.