cityofphiladelphia / soda-carto Goto Github PK
View Code? Open in Web Editor NEWQuery Carto datasets with a Socrata-style SODA2 API
Query Carto datasets with a Socrata-style SODA2 API
I think node-soda2-parser is stringifying the LIMIT
clause as LIMIT 0, 10
instead of LIMIT 10 OFFSET 0
because carto's throwing an error about that.
EDIT: found the source
Adding skipped test.
Looks like lambda runs Node v4.3.2
, which doesn't let you have much fun.
Socrata supports writing full SoQL queries via the $query
parameter, similar to carto's ?q
parameter but without the FROM
.
https://dev.socrata.com/docs/queries/query.html
We might just call this one 'not supported', though I imagine we could add support for it if needed.
EDIT: Although, interestingly, this feature supports "sub queries" (see docs)
Once we're finished, we'll need to tell serverless to deploy to production (it defaults to dev
). Pretty sure this is just a command line flag; just want to leave a task for it.
This is just notes of the issues I saw for now, I will clean this up later / split into separate issues:
http://postgis.net/docs/ST_NPoints.html (wrong function)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L122
http://postgis.net/docs/ST_Distance.html (cast to geography)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L130
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L143 (could be replaced by the star variable above?)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L154 (should this be "location" and not the_geom??)
http://postgis.net/docs/ST_Intersects.html (the WKT needs to be casted to geometry)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L274
http://postgis.net/docs/ST_MakeEnvelope.html (need to swap order here, socrata within_box is different ordering of xmin, ymax, xmax, ymin than what postgis expects)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L290
http://postgis.net/docs/ST_Within.html (the WKT needs casting to geometry maybe? or use st_geomfromtext?)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L306
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L314 (missing - but you could easily do carto's updated_at maybe?)
https://github.com/CityOfPhiladelphia/soda-carto/blob/master/test/index.js#L394 (missing?)
This is definitely not an MVP need. But I think node-soda2-parser may be overkill. It makes more sense when trying to convert Socrata/SQL to something like Esri's API, but considering how similar Socrata's API is to vanilla postgres, I think we could probably do the job in like..6 regexes...and combining it into this library simply makes maintenance easier (especially by folks who aren't familiar with the 2 libraries and where one leaves off and the other begins).
I queried ?limit=10
by accident (it should be ?$limit=10
and noticed it threw a lambda/api gateway error. In theory, it should generate WHERE limit = 10
, which should just throw a sql error. Indicates something failed and the error's being lost.
https://data.phila.gov/resource/sspu-uyfa.json?$limit=5&$where=within_circle(shape,39.9642259,-75.1476617,100)
Maps to
https://phl.carto.com/api/v2/sql?q=SELECT *, ST_AsGeoJSON(the_geom)::json AS shape FROM "incidents_part1_part2" WHERE ST_Point_Inside_Circle(the_geom, 39.9642259, -75.1476617, 0.001) LIMIT 5
But the carto version doesn't return any results. It is also much slower for some reason. @andrewbt any idea what's wrong here? Maybe I'm doing a radius search incorrectly?
Socrata uses date_trunc_ym
, date_trunc_ymd
, date_trunc_y
. Postgres expects date_trunc(fieldname, 'month')
Reference: https://dev.socrata.com/docs/functions/#,
Note this is commonly used in $group
Socrata will rate limit you at a certain point unless you include an API key, so it may be worth registering one.
We should probably make things more fault-tolerant
I downloaded a CSV and got an error about its encoding when trying to open it in csvlook
:
'ascii' codec can't encode character u'\u2026' in position 247: ordinal not in range(128)
Need to try the same thing directly against the carto data to find out if the proxy is changing the character encoding or something
EDIT: verified that same thing happens when downloading from carto directly (example query)
I'm lost on this one. Not sure it's carto's fault, maybe it's csvkit's fault?? Looks like a common python error. @andrewbt maybe you're more familiar with it? I'm using it via
csvlook incidents.csv | less -S
Carto returns { rows: [] }
but socrata users expect [ ]
(not nested). Unfortunately this requires loading the entire response object into memory, parsing it, then serializing it :-/ But I suppose there are larger problems in this world.
Socrata supports downloading a full dataset in CSV form by specifying .csv
in the url.
ex. https://data.phila.gov/api/views/sspu-uyfa/rows.csv?accessType=DOWNLOAD
The geometry columns are downloaded in WKT.
It shouldn't include /api/v2/
, as that's closely tied to the query being generated by this code and will always be the same anyway. Need to change this in handler.js
and server.js
.
I was previously using a query like this to fetch rows from Socrata based on a spatial radius search (500 feet or 152.4 meters).
That now gives an error: column \"shape\" does not exist
Per a suggestion by @awm33 I tried replacing the field name shape
with location
, which appears to be the soda-carto
default for spatial columns.
I'm now getting an error column \"location\" does not exist
.
@andrewbt , I was told you might have some insights into how this works. Would appreciate any tips, thanks!
Example: incidents_part1_part2?dc_dist=24
The table's dc_dist
column is a string, but node-soda2-parser
treats this basic filter's value as a number (code). The result is a postgres query WHERE dc_dist = 24
, which throws an error.
@andrewbt do you think it would be wise to always treat these values as strings? For instance, if dc_dist
where actually an integer field in the table, and we queried as if it were a string (WHERE dc_dist = '24'
), it seems like that would work fine. The only other types I can think of are booleans, floats, and date/times. Can we query those as strings too?
Currently you can rename the geometry field in the datasets.yml
file. We should make this logic more generic, letting you rename any number of fields that way. This will resolve the issue of the crime data's hour
field vs hour_
field in GeoDB2/carto.
Socrata supports full text search via $q=foo bar
. Postgres and carto support it in the following format:
SELECT * FROM crimes_2015_to_oct_2016 WHERE (crimes_2015_to_oct_2016::text ILIKE '%foo%')
Interesting that this one requires the table name in the where clause.
Skipped test added.
#7 allowed us to rename the geometry field from the_geom
to whatever the socrata dataset called it. This affects the names of the fields that are returned, but SELECT
clauses like SELECT convex_hull(shape)
and WHERE
clauses like WHERE within_circle(shape, 39, -75, 50)
will fail because shape
is an unknown field.
The underlying parser fails on where foo is not null
. It works on is null
but the not
throws it off. We knew about this for is not between
and deemed it not important enough. Overlooked is not null
.
Ex: https://data.phila.gov/resource/4t9v-rppq.json?$where=requested_datetime is not null
I called it SODA because that's the name of Socrata's API (Socrata Open Data API), but I think "socrata-carto" would be a little more discoverable and clear. Any thoughts?
Just to make debugging easier, definitely not necessary.
@andrewbt discovered that carto gzips large payloads, and AWS' API Gateway doesn't seem to support that in its http proxy / pass-through. Thought I'd create a separate issue since it's a bit different from simply standing up the pass-through API via cloudformation code.
"Go live" involves flipping data.phila.gov
to this API gateway
Socrata allows geometry fields to be named anything. Carto requires geometry columns be named the_geom
. Perhaps the dataset lookup table should optionally let you specify the name of the geometry column?
We don't need to do ST_AsGeoJSON(the_geom)::json AS the_geom_geojson
when the format being requested is geojson, because carto does that for us under the hood. Technically the geojson response includes the geometry as geojson twice at the moment. This is not that bad (and not an MVP issue); more of an optimization.
There's some logic and decision-making in handler.js
that we'd have a few tests for, ideally. It's more challenging because we have to mock request
using something like proxyquire though.
This definitely isn't MVP though.
In the standard API JSON response, we want geometry represented as GeoJSON. By default it's returned as WKB. To get it in GeoJSON from Postgres, you have to use ST_AsGeoJSON(the_geom)
in your SELECT
clause. This requires explicitly naming all the fields instead of using SELECT *
, or using SELECT *, ST_AsGeoJSON(the_geom)
and getting the geometry value twice. It also requires knowing that there is a geometry field, and what the field name is (at least carto makes this guaranteed).
The way this project currently does the job is by requesting the carto response in GeoJSON format entirely via (&format=geojson). This does the job much easier, and maybe that's the easiest solution. But in the CSV download, we want geometry represented as WKT.
Options seem to be:
A. Explicitly specify format in the SELECT
clause (may result in duplicate geometry columns or a second http request to get the list of fields so they can be selected explicitly)
B. Rely on the format
parameter, and maybe add a field to the SELECT
clause for the CSV download (ST_AsWKT(the_geom)
)
C. Convert formats client-side (between WKB and GeoJSON or WKT)
/cc @andrewbt
Cloudwatch should log the inbound request as well as what it was converted to for debugging and future regression testing.
Per: http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html
Integration timeout 30 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.
For very large "SELECT *" queries on large CARTO files, we could hit this timeout. Also, for extremely complex SQL.
@timwis what are your largest datasets? For the crimes one and others, I'm thinking links through the proxy like data.phila.gov/api/v2/sql?q=select * from crimes
as the "official export links" on opendataphilly.org or elsewhere might not actually work because of this timeout.
https://dev.socrata.com/docs/queries/limit.html
Otherwise querying an endpoint with no specified limit takes a really long while (and may even time out lambda)
Socrata supports exporting a full dataset in GeoJSON format by specifying .geojson
in the URL.
ex. https://data.phila.gov/views/sspu-uyfa/rows.geojson
The response headers look like CORS is enabled, but serverless docs say you need to explicitly enable it. Would be nice to test it in a simple JS app. Although I suppose it can't hurt to explicitly enable it anyway.
Identified by @andrewbt in #43
Socrata: https://dev.socrata.com/docs/functions/within_box.html
PostGIS: http://postgis.net/docs/ST_MakeEnvelope.html
Not sure what the correct order should be though. Socrata docs don't use the terms xmin/ymax/etc. Any suggestions?
EDIT: Test is addressed by #48
A way to map Socrata dataset IDs (ie s98w-oiup
) to carto table names (ie. phl.crime_incidents
) so that incoming queries can reference the socrata ID but pull from a carto table.
I'm thinking a YAML file may be the best option for this.
ie. $where=date between '2015-01-10T12:00:00' and '2015-01-10T14:00:00'
https://dev.socrata.com/docs/functions/between.html
How is this written in postgres?
and ensure we have Lambda functions for handling them.
/resource/xxxx-xxxx.json
(primary API endpoint)/resource/xxxx-xxxx.geojson
/api/views/xxxx-xxxx/rows.csv?accessType=DOWNLOAD
(all CSV endpoints are like/views/xxxx-xxxx/rows.json
(only 6 endpoints advertised this way)/views/xxxx-xxxx/rows.geojson
/view/yyyy-yyyy
(DataLens page - note 4x4 is different from dataset id)5 different endpoints for Socrata URLS
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.