Code Monkey home page Code Monkey logo

soda-carto's Issues

Limit syntax

I think node-soda2-parser is stringifying the LIMIT clause as LIMIT 0, 10 instead of LIMIT 10 OFFSET 0 because carto's throwing an error about that.

EDIT: found the source

Adding skipped test.

$query parameter

Socrata supports writing full SoQL queries via the $query parameter, similar to carto's ?q parameter but without the FROM.

https://dev.socrata.com/docs/queries/query.html

We might just call this one 'not supported', though I imagine we could add support for it if needed.

EDIT: Although, interestingly, this feature supports "sub queries" (see docs)

Research & document how to deploy to production

Once we're finished, we'll need to tell serverless to deploy to production (it defaults to dev). Pretty sure this is just a command line flag; just want to leave a task for it.

  • Enable CloudWatch Logs (INFO level)

Make test suite compliant with PostGIS

Roadmap: merge node-soda2-parser into this library

This is definitely not an MVP need. But I think node-soda2-parser may be overkill. It makes more sense when trying to convert Socrata/SQL to something like Esri's API, but considering how similar Socrata's API is to vanilla postgres, I think we could probably do the job in like..6 regexes...and combining it into this library simply makes maintenance easier (especially by folks who aren't familiar with the 2 libraries and where one leaves off and the other begins).

?limit=10 throws error

I queried ?limit=10 by accident (it should be ?$limit=10 and noticed it threw a lambda/api gateway error. In theory, it should generate WHERE limit = 10, which should just throw a sql error. Indicates something failed and the error's being lost.

within_circle function not working as expected against carto

https://data.phila.gov/resource/sspu-uyfa.json?$limit=5&$where=within_circle(shape,39.9642259,-75.1476617,100)

Maps to

https://phl.carto.com/api/v2/sql?q=SELECT *, ST_AsGeoJSON(the_geom)::json AS shape FROM "incidents_part1_part2" WHERE ST_Point_Inside_Circle(the_geom, 39.9642259, -75.1476617, 0.001) LIMIT 5

But the carto version doesn't return any results. It is also much slower for some reason. @andrewbt any idea what's wrong here? Maybe I'm doing a radius search incorrectly?

Use a Socrata API key

Socrata will rate limit you at a certain point unless you include an API key, so it may be worth registering one.

Improve docs

  • Add more detail generally (goal of project, how it works)
  • Add serverless info

Verify CSV encoding

I downloaded a CSV and got an error about its encoding when trying to open it in csvlook:

'ascii' codec can't encode character u'\u2026' in position 247: ordinal not in range(128)

Need to try the same thing directly against the carto data to find out if the proxy is changing the character encoding or something

EDIT: verified that same thing happens when downloading from carto directly (example query)

I'm lost on this one. Not sure it's carto's fault, maybe it's csvkit's fault?? Looks like a common python error. @andrewbt maybe you're more familiar with it? I'm using it via

csvlook incidents.csv | less -S

Process response

Carto returns { rows: [] } but socrata users expect [ ] (not nested). Unfortunately this requires loading the entire response object into memory, parsing it, then serializing it :-/ But I suppose there are larger problems in this world.

CSV download

Socrata supports downloading a full dataset in CSV form by specifying .csv in the url.

ex. https://data.phila.gov/api/views/sspu-uyfa/rows.csv?accessType=DOWNLOAD

The geometry columns are downloaded in WKT.

Endpoint env var should only be the domain

It shouldn't include /api/v2/, as that's closely tied to the query being generated by this code and will always be the same anyway. Need to change this in handler.js and server.js.

Spatial radius query not working

Basic filters may misinterpret numeric types, which will fail in postgres

Example: incidents_part1_part2?dc_dist=24

The table's dc_dist column is a string, but node-soda2-parser treats this basic filter's value as a number (code). The result is a postgres query WHERE dc_dist = 24, which throws an error.

@andrewbt do you think it would be wise to always treat these values as strings? For instance, if dc_dist where actually an integer field in the table, and we queried as if it were a string (WHERE dc_dist = '24'), it seems like that would work fine. The only other types I can think of are booleans, floats, and date/times. Can we query those as strings too?

Support generic field renaming

Currently you can rename the geometry field in the datasets.yml file. We should make this logic more generic, letting you rename any number of fields that way. This will resolve the issue of the crime data's hour field vs hour_ field in GeoDB2/carto.

full text search $q

Socrata supports full text search via $q=foo bar. Postgres and carto support it in the following format:

SELECT * FROM crimes_2015_to_oct_2016 WHERE (crimes_2015_to_oct_2016::text ILIKE '%foo%')

Interesting that this one requires the table name in the where clause.

Skipped test added.

Geometry field not renamed in SELECT or WHERE

#7 allowed us to rename the geometry field from the_geom to whatever the socrata dataset called it. This affects the names of the fields that are returned, but SELECT clauses like SELECT convex_hull(shape) and WHERE clauses like WHERE within_circle(shape, 39, -75, 50) will fail because shape is an unknown field.

`is not null` fails

The underlying parser fails on where foo is not null. It works on is null but the not throws it off. We knew about this for is not between and deemed it not important enough. Overlooked is not null.

Ex: https://data.phila.gov/resource/4t9v-rppq.json?$where=requested_datetime is not null

Consider renaming "socrata-carto"

I called it SODA because that's the name of Socrata's API (Socrata Open Data API), but I think "socrata-carto" would be a little more discoverable and clear. Any thoughts?

Geometry column name

Socrata allows geometry fields to be named anything. Carto requires geometry columns be named the_geom. Perhaps the dataset lookup table should optionally let you specify the name of the geometry column?

GeoJSON doesn't need to convert geometry format

We don't need to do ST_AsGeoJSON(the_geom)::json AS the_geom_geojson when the format being requested is geojson, because carto does that for us under the hood. Technically the geojson response includes the geometry as geojson twice at the moment. This is not that bad (and not an MVP issue); more of an optimization.

Add tests for handler.js

There's some logic and decision-making in handler.js that we'd have a few tests for, ideally. It's more challenging because we have to mock request using something like proxyquire though.

This definitely isn't MVP though.

Discussion: What format should carto be queried in?

In the standard API JSON response, we want geometry represented as GeoJSON. By default it's returned as WKB. To get it in GeoJSON from Postgres, you have to use ST_AsGeoJSON(the_geom) in your SELECT clause. This requires explicitly naming all the fields instead of using SELECT *, or using SELECT *, ST_AsGeoJSON(the_geom) and getting the geometry value twice. It also requires knowing that there is a geometry field, and what the field name is (at least carto makes this guaranteed).

The way this project currently does the job is by requesting the carto response in GeoJSON format entirely via (&format=geojson). This does the job much easier, and maybe that's the easiest solution. But in the CSV download, we want geometry represented as WKT.

Options seem to be:

A. Explicitly specify format in the SELECT clause (may result in duplicate geometry columns or a second http request to get the list of fields so they can be selected explicitly)
B. Rely on the format parameter, and maybe add a field to the SELECT clause for the CSV download (ST_AsWKT(the_geom))
C. Convert formats client-side (between WKB and GeoJSON or WKT)

/cc @andrewbt

Log inbound requests

Cloudwatch should log the inbound request as well as what it was converted to for debugging and future regression testing.

API Gateway has maximum 30 second timeout

Per: http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

Integration timeout 30 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.

For very large "SELECT *" queries on large CARTO files, we could hit this timeout. Also, for extremely complex SQL.

@timwis what are your largest datasets? For the crimes one and others, I'm thinking links through the proxy like data.phila.gov/api/v2/sql?q=select * from crimes as the "official export links" on opendataphilly.org or elsewhere might not actually work because of this timeout.

GeoJSON download

Socrata supports exporting a full dataset in GeoJSON format by specifying .geojson in the URL.

ex. https://data.phila.gov/views/sspu-uyfa/rows.geojson

Verify CORS is enabled

The response headers look like CORS is enabled, but serverless docs say you need to explicitly enable it. Would be nice to test it in a simple JS app. Although I suppose it can't hurt to explicitly enable it anyway.

Dataset lookup table

A way to map Socrata dataset IDs (ie s98w-oiup) to carto table names (ie. phl.crime_incidents) so that incoming queries can reference the socrata ID but pull from a carto table.

I'm thinking a YAML file may be the best option for this.

Account for all Socrata API endpoints

and ensure we have Lambda functions for handling them.

  • /resource/xxxx-xxxx.json (primary API endpoint)
  • /resource/xxxx-xxxx.geojson
  • /api/views/xxxx-xxxx/rows.csv?accessType=DOWNLOAD (all CSV endpoints are like
    this)
  • /views/xxxx-xxxx/rows.json (only 6 endpoints advertised this way)
  • /views/xxxx-xxxx/rows.geojson
  • /view/yyyy-yyyy (DataLens page - note 4x4 is different from dataset id)

5 different endpoints for Socrata URLS

  • 2 for a CSV
  • some other ways for SoQL
  • need to make sure that we are accomodating these API endpoints in the proxy

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.