urbanccd-uchicago / plenario Goto Github PK
View Code? Open in Web Editor NEWAPI for geospatial and time aggregation across multiple open datasets.
Home Page: http://plenar.io
License: MIT License
API for geospatial and time aggregation across multiple open datasets.
Home Page: http://plenar.io
License: MIT License
It seems a user can download two different things on the /explore page after running a query: point-level data for individual datasets, and a data matrix with time-aggregated data for all datasets the query returned. Two issues with this:
We may want to broaden these options by letting a user check off which datasets they want and then download either the point-level or time-aggregated data, the former as a .zip of CSVs, and the latter as a data matrix containing data from the selected datasets.
I don't think it's supposed to be there, and it is not in the documentation
Possible spatial resolutions:
Example
http://base_url/api/datset-id/?spatial_resolution=city-block
While working on the SF version, I've noticed that if there is no datapoint in an aggregation unit --say, no crimes in a specific day--, the corresponding object is simply missing from the response.
This can be OK if the client expects a "sparse" representation of the response and takes care of filling in the missing values with zeros, but in general we might want to return a response that already contains the zero values.
Moreover, interpreting the "missing values" has a different meaning if we consider types of data that have an interval lifetime, such as a shape representing a park, where each returned date-value pair could represent a point which "something has changed."
I wrote some code to fill in the "missing values" with zeros and I can wrap it up in a pull-request, if needed. If I'm overseeing something or you believe that this is not a backend-related problem (e.g. the client will take care of it) feel free to close this issue.
From Brett: we need to think about how to log API calls we receive, if there isn't yet anything in place. Initial thoughts are to log the actual call, but not information about the submitter (IP, location, etc). This would enable us to see what people are accessing, how much data (and how frequently) they're accessing, and whether there are issues that crop up from API calls.
No rush on this, but I imagine we should have something in place by end of summer.
Requests to the /api/master/ endpoint time out or return 502 or 404. Requests to specific databases, like /api/chicago_business_licenses/, also time out. This occurs whether we add query parameters or not. The sample queries at http://wopr.datamade.us/ therefore do not work.
This will make it easier to:
Does the API support multipolygons in a single request? I know @apanella's shapefile code can import multipolygons.
hitting /api/master/ with no query parameters returns all records in the system, which hangs the server. we should respond with an error suggesting query parameters or limit the response length.
If anything does, it should be U of C alone as that is named in the grant. We can discuss more on Tuesday
It would seem that I have not yet mastered the art of parsing latitude and longitude information out of a socrata "location" field. Should probably fix this.
should take you back to /explore
without any query parameters
It would be great to add tags to datasets, e.g. "Chicago", "crime", "infrastructure" for when there are hundreds or thousands of datasets and the user doesn't care about most of them. Not sure if this should be only an admin permission - probably login is enough.
@derekeder and I were talking about making the queries against the API a bit more sensible by ensuring that certain field names in the database only exist in the master table. The only three that I can think of right now would be obs_date
, geom
and dataset_name
. This would make it so that I would not need to prefix the field names with something in order to ensure that they query was built in a sane way.
@svetlozarn Does this seem like an achievable goal?
Specify:
lat
latitudelon
longituderadius
in metersExample
http://base_url/api/datset-id/?lat=41.878114&lon=-87.629798&radius=100
Specify a GeoJSON polygon:
{
"coordinates": [
[
[
-87.66865611076355,
42.00809838577665
],
[
-87.66855955123901,
42.004662333308616
],
[
-87.66045928001404,
42.004869617835695
],
[
-87.66071677207947,
42.00953334115145
],
[
-87.6644504070282,
42.01010731423809
],
[
-87.66865611076355,
42.00809838577665
]
]
],
"type": "Polygon"
}
which gets stringified and appended as a query parameter:
http://base_url/api/datset-id/?location__geoWithin=%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B-87.66865611076355%2C42.00809838577665%5D%2C%5B-87.66855955123901%2C42.004662333308616%5D%2C%5B-87.66045928001404%2C42.004869617835695%5D%2C%5B-87.66071677207947%2C42.00953334115145%5D%2C%5B-87.6644504070282%2C42.01010731423809%5D%2C%5B-87.66865611076355%2C42.00809838577665%5D%5D%5D%7D&date__lte=1369285199&date__gte=1368594000&type=violent%2Cproperty&_=1369866788554
Now that we're updating datasets automatically with SQLAlchemy, can we populate the obs_from
and obs_to
fields in the metadata table?
It would seem that this is a bit more nuanced than it used to be. This seems to be the best reference on how to get that going:
Specificy a time range for the data
or, if you wanted to find all crimes reported between May 23, 2012 and June 25, 2012 it would be formatted in to a UNIX timestamp.
http://base_url/api/datset-id/?date__lte=1340582400&date__gte=1337731200
I checked Tree Debris and Tree Trims and both had only one observation in this query.
Might be worth revisiting #31 - in addition to issues like this, some of the sparklines can be misleading when the data is sparse.
Each geospatial area of the city should include metadata about the available data. This data should use the schema of http://schema.org/Dataset
the meta data must include the following fields:
name
description
url
(endpoint to the data for this area)isBasedOnUrl
source url, i.e https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2"temporal
The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format).spatial
The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York.frequency
Frequency with which dataset is published. accrualPeriodicityIn addition to these schema.org standards the meta data must also include
temporal-resolutions
list of available temporal resolutions {continuous, minute, hour, day, week, month, year}spatial-resolutions
list of available temporal resolutions {point, street segment, grid, building, parcel, city-block, census-block, tract, beat, community area... etc)Pretty standard - don't let someone make a bazillion calls or download a bazillion rows without an API key so we know who they are. We should also give them an API key after registration, when we have a /register page up.
Add datatables.net sorting to the view datasets page.
For now, we don't really have much code. But we should probably start testing it once we do.
Relevant: http://flask.pocoo.org/docs/testing/
@derekeder and I have been talking a lot about what qualities a useful data portal might have. One thing that keeps coming up is bulk downloads. While it is nice to have beautiful visualizations that give you a way to explore what's available, in the end you're probably going to just want a csv file that you can use for your own analysis or visualizations. APIs are great but I'm thinking a bare minimum you should have before getting to that point is bulk downloads.
But since in our case that would basically mean redirecting people to Socrata, what unique thing can we offer? Well, maybe an obvious thing to build would be a way to select a box in space and time and get a zip file full of CSVs containing all the stuff we have in that space and time. To me, that would already be super useful and powerful tool that researchers, journalists and developers would be pretty happy with. This is mainly based upon discussions I've had with developers and journalists in the open government space who have a few skills when it comes to exploring a spreadsheet and are interested in getting as close to the source data as possible.
Any thoughts?
Some time spans like Graffiti Removal span back to 1926 only because of apparently unclean data. There are 19 entries on 2/5/1926 that were all closed 6/18/2014 and 8 on 1/1/1927 that were all closed 6/27/2014. The next entry is in 2004. This means Graffiti Removal has a misleading "89 years" of data. If we can't clean the data, is it possible to restrict the timespan only to periods when the data came in relatively consistently, like at least one record every 2 years?
Tree Trims has similar issues.
Another option would be to run all 311 data through a script that removes records whose completion_date is very long after their creation_date, and then reinsert.
Similar to #32, we might want to do something other than return all data from all datasets over the time period specified. I don't see the point of a visual query without a spatial parameter, especially given the load it could cause. If the user wants the whole world they can zoom out and select the whole world.
This is especially important because a new user might load the page, skip the instructions, and not realize there are tools to draw on the map. We could even just open the "Instructions" box when they try to submit such a query, as a gentle reminder.
Here is some detail about how @meemking and I envision using weather data in Plenario.
Perhaps the most important use case is the ability to use weather data to filter observations in other datasets - e.g., if the user only cares about murders that occurred when the sun was shining and it was at least 95 degrees out. One way this could be implemented could be to designate which fields in a dataset could be filtered on, perhaps using a flag on the column when the dataset is first imported, and give the user the ability to add filters on the /explore page.
For instance, imagine that under the "Aggregate by:" select field there was a button to add a filter, which then displays a select field to select a database, which then displays a select field to choose an attribute from that database, among a pre-selected list of attributes. Then it displays buttons or fields to let the user say things like "=cloudy", ">=95", "!=0". The user could then add more filters in the same way, which would be joined together under a single WHERE ... AND ... clause. The final query would filter out points whose nearest weather observation (in space and time) does not meet those conditions.
Obviously, this can only be done with datasets that are guaranteed to have complete spatial and temporal coverage, which is where Matt's imputation script can be particularly helpful.
It would also be great to have a "day/night" filter which could even use dawn/dusk time information from weather.
Further functions and use cases below, which may be things we can start implementing sooner without major overhauls.
Feel free to use this as a starting point, let us know what will be easy/difficult to implement, and what questions you have. This is only meant to get us to think more broadly about how to incorporate sensor data and how to perform queries on multiple tables.
admins may want to change the update frequency, title, columns, etc. add a view to do this.
perhaps one that links to the API docs and data explorer, if we don't know what the user did wrong
Supported temporal resolutions:
Example
http://base_url/api/datset-id/?temporal_resolution=minute
Partly because I apparently mistyped mine.
Right now, I'm using the type inference stuff in csvkit to figure out the data types for the columns. This is great for smaller datasets but when we get into the larger ones, it gets a bit hairy.
Maybe there is a way to leverage the type inference inside pandas thusly: http://stackoverflow.com/questions/15555005/get-inferred-dataframe-types-iteratively-using-chunksize to make sure that we're not ending up with processes getting killed by the OS because they are eating too much RAM.
Another approach would be to more cleverly iterate the incoming csv file so that it's not all read into memory at once and then use an approach such as the one in csvkit.
This is a discussion we should all be part of - what ideas do we have for plenario's branding?
For example, take the icon - I put a working one up as a placeholder but it may be a good candidate. It matches our current color (red?) and brings in a few good connotations:
Not tied to this at all, it took me 10 minutes to make and is intended as a placeholder.
What do you all think, regarding both logo but also the other branding elements mentioned above? The U Chicago team has zero branding expertise (see "WOPR") and we're happy to listen to any and all ideas.
Looks like this is a feature, not a bug - feel free to close if so.
I started playing around with how this might work today. I implemented a line string drawing tool on both the wopr.datamade.us/map/ and wopr.datamade.us/grid/ views.
For the moment on the /map/ view, it defaults to using a 100m buffer on either side of the line that you draw. On the grid view, you're able to pick a buffer (once you draw a line). The spatial aggregate is a 50m box for any buffer (for the moment) but it could be anything (figuring out how to make it work is mainly a UI problem). Here's what that kinda looks like:
This shows crime between June of last year and May of this year within a 300m buffer on either side of a corridor along Fullerton, Western and North aggregated to 50m boxes.
Try it out: http://wopr.datamade.us/grid/
Combine the start/end dates in to one field and use the awesome date range picker
I created a new repo for the wopr-etl code:
Unauthenticated users can navigate to this page.
Define a set of finer nested grid resolutions, similar in spirit to tile maps. See Tile Map Service specification.
relates to #4
One of the limitations for having high spatial resolution is that, with a uniform grid, higher resolutions dramatically increase the number of polygons the client has to render.
One possible solution is to use quad trees, to have adaptively higher resolution where the action is and low resolution where there's little going on.
We'd set a minimum cell size, and a maximum cell count.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.