urbanccd-uchicago / plenario Goto Github PK

View Code? Open in Web Editor NEW

154.0 154.0 42.0 120.02 MB

API for geospatial and time aggregation across multiple open datasets.

Home Page: http://plenar.io

License: MIT License

Python 77.72% CSS 3.52% JavaScript 2.52% HTML 16.06% PLpgSQL 0.09% Dockerfile 0.08%

plenario's People

Contributors

Stargazers

Watchers

Forkers

hunterowens bepetersn apanella fototo darkblue-b zawsx abhinemani datameet nivertech technickle jqnatividad ezanltd hydrologic emilywebber dssg laurenancona urbandatachallenge tosseto mdobbs612 pombredanne tanumalik sumanth1105 anukat2015 transformaps carhart geosir ipercel2 gitter-badger hydrosquall gdtm86 npmcdn-to-unpkg-bot jcu-eresearch devdenilson jcgiuffrida mclark alxfed vforgione fagan2888 itsmekp001

plenario's Issues

indicate when dataset was last updated

For updating purposes, it would be good to know when each dataset was last updated and display it on the view datasets page

If something goes wrong with the update, this should also be indicated on this page as well.

Distinguish between two kinds of downloads

It seems a user can download two different things on the /explore page after running a query: point-level data for individual datasets, and a data matrix with time-aggregated data for all datasets the query returned. Two issues with this:

The user may want time-aggregated data for only a few of the datasets
The user may want point-level data for more than one dataset

We may want to broaden these options by letting a user check off which datasets they want and then download either the point-level or time-aggregated data, the former as a .zip of CSVs, and the latter as a data matrix containing data from the selected datasets.

Grid is rectangular, not square

Minor issue, but especially prominent at the 5km level. This isn't just the projection, is it? It's unclear whether the spatial resolution is the width or height of the cells.

Remove "agg" parameter from api/detail example

I don't think it's supposed to be there, and it is not in the documentation

Spatial resolution

Possible spatial resolutions:

point
street segment
grid
building
parcel
city block
census block
tract
beat
community area
PUMAs
congresional districts

Example

http://base_url/api/datset-id/?spatial_resolution=city-block

Zero-values in aggregates

While working on the SF version, I've noticed that if there is no datapoint in an aggregation unit --say, no crimes in a specific day--, the corresponding object is simply missing from the response.
This can be OK if the client expects a "sparse" representation of the response and takes care of filling in the missing values with zeros, but in general we might want to return a response that already contains the zero values.
Moreover, interpreting the "missing values" has a different meaning if we consider types of data that have an interval lifetime, such as a shape representing a park, where each returned date-value pair could represent a point which "something has changed."

I wrote some code to fill in the "missing values" with zeros and I can wrap it up in a pull-request, if needed. If I'm overseeing something or you believe that this is not a backend-related problem (e.g. the client will take care of it) feel free to close this issue.

remove sorting from Trend column in /explore

Datatables doesn't know how to sort the sparklines (is there a way to do so?)

Convert database from Postgres to PostGIS

To support geospatial queries other than boxes #6, we will need the database to be converted to PostGIS

Logging API calls

From Brett: we need to think about how to log API calls we receive, if there isn't yet anything in place. Initial thoughts are to log the actual call, but not information about the submitter (IP, location, etc). This would enable us to see what people are accessing, how much data (and how frequently) they're accessing, and whether there are issues that crop up from API calls.

No rush on this, but I imagine we should have something in place by end of summer.

/master and /<dataset> unresponsive

Requests to the /api/master/ endpoint time out or return 502 or 404. Requests to specific databases, like /api/chicago_business_licenses/, also time out. This occurs whether we add query parameters or not. The sample queries at http://wopr.datamade.us/ therefore do not work.

Allow user to draw multiple shapes on map

This will make it easier to:

compare two highways
look at Hyde Park and Lakeview
draw complicated shapes or shapes with holes in them

Does the API support multipolygons in a single request? I know @apanella's shapefile code can import multipolygons.

handle response to /api/master/ with no parameters

hitting /api/master/ with no query parameters returns all records in the system, which hangs the server. we should respond with an error suggesting query parameters or limit the response length.

feedback when manually updating data

Clicking the Update button on the view datasets page doesn't seem to do anything. It should respond with a message containing details on the status of the manual update.

copyright attribution

If anything does, it should be U of C alone as that is named in the grant. We can discuss more on Tuesday

Better location parsing

It would seem that I have not yet mastered the art of parsing latitude and longitude information out of a socrata "location" field. Should probably fix this.

reset button on data explorer doesn't fully reset

should take you back to /explore without any query parameters

tags for datasets

It would be great to add tags to datasets, e.g. "Chicago", "crime", "infrastructure" for when there are hundreds or thousands of datasets and the user doesn't care about most of them. Not sure if this should be only an admin permission - probably login is enough.

Reserved query parameters

@derekeder and I were talking about making the queries against the API a bit more sensible by ensuring that certain field names in the database only exist in the master table. The only three that I can think of right now would be obs_date, geom and dataset_name. This would make it so that I would not need to prefix the field names with something in order to ensure that they query was built in a sane way.

@svetlozarn Does this seem like an achievable goal?

Grid legend can't handle small data results

Dataset: 311 Alley Lights Out

leaflet label styles missing on data explorer detail page

Location queries

Within a radius of a point.

Specify:

lat latitude
lon longitude
radius in meters

Example

http://base_url/api/datset-id/?lat=41.878114&lon=-87.629798&radius=100

Within a given polygon

Specify a GeoJSON polygon:

{
    "coordinates": [
        [
            [
                -87.66865611076355, 
                42.00809838577665
            ], 
            [
                -87.66855955123901, 
                42.004662333308616
            ], 
            [
                -87.66045928001404, 
                42.004869617835695
            ], 
            [
                -87.66071677207947, 
                42.00953334115145
            ], 
            [
                -87.6644504070282, 
                42.01010731423809
            ], 
            [
                -87.66865611076355, 
                42.00809838577665
            ]
        ]
    ], 
    "type": "Polygon"
}

which gets stringified and appended as a query parameter:

http://base_url/api/datset-id/?location__geoWithin=%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B-87.66865611076355%2C42.00809838577665%5D%2C%5B-87.66855955123901%2C42.004662333308616%5D%2C%5B-87.66045928001404%2C42.004869617835695%5D%2C%5B-87.66071677207947%2C42.00953334115145%5D%2C%5B-87.6644504070282%2C42.01010731423809%5D%2C%5B-87.66865611076355%2C42.00809838577665%5D%5D%5D%7D&date__lte=1369285199&date__gte=1368594000&type=violent%2Cproperty&_=1369866788554

populate date range in metadata table

Now that we're updating datasets automatically with SQLAlchemy, can we populate the obs_from and obs_to fields in the metadata table?

Add Sentry logging for Celery process

It would seem that this is a bit more nuanced than it used to be. This seems to be the best reference on how to get that going:

celery/celery#1867

Time range

Specificy a time range for the data

or, if you wanted to find all crimes reported between May 23, 2012 and June 25, 2012 it would be formatted in to a UNIX timestamp.

http://base_url/api/datset-id/?date__lte=1340582400&date__gte=1337731200

sparkline is empty when there's only one observation

I checked Tree Debris and Tree Trims and both had only one observation in this query.

Might be worth revisiting #31 - in addition to issues like this, some of the sparklines can be misleading when the data is sparse.

query: http://plenar.io/explore#aggregate/obs_date__le=2014%2F08%2F14&obs_date__ge=2014%2F02%2F15&location_geom__within=%7B%22type%22%3A%22Feature%22%2C%22properties%22%3A%7B%7D%2C%22geometry%22%3A%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B-87.63027101755142%2C41.84028095594902%5D%2C%5B-87.63027101755142%2C41.847442337227704%5D%2C%5B-87.61224657297134%2C41.847442337227704%5D%2C%5B-87.61224657297134%2C41.84028095594902%5D%2C%5B-87.63027101755142%2C41.84028095594902%5D%5D%5D%7D%7D&agg=week

Issue with dates on detail page

Sometimes the dates are flipped:

Often there's no timespan at all, and so no data:

Metadata endpoint

Each geospatial area of the city should include metadata about the available data. This data should use the schema of http://schema.org/Dataset

the meta data must include the following fields:

name
description
url (endpoint to the data for this area)
isBasedOnUrl source url, i.e https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2"
temporal The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format).
spatial The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York.
frequency Frequency with which dataset is published. accrualPeriodicity

In addition to these schema.org standards the meta data must also include

temporal-resolutions list of available temporal resolutions {continuous, minute, hour, day, week, month, year}
spatial-resolutions list of available temporal resolutions {point, street segment, grid, building, parcel, city-block, census-block, tract, beat, community area... etc)

Restrict API usage without API keys

Pretty standard - don't let someone make a bazillion calls or download a bazillion rows without an API key so we know who they are. We should also give them an API key after registration, when we have a /register page up.

add sorting to view datasets page

Add datatables.net sorting to the view datasets page.

Get some unit tests together

For now, we don't really have much code. But we should probably start testing it once we do.

Relevant: http://flask.pocoo.org/docs/testing/

Extract downloads

@derekeder and I have been talking a lot about what qualities a useful data portal might have. One thing that keeps coming up is bulk downloads. While it is nice to have beautiful visualizations that give you a way to explore what's available, in the end you're probably going to just want a csv file that you can use for your own analysis or visualizations. APIs are great but I'm thinking a bare minimum you should have before getting to that point is bulk downloads.

But since in our case that would basically mean redirecting people to Socrata, what unique thing can we offer? Well, maybe an obvious thing to build would be a way to select a box in space and time and get a zip file full of CSVs containing all the stuff we have in that space and time. To me, that would already be super useful and powerful tool that researchers, journalists and developers would be pretty happy with. This is mainly based upon discussions I've had with developers and journalists in the open government space who have a few skills when it comes to exploring a spreadsheet and are interested in getting as close to the source data as possible.

Any thoughts?

Misleading time span for datasets b/c of unclean data

Some time spans like Graffiti Removal span back to 1926 only because of apparently unclean data. There are 19 entries on 2/5/1926 that were all closed 6/18/2014 and 8 on 1/1/1927 that were all closed 6/27/2014. The next entry is in 2004. This means Graffiti Removal has a misleading "89 years" of data. If we can't clean the data, is it possible to restrict the timespan only to periods when the data came in relatively consistently, like at least one record every 2 years?

Tree Trims has similar issues.

Another option would be to run all 311 data through a script that removes records whose completion_date is very long after their creation_date, and then reinsert.

Handle /explore query with no spatial parameter

Similar to #32, we might want to do something other than return all data from all datasets over the time period specified. I don't see the point of a visual query without a spatial parameter, especially given the load it could cause. If the user wants the whole world they can zoom out and select the whole world.

This is especially important because a new user might load the page, skip the instructions, and not realize there are tools to draw on the map. We could even just open the "Instructions" box when they try to submit such a query, as a gentle reminder.

Use cases for weather data

Here is some detail about how @meemking and I envision using weather data in Plenario.

Filter by other datasets

Perhaps the most important use case is the ability to use weather data to filter observations in other datasets - e.g., if the user only cares about murders that occurred when the sun was shining and it was at least 95 degrees out. One way this could be implemented could be to designate which fields in a dataset could be filtered on, perhaps using a flag on the column when the dataset is first imported, and give the user the ability to add filters on the /explore page.

For instance, imagine that under the "Aggregate by:" select field there was a button to add a filter, which then displays a select field to select a database, which then displays a select field to choose an attribute from that database, among a pre-selected list of attributes. Then it displays buttons or fields to let the user say things like "=cloudy", ">=95", "!=0". The user could then add more filters in the same way, which would be joined together under a single WHERE ... AND ... clause. The final query would filter out points whose nearest weather observation (in space and time) does not meet those conditions.

Obviously, this can only be done with datasets that are guaranteed to have complete spatial and temporal coverage, which is where Matt's imputation script can be particularly helpful.

It would also be great to have a "day/night" filter which could even use dawn/dusk time information from weather.

Further functions and use cases below, which may be things we can start implementing sooner without major overhauls.

Functionality

Show me the count of [days, hours] when it was raining anywhere inside this polygon
- SQL query: select count(days) where weather.precipitation > 0
- this could be via number, or a heatmap as long as the polygon is a certain minimum size (e.g. 50 mi across)

Use cases

Where were people murdered when it was at least 95 degrees out?
- query might look like SELECT * from Crime LEFT JOIN Weather On Crime.Date NEAR Weather.Date And Crime.Location NEAR Weather.Location WHERE weather.temp > 95
Show a heatmap which is hotter when the correlation between temperature and count of crimes (by type of crime) is closer to 1
- In each cell of the grid, find correlation of temperature (nearest weather station) and count of crimes, using a given temporal aggregation, and color the heatmap using those correlations as inputs

Download

Allow user to choose which attributes to download
Allow user to download results of filtered queries (e.g. crime data where temperature was at least 95 degrees) - could be simply the original crime data, filtered by row, with temperature appended on
Add weather attributes onto the crime dataset for download - what the weather was like at each crime

Feel free to use this as a starting point, let us know what will be easy/difficult to implement, and what questions you have. This is only meant to get us to think more broadly about how to incorporate sensor data and how to perform queries on multiple tables.

add edit view for admins to modify datasets

admins may want to change the update frequency, title, columns, etc. add a view to do this.

template for 404 page

perhaps one that links to the API docs and data explorer, if we don't know what the user did wrong

Temporal resolution

Supported temporal resolutions:

continuous (default, no aggregation)
minute
hour
day
week
month
year

Example

http://base_url/api/datset-id/?temporal_resolution=minute

Forgot password function

Partly because I apparently mistyped mine.

More efficient type inference

Right now, I'm using the type inference stuff in csvkit to figure out the data types for the columns. This is great for smaller datasets but when we get into the larger ones, it gets a bit hairy.

Maybe there is a way to leverage the type inference inside pandas thusly: http://stackoverflow.com/questions/15555005/get-inferred-dataframe-types-iteratively-using-chunksize to make sure that we're not ending up with processes getting killed by the OS because they are eating too much RAM.

Another approach would be to more cleverly iterate the incoming csv file so that it's not all read into memory at once and then use an approach such as the one in csvkit.

plenario icon

This is a discussion we should all be part of - what ideas do we have for plenario's branding?

color
logo/icon
tagline
feel of the site (flat, round, skeuomorphic, what have you)

For example, take the icon - I put a working one up as a placeholder but it may be a good candidate. It matches our current color (red?) and brings in a few good connotations:

looks like (in fact is derived from) the seating chart for a sitting body, i.e. "plenary session"
lots of distinct elements come together into a whole image
looks like an archway, which hearkens back to the original branding discussion in which we talked about the Plenario platform being like a physical structure
is also the intersection ∩ sign, which is fitting for the sort of queries we're running
with a few tweaks, could also resemble the Ω symbol for the other meaning of plenary="full, whole"

Not tied to this at all, it took me 10 minutes to make and is intended as a placeholder.

What do you all think, regarding both logo but also the other branding elements mentioned above? The U Chicago team has zero branding expertise (see "WOPR") and we're happy to listen to any and all ideas.

Mouse scrolling no longer zooms in/out on map

Looks like this is a feature, not a bug - feel free to close if so.

Allow for spatial queries along a path

I started playing around with how this might work today. I implemented a line string drawing tool on both the wopr.datamade.us/map/ and wopr.datamade.us/grid/ views.

For the moment on the /map/ view, it defaults to using a 100m buffer on either side of the line that you draw. On the grid view, you're able to pick a buffer (once you draw a line). The spatial aggregate is a 50m box for any buffer (for the moment) but it could be anything (figuring out how to make it work is mainly a UI problem). Here's what that kinda looks like:

This shows crime between June of last year and May of this year within a 300m buffer on either side of a corridor along Fullerton, Western and North aggregated to 50m boxes.

Try it out: http://wopr.datamade.us/grid/

use date range picker for dates

Combine the start/end dates in to one field and use the awesome date range picker

"Instructions" box does not close when user clicks "OK, thanks!"

Colors all grid cells even if empty.

New wopr-etl repo

I created a new repo for the wopr-etl code:

https://github.com/svetlozarn/wopr-etl

require login on /view-datasets

Unauthenticated users can navigate to this page.

back button doesn't fully reset view in data explorer

Some stuff is going on with backbone not properly re-populating the container divs when you hit the back button from a dataset detail page

Dataset detail page:

click back button

Data Explorer home:

Grid resolution

Define a set of finer nested grid resolutions, similar in spirit to tile maps. See Tile Map Service specification.

0.25 mile grid
0.5 mile grid
1 mile grid
2 mile grid
4 mile grid
8 mile grid

Example:

relates to #4

Extending spatial resolution

One of the limitations for having high spatial resolution is that, with a uniform grid, higher resolutions dramatically increase the number of polygons the client has to render.

One possible solution is to use quad trees, to have adaptively higher resolution where the action is and low resolution where there's little going on.

We'd set a minimum cell size, and a maximum cell count.

urbanccd-uchicago / plenario Goto Github PK

plenario's People

Contributors

Stargazers

Watchers

Forkers

plenario's Issues

Within a radius of a point.

Within a given polygon

Filter by other datasets

Functionality

Use cases

Download

Recommend Projects

Recommend Topics

Recommend Org