Code Monkey home page Code Monkey logo

plenario's People

Contributors

apanella avatar carhart avatar cecat avatar derekeder avatar fgregg avatar geosir avatar hectron avatar hunterowens avatar jcgiuffrida avatar lw334 avatar mccc avatar nhdaly avatar npmcdn-to-unpkg-bot avatar pbeckman avatar stevenvandervalk avatar vforgione avatar willengler avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

plenario's Issues

indicate when dataset was last updated

For updating purposes, it would be good to know when each dataset was last updated and display it on the view datasets page

screen shot 2014-08-04 at 2 56 18 pm

If something goes wrong with the update, this should also be indicated on this page as well.

Distinguish between two kinds of downloads

It seems a user can download two different things on the /explore page after running a query: point-level data for individual datasets, and a data matrix with time-aggregated data for all datasets the query returned. Two issues with this:

  • The user may want time-aggregated data for only a few of the datasets
  • The user may want point-level data for more than one dataset

We may want to broaden these options by letting a user check off which datasets they want and then download either the point-level or time-aggregated data, the former as a .zip of CSVs, and the latter as a data matrix containing data from the selected datasets.

Grid is rectangular, not square

Minor issue, but especially prominent at the 5km level. This isn't just the projection, is it? It's unclear whether the spatial resolution is the width or height of the cells.

image

Spatial resolution

Possible spatial resolutions:

  • point
  • street segment
  • grid
  • building
  • parcel
  • city block
  • census block
  • tract
  • beat
  • community area
  • PUMAs
  • congresional districts

Example

http://base_url/api/datset-id/?spatial_resolution=city-block

Zero-values in aggregates

While working on the SF version, I've noticed that if there is no datapoint in an aggregation unit --say, no crimes in a specific day--, the corresponding object is simply missing from the response.
This can be OK if the client expects a "sparse" representation of the response and takes care of filling in the missing values with zeros, but in general we might want to return a response that already contains the zero values.
Moreover, interpreting the "missing values" has a different meaning if we consider types of data that have an interval lifetime, such as a shape representing a park, where each returned date-value pair could represent a point which "something has changed."

I wrote some code to fill in the "missing values" with zeros and I can wrap it up in a pull-request, if needed. If I'm overseeing something or you believe that this is not a backend-related problem (e.g. the client will take care of it) feel free to close this issue.

Logging API calls

From Brett: we need to think about how to log API calls we receive, if there isn't yet anything in place. Initial thoughts are to log the actual call, but not information about the submitter (IP, location, etc). This would enable us to see what people are accessing, how much data (and how frequently) they're accessing, and whether there are issues that crop up from API calls.

No rush on this, but I imagine we should have something in place by end of summer.

/master and /<dataset> unresponsive

Requests to the /api/master/ endpoint time out or return 502 or 404. Requests to specific databases, like /api/chicago_business_licenses/, also time out. This occurs whether we add query parameters or not. The sample queries at http://wopr.datamade.us/ therefore do not work.

Allow user to draw multiple shapes on map

This will make it easier to:

  • compare two highways
  • look at Hyde Park and Lakeview
  • draw complicated shapes or shapes with holes in them

Does the API support multipolygons in a single request? I know @apanella's shapefile code can import multipolygons.

feedback when manually updating data

Clicking the Update button on the view datasets page doesn't seem to do anything. It should respond with a message containing details on the status of the manual update.

screen shot 2014-08-04 at 2 59 11 pm

Better location parsing

It would seem that I have not yet mastered the art of parsing latitude and longitude information out of a socrata "location" field. Should probably fix this.

tags for datasets

It would be great to add tags to datasets, e.g. "Chicago", "crime", "infrastructure" for when there are hundreds or thousands of datasets and the user doesn't care about most of them. Not sure if this should be only an admin permission - probably login is enough.

Reserved query parameters

@derekeder and I were talking about making the queries against the API a bit more sensible by ensuring that certain field names in the database only exist in the master table. The only three that I can think of right now would be obs_date, geom and dataset_name. This would make it so that I would not need to prefix the field names with something in order to ensure that they query was built in a sane way.

@svetlozarn Does this seem like an achievable goal?

Location queries

Within a radius of a point.

Specify:

  • lat latitude
  • lon longitude
  • radius in meters

Example

http://base_url/api/datset-id/?lat=41.878114&lon=-87.629798&radius=100

Within a given polygon

Specify a GeoJSON polygon:

{
    "coordinates": [
        [
            [
                -87.66865611076355, 
                42.00809838577665
            ], 
            [
                -87.66855955123901, 
                42.004662333308616
            ], 
            [
                -87.66045928001404, 
                42.004869617835695
            ], 
            [
                -87.66071677207947, 
                42.00953334115145
            ], 
            [
                -87.6644504070282, 
                42.01010731423809
            ], 
            [
                -87.66865611076355, 
                42.00809838577665
            ]
        ]
    ], 
    "type": "Polygon"
}

which gets stringified and appended as a query parameter:

http://base_url/api/datset-id/?location__geoWithin=%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B-87.66865611076355%2C42.00809838577665%5D%2C%5B-87.66855955123901%2C42.004662333308616%5D%2C%5B-87.66045928001404%2C42.004869617835695%5D%2C%5B-87.66071677207947%2C42.00953334115145%5D%2C%5B-87.6644504070282%2C42.01010731423809%5D%2C%5B-87.66865611076355%2C42.00809838577665%5D%5D%5D%7D&date__lte=1369285199&date__gte=1368594000&type=violent%2Cproperty&_=1369866788554

Time range

Specificy a time range for the data

or, if you wanted to find all crimes reported between May 23, 2012 and June 25, 2012 it would be formatted in to a UNIX timestamp.

http://base_url/api/datset-id/?date__lte=1340582400&date__gte=1337731200

sparkline is empty when there's only one observation

Metadata endpoint

Each geospatial area of the city should include metadata about the available data. This data should use the schema of http://schema.org/Dataset

the meta data must include the following fields:

  • name
  • description
  • url (endpoint to the data for this area)
  • isBasedOnUrl source url, i.e https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2"
  • temporal The range of temporal applicability of a dataset, e.g. for a 2011 census dataset, the year 2011 (in ISO 8601 time interval format).
  • spatial The range of spatial applicability of a dataset, e.g. for a dataset of New York weather, the state of New York.
  • frequency Frequency with which dataset is published. accrualPeriodicity

In addition to these schema.org standards the meta data must also include

  • temporal-resolutions list of available temporal resolutions {continuous, minute, hour, day, week, month, year}
  • spatial-resolutions list of available temporal resolutions {point, street segment, grid, building, parcel, city-block, census-block, tract, beat, community area... etc)

Restrict API usage without API keys

Pretty standard - don't let someone make a bazillion calls or download a bazillion rows without an API key so we know who they are. We should also give them an API key after registration, when we have a /register page up.

Extract downloads

@derekeder and I have been talking a lot about what qualities a useful data portal might have. One thing that keeps coming up is bulk downloads. While it is nice to have beautiful visualizations that give you a way to explore what's available, in the end you're probably going to just want a csv file that you can use for your own analysis or visualizations. APIs are great but I'm thinking a bare minimum you should have before getting to that point is bulk downloads.

But since in our case that would basically mean redirecting people to Socrata, what unique thing can we offer? Well, maybe an obvious thing to build would be a way to select a box in space and time and get a zip file full of CSVs containing all the stuff we have in that space and time. To me, that would already be super useful and powerful tool that researchers, journalists and developers would be pretty happy with. This is mainly based upon discussions I've had with developers and journalists in the open government space who have a few skills when it comes to exploring a spreadsheet and are interested in getting as close to the source data as possible.

Any thoughts?

Misleading time span for datasets b/c of unclean data

Some time spans like Graffiti Removal span back to 1926 only because of apparently unclean data. There are 19 entries on 2/5/1926 that were all closed 6/18/2014 and 8 on 1/1/1927 that were all closed 6/27/2014. The next entry is in 2004. This means Graffiti Removal has a misleading "89 years" of data. If we can't clean the data, is it possible to restrict the timespan only to periods when the data came in relatively consistently, like at least one record every 2 years?

Tree Trims has similar issues.

Another option would be to run all 311 data through a script that removes records whose completion_date is very long after their creation_date, and then reinsert.

Handle /explore query with no spatial parameter

Similar to #32, we might want to do something other than return all data from all datasets over the time period specified. I don't see the point of a visual query without a spatial parameter, especially given the load it could cause. If the user wants the whole world they can zoom out and select the whole world.

This is especially important because a new user might load the page, skip the instructions, and not realize there are tools to draw on the map. We could even just open the "Instructions" box when they try to submit such a query, as a gentle reminder.

Use cases for weather data

Here is some detail about how @meemking and I envision using weather data in Plenario.

Filter by other datasets

Perhaps the most important use case is the ability to use weather data to filter observations in other datasets - e.g., if the user only cares about murders that occurred when the sun was shining and it was at least 95 degrees out. One way this could be implemented could be to designate which fields in a dataset could be filtered on, perhaps using a flag on the column when the dataset is first imported, and give the user the ability to add filters on the /explore page.

For instance, imagine that under the "Aggregate by:" select field there was a button to add a filter, which then displays a select field to select a database, which then displays a select field to choose an attribute from that database, among a pre-selected list of attributes. Then it displays buttons or fields to let the user say things like "=cloudy", ">=95", "!=0". The user could then add more filters in the same way, which would be joined together under a single WHERE ... AND ... clause. The final query would filter out points whose nearest weather observation (in space and time) does not meet those conditions.

Obviously, this can only be done with datasets that are guaranteed to have complete spatial and temporal coverage, which is where Matt's imputation script can be particularly helpful.

It would also be great to have a "day/night" filter which could even use dawn/dusk time information from weather.

Further functions and use cases below, which may be things we can start implementing sooner without major overhauls.

Functionality

  • Show me the count of [days, hours] when it was raining anywhere inside this polygon
    • SQL query: select count(days) where weather.precipitation > 0
    • this could be via number, or a heatmap as long as the polygon is a certain minimum size (e.g. 50 mi across)

Use cases

  • Where were people murdered when it was at least 95 degrees out?
    • query might look like SELECT * from Crime LEFT JOIN Weather On Crime.Date NEAR Weather.Date And Crime.Location NEAR Weather.Location WHERE weather.temp > 95
  • Show a heatmap which is hotter when the correlation between temperature and count of crimes (by type of crime) is closer to 1
    • In each cell of the grid, find correlation of temperature (nearest weather station) and count of crimes, using a given temporal aggregation, and color the heatmap using those correlations as inputs

Download

  • Allow user to choose which attributes to download
  • Allow user to download results of filtered queries (e.g. crime data where temperature was at least 95 degrees) - could be simply the original crime data, filtered by row, with temperature appended on
  • Add weather attributes onto the crime dataset for download - what the weather was like at each crime

Feel free to use this as a starting point, let us know what will be easy/difficult to implement, and what questions you have. This is only meant to get us to think more broadly about how to incorporate sensor data and how to perform queries on multiple tables.

template for 404 page

perhaps one that links to the API docs and data explorer, if we don't know what the user did wrong

Temporal resolution

Supported temporal resolutions:

  • continuous (default, no aggregation)
  • minute
  • hour
  • day
  • week
  • month
  • year

Example

http://base_url/api/datset-id/?temporal_resolution=minute

More efficient type inference

Right now, I'm using the type inference stuff in csvkit to figure out the data types for the columns. This is great for smaller datasets but when we get into the larger ones, it gets a bit hairy.

Maybe there is a way to leverage the type inference inside pandas thusly: http://stackoverflow.com/questions/15555005/get-inferred-dataframe-types-iteratively-using-chunksize to make sure that we're not ending up with processes getting killed by the OS because they are eating too much RAM.

Another approach would be to more cleverly iterate the incoming csv file so that it's not all read into memory at once and then use an approach such as the one in csvkit.

plenario icon

This is a discussion we should all be part of - what ideas do we have for plenario's branding?

  • color
  • logo/icon
  • tagline
  • feel of the site (flat, round, skeuomorphic, what have you)

For example, take the icon - I put a working one up as a placeholder but it may be a good candidate. It matches our current color (red?) and brings in a few good connotations:

image

  • looks like (in fact is derived from) the seating chart for a sitting body, i.e. "plenary session"
  • lots of distinct elements come together into a whole image
  • looks like an archway, which hearkens back to the original branding discussion in which we talked about the Plenario platform being like a physical structure
  • is also the intersection ∩ sign, which is fitting for the sort of queries we're running
  • with a few tweaks, could also resemble the Ω symbol for the other meaning of plenary="full, whole"

Not tied to this at all, it took me 10 minutes to make and is intended as a placeholder.

What do you all think, regarding both logo but also the other branding elements mentioned above? The U Chicago team has zero branding expertise (see "WOPR") and we're happy to listen to any and all ideas.

Allow for spatial queries along a path

I started playing around with how this might work today. I implemented a line string drawing tool on both the wopr.datamade.us/map/ and wopr.datamade.us/grid/ views.

For the moment on the /map/ view, it defaults to using a 100m buffer on either side of the line that you draw. On the grid view, you're able to pick a buffer (once you draw a line). The spatial aggregate is a 50m box for any buffer (for the moment) but it could be anything (figuring out how to make it work is mainly a UI problem). Here's what that kinda looks like:

screen shot 2014-06-10 at 3 01 59 pm

This shows crime between June of last year and May of this year within a 300m buffer on either side of a corridor along Fullerton, Western and North aggregated to 50m boxes.

Try it out: http://wopr.datamade.us/grid/

Extending spatial resolution

One of the limitations for having high spatial resolution is that, with a uniform grid, higher resolutions dramatically increase the number of polygons the client has to render.

One possible solution is to use quad trees, to have adaptively higher resolution where the action is and low resolution where there's little going on.

image006

We'd set a minimum cell size, and a maximum cell count.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.