insight-lane / crash-model Goto Github PK

Build a crash prediction modeling application that leverages multiple data sources to generate a set of dynamic predictions we can use to identify potential trouble spots and direct timely safety interventions.

Home Page: https://insightlane.org

License: MIT License

Jupyter Notebook 87.07% Python 11.26% CSS 0.16% HTML 0.36% JavaScript 1.01% R 0.08% Shell 0.01% Dockerfile 0.06%

crash-model's People

Contributors

Stargazers

Watchers

crash-model's Issues

Presentation/interaction layer for the model

It's important for both model evaluation (against subject matter expertise) and presentation to the public/planners/engineers to have some kind of way to interact with model predictions. This could be anything from a barebones map to a webmap with more interactivity. This could happen concurrently with model development—we could generate predictions from the benchmark model or use dummy data.

Utility to deal with bad csvs.

For instance, the DC crash data csv looks corrupted in some way. I was able to manually fix, but we need to automate this

Generate transit station proximity features

The MBTA has a comprehensive GTFS feed available at:

http://realtime.mbta.com/Portal/Content/Documents/MBTA_GTFS_Documentation.html

Could be useful to generate a variety of useful features for each segment:

whether there's a bus stop on the segment
whether the segment is on one or more bus routes
whether there's a subway station on the segment
distance to nearest bus stop
distance to nearest subway station

The thinking here is both to capture actual vehicle activity (buses stopping or driving by) that may affect crash likelihood as well as to serve as indicators of pedestrian activity (since there will be more of that in places close to traffic).

Also note that because the GTFS standard is a common one, this same tool could be readily ported over to other cities by just swapping the feed source, so this task would be a useful one beyond just for Boston.

Finish unified pipeline script

We need to finish the pipeline script to allow an interested user to test the demo release with their own city's data. The steps that this process needs to run are:

data transformation
making the canonical dataset
running model training & prediction generation
outputting data for visualization

@j-t-t I think you've been working on this, are there specific items that need to be completed still that you'd like some help with?

Discrepencies between README and Makefile for creating environment

Hello!

The README advises creating the development environment using the environment.yml file. Meanwhile, the Makefile provides create_environment and requirement targets that uses a generic conda env and a pip install -r requirements.txt. Should the Makefile even be used, or is it unchanged from the cookie cutter template?

Trying make data with a fresh clone results in an error

ImportError: No module named fiona

that seems to result from not using the environment.yml to define the environment and install fiona.

(If it should be supported, I could take a stab at updating the Makefile)

Discuss opportunities for collaboration on data standards with Vision Zero Network

I've been speaking to Vision Zero Network (visionzeronetwork.org), who are looking to expand the data they have for Vision Zero cities and are interested in our ideas around data standard development.

I have responded to their request for more details, our next still will likely be to talk with them directly to see what opportunities for collaboration exist.

I'll update this card when I hear more.

Develop functionality around crash categorization

Crash severity

The current prediction dataset is based only on accidents in which someone required transport away from the crash. It would be interesting to develop some kind of model or visualization of crash severity. This would require additional data that can be obtained from City of Boston.

Features to add to the model

This is the issue where we track features we want to consider adding. If we decide to work on these, we can spin them off into tasks
-streetlights
-block size (smaller blocks significantly decrease crash rates per person)
-street trees

python 3 upgrade

Define standard: predictions

We need to define an initial standard for predictions the project generates, that resolves:

required / optional attributes of a prediction, including data typing
temporal aspect: previously we've talked about converting the dates for weekly predictions from a 1-53 week identifier to an epoch timestamp (e.g. the first second of the week) to ensure that we build with the capability to handle more than one year

Better handling of time-based features

We're having some issues with the concern data joining to the crash data, and the isocalendar formatting. We need to come up with a design for handling spatial_temporal features.

Spatial features in prediction model.

The canonical dataset does not incorporate spatial information, which is an important source of information for predictions. Other research with a similar target have used methods like [spatial autoregression] (SAR) (http://www.montana.edu/ce/instructors_professors/faculty/TRB13PedCrashesCAR.pdf) and panel-based methods.

We'd like to incorporate spatial information either as features or as part of the model structure.

Clean and parse traffic count/turning movement count data

Necessary for #9 -- traffic count data is currently holed away in individual .xls files. Includes both automated traffic recordings (ATRs) and turning movement counts (TMCs). Two major components:

Clean the data and get it into a tidy format
Make the data geospatial (i.e. geocode ATR locations)
Scrape data from TMCs and bind to segements

Incorporate traffic count data

Utilize vehicle speed/count and pedestrian count data to develop informative visuals and features for incorporation into the model or use in future projects.

Incorporate weather data

It would be insightful to find daily weather data for Boston in 2016 and map it along with the crash data to see how flunctuations in weather can effect crash frequency.

Hooking train/prediction scripts into visualization

Part of the pipeline requires the model to read in processed data and output predictions by week, which can then be passed to the visualization tool to create a dynamic visual. Right now this will take the form of train_model.py outputting csvs of segment-year-week-prediction, but eventually could be streamlined.

Define standard: segments & intersections

Segments & intersections are now to be based off OSM data (nodes, ways and relations) to allow any city with OSM coverage to onboard into the project easily.

We need to define an initial standard of how the segments & intersections will be stored once built. Some questions that have been raised which may still be open -

what are the required & optional characteristics of a segment/intersection, including data typing?
are segments/intersections stored separately, or together? If stored together, how do we differentiate?
are we storing foreign key data for segments/intersections that connects them back to the OSM nodes/ways/relations from which they were built, so that we can detect & respond to changes in the source and possibly pass information back to OSM at some point?

March 07 2018

Basic demo of Docker workflow

How to build the image, or download the image
How to run the image in a container, with optional volume mounting of external directory
How to make changes to the image's environment via the Dockerfile

Add medians a feature

Open street maps (and Boston's data before that) has roads with medians consist of two separate roads. We want to identify medians and add it as a feature, but keep the representation otherwise the same.

This also requires fixing the case where there is a t-intersection on one side of the median but no intersection on the other. Currently, the stretch on the other side of the median is part of the intersection, but we don't want that to be the case.

concern data analysis

Analyze Boston and DC's concern data, as well as Cambridge's see click fix data. Write up

Look at hosting & deployment options

With development now taking place under Docker, we have lots of options for deployment of the project (e.g AWS, or any other cloud services provider). Domino Datalabs have also shown interest in helping us out with hosting support.

We need to determine what the level of resources required for the pipeline version will be, and come up with a strategy for deployment.

Data storage will also need to be considered into this.

Provide "explanatory features" from model

One of the goals of the project is to provide actionable insight to stakeholders. One way we can do that even with the model we have now is listing some of the high importantance/high magnitude features. Some thinking needs to be done about what's the best way to do this and what it might produce.

Plot all segments on a map

We currently have around 24,000 individual road and intersection segments in our data which is too much for Leaflet to handle. It would be nice to find a way to be able to plot all of the segments in one map so that we can create some descriptive visualizations of the features in our dataset.

Update documentation to support pipeline running

Update the documentation to explain how to run the unified pipeline script, including any configuration that needs to be managed.

Create prediction evaluation tool

Potential enhancement: add a tool that will take multiple sets of predictions and output (ideally with a graphical component) the relative performance of each on a test set compared to one another and/or a baseline. Flexible in terms of design, would be good to get a basic framework in place that we could refine later.

Resolve storage location of input/output data files

The current path to the data files

/boston-crash-modeling/osm-data

may be problematic from a Docker perspective. The container (executing image) is built by copying in the entire project repo at the point the image is created. This lets us build the image with a specific state of /boston-crash-modeling which is good, but we of course want to vary the data we're running against (for different cities). It also becomes problematic when you want to use the container to develop locally, by running it with an overloaded project code directory. The current setup doesn't allow for overloading just the code, you'd be overloading the code and data.

More broadly though, should we not be looking to automatically pull input data into the project separately from the code anyway, e.g from GoogleDrive, S3 etc? If this is to be developed as a pipeline, I think we want a certain degree of 'self-serve' to be included (get your city-specific files stored online, specify the urls and execute the app to generate predictions).

Please add any thoughts / opinions / questions, thanks.

Create prediction dashboard

Potential enhancement: given a set of predictions, show a dashboard that describe the predictions and provide useful information (as a complement to a map of predictions). For example:

What are the top 5 or 10 most dangerous street segments?
How many predictions are above a given threshold?
How has the average risk level across all segments changed over time?

Define standard: crashes

Define a standard for the format that crash data needs to be supplied in to be usable by the project:

required / optional attributes, including data typing

This may not be the current standard that is employed by Boston data if there are problems with that. We should define what we want our ideal standard to look like first, and then if necessary look at middleware to translate any city's data standard into ours.

Create test export of Cambridge SeeClickFix

Define standard: concerns

Define a standard for the format that concerns data needs to be supplied in to be usable by the project:

required / optional attributes, including data typing

Similar to the issue for defining a standard on crash data -
this may not be the current standard that is employed by Boston data. We should define what we want our ideal standard to look like first, and then if necessary look at middleware to translate any city's data standard into ours.

Vision Zero Network may be interested in the outcome of this issue, so we should ensure that we include all attributes that might be important.

Features based on nearby land use

It would be interesting to use open assessing data to generate some basic features based on nearby land use. There's been some research indicating that land use may affect crash risk in a non-urban setting.

Define standard: Police Ticketing

Ticketing is another nice to have feature

Link bike lanes / routes to segments

Geospatial data on the City's network of bike lanes / routes:

https://data.boston.gov/dataset/existing-bike-network (Data Dictionary)

Should be pretty easy to import with a segment mapping because it uses the MassDOT road inventory that the segments are built from. One important detail: the install year should be included in the mapping, so that we don't include bike lanes installed after a given crash incident (which may be endogenous!). So I'd suggest the output dataset format (columns) be something like:

-segment ID
-year (only include bike lanes installed before that year - this should be the year of the outcome data on crashes that are being mapped to these features, so basically think of this as "as of January 1, 20XX...")
-segment-specific characteristics (though these aren't exactly time-invariant, our crash training data is recent enough that we can treat them as if they are, so we'd repeat these for each year)

Note that the dataset has a bunch of other characteristics about roadway features (width, lanes, whether's it a key MBTA bus route, etc.) that would also be useful for including as features.

Define standard: Automated Traffic Recordings

Same as TMCs: This is a general issue and part of the "standard definition" piece

New features based on vision zero data

Sort of a placeholder while more analysis is done

-Whether or not there are at least 3 concerns at this segment
-Whether it's a high crash type of concern

Not vision zero data, but probably a good idea: whether or not it's an intersection

Search map by street name

The current map setup allows risk scores to be retrieved for individual segments, but you need to know where on the map the street is you're interested in.

We need a feature that allows a street name to be typed in, and the map can shift to that specific segment to return the risk score, as well as other available data.

Finalize Docker setup to support pipeline process & display of visualization

Ensure the Docker configuration is complete in its support for all required libraries, as well as serving of the visualization.

Create standard input / output standards between feature gen / modeling / viz tracks

Early in the project, we identified three distinct project tracks:

Feature Generation - taking raw data sources and turning them into features for modeling (also technically includes generating modeling outcomes, but that's a relatively minor part of it that's mostly solved)
Modeling - taking prepped outcomes & features and turning them into predictions of crash predictions
Visualization - taking models' predictions and translating them into a more useful format (maps, dashboards, reports, etc.)

The three tracks have been going along well, but having a standard format to pass data in between them (passing features/outcomes from #1 to #2 and predictions from #2 to #3) would greatly simplify things, making it so the tracks could operate truly independently.

So the request here is for someone to define exactly what the format should be at the end of stages #1 and #2 with an eye toward how they'll subsequently be used. Some of the questions to answer:

Where are they stored and in what format?
How are dates and times handled (both for outcomes and for the features that apply to them, so that we can generate features that are time-variant and readily apply them to specific outcome events)?
Do we want to put outcomes and features in model-ready format (e.g., # of crashes per week across all permutations of week & segment) or in simplified format (e.g., just the list of crashes, to be reshaped before modeling)?
How do we arrange the conditions of predictions (e.g., predictions for a given timespan/location)?

Might begin discussion here and or just start writing up standards in a text file or readme---someone should take the first cut and then note here how to edit. Thanks!

Add comments on existing concerns

Currently, we're only using entries on the Vision Zero Concerns map; however, there are also [comments on existing entries(https://data.boston.gov/dataset/vision-zero-comment), which are in a slightly different format. We should include these in concerns-based features.

Make a subset of inter/non_inter data for make_canon_dataset test

Look at incorporating "transition features" into segments

Right now our segments taken from OSM ways are linestrings between two termination points (e.g. A and B) which are either endpoints (dead-ends) or intersections. These are extracted using "strict" sampling in OSMNX.

OSM may actually have multiple ways between two termination points, for example the road between A and B may have two ways: one part that has 2 lanes, the other which has 3.

For now we are going to ignore these situations and stick with strict extraction, but in the future it may be worth picking up this type of change and adding it to our segment as a transition feature.

Allow viz to dynamically display new cities

The cities that are currently available in the viz (Boston, Cambridge, DC) are hardcoded. The code needs to be changed to be able to read in new cities from a config file and automatically display maps for those cities. Specific areas that need to be changed are the list of cities that appear in the dropdown menu and the coordinates that the map is centered on. Both of these things should be sourced from the config file.

Define standard: Turning Movement Counts

This is probably better described as an issue and moved to a later version. TMCs are definitely helpful in prediction, but some amount of thinking would have to be devoted to making a standard for this. Different cities are likely to measure this differently.

Link Traffic Signals to Segments

The latest available dataset of traffic signals in Boston is available here:

https://data.boston.gov/dataset/traffic-signals

Could be useful as features to note whether a given street segment has a signal at both ends, one end, or neither end (since this probably has a significant impact on both speeds and driving behavior).

So we'd want to link these to segments, and there are a few options for how to do that comprehensively:

there's an intersection ID, not sure if that matches to the intersection dataset we already have for this or if it's from a separate system
the location field gives the intersecting streets which could be mapped to either the intersection dataset or the segments themselves
if all else fails, we could approximately match by using the geocoordinates of the signals to map to the intersection locations and assign them to intersections that way. (Hopefully it doesn't come to that, but it'll be close enough for use as features.)

Storyboard the new city onboarding process

We need to decide what the process for a new city onboarding into the project looks like.

To make this project scalable and attract interest, the process should be as self-serve as possible. An interested city needing to send an enquiry, wait for documentation on data standards, send across data files, wait for those to be loaded and then finally receive notice that something is available will likely push the project beyond the interest of most (not to mention introduced a lot of administration on our part).

Could a process be developed that basically allows a new city to:

enter their city name into a web form, that kicks off the process of building segments & intersections from OSM
view our data standards on crashes & concerns, format their available data accordingly and upload it to a public storage service (probably their own initially, as long as the URLs are accessible)
indicated via another form that they have additional data ready to go, and submit the paths to it
have the app pick up their data and integrate it into the predictions

POC predictive model

Use the canonical dataset to predict accidents.

We'd want to focus on how closely the "risk score" maps to actual risk. The score wouldn't necessarily be the output of the model, but whatever transformation would have to map closely to the actual risk.

For example, in the binary case, the probability density on the positive class could be used as a risk score. In that case, we'd want to compare this risk score to the actual percentage of segments with accidents at that each score.

insight-lane / crash-model Goto Github PK

crash-model's People

Contributors

Stargazers

Watchers

Forkers

crash-model's Issues

Recommend Projects

Recommend Topics

Recommend Org