greeninfo-network / nyc-crash-mapper-etl-script Goto Github PK

3.0 3.0 0.0 4.4 MB

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.

Python 100.00%

python heroku-scheduler extract-transform-load carto socrata-api open-data carto-api nyc

nyc-crash-mapper-etl-script's Introduction

ETL SCRIPT FOR NYC CRASH MAPPER

Extract, Transform, and Load script for fetching new data from the NYC Open Data Portal's vehicle collision data and loading into the NYC Crash Mapper table on CARTO.

Python

This script is written for Python 3.8 The python and pip commands below reflect this.

Setup

Set the following environment variables in your shell. Copy in the values from the Heroku panel, or from heroku config -a nyc-crash-mapper-etl if you use the Heroku CLI.

export CARTO_API_KEY='<redacted>'
export CARTO_MASTER_KEY='<redacted>'
export SOCRATA_APP_TOKEN_SECRET='<redacted>'
export SOCRATA_APP_TOKEN_PUBLIC='<redacted>'
export SENDGRID_API_KEY='<redacted>'
export SENDGRID_USERNAME='<redacted>'
export SENDGRID_TO_EMAIL="<redacted>"

You may find it useful to create a file called .env which contains these commands, then to use source .env to load those variables into your shell.

Double check that the variable was set and is in your environment:

echo $SENDGRID_USERNAME

Install Python requirements:

pip3.8 install -r requirements.txt

Running Locally

Run the script using Python 2.7 by doing:

python3.8 main.py

Running via a Heroku Scheduler

To run on Heroku, fill in the values and send them to Heroku via commands such as these. Include all of the variables in that environment variable list described above.

heroku git:remote -a nyc-crash-mapper-etl

heroku config:set CARTO_API_KEY=<redacted>
heroku config:set CARTO_MASTER_KEY=<redacted>
heroku config:set SOCRATA_APP_TOKEN_SECRET=<redacted>
heroku config:set SOCRATA_APP_TOKEN_PUBLIC=<redacted>
heroku config:set SENDGRID_API_KEY='<redacted>'
heroku config:set SENDGRID_USERNAME='<redacted>'
heroku config:set SENDGRID_TO_EMAIL="<redacted>"

Then provision the Heroku Scheduler, and add a job simply with the following command:

python3.8 main.py

Deploying the Scheduled Task

After making changes to the script, you will want to push these changes to Heroku scheduler so the script is used the next day.

To deploy the site to the Heroku scheduler, push the code to the Heroku remote:

heroku git:remote -a nyc-crash-mapper-etl

git push heroku master

Note about qgtunnel

In 2023, we needed to have a static IP for this service, so that it could be safelisted for use with a MySQL database the client is using for the Walkmapper project. Heroku does not offer static IPs itself, but there's an addon for it. the .qgtunnel file in the root of this repo is the config for that. Settings and docs are reachable via the add-on section of the control panel on heroku.com.

nyc-crash-mapper-etl-script's People

Contributors

Stargazers

Watchers

nyc-crash-mapper-etl-script's Issues

ETL adjustment or offline task, to find & rectify altered crash records?

Investigations in #12 have confirmed that records at SODA are being modified after the fact, sometimes as much as 3 months after the fact In particular, a crash having an injury "converted" into a fatality is causing variances in injuries (2-3 out of several thousand, per month; acceptable) and in fatalities (2-3 out of 10-15; significant)

Can a mechanism exist, to detect crashes which have been altered potentially 3 months after the fact, and update the CARTO record?

The sheer volume of data records, exceeds what can feasibly be done with SODA and Socrata using a pure brute force mechanism. The updatedat* and created_at hidden system fields at Socrata may provide some mechanism for filtering for altered records (entered_at does not equal updated_at) within a certain timeframe (last 3 months).

Missing crashes since 11/26

Christine reported we had no data since 11/26, though the script was running fine as recently as 12/1 (though given Thanksgiving, I was not surprised that there was no new data yet.

But now the script is failing.

Scope out potential reuse of ETL for crime

Crime stats
https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Current-Year-To-Date-/5uac-w243/data

filter by borough = manhattan
CB= CB4
grouped by OFNS_DESC
show the total numbers for a given year , and option to change the years

┆Issue is synchronized with this Asana task
┆Due Date: 2023-03-22

Data import failing, again

We have no data for August:

Looking at heroku-logs, it appears we're having timeout issues with SQL queries again:

2020-09-01T17:23:00.358575+00:00 app[scheduler.3180]: 05:23:00 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:23:00.358814+00:00 app[scheduler.3180]: 05:23:00 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:23:25.713549+00:00 app[scheduler.3180]: 05:23:25 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:23:25.713681+00:00 app[scheduler.3180]: 05:23:25 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:23:51.016831+00:00 app[scheduler.3180]: 05:23:51 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:23:51.017009+00:00 app[scheduler.3180]: 05:23:51 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:24:17.179047+00:00 app[scheduler.3180]: 05:24:17 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:24:17.179135+00:00 app[scheduler.3180]: 05:24:17 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:24:42.564719+00:00 app[scheduler.3180]: 05:24:42 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:24:42.564977+00:00 app[scheduler.3180]: 05:24:42 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:25:07.880869+00:00 app[scheduler.3180]: 05:25:07 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:25:07.881015+00:00 app[scheduler.3180]: 05:25:07 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:25:33.238895+00:00 app[scheduler.3180]: 05:25:33 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:25:33.239067+00:00 app[scheduler.3180]: 05:25:33 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:25:58.593667+00:00 app[scheduler.3180]: 05:25:58 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:25:58.595859+00:00 app[scheduler.3180]: 05:25:58 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:26:23.896296+00:00 app[scheduler.3180]: 05:26:23 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:26:23.896386+00:00 app[scheduler.3180]: 05:26:23 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:26:49.183942+00:00 app[scheduler.3180]: 05:26:49 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:26:49.184233+00:00 app[scheduler.3180]: 05:26:49 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:27:14.548640+00:00 app[scheduler.3180]: 05:27:14 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:27:14.548850+00:00 app[scheduler.3180]: 05:27:14 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:27:39.859301+00:00 app[scheduler.3180]: 05:27:39 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:27:39.859457+00:00 app[scheduler.3180]: 05:27:39 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:28:05.165582+00:00 app[scheduler.3180]: 05:28:05 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:28:05.165734+00:00 app[scheduler.3180]: 05:28:05 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:28:30.475579+00:00 app[scheduler.3180]: 05:28:30 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:28:30.475705+00:00 app[scheduler.3180]: 05:28:30 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:28:55.758266+00:00 app[scheduler.3180]: 05:28:55 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:28:55.759476+00:00 app[scheduler.3180]: 05:28:55 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:29:21.050903+00:00 app[scheduler.3180]: 05:29:21 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:29:21.051109+00:00 app[scheduler.3180]: 05:29:21 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:29:46.356416+00:00 app[scheduler.3180]: 05:29:46 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:29:46.357274+00:00 app[scheduler.3180]: 05:29:46 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:30:11.644257+00:00 app[scheduler.3180]: 05:30:11 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:30:11.644674+00:00 app[scheduler.3180]: 05:30:11 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:30:37.002447+00:00 app[scheduler.3180]: 05:30:37 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:30:37.002607+00:00 app[scheduler.3180]: 05:30:37 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:31:02.417405+00:00 app[scheduler.3180]: 05:31:02 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:31:02.417685+00:00 app[scheduler.3180]: 05:31:02 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:31:27.728228+00:00 app[scheduler.3180]: 05:31:27 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:31:27.731142+00:00 app[scheduler.3180]: 05:31:27 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:31:53.028471+00:00 app[scheduler.3180]: 05:31:53 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:31:53.029373+00:00 app[scheduler.3180]: 05:31:53 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:32:18.363413+00:00 app[scheduler.3180]: 05:32:18 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:32:18.367833+00:00 app[scheduler.3180]: 05:32:18 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:32:43.668687+00:00 app[scheduler.3180]: 05:32:43 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:32:43.669571+00:00 app[scheduler.3180]: 05:32:43 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:33:08.991976+00:00 app[scheduler.3180]: 05:33:08 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:33:08.992107+00:00 app[scheduler.3180]: 05:33:08 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:33:34.363651+00:00 app[scheduler.3180]: 05:33:34 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:33:34.363844+00:00 app[scheduler.3180]: 05:33:34 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:33:59.704032+00:00 app[scheduler.3180]: 05:33:59 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:33:59.704155+00:00 app[scheduler.3180]: 05:33:59 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:34:25.004493+00:00 app[scheduler.3180]: 05:34:25 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:34:25.004651+00:00 app[scheduler.3180]: 05:34:25 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:34:50.363677+00:00 app[scheduler.3180]: 05:34:50 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:34:50.363887+00:00 app[scheduler.3180]: 05:34:50 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:35:15.739153+00:00 app[scheduler.3180]: 05:35:15 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:35:15.739481+00:00 app[scheduler.3180]: 05:35:15 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:35:41.047129+00:00 app[scheduler.3180]: 05:35:41 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:35:41.047189+00:00 app[scheduler.3180]: 05:35:41 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:36:06.352593+00:00 app[scheduler.3180]: 05:36:06 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:36:06.352746+00:00 app[scheduler.3180]: 05:36:06 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:36:31.647629+00:00 app[scheduler.3180]: 05:36:31 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:36:31.647786+00:00 app[scheduler.3180]: 05:36:31 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:36:57.008922+00:00 app[scheduler.3180]: 05:36:57 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:36:57.009082+00:00 app[scheduler.3180]: 05:36:57 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:37:22.310811+00:00 app[scheduler.3180]: 05:37:22 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:37:22.310958+00:00 app[scheduler.3180]: 05:37:22 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:37:47.632298+00:00 app[scheduler.3180]: 05:37:47 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-09-01T17:37:47.634564+00:00 app[scheduler.3180]: 05:37:47 PM - INFO - Insert chunk of up to 100 crash records
2020-09-01T17:38:12.950562+00:00 app[scheduler.3180]: 05:38:12 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}

And then looking in CARTO, it appears we have no crashes since Jul 7:

Over in Socrata, there are 16,014 crashes since then:
https://data.cityofnewyork.us/Public-Safety/Cashes-Since-7-8-2020/3mj4-ptnv/edit

So something happened on Jul 7, and now we have an issue where we have too much backlog to easily manage...

Crash counts mismatch between CARTO & SODA

Dan has put together a spreadsheet investigating the crash counts from Socarata and from CARTO, and come up with some figures indicating that in many cases the injury count in CARTO (and CrashMapper) is higher than indicated by the Socrata systems.

https://docs.google.com/spreadsheets/d/14JNTMS4gRszWSVXY0kOGXkYCWWp440Las_PE40nAAT0/edit#gid=248679534

Notably the bottom which lists the most-incorrect months.

Year	Month	Injured_NYC	Injured_CARTO	Killed_NYC	Killed_CARTO
2017	2	3907		4061		13		13				CARTO extra 100 injuries
2016	12	4884		5580		24		26				CARTO extra 700 injuries
2016	8	5509		11318		29		27				CARTO nearly 2X the injuries

Crashes Missing from Crashmapper - CB107

Forwarded by Christine:

Greetings -

I have been reviewing crash data for Manhattan CB7 and notice that a large number of crashes aren't showing up when I filter for Community Board 107.

Of the crashes I downloaded from the Socrata NYC open data portal (https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data) filtered for CB7 by Lat/Long https://data.cityofnewyork.us/NYC-BigApps/UWS-Latitudes-and-Longitudes/ujp2-3ypa and then manually deleting crashes that are outside the district, 20,766 of 25,312 are in Crashmapper and 4,546 are not. This includes 3 fatalities that are not in Crashmapper (COLLISION_ID 3959009, 3562386, and 4354569).

Here is my data in Google Sheets https://docs.google.com/spreadsheets/d/1T8I6XXOfrRo_xG6kCW-XTtUOAz4gc06bmOU8Xh1QEgI/edit#gid=503430291 with tabs for the data I downloaded from Crashmapper (one download with all categories except no injuries, and another with only no injuries), and from Socrata. Note that I used vlookup in column B of the "UWS_Latitudes_and_Longitudes" tab to see if the COLLISION_ID number in column Z of that tab appeared in column C (socrata_id) of the "Crashmapper CB7 Data" tab. (The data for "Crashmapper CB7 Data" was from an export of Crashmapper filtered for CB7 https://crashmapper.org/#/?cfat=false&cinj=false&endDate=2020-09&geo=community_board&identifier=107&lat=40.78535069903749&lng=-73.95360946655273&lngLats=%255B%255D&mfat=false&minj=false&noInjFat=true&pfat=false&pinj=false&startDate=2020-09&vbicycle=true&vbusvan=true&vcar=true&vmotorcycle=true&vother=true&vscooter=true&vsuv=true&vtruck=true&zoom=13.) I then changed any crash that showed a match to "In Crashmapper" and the others to "Not in Crashmapper." As the pivot table shows:

Year In Crashmapper Not in Crashmapper
2012 1411 280
2013 2592 546
2014 2523 562
2015 2725 540
2016 1724 1140
2017 2862 441
2018 2749 388
2019 2341 318
2020 1015 149
2021 824 182
Grand Total 20766 4546

Would you know why these crashes aren't appearing?

┆Issue is synchronized with this Asana task
┆Attachments: image.png | image.png | image.png
┆Due Date: 2021-11-30

Handle new data ID and auth requirements at Socrata

As noted between @fahadkirmani, Christine, and me on Upworks,

It sounds like we need to change this line: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py#L21 to point at the new data id h9gi-nx95. That seems very simple and either Fahad or GreenInfo has credentials needed to do it. But we also need to investigate and rewrite our script to pass an App Token.

I made one under a GreenInfo Socrata account and added a new ENV variable in Heroku called SOCRATA_APP_TOKEN_SECRET, tied to public app token of 32fNIbFNcIWEhZJfm1q6ypTNA, but now we need to figure out the details of how we pass that to Socrata. Ideally we could just add it to our query URL, but even I assume we then need to call the secret in our code to validate. I am sure it's not hard, but takes time and a little sleuthing.

@fahadkirmani can you take the first pass at looking into this? You should have all the credentials you need to log into Heroku, and push code here and there, or at least submit a PR here with a fix.

Assess impact of new API key system

https://carto.com/developers/fundamentals/authorization/

@gregallensworth related to discoveries in #eastcoastgreenway, it appears that CARTO rolled out an improved API key system where one can have multiple keys with different levels of access to different things.

Documentation is here: https://carto.com/developers/fundamentals/authorization/

Looks good, actually. BUT it appears possible that the single key we have by default is read only

So we need to enable write, maybe make a second key just for ETL

Vehicle Type Crosswalk examination tool

A followup to issue #22 in which hasvehicle_XXX fields are created and populated.

The vehicletype_crosswalk_prod table contains NYC vehicle types codes, which are a wide variety of typographical variants, to be remapped onto a standardized set of 8 vehicle codes.

It is expected that new variants will pop up from time to time, e.g. a new misspelling of the word "BICYCLE" or a wholly-new alias such as "Delivery Van". As a result, newly incoming records would fail to be tagged because the alias was not known.

Thus, a tool to detect such discrepancies would have two parts:

A query which would unnest all vehicle_type array values found in crashes_all_prod but not found in the nyc_vehicletype field of the vehicletype_crosswalk_prod table. That is, all vehicle types found in Socrata data but which lack an alias for hasvehicle_XXX tagging.
For each mismatch, a query to list what records are using this domain value, so those records could be evaluated for potential manual correction.

Vehicle Type apportionment

Over in #22 we assigned booleans for the 8 canonical Vehicle Types.

Another part of that task, which is properly a separate issue, is this:

Portioning out the injured and the killed

#22 is be sufficient to allow for basic filtering of crashes by involved vehicles -- "Show me crashes involving an SUV or CAR or TRUCK" like the push button Crash Type filters we have on the current application.

There's a further wrinkle in that we'll want to also know what proportion of injuries or fatalities were attributable to each vehicle type, but without double or triple-counting. If we simply add up all fatalities, say, where CAR and SUV are involved, a crash of an SUV and car with a single fatality could create 1+1, when what we want to do is say that they were each equally responsible for a single fatality, so 0.5+0.5. With the further challenge that a BICYCLE is the cause of a fatality only when no other vehicles are involved.

So I think what we would need to do here is:

get the count of the number of distinct crashmapper_vehicletypes involved
if that's >1, check for bike, if bike=true, then subtract 1 from total vehicle types
divide the number of fatalities and injuries by the number from 2.

But then it gets even more gnarly, in that we also need to retain whether the injured or killed were walking, biking, or driving.

Though this makes for a large schema, seems like a clear option would be to create a set of 48 (!) fields that directly allocate to the 8 canonical types:

e-bike-scoot boolean?
- e-bike-scoot_pedInjury
- e-bike-scoot_bikeInjury
- e-bike-scoot_motoristInjury
- e-bike-scoot_pedFatality
- e-bike-scoot_bikeFatality
- e-bike-scoot_motoristFatality

... and so on for all 8 crashmapper_vehicletypes

Then we could always query the same fields for the same mode-related injury/death data, and rows without that data simply wouldn't contribute to the total.

Visualizing that data will be a challenge, but first we have to get the data (and I do have a plan for visualization within current application structure).

Missing Feb 2020 crashes, exceeding API limits

We have a reasonable number for January -- 13545 as a result of SELECT count(*) FROM crashes_all_prod where date_val >= '2020-01-01' and date_val <= '2020-01-31'

But in February we have only 358 as a result of SELECT count(*) FROM crashes_all_prod where date_val >= '2020-02-01' and date_val <= '2020-02-29'

And the map looks it:

We made changes to the ETL script in early January, but then it ran fine for 20+ days.

Based on https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data, there should be about 13,300 crashes in Feb.

Aha, damn, here it is in Heroku logs:

2020-03-03T16:30:21.225483+00:00 app[scheduler.6563]: 04:30:21 PM - INFO - Found 10233 new rows to insert into CARTO
2020-03-03T16:30:21.225538+00:00 app[scheduler.6563]: 04:30:21 PM - INFO - Creating CARTO SQL insert for 10233 new rows
2020-03-03T16:30:27.693163+00:00 app[scheduler.6563]: 04:30:27 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}
2020-03-03T16:30:28.020869+00:00 app[scheduler.6563]: 04:30:28 PM - INFO - {"error":["Quota exceeded by 565604KB"]}

We refactored the ETL to use their batch API specifically to avoid this, but

@fahadkirmani you set up emails so you should get alerts about things like this. Is that system not working? The goal of the notifications to you is that we discover these issues before we're 12,000 crashes behind.

@gregallensworth maybe we need to break the batches into even smaller payloads?

find_updated_killcounts() bailing with JSON error

Discovered while debugging for #15

The find_updated_killcounts() function looks back through existing records, so as to find those which are in CARTO but their kill/injure counts may have changed since then (typically an injury who later died after the crash was logged).

During the Fetching CARTO entries step:

  File "main.py", line 566, in find_updated_killcounts
    'q': "SELECT * FROM {0} WHERE socrata_id IN ({1})".format(CARTO_CRASHES_TABLE, crashidlist),

ValueError: No JSON object could be decoded

Suddenly low CARTO quota is killing import

Client reports no new data since Sept 1.

I logged into CARTO and I see:

Heroku log:

2018-11-12T17:00:25.965375+00:00 app[api]: Starting process with command `python main.py` by user [email protected]
2018-11-12T17:00:28.665051+00:00 heroku[scheduler.2390]: State changed from starting to up
2018-11-12T17:00:30.920187+00:00 app[scheduler.2390]:  05:00:30 PM - INFO - Getting data from Socrata SODA API as of 2018-09-12
2018-11-12T17:00:30.975810+00:00 app[scheduler.2390]: /app/.heroku/python/lib/python2.7/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
2018-11-12T17:00:30.975815+00:00 app[scheduler.2390]:   InsecureRequestWarning)
2018-11-12T17:00:33.900943+00:00 app[scheduler.2390]:  05:00:33 PM - INFO - Got 36452 SODA entries OK
2018-11-12T17:00:33.901019+00:00 app[scheduler.2390]:  05:00:33 PM - INFO - Getting socrata_id list from CARTO as of 2018-09-12
2018-11-12T17:00:34.408454+00:00 app[scheduler.2390]:  05:00:34 PM - ERROR - No socrata_id rows: {"time": 0.295, "fields": {"socrata_id": {"type": "number"}}, "total_rows": 0, "rows": []}
2018-11-12T17:00:34.619079+00:00 heroku[scheduler.2390]: Process exited with status 1
2018-11-12T17:00:34.660615+00:00 heroku[scheduler.2390]: State changed from up to complete

that is also when Christine had problems with her grant, and sure enough they appear to have downgraded her to FREE plan on Sept 6:

I have informed her about this and she'll need to work out with CARTO what the terms of the grant really are.

Add Business Improvement Districts to ETL process.

As a first step on this new Walk Mapper project, we need to add Business Improvement Districts to CARTO and then assign Business Improvement Districts to our data, similar to

nyc-crash-mapper-etl-script/main.py

Lines 408 to 422 in f9785e3

    
           def update_borough(): 
        
               """ 
        
               SQL query to update the borough column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_borough()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET borough = a.borough 
        
               FROM nyc_borough a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.borough IS NULL OR {0}.borough='') 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql

The data is here: https://drive.google.com/open?id=1qoUAdSabPUMcoSmkeV9NWIYPORXc3kvA

This will be part of the new Walk Mapper data system, and I think it is prudent to add it to Crashmapper, though I'm not sure yet how or when it will get added to the front end there, so it might be best use of effort to add the new tagging going forward but not yet tackle trying to apply it backwards to ~2 million crashes, which seems like a rabbit hole.

Bills to CHEKPEDS:Walk Mapper Phase 1

Update Heroku stack

I was looking at Heroku for ECG and noticed this alert for our ETL scripts for Crashmapper:

Past updates have been quick -- reading some docs to see how to do it, running the commands, seeing that nothing is broken, and then pushing the update. More full featurede apps might be harder, but fingers crossed these scripts will be smooth

┆Issue is synchronized with this Asana task

Redistricting Assembly and City Council districts

Locate new SHP for City Council (Dan)
Locate new SHP for State Assembly (Dan)
Confirm same schema as in Carto (Dan)
Upload new tables to CARTO (Dan)
Develop script to reassign all crashes to new districts, based on https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/initialpolygonlinkage (Gregor)
[ ]

┆Issue is synchronized with this Asana task

Data vetting and checkup: non-uniquesocrata_id and missing socrata_id

As long as I've been put in charge of crossing the Ts and dotting the Is, I checked up that the socrata_id field is in fact unique. However, it appears not to be.

At this point, the socrata_id is not required to be unique, only to be present. It is used only by the ETL script (#6) to determine that a SODA record has already been seen, and not to import it. Under these conditions, the non-unique or null values are not problematic. But they should be investigated to determine whether there exists some deeper issue in ETL, or just a QC issue during initial data loading which is not likely to recur.

Query:

SELECT * FROM (
  SELECT socrata_id, COUNT(*) AS howmany FROM crashes_all_prod GROUP BY socrata_id
  ) chex
WHERE howmany > 1
ORDER BY howmany DESC

Results:

105076 records with a null value in socrata_id

398 socrata_ids which are associated to 2 records.

No records with a non-null socrata_id had more than 2 such instances. (repeats existed, but no double repeats, triple repeats, ...)

Create new FixCoords script, or expand FixTallies

Over in #34 I diagnosed an issue where lots of crashes in Socrata now have different coordinates than what we have in CARTO. We suspect that NYC might sometimes regeocode crashes with poor location info as they improve their geocoder, but ultimately, we're not sure why this happens.

As a way to patch the problem, we need to develop a script that updates XY coordinates from Socrata into CARTO.

We already have a couple of methods we have used to deal with changes in tallies of injured/killeD:

Every day as part of main.py we look back 90 days and update tallies. That's here: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/master/main.py#L814. This runs daily so we can be sure it's correct to current schema and approaches, etc.
We also "FixTallies" script that checks more broadly for changes in injured/killed data and updates CARTO. That's here:
https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/fixtallies. This runs manually and we haven't touched it in 2 years. Not sure when we last ran it, so it may or may not work. As I recall, we used this to catch up on wrong injured/killed tallies so we could then do the 90 day update in 1.

Right now, we could have wrong coordinates going all the way back to the beginning, so it seems like we'll need some version of 2, and then include the XY update in our main.py going forward as well.

This spreadsheet tab has a list of known mismatched records, where COLLISION_ID is socrata_id.

This is interesting, drawing on a sample of just things around CB107, showing that most of the problem is in 2016:

Year	Mismatched Count
2021	58
2020	15
2019	13
2018	24
2017	89
2016	1032
2015	0
2014	1
2013	0
2012	1

Bills to CHEKPEDS:Crashmapper Ongoing.

┆Issue is synchronized with this Asana task

New Vehicle Type crosswalk and allocations

Warning that this is not simple! But I hope I've lined it up in a way that's reasonably well thought out.

This issue is the first step in a series of changes we'll be making to allow for filtering and inspection of crash data based on "vehicle type". This issue is confined to the ETL needs. We'll have separate issues for the actual filter and display requirements.

The data as given from NYC open data (as shown here on their Socrata site) is very messy, including almost a thousand variants, including a lot of misspellings and partials. The API source data includes up to 5 "vehicle type" fields per crash, to account for multivehicle crashes, though CHEKPEDS has reviewed the data and 99%+ crashes are accounted for by the first three

We currently store the data in a JSON array in a single field, vehicle_type. The goal here is to store cleaned up domain values for vehicle type and also do some new allocation of injuries and fatalities to account for some special cases, like bikes. This issue describes this as fully as I can, and then we can see if anything is missing:

From 900+ variants to 8 values.

First, we need to translate those messy values into a set of canonical values:

E-BIKE-SCOOT
MOTORCYCLE-MOPED
SUV
CAR
OTHER
TRUCK
BICYCLE
BUS-VAN

Using this crosswalk provided by CHEKPEDS:
https://chekpeds.carto.com/tables/vehicletype_crosswalk_prod/table

Where we look up nyc_vehicletype and write into a new field the matching value from crashmapper_vehicletype, or fall back to OTHER if nothing is found. The goal is to keep this crosswalk as a table in CARTO so that as new weird values turn up in the source data, we can update the crosswalk using CARTO's existing web UI and then translate those new values into one of our eight domain values. We really need only distinct types here. We won't use crashmapper_vehicletype to count the number of total vehicles involved, so a CAR,CAR,CAR would be the same as CAR

(Note that I assume any changes to the 8 domain values would require code changes in filtering elements we implement in future tasks in other repos.)

UI will be a separate task, but this first step should enable a filter like this:

Pending the next section of this issue, I'm pausing on a determination whether the translated vehicle types should be stored as an array or in some other fashion. I do think we should retain the existing vehicle_type array as is, to make this translation nondestructive.

Portioning out the injured and the killed

The above translation would be sufficient to allow for basic filtering of crashes by involved vehicles -- "Show me crashes involving an SUV or CAR or TRUCK" like the push button Crash Type filters we have on the current application.

So I think what we would need to do here is:

get the count of the number of distinct crashmapper_vehicletypes involved
if that's >1, check for bike, if bike=true, then subtract 1 from total vehicle types
divide the number of fatalities and injuries by the number from 2.

But then it gets even more gnarly, in that we also need to retain whether the injured or killed were walking, biking, or driving.

Though this makes for a large schema, seems like a clear option would be to create a set of 48 (!) fields that directly allocate to the 8 canonical types:

e-bike-scoot boolean?
- e-bike-scoot_pedInjury
- e-bike-scoot_bikeInjury
- e-bike-scoot_motoristInjury
- e-bike-scoot_pedFatality
- e-bike-scoot_bikeFatality
- e-bike-scoot_motoristFatality

... and so on for all 8 crashmapper_vehicletypes

Then we could always query the same fields for the same mode-related injury/death data, and rows without that data simply wouldn't contribute to the total.

Visualizing that data will be a challenge, but first we have to get the data (and I do have a plan for visualization within current application structure).

I suspect we should work with a small subset of data, like the past month or two, before we try to run this against the whole dataset, which is going to its own challenge.

Scope out import of 1 year of 311 data to WalkMapper

Client question:

What would it take to load the past year of 311 data ?

My previous findings for ALL 311 data from scoping period:

The 311 data from Socrata would be a whole parallel project with its own variant of the current ETL script that would be more similar to the current crash script, but with new effort to figure out all the ways that data can be broken, like we have seen with the crash data.

I see that 311 is an order of magnitude more data — 25 million rows! I did an initial filter in Socrata down to what seemed like the most relevant complaint types, and we get down to 5.8 million with those 12 types. I saved that here so you can take a look: https://data.cityofnewyork.us/Social-Services/Curb-and-Street-311-Calls/inpy-irdd

That is still 2.5X the volume of data that we have in Crashmapper. On the one hand, that shouldn't matter since we have figured out all of the chunk-processing and incremental loading, etc, on the crash data. On the other hand, I would expect lots of new data problems given the amount of data that needs to be pulled in.

At the least, we would need to replicate the "Type Assignment" system like we did for Vehicle Types, since the Complaint Types data is very messy. Speaking of which, I was just looking at the Google Sheet we made for you to maintain the Vehicle Type data and it looks like it's been a while since those were updated. Is that system still useful for you?

Filtering the previous all time view to just last 365 days, we have about 533,000 rows:

So we could import all those, but we need to crosswalk first with our existing schema for everything but Type and then wire in the cross walk for Type

Create monthly database backup task

Python script to do the following:

Make a copy of crashes_all_prod with some name like crashes_all_prod_achrive_YYYYMMDD
Delete previous month's backup to preserve disk space

Then

set on monthly schedule using Heroku Scheduler

┆Issue is synchronized with this Asana task

Update requirements and README for local running with sendgrid

In the README, we say that given a .env file, we can run main.py locally like this:

Install Python requirements:
pip install -r requirements.txt
Running Locally

Run the script using Python 2.7 by doing:
python main.py

I have done that successfully in the past. But now, it fails on :

  File "main.py", line 15, in <module>
    from sendgrid import SendGridAPIClient
ImportError: No module named sendgrid

That's after a fresh run on pip install. I also tried adding sendgrid to requirements.txt and running again, same error.

Can you update the README, requirements.txt or anything else we need to get local running working again on a fresh install of this repo?

Past records not in any assembly district

Discovered while updating polygon data in #38

The crashes_all_prod table has 2,071,732 crashes in it today, of which 205,560 (about 10%) have NULL for assembly

That's about 10% which "feels" like a lot to fall outside of any assembly district but (presumably) fitting into the five boroughs.

I should check into this,

see how this many fall outside any Assembly District
and see to what degree this also means City Council, Neighborhood, Community Board, ...

┆Issue is synchronized with this Asana task

Review data divergence in Crashmapper

Tasks:

Run backlog fill to catch missing records
Run fixtallies to catch changes in injury or fatality counts
Run fixnullgeom on last 3 years to make sure we catch any updates coordinates

┆Issue is synchronized with this Asana task
┆Due Date: 2023-02-06

Check SODA back 1 month, filter against already-existing records

Relevant to GreenInfo-Network/nyc-crash-mapper#67

The current approach of filtering by the highest date already fetched, would miss any record "backlogged" after its actual crash date.

Having already loaded a January 7 crash, the filter used at Socarata would explicitly be "date >= ''2018-01-07"
A crash dated January 6 being entered into Socarata on January 8, would effectively cause that record to be ignored because of the January 7 filter.

Realistically, this is observed to happen. Between Dec 12 2017 and Jan 8 2018, 214 crash records were not seen (out of 16872 records, so 1.27%).

A smarter but more complex system could be introduced:

get a list of all socrata_id values already in CARTO
query SODA back 1 month, not back to max-date (typically 1 day)
filter the SODA results, to skip over records already seen in Socrata

More specific code-level changes:

get_max_date_from_carto() will be removed
the date filter in get_soda_data() will use the current date minus a reasonable time period, e.g. get_date_monthsago_from_carto(1) to get crash records dated within the last month (about 20,000 records) Reference: https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/blob/gda-intersections_crashcount/main.py#L28
get_soda_data() will also collect a list of socarata_id values from CARTO within this same time period
as format_soda_response() loops over rows, it would skip records which are already present (SODA record's unique_key is found in CARTO's socrata_id list)
update_carto_table() would proceed as usual, with rows not previously present in CARTO

Custom SODA run for 2015-01-01 to present?

Relevant to GreenInfo-Network/nyc-crash-mapper#67

The solution described in #6 should prevent future data loading from skipping over late-entered crash records. Now the question is back-loading these missing crashes.

At 500-600 records per month (gross assumption that Dec 2017 was average), the last 36 months may be missing 20,000 crashes (out of the 1.2 million, about 2%).

A custom utility could be created to repeatedly download Socrata records and merge them, fetch the list of socrata_id entries in CARTO, and generate a JSON file of the SODA records which are missing from CARTO. This JSON file could then be loaded following techniques outlined in the existing ETL script.

Improve error reporting

We have a system in place that sends emails if the ETL script hits an exception.

We confirmed it worked back in March:

But then as described in #28 and #25, we have had ongoing import failures that don't cause emails to be sent. I think what we need to do is add more conditions that send emails.

Like if we see error in the API response from CARTO, we should trigger an alert:
2020-09-01T17:23:25.713549+00:00 app[scheduler.3180]: 05:23:25 PM - INFO - {"error":["You are over platform's limits: SQL query timeout error. Refactor your query before running again or contact CARTO support for more details."],"context":"limit","detail":"datasource"}

┆Issue is synchronized with this Asana task

Puzzling through getting a static IP for the ETL script process on heroku

Christine asked me a couple of weeks ago about getting a static IP assigned to the ETL script process on Heroku, since they have changed the Walkmapper database and need to safelist specific IPs for database access.

This is for the scripts here:
https://github.com/GreenInfo-Network/nyc-crash-mapper-etl-script/tree/master/walkmapper

Not surprisingly, Heroku does not provide static IPs. Christine found a Heroku addon that does so, accessbile form the app on Heroku here:

I updated the ENV vars on that app to be the correct ones for the new database, and also updated the record in 1P

Then I asked the client to also safelist our office IP address. Loading the ENV vars locally and running python walkmapper/mysql2carto.py works as expected. I can access the DB and data is updated in CARTO.

However, the QuotaGuard proxy is not configured properly.

When I run from Heroku, I get this:

heroku run bin/qgtunnel python walkmapper/mysql2carto.py
Running bin/qgtunnel python walkmapper/mysql2carto.py on ⬢ nyc-crash-mapper-etl... up, run.5677 (Basic)
Traceback (most recent call last):
  File "/app/walkmapper/mysql2carto.py", line 404, in <module>
    ObstructionMyqlToCartoLoader().run()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/walkmapper/mysql2carto.py", line 64, in __init__
    self.db = MySQLdb.connect(host=DB_HOST, port=int(DB_PORT), user=DB_USER, password=DB_PASS, database=DB_NAME, cursorclass=MySQLdb.cursors.DictCursor)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/MySQLdb/__init__.py", line 123, in Connect
    return Connection(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/MySQLdb/connections.py", line 185, in __init__
    super().__init__(*args, **kwargs2)
MySQLdb.OperationalError: (1130, "Host 'ec2-34-207-80-161.compute-1.amazonaws.com' is not allowed to connect to this MySQL server")

Seems clear the IP is just not configured.

I tried this test they recommend and it seems to return an expected result:

heroku run 'curl -x http://94hgudyixsxiqy:[email protected]:9293 -L ip.quotaguard.com'
Running curl -x http://94hgudyixsxiqy:[email protected]:9293 -L ip.quotaguard.com on ⬢ nyc-crash-mapper-etl... up, run.4035 (Basic)
{"ip":"54.88.136.216"}%

So I filed this suport request:

I need some help setting up QuotaGuard for a simple app on Heroku that runs Pythons scripts to harvest data from a few client databases.

I can access those databases locally with my own IP that is safelisted, but I can't get the Heroku app to access the databses, I believe because I have not properly configured QuotaGuard.

I tried following the instructions here, https://www.quotaguard.com/docs/language-platform/heroku/heroku-shield/, but I am not able to find the value for QUOTAGUARDSHIELD_URL in my Heroku config.

When I attempt to connect to the remote datbase, I get "MySQLdb.OperationalError: (1130, "Host 'ec2-18-212-22-136.compute-1.amazonaws.com' is not allowed to connect to this MySQL server")" -- which suggests to me the IP connection to QuotaGuard is not actually happening.

┆Issue is synchronized with this Asana task

Walkmapper input running but not importing

I just noticed that our daily task on this project is running and seems not to be reporting errors. Here's Heroku Logs:

2021-07-30T22:31:15.945781+00:00 app[api]: Starting process with command `python walkmapper/mysql2carto.py` by user [email protected]
2021-07-30T22:31:19.739117+00:00 heroku[scheduler.6975]: Starting process with command `python walkmapper/mysql2carto.py`
2021-07-30T22:31:20.346862+00:00 heroku[scheduler.6975]: State changed from starting to up
2021-07-30T22:31:22.039966+00:00 app[scheduler.6975]: 10:31:22 PM - INFO - Found 20 obstructions in CARTO
2021-07-30T22:31:22.233539+00:00 app[scheduler.6975]: 10:31:22 PM - INFO - Found 137 obstructions in MySQL
2021-07-30T22:31:30.861763+00:00 app[scheduler.6975]: 10:31:30 PM - INFO - Sorted: 15 to skip
2021-07-30T22:31:30.861787+00:00 app[scheduler.6975]: 10:31:30 PM - INFO - Sorted: 118 to insert
2021-07-30T22:31:30.861830+00:00 app[scheduler.6975]: 10:31:30 PM - INFO - Sorted: 4 to update
2021-07-30T22:31:30.861880+00:00 app[scheduler.6975]: 10:31:30 PM - INFO - Sorted: 0 to delete
2021-07-30T22:31:30.862203+00:00 app[scheduler.6975]: 10:31:30 PM - INFO - ALL DONE
2021-07-30T22:31:30.925363+00:00 heroku[scheduler.6975]: Process exited with status 0
2021-07-30T22:31:30.991896+00:00 heroku[scheduler.6975]: State changed from up to complete

With 20 in the system and 118 to insert, we'd expect to see 138 in the map. But we have only 20 still:
https://chekpeds.carto.com/viz/a71e66b8-cb4b-4c0a-aee7-3ef79a3f4b1a

I wondered if it might be a permissions issue, but I'd expect that to throw errors and I also verified that the key we have has insert and update permissions.

I wonder if parts of the script simply are not running?

Discussion: Upgrade to Python 3

Over in #26 @fahadkirmani recommended we upgrade to Python 3. Seems reasonable, but what's involved?

Yet more QA on missing data

Christine says we're missing data. #15 should resolve a bunch of that, but she also sent this CSV and wants some data QA: https://docs.google.com/spreadsheets/d/1T9-ENN7uuhHabm3kYhl7XulO8th5qYgbZA2eaSNrOnE/edit#gid=1551094594

I responded asking her what the query was that generated this doc. I can't divine it from the data itself -- seems like random days grouped by borough.

no data since 11/16

Hmm, client reports and I just confirmed in https://chekpeds.carto.com/tables/crashes_all_prod that we have no crashes since Nov 16, 2018.

Looking at the logs at https://dashboard.heroku.com/apps/nyc-crash-mapper-etl/logs, I can see that the scheduled task is running.

Ah, here's a new one, NYC data has nothing since 11/16.
https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data

Their view of all data, sorted descending by date:

Will report this back to client. Leaving open for now.

Filter out badly geocoded data after updating crashes table in Carto

It appears new NYC Open Data contains lat lons way outside of NYC in the Atlantic Ocean and Null Island. These should be filtered out via an SQL UPDATE query after new data has been inserted.

Something like:

UPDATE crashes_all_prod
SET geom = null
WHERE cartodb_id IN (
  SELECT
    a.cartodb_id
  FROM
    crashes_all_prod AS a LEFT JOIN
    nyc_borough AS b ON
    ST_Intersects(b.the_geom, a.the_geom)
  WHERE b.cartodb_id IS NULL
  AND a.the_geom IS NOT NULL
) _

Experiment with NYC geocoder for missing geoms

After the most recent round of fixes for null geometries, we still have about 10% of records with no geometries:

SELECT count(cartodb_id) FROM crashes_all_prod

1,950,062 crashes in the whole universe.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null

187,456 without geometry. That's 9.6%

Of those, 40,557 have neither on_street nor off_street, so they could never be located. That's 2% of total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
length(on_street_name)=0 and 
length(off_street_name)=0

35,483 have both cross streets and could be candidates for geocoding. That's 1.8% of the total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
length(on_street_name)>0 and 
length(off_street_name)>0

111,416 have either on_street or off_street but not both. That's 5.7% of total.

SELECT count(cartodb_id) FROM crashes_all_prod 
where 
latitude is  null and 
(length(on_street_name)=0 OR 
length(off_street_name)=0) and not
(length(on_street_name)=0 AND 
length(off_street_name)=0)

That's the hardest set. In some cases, one could get to borough and maybe even smaller, but geocoding these would put a point at a location on the map that is almost certainly inaccurate. Most geocoders would pick the geospatial center of the length of a street or avenue. This point would then be included in all area calculations and fall into whatever boundaries happen to overlap with that center. This is a common problem where, for example, the center of the US, somewhere in Kansas, gets assigned all kinds of stuff that's clearly not there.

I recommended to Chrstine we not pursue this, but if she wants the 1.8% increase, next step is a manual try at geocoding those to see how the results look.

┆Issue is synchronized with this Asana task

Intersections crashcount: change to last 24 months

Relevant to GreenInfo-Network/nyc-crash-mapper-chart-view#81

Client's focus has changed from past 36 months to past 24 months. Update the ETL script to calculate intersections' crashcount from the last 24, not the last 36.

New ETL script, similar to main.py, but for Walk Mapper

We need to develop a script to consume data from a new database, and then run it through a similar set of intersections as we do crashes in this code,

nyc-crash-mapper-etl-script/main.py

Lines 408 to 517 in f9785e3

    
           def update_borough(): 
        
               """ 
        
               SQL query to update the borough column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_borough()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET borough = a.borough 
        
               FROM nyc_borough a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.borough IS NULL OR {0}.borough='') 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_city_council(): 
        
               """ 
        
               SQL query to update the city_council column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_city_council()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET city_council = a.identifier 
        
               FROM nyc_city_council a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.city_council IS NULL) 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_senate(): 
        
               """ 
        
               SQL query to update the senate column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_senate()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET senate = a.identifier 
        
               FROM nyc_senate a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.senate IS NULL) 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_assembly(): 
        
               """ 
        
               SQL query to update the assembly column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_assembly()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET assembly = a.identifier 
        
               FROM nyc_assembly a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.assembly IS NULL) 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_community_board(): 
        
               """ 
        
               SQL query to update the community_board column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_community_board()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET community_board = a.identifier 
        
               FROM nyc_community_board a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.community_board IS NULL) 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_neighborhood(): 
        
               """ 
        
               SQL query to update the neighborhood column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_neighborhood()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET neighborhood = a.identifier 
        
               FROM nyc_neighborhood a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.neighborhood IS NULL OR {0}.neighborhood='') 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql 
        
           def update_nypd_precinct(): 
        
               """ 
        
               SQL query to update the nypd_precinct column in the crashes table 
        
               """ 
        
               logger.info('Cleanup update_nypd_precinct()') 
        
               sql = ''' 
        
               UPDATE {0} 
        
               SET nypd_precinct = a.identifier::int 
        
               FROM nyc_nypd_precinct a 
        
               WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom) 
        
               AND ({0}.nypd_precinct IS NULL) 
        
               '''.format(CARTO_CRASHES_TABLE) 
        
               return sql

, plus a new Business Improvement Districts as described in #30.

The new application is called "WalkMapper" and it is a crowdsourcing tool that will allow New Yorkers to log all kinds of sidewalk problems and obstructions, and the it will auto-file complaints for them. Our remit is to get the data into CARTO for later mapping. In this phase, the only end-user map will be built with CARTO's GUI tools.

There is very little data in this system as of now. Unlike Crashmapper drawing from a massive public dataset, this is a new app that will build up data over time.

Steps as I understand them:

Confirm programmatic access to the database using new ENVvars based on the info in Walk Mapper CHEKPEDS app DB credentials. I have tested access via a desktop GUI called Sequel Pro and was able to inspect the data. My hope is we can use config vars and Python scripts to connect directly to the database. There's really no API though. My assumption is we can connect directly and run SQL queries, but that might take some R+D on how do do that in a Heroku-friendly way.
Access the core tables: obstructionDetails and obstructionImageDetails and categoryMaster
Spec out the fields we'll need in CARTO -- this will likely need to be a separate issue once you get the basic data connection figured out.

For the CARTO GUI map, we'd want to be able to show Obstruction location, date reported, Category/type, geometry intersections (burough, city council, etc), and one or more images. We can ignore any assignments of borough or other boundaries in the application data since we'll get better data through intersection. And then we also need a smart way to pull in the file names of related images. If we can do this in an array, then we can include multiple image names in the same field and not have to recreate the whole relational table structure on our side.

For now, take a look around and make sure you can access the database and see the way forward to consume data and write to CARTO.

This project bills to: CHEKPEDS:Walk Mapper Phase 1

How to deal with historic updates to NYC Open Data?

It appears that the vehicle collision data on the NYC Open Data Portal is edited after it is published. For example, previously published rows may be removed and values for columns in other rows may be updated, even for rows that date back over a year from the current date.

How do we know when previously published data is updated?

What is the best way to keep the table in CARTO in sync with the updated data?

Updates aren't as big of a deal as an UPSERT query can be used
Removing rows gets trickier, we can use the unique_key from Socrata, but will need to get values for every row in Socrata each time the database is updated. Could then do a DELETE with a WHERE unique_key NOT IN socrata_id.

What if a mistake is made by someone at Socrata (such as incorrect rows being deleted) and it should be avoided synced with the table at CARTO?

	def update_borough():
	"""
	SQL query to update the borough column in the crashes table
	"""
	logger.info('Cleanup update_borough()')

	sql = '''
	UPDATE {0}
	SET borough = a.borough
	FROM nyc_borough a
	WHERE {0}.the_geom IS NOT NULL AND ST_Within({0}.the_geom, a.the_geom)
	AND ({0}.borough IS NULL OR {0}.borough='')
	'''.format(CARTO_CRASHES_TABLE)
	return sql

greeninfo-network / nyc-crash-mapper-etl-script Goto Github PK

nyc-crash-mapper-etl-script's Introduction

ETL SCRIPT FOR NYC CRASH MAPPER

Python

Setup

Running Locally

Running via a Heroku Scheduler

Deploying the Scheduled Task

Note about qgtunnel

nyc-crash-mapper-etl-script's People

Contributors

Stargazers

Watchers

nyc-crash-mapper-etl-script's Issues

Portioning out the injured and the killed

From 900+ variants to 8 values.

Portioning out the injured and the killed

Running Locally

Recommend Projects

Recommend Topics

Recommend Org