renci / cmaq-exposure-api Goto Github PK

CMAQ Exposures API

Shell 8.09% Python 91.08% Dockerfile 0.83%

cmaq docker rest-api open-api swagger environmental-exposures python3

cmaq-exposure-api's Introduction

cmaq-exposure-api

The CMAQ Exposure API is a RESTful data service implemented in Swagger using OpenAPI 2.0 standards and provides environmental CMAQ exposures data based on GeoCodes (latitude, longitude) and dates.

TL;DR

cd cmaq-exposure-api
./run-cmaq-api.sh

Open http://localhost:5000/v1/ui/#/default in your browser when the script completes.

Development Environment

Preliminary assumptions

Docker and docker-compose are available on the host
- Generally executed using a bash script that performs docker or docker-compose calls

Python 3 is available on the host

Generally executed using virtualenv in the manner:

 $ virtualenv -p /PATH_TO/python3 venv
 $ source venv/bin/activate
 (venv)$ pip install -r requirements.txt
 (venv)$ python SOMETHING.py [/PATH_TO/SOME_FILE]

Repository structure

The repository has been broken into multiple sections based on the infrastructure, application or task being addressed. Each section is briefly described here with a more detailed overview as a README.md file at each primary directory level.

PostgreSQL 9.6 / PostGIS 2.3:

Docker-compose based development database
See README.md in postgres96/

Sample Data:

Initialization scripts for PostgreSQL cmaq database and tables
Representive CMAQ data in SQL format
See README.md in data-sample/

Data Tools:

pre-ingest: checks to validate CMAQ source data against the PostgreSQL database schema
ingest: scripts for reading the CMAQ source data into the PostgreSQL database
postgres-functions: indexes and function generation tools
post-ingest: scripts for updating the aggregate values of newly ingested data
See README.md in data-tools/

Server

Python3/Flask based API server
Docker implementation of the API server
See README.md in server/

Client

TODO

Swagger Editor

TODO

See INSTALL.md for full details.

About CMAQ / CMAS

CMAQ is an active open-source development project of the U.S. EPA that consists of a suite of programs for conducting air quality model simulations. CMAQ is supported and distributed by the Community Modeling and Analysis System (CMAS) Center.

CMAQ combines current knowledge in atmospheric science and air quality modeling with multi-processor computing techniques in an open-source framework to deliver fast, technically sound estimates of ozone, particulates, toxics, and acid deposition.

cmaq-exposure-api's People

Contributors

Stargazers

Watchers

cmaq-exposure-api's Issues

provide bravado client example

Though stand alone clients can be generated using swagger codegen, Python users find the bravado package to be useful for generating the API client on the fly.

generate API server code

stub from swagger codegen for python-flask using swagger.yaml definition
run over https
use gevent as web server
provide Docker option in addition to Python3
will be similar to nih-exposures-api server

script for updating common_name attribute of expsoure_list table

Create a script to update the common_name attribute for CMAQ exposure variables.

The update-cmaq-tables.py script populates the exposure_list table based on discovered attributes in the CMAQ source data files.
It does not however have the ability to update the common_name attribute as that will be manually defined by @arunacs for the variables of interest.

Example of existing exposure_list table:

 id |   type    |           description           | units | common_name |  utc_min_date_time  |  utc_max_date_time  |     resolution      | aggregation
----+-----------+---------------------------------+-------+-------------+---------------------+---------------------+---------------------+-------------
  1 | ald2      | 1000.0*ALD2[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  2 | aldx      | 1000.0*ALDX[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  3 | benzene   | 1000.0*BENZENE[1]               | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  4 | co        | 1000.0*CO[1]                    | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  5 | eth       | 1000.0*ETH[1]                   | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  6 | etha      | 1000.0*ETHA[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  7 | form      | 1000.0*FORM[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  8 | h2o2      | 1000.0*H2O2[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
  9 | hno3      | 1000.0*HNO3[1]                  | ppbV  |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
 10 | hno3_ugm3 | 1000.0*(HNO3[1]*2.1756*DENS[3]) | ug/m3 |             | 2011-01-01 01:00:00 | 2011-02-01 01:00:00 | hour;day;7day;14day | max;avg
(10 rows)

quality metrics retrieval flag - default to false

Allow user to determine if they want to retrieve quality metrics for CMAQ variables that have them.

Add new parameter to /values path named include_quality_metric and default to false

Per discussion with @hyi re: hackathon experience with @arunacs

There was interest in retrieving CMAQ data in yearly increments for Ozone, however the existing quality metric implementation slows the query down to the point of not being effective at that scale.
Issue #18 has been created to address speeding up quality metrics queries

generate swagger specification

Generate swagger / OpenAPI specification for CMAQ Exposure API

Will be similar to: nih-exposures-api swagger.yaml
Will be swagger 2.0 as OAS 3.0 does not yet support server generation tools
Will need to be augmented to contain the to be determined quality metric Reference

refactor CMAQ ingest to be more efficient / faster

Existing implementation works well for small sets, but takes weeks (or longer) to ingest data formatted at the 299 x 459 grid size

Example:

Processes started to ingest for 2011 CMAQ data:

25124 Sat Dec 23 20:24:15 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25268 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25269 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25270 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25271 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25272 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25273 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25274 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25275 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25276 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25277 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_
25278 Sat Dec 23 20:40:27 2017 ./venv/bin/python ./ingest-cmaq-file.py /projects/datatrans/CMAQ/2011/raw/CCTM_CMAQ_v51_Release_Oct23_NoDust_

As of 2018-01-16, the most populated set is January, 2011 with 90 of 459 columns completed. The rest of the months are between 21 and 24 of 459 completed.

psql (9.6.6)
Type "help" for help.

cmaq=# select utc_date_time::date, max(row), max(col) from exposure_data where utc_date_time::date >= '2011-01-01' group by utc_date_time::date order by utc_date_time::date;
 utc_date_time | max | max
---------------+-----+-----
 2011-01-01    | 299 |  90
 2011-01-02    | 299 |  90
 2011-01-03    | 299 |  90
 2011-01-04    | 299 |  90
 2011-01-05    | 299 |  90
 2011-01-06    | 299 |  90
 2011-01-07    | 299 |  90
 2011-01-08    | 299 |  90
 2011-01-09    | 299 |  90
 2011-01-10    | 299 |  90
 2011-01-11    | 299 |  90
 2011-01-12    | 299 |  90
 2011-01-13    | 299 |  90
 2011-01-14    | 299 |  90
 2011-01-15    | 299 |  90
 2011-01-16    | 299 |  90
 2011-01-17    | 299 |  90
 2011-01-18    | 299 |  90
 2011-01-19    | 299 |  90
 2011-01-20    | 299 |  90
 2011-01-21    | 299 |  90
 2011-01-22    | 299 |  90
 2011-01-23    | 299 |  90
 2011-01-24    | 299 |  90
 2011-01-25    | 299 |  90
 2011-01-26    | 299 |  90
 2011-01-27    | 299 |  90
 2011-01-28    | 299 |  90
 2011-01-29    | 299 |  90
 2011-01-30    | 299 |  90
 2011-01-31    | 299 |  90
 2011-02-01    | 299 |  90
 2011-02-02    | 299 |  24
 2011-02-03    | 299 |  24
 2011-02-04    | 299 |  24
 2011-02-05    | 299 |  24
 2011-02-06    | 299 |  24
 2011-02-07    | 299 |  24
...
 2011-12-22    | 299 |  21
 2011-12-23    | 299 |  21
 2011-12-24    | 299 |  21
 2011-12-25    | 299 |  21
 2011-12-26    | 299 |  21
 2011-12-27    | 299 |  21
 2011-12-28    | 299 |  21
 2011-12-29    | 299 |  21
 2011-12-30    | 299 |  21
 2011-12-31    | 299 |  21
 2012-01-01    | 299 |  21
(366 rows)

Use outer join to add quality metrics data instead of loop

Currently the quality metrics data is queried separately and added to the CMAQ values data by doing a utc_date_time comparison in a loop.

This is inefficient and should be done as an outer join to the CMAQ value query itself.

Per discussion with @hyi regarding experience with @arunacs at hackathon. Existing implementation is too slow for queries at year long range.

option for batch operation
python based
should not attempt to modify schema (use pre-ingest scripts instead)

update server documentation

document server section under server/
add server refs to README.md at main level
add full workflow doc using existing references
- local development
- production deployment

generate PostgreSQL functions for pre-calculating exposure aggregates

Majority of queries will be interested in daily, 7 day or 14 day aggregates for the CMAQ hourly data.

create function to look for non-calculated aggregates and then generate values for them
function should scope the size of the transaction to be runnable on a modest system
postgres functions do not commit until the entire transaction is completed and any new functions should keep this in mind

cmaq init update for operating as user postgres

Include flag --postgres for init scripts

switch from docker use to running as the postgres user