odm2 / odm2-performance-optimization Goto Github PK

0.0 0.0 0.0 5 KB

A repo to share and discuss methods for optimizing performance of ODM2 databases and software.

odm2-performance-optimization's Introduction

ODM2

The next version of the Observations Data Model.

For more information about the ODM2 development project, visit the wiki.

Have a look at the ODM2 paper in Environmental Modelling & Software. It's open access!

Horsburgh, J. S., Aufdenkampe, A. K., Mayorga, E., Lehnert, K. A., Hsu, L., Song, L., Spackman Jones, A., Damiano, S. G., Tarboton, D. G., Valentine, D., Zaslavsky, I., Whitenack, T. (2016). Observations Data Model 2: A community information model for spatially discrete Earth observations, Environmental Modelling & Software, 79, 55-74, http://dx.doi.org/10.1016/j.envsoft.2016.01.010

If you are interested in learning more about how ODM2 supports different use cases, have a look at our recent paper in the Data Science Journal.

Hsu, L., Mayorga, E., Horsburgh, J. S., Carter, M. R., Lehnert, K. A., Brantley, S. L. (2017), Enhancing Interoperability and Capabilities of Earth Science Data using the Observations Data Model 2 (ODM2), Data Science Journal, 16(4), 1-16, http://dx.doi.org/10.5334/dsj-2017-004.

Getting Started with ODM2

SQL scripts for generating blank ODM2 databases can be found at the following locations:

View Documentation of ODM2 Concepts

For more information on ODM2 concepts, examples, best practices, the ODM2 software ecosystem, etc., visit the Documentation page on the wiki.

View Diagrams and Documentation of the ODM2 Schema

Schema diagrams for the current version of the ODM2 schema are at:

Entity Relationship Diagrams

Data Use Cases

The following data use cases are available. We have focused on designing ODM2 to support these data use cases. Available code and documentation show how these data use cases were mapped to the ODM2.

Little Bear River - Hydrologic time series and water quality samples from an ODM 1.1.1 database. Implements an ODM2 database in Microsoft SQL Server.
PRISM-XAN - Water quality depth profiles and samples from Puget Sound. Implements an ODM2 database in PostgreSQL.

Our Goal with ODM2

We are working to develop a community information model to extend interoperability of spatially discrete, feature based earth observations derived from sensors and samples and improve the capture, sharing, and archival of these data. This information model, called ODM2, is being designed from a general perspective, with extensibility for achieving interoperability across multiple disciplines and systems that support publication of earth observations.

Credits

This work was supported by National Science Foundation Grant EAR-1224638. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

ODM2 draws heavily form our prior work with the CUAHSI Hydrologic information system and ODM 1.1.1 (Horsburgh et al., 2008; Horsburgh and Tarboton, 2008), our experiences workin on the Critical Zone Observatory Integrated Data Management System (CZOData), and our experiences with the EarthChem systems (e.g., Lehnert et al., 2007; Lehnert et al., 2009). It also extensively uses concepts from the Open Geospatial Consortium's Observations & Measurements standard (Cox, 2007a; Cox, 2007b; Cox, 2011a; Cox, 2011b; ISO, 2011).

References

See a full list of ODM2 related references

Cox, S.J.D. (2007a). Observations and Measurements - Part 1 - Observation schema, OGC Implementation Specification, OGC 07-022r1. 73 + xi. http://portal.opengeospatial.org/files/22466.

Cox, S.J.D. (2007b). Observations and Measurements – Part 2 - Sampling Features, OGC Implementation Specification, OGC 07-002r3. 36 + ix. http://portal.opengeospatial.org/files/22467.

Cox, S.J.D. (2011a). Geographic Information - Observations and Measurements, OGC Abstract Specification Topic 20 (same as ISO 19156:2011), OGC 10-004r3. 54. http://dx.doi.org/10.13140/2.1.1142.3042.

Cox, S.J.D. (2011b). Observations and Measurements - XML Implementation, OGC Implementation Standard, OGC 10-025r1. 66 + x. http://portal.opengeospatial.org/files/41510 (accessed September 16, 2014).

Horsburgh, J.S., D.G. Tarboton, D.R. Maidment, and I. Zaslavsky (2008). A relational model for environmental and water resources data, Water Resources Research, 44, W05406, http://dx.doi.org/10.1029/2007WR006392.

Horsburgh, J.S., D.G. Tarboton (2008). CUAHSI Community Observations Data Model (ODM) Version 1.1.1 Design Specifications, CUAHSI Open Source Software Tools, http://www.codeplex.com/Download?ProjectName=HydroServer&DownloadId=349176.

ISO 19156:2011 - Geographic information -- Observations and Measurements, International Standard (2011), International Organization for Standardization, Geneva. http://dx.doi.org/10.13140/2.1.1142.3042.

Lehnert, K.A., Walker, D., Vinay, S., Djapic, B., Ash, J., Falk, B. (2007). Community-Based Development of Standards for Geochemical and Geochronological Data, Eos Trans. AGU, 88(52), Fall Meet. Suppl., Abstract IN52A-09.

Lehnert, K.A., Walker, D., Block, K.A., Ash, J.M., Chan, C. (2009). EarthChem: Next developments to meet new demands, American Geophysical Union, Fall Meeting 2009, Abstract #V12C-01.

odm2-performance-optimization's People

Watchers

odm2-performance-optimization's Issues

view takes 20 times as long with annotations

This view https://github.com/ODM2/ODM2-performance-optimization/blob/master/timeseriesresultvaluesextwannotations.sql

takes over 20 times longer then this view https://github.com/ODM2/ODM2-performance-optimization/blob/master/timeseriesresultvaluesext.sql

I'm trying to create a view for exporting data including time series result annotations for data quality but the view with annotations is taking over 20 times as long as the one without. Any suggestions on a better approach? @smrgeoinfo maybe?

The only thing I can think to do right now is not use the left joins and add time series values with annotations as a seperate file, but I don't really want to do that.

flattened time series views

I added two PostgreSQL view for flattened time series I'm using in ODM2 Admin. I converted one of these views to be compatible with SQLite and tried it out. The names of the views are probably not desirable, but table names are so long already adding additional words to timeseriesresultvalues seems excessive. Is there a better way to name these?

Maybe someone could add versions of similar views for MSSQL and MySQL? Who is using MySQL?

https://github.com/ODM2/ODM2-performance-optimization/tree/master/PostgreSQL/time%20series%20flattened%20views

https://github.com/ODM2/ODM2-performance-optimization/tree/master/SQLite/time%20series%20flattened%20views

Try HDF5 / PyTables / Pandas integration to speed time series I/O

One of the ideas I had back in 2012-2013 when we were developing ODM2 was to use the HDF5 file format in certain cases to improve performance, because of the benefits of HDF5:

High performance read/write of files, especially for very large files (much faster than text formats, such as CSV, JSON or XML).
Compressed binary format for portable files that is very space efficient, on disk and for exchange
Supports data slicing of files that are bigger than memory.
Hierarchical Data Format (HDF) can contain simple dataset structures, has self-describing metadata, and can support heterogeneous data types
NOTE that NetCDF uses an HDF5 container.

The two use cases I had in mind were:

High performance web services, to exchange or deliver ODM2 "Datasets" via specialized web services (because read/write is fast, because it is compact and does't take as much I/O bandwidth). I think a YODA file, in a tabular array format, could be put in an HDF5 container.
High performance database functionality, via a hybrid of a standard RDBMS ODM2 instance that stores TimeSeriesResults in HDF5 files. At the time, it seemed that a lot of people doing nuclear physics were using similar approaches. Also, Aquatic Informatics does something like this, storing all their Time Series data in a "proprietary" binary file that their MS SQLserver points to. Roelof Versteeg also has done this for very large datasets.

Given that the ODM2PythonAPI uses the Pandas library, I think we could tap into HDF5 very easily for one or more of these uses. Here are a few links to information:

In writing this, I have come across some recent posts about people who are not happy with HDF5 (such as this: http://cyrille.rossant.net/moving-away-hdf5/ or https://www.rustprooflabs.com/2014/11/data-processing-in-python). I read the comments on the first of these two articles, and it sounds like most of the issues have been with improper direct use of the C library (and the lack of a Javascript library.

People who use the Python libraries (h5py and PyTables) seem to have very positive experiences. The fact that PyTables is actively supported by ContinuumIO and NumFocus, and is an important package in SciPy and the binary format of Matlab, all suggests that HDF5 is still well loved and useful to many.

If we approached the use of HDF5, we would want to use of refined libraries (such as the Pandas/PyTables integration), and in small steps, such as in the time-series data caching work that Jeff was just describing to me related to improving the CSV delivery of EnviroDIY datasets.

I'm interested in your thoughts!

cc: @horsburgh, @sreeder, @emiliom, @lsetiawan, @miguelcleon

NVMe SSD improves database scan times exponentially

I recently got a new desktop with an NVMe SSD and I'm getting huge performance increases in query times. A query for a time series with about 140,000 values takes about 11.6 seconds on a server while it takes 0.334 seconds on the desktop with the NMVe SSD. This appears to be due to a data throughput of about 2.7 GB/s vs 95 to 100 MB/s on the server. This results in a massive decrease in the amount of time it takes to scan the database for the requested data.

Try materialized views to speed time series I/O

Roelof Versteeg, @roelofversteeg, gave the ODM2 team a presentation on Aug. 29 that described how he massively improved performance for fetching time series data by creating materialized views in ODM2 (using MySQL on 3-y old hardware). These are the notes from that meeting:

ODM2 Performance Optimization

Time series
- 5 or 15 min intervals, via CampbellSci LoggerNet
- Lots of sensors per station
Enters DB in very simple flatfile structure, similar to LoggerNet files
Subsequent mapping to get proper metadata
- Into ODM2?
- Continuously runs Python script
Create materialized view of auto-QAQCed data (via Python scripts)
- Optimized in MySQL
- Regular Views didn’t perform
- Jeff: how do you create/manage these in MySQL?
  - Continuously runs Python script
- Database is bigger, but all running well on 3-y old hardware

Try TimeScaleDB extension to PostgreSQL

@emiliom suggested that we might explore the use of the new open-source TimeScaleDB extension to PostgreSQL, to massively improve the performance of streaming time series data?

See: