iquod / autoqc Goto Github PK

View Code? Open in Web Editor NEW

28.0 28.0 16.0 2.19 MB

A testing suite for automatic quality control checks of subsurface ocean temperature observations

License: MIT License

Python 99.70% Shell 0.30%

autoqc's People

Stargazers

Watchers

Forkers

s-good bkatiemills arnaldorusso beccowley kristenvogt mhidas anaymalpani sanakel zqtzt castelao xifengbishu uk-gov-mirror ci-cmg yuanhf

autoqc's Issues

Database changes

Hi @BillMills, just been looking over the database code. Looks really good and will be a big step forward by releasing us from the memory issues! One thing that occurred to me was I was wonder if the database needs to contain the data? We could just store the unique ID, the file the profile came from and location in the file. Then, this information can be used to read the data from file just prior to the QC. That way there wouldn't need to be any modification of the QC tests (apart from maybe the ones that need to use other profiles other than the one being tested). This would avoid the database becoming too large (which might cause performance issues?) and another reason is that a QC test we add in the future may need to use extra information that is not in the database.

If you think this is a good idea I could try to make some changes over the next few days to implement this.

Profiles pre 1900

Hi @castelao - CoTeDe tests are currently producing errors on profiles from years earlier than 1900:

ValueError: year=1773 is before 1900; the datetime strftime() methods require year >= 1900

before I go hacking on these, have you solved this previously for CoTeDe? Let me know if you have a preferred solution.

Bug on DummyCNV. Only affects CoTeDe tests

It was missing a round to define the seconds at DummyCNV. Approximately 50% of the cases would have a 1 second off.

This only affects CoTeDe tests, and it is just a conceptual error. It shouldn't effectively change any result.

Goals for Hamburg

Hi @s-good @BecCowley et al - while we've still got some time, it would be good to set a goal for results to present in Hamburg. Finishing the outstanding tests (#54, #56 & #64) would be a good start, but also having something to present in terms of decision-making performance would be a great thing to be able to report. A machine learning strategy is described in #60, which I'm interested to execute on, but as I mention in #48, we need a larger dataset to train and test on (several thousand profiles).

Let me know what you'd like to show in Hamburg, and we'll try to make it happen.

wod decoding: missing data edge case

When some data is missing from a profile's primary header, the corresponding key is never inserted into the primary header dictionary by _interpret_data.

As a result, functions like

    def time(self):
        """ Returns the time. """
        return self.primary_header['Time']

return a KeyError when they go looking for them; reproduce this by inserting a call to p.time() in any of the qc tests and running on the testing data.

We could solve this by putting exception handling into each downstream function, but I would recommend keeping the keys for every primary header dictionary always exactly the same, and assigning a value of None when something is missing. That way everything remains as consistent as possible - thoughts, @s-good?

historical flag integration testing

The historical flags mentioned in #115 also present a convenient opportunity to do a full integration test on the final CSV produced by AutoQC.

Identify a body of data to analyze

Since a usable version 1 of this project is on the horizon, it would be good to identify a body of data we'd like to analyze in early studies.

kristenvogt: new volunteer via Mozilla Science Lab Collaborate

It sounds neat!

This issue was created by @kristenvogt via Mozilla Science Lab Collaborate

Data unpacker validation

#17 introduced a functioning data unpacker, but now we need a set of tests to ensure it is indeed functioning as it should. Page 137 of http://data.nodc.noaa.gov/woa/WOD/DOC/wodreadme.pdf has an example of some unpacked data we can use for this. So:

replace XBT01966 with the appropriate data file used in the example above
use unpacker.py to unpack this file, and write some tests to ensure the result matches the example.

Probably simplest to continue using unittest to build these tests, but no need to convolve them in the main data testing suite; these software tests should be clearly separate.

recurring depth 0 problems

I think the issue of profiles with all depths listed as 0 has come up before - one other place we need to protect against this is in how we check for missing data. Currently we rely on the numpy mask, and in the postgres branch I'm developing something where we simply check to see if there is a sensible number present; both these methods will be fooled by depths of all zero.

While it's not a problem to check for all zeroes, I want to make a note of this here as this will affect validation of the postgres-based technique, as it will look like it's giving different answers than previous, when in fact the difference is that it's doing its basic validation more correctly (maybe).

database wrangling

I'm beginning to tinker with a proper database to sit behind AutoQC - I'll keep some notes updated in this issue if anyone wants to follow along (also so I don't forget)...

engineering choices

This article seems to suggest that postgres is the better choice for lots of concurrent database interactions, which we will certainly have in a parallelized environment.
the current parallelization scheme is a bit obtuse in its use of map; a different suggestion could look something like the following, in AutoQC.py instead of the calls to parallel_function:

  ...
  pool = Pool(processes=int(sys.argv[2]))
  parallel_result = []
  def log_result(result):
    parallel_result.append(result)

  for i in range(len(filenames)):
    pool.apply_async(processFile, (filenames[i],), callback = log_result)
  pool.close()
  pool.join()
  ...

while this does the same thing as the present implementation, the combination of the for loop and apply_asyncmight be nice when we switch to postgres, since instead of parallelizing over a pre-defined set of files, we can just keep throwing chunks of the database at new parallel processes in the for loop, without loading the whole thing or even having any prior knowledge of how many processes we're going to spawn, like we would need to to form an array to map onto.

postgres notes

On our current docker image, I got postgres installed and running with a database owned by root via the following (which does not exactly match any of the tutorials I found, but rather combines a bunch of them...):

apt-get update
apt-get install -y postgresql
/etc/init.d/postgresql start
su - postgres
createuser -s root
createdb root

to do this from a Dockerfile, the apt-gets are the usual, but the context-specific commands that follow need to be chained together:

RUN /etc/init.d/postgresql start && su postgres -c 'createuser -s root' && su postgres -c 'createdb root'

postgres via python

psycopg2 is a popular way to interact with postgres from python, but it needs a dependency installed first (already integrated into the docker image, reflected in the Dockerfile on branch postgres):

apt-get install -y libpq-dev python-dev
pip install psycopg2

That's all that's needed for the example in postgres/build-db.py on the postgres branch to work properly.

Code up EN spike and step check

As part of an intercomparison of 'spike tests' (#21) code up the spike and step check described at http://www.metoffice.gov.uk/hadobs/en3/OQCpaper.pdf pages 20-21. The program file should be called EN_spike_and_step_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

Code up Argo regional range test

Code up the regional range test described at http://w3.jcommops.org/FTPRoot/Argo/Doc/argo-quality-control-manual.pdf on page 6. The program file should be called Argo_regional_range_test.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

correct temperature conversions between t48 and t68

There's a note in obs_utils.t48tot68() questioning whether the conversion implemented is correct; the same equation is reported here and here; are there other conflicting references?

Implement WOD range check

Code up the range test described at http://data.nodc.noaa.gov/woa/WOD/DOC/wodreadme.pdf on page 47. The program file should be called WOD_range_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

This requires reading in a data file that is used each time the quality control check is run. It would therefore be most efficient to read this in once in the main program and pass the data to the quality control check.

Code up Argo impossible location test

Code up the impossible location test described at http://w3.jcommops.org/FTPRoot/Argo/Doc/argo-quality-control-manual.pdf on page 6. The program file should be called Argo_impossible_location_test.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

EN_increasing_depth shortcut

@s-good, would it make sense to immediately flag all levels in EN_increasing_depth for profiles where every level has depth = 0? This is a practical performance concern, since there are profiles with thousands of levels in them and no depth info; looking at the test as written, I believe that in this case it will construct an NxN matrix, reject the last level, construct an (N-1)x(N-1) matrix.... which gets pretty painful for N O(1000).

running AutoQC on AWS

Pushing all of quota through AutoQC on my home machine takes ~20 hours; it may behoove us to be prepared to run full data reduction runs on AWS. Here are some notes on how I got AutoQC up and running there:

make a free account at AWS
follow the instructions to set up access to EC2 - I followed these exactly, except for the warnings about not allowing ssh from anywhere; there's no security concerns here, easiest to just allow ssh.
After signing in as your IAM user from the last step, go to the 'Services' link in the top navigation bar, and choose:
- EC2
- launch instance
- amazon linux
- t2.micro
- launch

After launching this compute cluster, a link is presented to go back to the list of instances; go there, wait for this new instance to finish initializing, click on it, click connect at the top, and follow the instructions to ssh into your new node.

Once there, set up AutoQC and all its dependencies via the following:

wget https://3230d63b5fc54e62148e-c95ac804525aac4b6dba79b00b39d1d3.ssl.cf1.rackcdn.com/Anaconda-2.3.0-Linux-x86_64.sh
bash Anaconda-2.3.0-Linux-x86_64.sh

(disconnect / reconnect)

sudo yum -y install geos
conda install netcdf4
sudo yum -y install git
git clone https://github.com/IQuOD/AutoQC.git
sudo yum -y install gcc-c++ python27-devel atlas-sse3-devel lapack-devel
pip install -r requirements.txt

After which AutoQC runs happily on the demo dataset it ships with. Next up will be sticking quota on S3, and parallelizing. Should be simple enough to slice quota up into pieces and run in parallel - nothing too fancy should be required.

Note this service is not free - the setup demo I described above should (should) run on free-tier resources, but a large cluster to run on all of quota will cost a few dollars; hopefully this will be in the single digits and will only have to be run very rarely to reduce the full raw dataset.

Code up WOD gradient check

As part of an intercomparison of 'spike tests' (#21) code up the excessive gradient test described at http://data.nodc.noaa.gov/woa/WOD/DOC/wodreadme.pdf on page 47. The program file should be called WOD_gradient_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

Code up the EN stability check

Code up the stability check described at http://www.metoffice.gov.uk/hadobs/en3/OQCpaper.pdf page 8-9. The program file should be called EN_stability_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py. There is code in the util/obs_utils.py module that can be used to calculate density.

EN standard level check - add tests using real profiles

The testing of the EN standard level check would benefit from some real world examples to confirm it is working as expected.

woa_normbias bug

Profile 542348 throws an error in one of the CoTeDe tests:

Traceback (most recent call last):
  File "AutoQC.py", line 105, in <module>
    parallel_result = processFile.parallel(filenames)
  File "/AutoQC/util/main.py", line 117, in easy_parallize
    result = pool.map(f, sequence) # for i in sequence: result[i] = f(i)
  File "/opt/conda/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/opt/conda/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
KeyError: 'woa_normbias'

Profile content:

C32306542348996900447199412 433278044222454426806150 21101022501021311 9NAVOCE 9
42421511110022911022541101271110129977020161270110000331999101100001105003312660
0110500110800331272001108002204700220270022047002205900331265002205900

QC test tests

As discussed in #41 and implemented in #45, we now have a mechanism and first example to create fake profiles, designed to pass or fail a give test, allowing us to build a test suite to ensure the correct performance of the qc tests. Currently, only one trivial example is implemented; behavior of all qc tests should be validated in qcvalidation.py.

Establishing main test runner

Now that most of the pieces are in place, we'd like to establish a minimum working example of running the tests found in qctests on the datafiles listed in datafiles.json.

Mostly this can be achieved by directly copying what we worked out in demo/demo.py to our main executable, AutoQC.py. The main task then is to replace the use of a json file (demo/data/demo.json) as the direct data input to the tests, and instead use dataio/wod.py to build a list of sets of profiles (ie, one set of profiles per input file) from the files found in datafiles.json, and pass that list to ddt to run the tests over.

Code up Argo spike test

As part of an intercomparison of 'spike tests' (#21) code up the spike test described at http://w3.jcommops.org/FTPRoot/Argo/Doc/argo-quality-control-manual.pdf on page 8. Where this refers to pressure in db, this can be read as depth in m. The program file should be called Argo_spike_test.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

Code up EN constant value check

Code up the constant value check described at http://www.metoffice.gov.uk/hadobs/en3/OQCpaper.pdf page 7. The program file should be called EN_contant_value_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

CSIRO unit tests

all CSIRO qc checks still need unit tests (see #111 , #113 )

EN standard level check - code tidying and testing

An initial attempt at the EN standard level background and buddy check has been made (#131, #132). The code is complex and requires some tidying (separating code into functions etc.) Although the code appears to be doing something sensible, all the parts of the code need to have tests implemented to ensure that it is working correctly.

Code up EN track check

Code up the constant value check described at http://www.metoffice.gov.uk/hadobs/en3/OQCpaper.pdf page 7. The program file should be called EN_track_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

This is quite a challenging check to code up as the check needs to use information about other profiles rather than just the profile itself. Therefore the whole list of profiles will need to be passed in as an additional argument.

How are we doing?

Since Hamburg, the number of qctests implemented has more than doubled (thanks @s-good and @castelao!) - it might be interesting to run the complete stack on quota again, to see how much of an impact we've made on our 55% true positive rate since then. I'm happy to do this, but it also might be good for someone other than me to go through the procedure of deploying and running on AWS (or wherever), so more than one person is familiar with how to do it, and we can double check the instructions are complete and clear.

Processing testing output

Currently, framework/demo.py successfully runs tests over the data in data/demo.json, but the results aren't returned in a useful format. After the tests execute, we'd like to have a list or dictionary available to pass to our own summarization and visualization tools that contains information on which tests passed on which inputs; so for example, each test could return a list of booleans corresponding by index to the list of inputs, indicating pass or fail; the entire suite could return a dictionary of these lists keyed by test name.

Code up Argo gradient test

As part of an intercomparison of 'spike tests' (#21) code up the gradient test described at http://w3.jcommops.org/FTPRoot/Argo/Doc/argo-quality-control-manual.pdf on page 8. Where this refers to pressure in db, this can be read as depth in m. The program file should be called Argo_gradient_test.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

factor testing guts out of main testing routine

In production, we won't want tests to be hard coded in the main testing routine (currently framework/demo.py) - they should be declared as separate modules in /tests, and included in demo.py automatically.

Code up the EN background check

Code up the background check on reported levels described at http://www.metoffice.gov.uk/hadobs/en3/OQCpaper.pdf page 9 onwards. The program file should be called EN_background_check.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

As described in the docs for the EN_spike_and_step_check #28, the results from part of that code should be used by this check, so the relevant code needs to be extracted and put into the background check.

Note that this check requires a lot of auxiliary information in the form of climatology information and error statistics.

Mozilla Sprint Challenge #2 - Data Unpacker

Oceanographic data is packed in a dense format that needs to be parsed into meaningful variables. A small zipped example file can be grabbed from
http://data.nodc.noaa.gov/woa/WOD/YEARLY/XBT/OBS/XBTO1966.gz

A description of how to read the contents of this file is given in section II of
http://data.nodc.noaa.gov/woa/WOD/DOC/wodreadme.pdf

Task: given a file (unzipped) like the example above, use the description in the documentation to unpack the primary header data into a list of dictionaries, one dictionary for each. Dictionaries should have the following keys, where <table row n> denotes the value described in row n of table 10.1 for the primary header info.

{
'version': <table row 1>,
'uniqueCast': <table row 5>,
'countryCode': <table row 6>,
'cruiseNumber': <table row 8>,
'year': <table row 9>,
'month'<table row 10>,
'day': <table row 11>,
'time': <table row 12>,
'latitude': <table row 13>,
'longitude': <table row 14>,
'nDepths': <table row 16>,
'profileType': <table row 17>,
'variables': [
                      {
                      'varCode': <table row 20>,
                      'QC': <table row 21>,
                      'metadata':[
                                            [
                                            <table row 25>, ... 
                                            ], ...
                                       ]
                      }, ...
                  ]
}

Where , ... denotes repeated similar elements; for example, the 'variables' key contains a list of N elements, each a dictionary containing the keys 'varCode', 'QC' and 'metadata', and where N is defined in the table referenced.

Travis CI now monitoring builds

Howdy all,

FYI since #72 travis CI is checking our test suite and reporting the build status in the badge at the top of the readme:

Please keep this badge green! You can check before you send a PR by running the current tests:

nosetests tests/qcvalidation.py tests/util_tests.py tests/wod_tests.py

If there are no complaints, you're good to go.

Last blockers before 1.0

So - our May deadline is almost upon us! Before we can make a final AutoQC decision, a few questions that have arisen in #146 and elsewhere need to be addressed:

Shall we calculate depth from pressure data, for profiles where depth is not reported?
Shall we calculate pressure from depth data for the ARGO suite of tests? Yes, done
Shall we immediately flag any classes of profiles (such as profiles with only one level) without further consideration? No
What shall our final flag definition be - temperature only, temperature and depth, or otherwise?
What will our final test dataset(s) be for measuring and validating the performance of this iteration of the AutoQC procedure? QuOTA (Jan, Feb, Mar and Jun only) and Argo delayed mode data

Once we make decisions for all of these points (and make the datasets from the 5th point available), I think we'll be able to produce a credible first iteration. Let me know what we decide and how we want to go about closing out 1.0.

Mozilla Sprint Challenge #1 - testing pipeline

To start, we'd like to build a simple testing package using Python's unittest module (https://docs.python.org/2/library/unittest.html). It should run the tests described below over the list of dummy inputs provided, and return a list for each test describing whether each input passed or failed.

Tests:

i % 2 == 0
i > 5
i % 3 == 1
i == 7
i < 0

Dummy inputs for i in the tests above:
[0,1,2,3,4,5,6,7,8,9]

So for the first test, the summary would be the list (where 1 == pass and 0 == fail): [1,0,1,0,1,0,1,0,1,0]

Similarly for the second test: [0,0,0,0,0,0,1,1,1,1]

Etc.

Result processing

Now that useful pass / fail statistics are being generated by the testing pipeline, we can start building summary statistics. In our previous conversations, @s-good described a summary table that compares test results to an expectation for each dataset, one row in which could be generated by the following:

Given the vector of pass / fail information generated for a test by the testing suite, and a list of the same length and format containing the expected pass / fail performance, write a function that returns a 4-element list that counts, in order,

number of datasets that passed when they were expected to pass
number of datasets that failed when they were expected to fail
number of datasets that passed when they were expected to fail
number of datasets that failed when they were expected to pass

Standalone wod.py

It occurs to me that we could do a service to the ocean science community with very little extra work, by splitting wod.py and related docs + tests out into their own package and listing it on PyPI - that way, anyone who wants to read WOD files would have a tool to do so.

I'm happy to take care of the details personally - as long as there are no objections from @s-good ?

Code Argo impossible date test

Code up the impossible date test described at http://w3.jcommops.org/FTPRoot/Argo/Doc/argo-quality-control-manual.pdf on page 6. The program file should be called Argo_impossible_date_test.py and be put into the qctests directory. It should follow the structure shown in EN_range_check.py.

CSIRO wire breaks, spikes and depth

@s-good recommended breaking CSIRO's wire breaks, spikes and depth tests out into individual tests in #111.

Wrap CoTeDe tests so that they can be evaluated in AutoQC

CoTeDe includes a lot of QC tests that we don't have implemented yet in AutoQC. Wrapper functions will be written that call the CoTeDe tests from inside AutoQC so that they can be evaluated alongside the other tests in AutoQC.

https://github.com/castelao/CoTeDe

Machine learning strategy (was 'Combinatorics intelligence')

The brute-force combinatorics examinations implemented in #38 are acceptable for small numbers of tests, but compute time will diverge very badly as the number of tests grows. We need a stronger strategy.

The numbers reported in #59 make the individual tests seem much too permissive on their own; one idea could be to look for tests that flag disjoint sets of profiles, and OR them all together.

Logging

As suggested by @s-good in #39, we need to rethink logging to accommodate the large number of possible combinations of tests.

benchmarks.plot_roc should have its text logging separated from plot drawing, and this text logging should be combined with the verbose logging in generateLogFile.

historical flag unit tests

Ann Thresher mentioned that which qc tests flagged which profiles was recorded in quota. We should use these to do further unit testing where possible. There will be complications as tests may have subtle but intentional differences in implementation (specifically how AutoQC bails out at the first hint of missing data in most cases, for example).

`wod_profile` docs

The wod_profile class from the new unpacker.py needs some easily digested docs on usage and what member data / methods are available on the object after creation. Lowest barrier to entry here would probably be as a README.md file in the same directory as the unpacker.

Fake profile generator

It'd be nice to have a convenient way to generate fake profiles, engineered to intentionally pass or fail different qc-tests; this would help us sanity-check new implementations of qc-tests.

Real tests

Now that we have a data unpacker and a testing pipeline, it's time to write some real tests! @s-good , you'll have to take the lead on describing just what tests you'd like to start with.

EN_track math domain error

EN_track appears bugged on the latest quota set from Tim; for example on his IQUOD_Quota_20.dat, EN_track throws

Traceback (most recent call last):
  File "AutoQC.py", line 96, in <module>
    parallel_result = processFile.parallel(filenames)
  File "/AutoQC/util/main.py", line 117, in easy_parallize
    result = pool.map(f, sequence) # for i in sequence: result[i] = f(i)
  File "/opt/conda/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/opt/conda/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
ValueError: math domain error

within the first 10 profiles. I will look into this more deeply in future, but since I'd like to focus on the new database backend which will dramatically change (and improve) how EN_track works, I'm just going to leave this as a todo until then.