udst / orca_test Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 52 KB

Data assertions for the Orca task orchestrator

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

orca_test's People

Contributors

Watchers

Forkers

semcog afcarl fagan2888

orca_test's Issues

Overbroad Except Clauses

To start with this is my first time looking at the code, so take this as the first impression that it is. I found the except: clause a code smell. It reminds me of this anti-pattern Overbroad Except Clauses Can we be more specific on what we are catching? When we are using it to catch an assert, would an if be clearer?

As always just starting a discussion, and willing to help implement any fix.

Slow performance & redundant tests

First off, this library is very cool @smmaurer thanks for doing this!

Has anyone else noticed that orca_test is pretty slow? I mean, our simulation is about 73 minutes and when I added the UAL code it slowed down dramatically. At this point I've found the two causes of the problem and the actual new code is fairly quick.

Basically the orca test code adds 25 minutes to the simulation, and that's only verifying schemas in a few places.

My guess is that I see merge_tables called with all columns in the code. Maybe it should only be called with only the columns that are being verified in the specific orca_test. I mean, we have lots of computed columns and it's definitely known that if you ask for all of them that's an expensive operation.

If using all the columns is necessary, perhaps it's not necessary to use all the rows to verify the schemas? For verification purposes, I imagine we only need a few hundred rows from each table?

Barring all that, a simple on-off switch would seem essential so that it's not required to merge all the tables when not in debug mode...

Add Python 3 cross-compatibility

We should go through and make the Orca_test code cross-compatible with Python 3.3/3.4/3.5 as well as Python 2.7.

It looks like those are the versions that Orca supports, and this would be best practice anyway.

We might need to rethink the exceptions a bit, but otherwise this shouldn't be too much work. One outstanding question related to that is how to handle passing backtraces along with OrcaAssertionErrors for some of the broader tests (like whether a table/column can be successfully generated). See discussion in #6.

YAML syntax and additional assertions on value consistency?

First, this project is exactly something we are looking for to help our model input development. We actually developed a tool similar to this in-house during our last forecast. And we did plan to improve that tool with additional functionalities in the coming months. But I can see this tool is well-suited and better-structured substitute to our old tool. So I will definitely try to integrate it into our work this time.

Here's my thoughts about the features so far.
I would like to see the YAML syntax implemented at your earliest convenience :) Our old database check list on the tables and columns are quite extensive. So I used an csv file to store the tables, columns and expected 'assertions'. I can see the the conversion of that to YAML will be much less time consuming (probably less error-prone) than to current dictionary based syntax.

I would also like to see more assertion test provided. A couple of things:
'foreign key' works well in my test, but what if the target tables are multi-indexed? Will the tool automatically takes the proper level of index? Also, what is the target 'foreign key' column is not indexed? we do find situations that just verify the consistency of two columns disregard whether they are indexed or not. So to the simple, an expansion of 'foreign key' test on unindexed columns would be useful.

Also, any assertions on expected value or list of values? For example, we have large_area_ids in our parcels table. Can I test those ids are matching a predefined list?

Thanks.

Error messages should put table name in a consistent spot

When looking at results it can be hard to tell what table/column is having trouble. Some errors start with the table; others end with it; and some don't have it at all. It would be much easier to investigate if all msg started with the table name or the table.column where available.

Write unit tests

At some point we need to write unit tests and set up continuous integration.

Right now all we have is an informal test script: development_tests.py

Split into multiple files

To start with this is my first time looking at the code, so take this as the first impression that it is. Does it make sense to split the project into multiple files? The ################### sections seam to build on each other. That may be a place to start the splitting.

As always just starting a discussion, and willing to help implement any fix.

Warnings or reports rather than assertions

Currently orca_test uses an assertion model, where a failing check raises an exception and causes code execution to stop. @sablanchard has a use case where it would be better to print a report of all the failures at once. This is similar to issue #5.

(Raising an exception makes sense if you're using orca_test to check data dynamically within a simulation; printing a report makes sense for things like one-off data validation.)

Immediate solution

We realized you can get this behavior without changing orca_test, though, using an approach like this:

import orca_test
from orca_test import OrcaAssertionError

specs = [...]  # list OrcaSpecs here

problems = False
for spec in specs:
    try:
        orca_test.assert_orca_spec(spec)
    except OrcaAssertionError as e:
        problems = True
        print(str(e))
        pass

if problems:
    raise OrcaAssertionError("Problems found")

Here's a self-contained, fully-functional script demonstrating this usage: orca_test_demo.py

And this is what the output looks like:

Table 'households' is not registered
Table 'buildings' is already registered
Table 'badtable' is registered but cannot be generated
Column 'index' is not registered in table 'buildings'
Column 'price1' is already registered in table 'buildings'
Column 'badcol' is registered but cannot be generated
Column 'price1' is not set as the index of table 'buildings'
Column 'strings' has type 'object' (not numeric)
Column 'price2' is 20% missing, above limit of 0%
Column 'price1' has maximum value of 50, not 25
Column 'price1' has minimum value of -1, not 0
Column 'price2' is 20% missing, above limit of 10%
Column 'fkey_bad' has values that are not in 'zone_id'
Injectable 'nonexistent' is not registered
Injectable 'rate' is already registered
Injectable 'bad_inj' is registered but cannot be evaluated
Injectable 'dict' has type 'dict' (not numeric)
Injectable 'rate' has value of 0.64, less than 5
Injectable 'rate' has value of 0.56, greater than -5
Injectable 'rate' is not a dict
Injectable 'dict' does not have key 'Oakland'
Traceback (most recent call last):
  File "orca_test_demo.py", line 124, in <module>
    raise OrcaAssertionError("Problems found")
OrcaAssertionError: Problems found

Longer-term solution

Ultimately, though, we probably want to add multiple modes to orca_test, perhaps as global settings.

# Potential future functionality
orca_test.mode = 'warning'
orca_test.mode = 'report'

categorical columns

Is there a way to check if a column's values are all members of the set of allowed values. For instance, I just changed the tenure column to the values 'rent' and 'own' from 1 and 2 and was hoping there was a way to make sure the values are either rent/own or null. Not an urgent thing but would be nice to have. I'm sure it would apply to any categorical column.

Foreign_key test report additional/missing values

Foreign key test reports errors but not the values causing those errors. It would be much helpful to print out those mismatching values as well. Thanks.

Set defaults in test specs

I have converted my old test spreadsheet to ORCA test specs and it looks pretty verbose. A lot of repetitive items such as "numeric=True" and "registered = True" in ColumnSpec. Anyway to make them default so the spec list or YAML could come up much simpler and cleaner?

More informative error message for max_portion_missing

The message when a max_portion_missing assertion fails close to the threshold is not as informative as it could be. For example:

Column 'year' is 0% missing, above limit of 0%

Maybe add a decimal, or include a count of missing values.

Complete a test with report on specific issues

Right now any assertion error breaks the checking routine. I think it is preferable to have a complete test with a diagnosis report on all the errors. So we can focus on fixing issues all together before performing another test.

Also, in data development stage, typically we have some known issues in our data. For minor issues, we may choose to keep it that way temporarily but still get the alerts for other problems. A full test will help in that situation.