udst / orca_test Goto Github PK
View Code? Open in Web Editor NEWData assertions for the Orca task orchestrator
License: BSD 3-Clause "New" or "Revised" License
Data assertions for the Orca task orchestrator
License: BSD 3-Clause "New" or "Revised" License
To start with this is my first time looking at the code, so take this as the first impression that it is. I found the except:
clause a code smell. It reminds me of this anti-pattern Overbroad Except Clauses Can we be more specific on what we are catching? When we are using it to catch an assert
, would an if
be clearer?
As always just starting a discussion, and willing to help implement any fix.
First off, this library is very cool @smmaurer thanks for doing this!
Has anyone else noticed that orca_test is pretty slow? I mean, our simulation is about 73 minutes and when I added the UAL code it slowed down dramatically. At this point I've found the two causes of the problem and the actual new code is fairly quick.
Basically the orca test code adds 25 minutes to the simulation, and that's only verifying schemas in a few places.
My guess is that I see merge_tables called with all columns in the code. Maybe it should only be called with only the columns that are being verified in the specific orca_test. I mean, we have lots of computed columns and it's definitely known that if you ask for all of them that's an expensive operation.
If using all the columns is necessary, perhaps it's not necessary to use all the rows to verify the schemas? For verification purposes, I imagine we only need a few hundred rows from each table?
Barring all that, a simple on-off switch would seem essential so that it's not required to merge all the tables when not in debug mode...
We should go through and make the Orca_test code cross-compatible with Python 3.3/3.4/3.5 as well as Python 2.7.
It looks like those are the versions that Orca supports, and this would be best practice anyway.
We might need to rethink the exceptions a bit, but otherwise this shouldn't be too much work. One outstanding question related to that is how to handle passing backtraces along with OrcaAssertionErrors for some of the broader tests (like whether a table/column can be successfully generated). See discussion in #6.
First, this project is exactly something we are looking for to help our model input development. We actually developed a tool similar to this in-house during our last forecast. And we did plan to improve that tool with additional functionalities in the coming months. But I can see this tool is well-suited and better-structured substitute to our old tool. So I will definitely try to integrate it into our work this time.
Here's my thoughts about the features so far.
I would like to see the YAML syntax implemented at your earliest convenience :) Our old database check list on the tables and columns are quite extensive. So I used an csv file to store the tables, columns and expected 'assertions'. I can see the the conversion of that to YAML will be much less time consuming (probably less error-prone) than to current dictionary based syntax.
I would also like to see more assertion test provided. A couple of things:
'foreign key' works well in my test, but what if the target tables are multi-indexed? Will the tool automatically takes the proper level of index? Also, what is the target 'foreign key' column is not indexed? we do find situations that just verify the consistency of two columns disregard whether they are indexed or not. So to the simple, an expansion of 'foreign key' test on unindexed columns would be useful.
Also, any assertions on expected value or list of values? For example, we have large_area_ids in our parcels table. Can I test those ids are matching a predefined list?
Thanks.
When looking at results it can be hard to tell what table/column is having trouble. Some errors start with the table; others end with it; and some don't have it at all. It would be much easier to investigate if all msg started with the table name or the table.column where available.
At some point we need to write unit tests and set up continuous integration.
Right now all we have is an informal test script: development_tests.py
To start with this is my first time looking at the code, so take this as the first impression that it is. Does it make sense to split the project into multiple files? The ###################
sections seam to build on each other. That may be a place to start the splitting.
As always just starting a discussion, and willing to help implement any fix.
Currently orca_test
uses an assertion model, where a failing check raises an exception and causes code execution to stop. @sablanchard has a use case where it would be better to print a report of all the failures at once. This is similar to issue #5.
(Raising an exception makes sense if you're using orca_test
to check data dynamically within a simulation; printing a report makes sense for things like one-off data validation.)
We realized you can get this behavior without changing orca_test
, though, using an approach like this:
import orca_test
from orca_test import OrcaAssertionError
specs = [...] # list OrcaSpecs here
problems = False
for spec in specs:
try:
orca_test.assert_orca_spec(spec)
except OrcaAssertionError as e:
problems = True
print(str(e))
pass
if problems:
raise OrcaAssertionError("Problems found")
Here's a self-contained, fully-functional script demonstrating this usage: orca_test_demo.py
And this is what the output looks like:
Table 'households' is not registered
Table 'buildings' is already registered
Table 'badtable' is registered but cannot be generated
Column 'index' is not registered in table 'buildings'
Column 'price1' is already registered in table 'buildings'
Column 'badcol' is registered but cannot be generated
Column 'price1' is not set as the index of table 'buildings'
Column 'strings' has type 'object' (not numeric)
Column 'price2' is 20% missing, above limit of 0%
Column 'price1' has maximum value of 50, not 25
Column 'price1' has minimum value of -1, not 0
Column 'price2' is 20% missing, above limit of 10%
Column 'fkey_bad' has values that are not in 'zone_id'
Injectable 'nonexistent' is not registered
Injectable 'rate' is already registered
Injectable 'bad_inj' is registered but cannot be evaluated
Injectable 'dict' has type 'dict' (not numeric)
Injectable 'rate' has value of 0.64, less than 5
Injectable 'rate' has value of 0.56, greater than -5
Injectable 'rate' is not a dict
Injectable 'dict' does not have key 'Oakland'
Traceback (most recent call last):
File "orca_test_demo.py", line 124, in <module>
raise OrcaAssertionError("Problems found")
OrcaAssertionError: Problems found
Ultimately, though, we probably want to add multiple modes to orca_test
, perhaps as global settings.
# Potential future functionality
orca_test.mode = 'warning'
orca_test.mode = 'report'
Is there a way to check if a column's values are all members of the set of allowed values. For instance, I just changed the tenure column to the values 'rent' and 'own' from 1 and 2 and was hoping there was a way to make sure the values are either rent/own or null. Not an urgent thing but would be nice to have. I'm sure it would apply to any categorical column.
Foreign key test reports errors but not the values causing those errors. It would be much helpful to print out those mismatching values as well. Thanks.
I have converted my old test spreadsheet to ORCA test specs and it looks pretty verbose. A lot of repetitive items such as "numeric=True" and "registered = True" in ColumnSpec. Anyway to make them default so the spec list or YAML could come up much simpler and cleaner?
The message when a max_portion_missing
assertion fails close to the threshold is not as informative as it could be. For example:
Column 'year' is 0% missing, above limit of 0%
Maybe add a decimal, or include a count of missing values.
Right now any assertion error breaks the checking routine. I think it is preferable to have a complete test with a diagnosis report on all the errors. So we can focus on fixing issues all together before performing another test.
Also, in data development stage, typically we have some known issues in our data. For minor issues, we may choose to keep it that way temporarily but still get the alerts for other problems. A full test will help in that situation.
Thanks.
Another task is to standardize the docstrings and put together Sphinx documentation!
These should align with the Orca repo: https://github.com/udst/orca
We should raise an error when an undefined key is included in a spec. This probably represents a bug in the user's code, and by not raising an error we give the impression that the unrecognized assertion is passing.
Just noticed that in my test script, there's an assertion that should fail but does not. development_tests.py#L96
Probably minor, but i don't have time to diagnose it right now and wanted to make a note.
This is not an issue, it is more a source of Ideas.
What can we learn from engarde?
Can we assert the value for a specific cell of a table?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.