tdda / tdda Goto Github PK

Test-Driven Data Analysis Functions

License: MIT License

Python 93.69% HTML 5.28% Shell 0.36% R 0.67%

tdda's Introduction

Test-Driven Data Analysis (Python TDDA library)

What is it?

The TDDA Python module provides command-line and Python API support for the overall process of data analysis, through the following tools:

Reference Testing: extensions to unittest and pytest for managing testing of data analysis pipelines, where the results are typically much larger, and more complex, than single numerical values.
Constraints: tools (and API) for discovery of constraints from data, for validation of constraints on new data, and for anomaly detection.
Finding Regular Expressions (Rexpy): tools (and API) for automatically inferring regular expressions from text data.
Automatic Test Generation (Gentest): TDDA can generate tests for more-or-less any command that can be run from a command line, whether it be Python code, R code, a shell script, a shell command, a Makefile or a multi-language pipeline involving compiled code. "Gentest writes tests, so you don't have to."™

Documentation

http://tdda.readthedocs.io

Installation

The simplest way to install all of the TDDA Python modules is using pip:

pip install tdda

The full set of sources, including all examples, are downloadable from PyPi with:

pip download --no-binary :all: tdda

The sources are also publicly available from Github:

git clone [email protected]:tdda/tdda.git

Documentation is available at http://tdda.readthedocs.io.

If you clone the Github repo, use

python setup.py install

afterwards to install the command-line tools (tdda and rexpy).

Reference Tests

The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest.

These are like other tests except:

They have special support for comparing strings to files and files to files.
That support includes the ability to provide exclusion patterns (for things like dates and versions that might be in the output).
When a string/file assertion fails, it spits out the command you need to diff the output.
If there were exclusion patterns, it also writes modified versions of both the actual and expected output and also prints the diff command needed to compare those.
They have special support for handling CSV files.
It supports flags (-w and -W) to rewrite the reference (expected) results once you have confirmed that the new actuals are correct.

For more details from a source distribution or checkout, see the README.md file and examples in the referencetest subdirectory.

Constraints

The tdda.constraints library is used to 'discover' constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file.

For more details from a source distribution or checkout, see the README.md file and examples in the constraints subdirectory.

Finding Regular Expressions

The tdda repository also includes rexpy, a tool for automatically inferring regular expressions from a single field of data examples.

Resources

Resources on these topics include:

TDDA Blog: http://www.tdda.info
Quick Reference Guide ("Cheatsheet"): http://www.tdda.info/pdf/tdda-quickref.pdf
1-page summary: https://stochasticsolutions.com/pdf/TDDA-One-Pager.pdf
Full documentation: http://tdda.readthedocs.io
General Notes on Constraints and Assertions: http://www.tdda.info/constraints-and-assertions
Notes on using the Pandas constraints library: http://www.tdda.info/constraint-discovery-and-verification-for-pandas-dataframes
PyCon UK Talk on TDDA:
- Video: https://www.youtube.com/watch?v=FIw_7aUuY50
- Slides and Rough Transcript: http://www.tdda.info/slides-and-rough-transcript-of-tdda-talk-from-pycon-uk-2016
Mastodon

All examples, tests and code run under Python 2.7, Python 3.5 and Python 3.6.

tdda's People

Contributors

Stargazers

Watchers

tdda's Issues

Allow DataFrame reference tests against feather files

The method ReferenceTest.assertDataFrameCorrect currently expects the reference DataFrame to be a CSV file, which is suboptimal.

It would be good for it to be able to be a feather file, expecially if it had extended metadata through the pmmif extensions.

Multi-column constraints / multi-column unique

https://github.com/tdda/tdda/blob/master/tdda/constraints/tdda_json_file_format.md
indicates that there is or will be some support for multi-column constraints.

Is there any mode information available?

I need to create a unique constraint on multiple columns.

The quick reference link is broken.

The quick reference link from the README.md returns 404. I couldn't find the new location of the file.

gentest: support other encodings

Add encoding-guessing code, at least for a few common encodings
Add encoding parameter to checkfiles
Set PDFs to be read as Latin-1

Unable to copy examples for constraint generation

Slide 21 of the London Pydata tutorial says:

python -m tdda.constraints.examples

When I run this on my Mac after doing pip install tdda, I get:

/tmp: python -m tdda.constraints.examples
/Users/gvwilson/anaconda3/bin/python: No module named tdda.constraints.examples.__main__; 'tdda.constraints.examples' is a package and cannot be directly executed

Note that python -m tdda.referencetest.examples works as expected, and that tdda examples constraints /tmp does what it should.

discover_df fails on centos docker due to getpass not finding a system user

discover_df fails with an error on an attempt to set self.user:
File "/usr/lib/python3.6/site-packages/tdda/constraints/pd/constraints.py", line 1113, in discover_df constraints.set_dates_user_host_creator() File "/usr/lib/python3.6/site-packages/tdda/constraints/base.py", line 216, in set_dates_user_host_creator self.user = getpass.getuser() File "/usr/lib64/python3.6/getpass.py", line 169, in getuser return pwd.getpwuid(os.getuid())[0] KeyError: 'getpwuid(): uid not found: 1000'

Note, 'self.user' does not seem to be used anywhere.

Ability for testrunner to tag failing tests

It would be great if you could run with -tf or something and have it add tags to tests that fail (or give errors)
It would also be great if -00 or something removed those (all) tags.

gentest: detect binary files and check as binary

Currently, gentest assumes all files are text files, which is silly

Type hint stubs?

This is a question.

Are there type hint stubs for tdda?

I ran mypy on one of my modules that uses tdda and gives the following:

Skipping analyzing "tdda": module is installed, but missing library stubs or py.typed marker  [import]

Handling of last line of files

Despite much code, handling of the last line of text files in assertTextFileCorrect is not always right.

Feature to exchange the backend save format to yaml

I believe using yaml instead of json to store the .tdda files would make editing checks manually easier

Tests in Jupyter?

Can you run tests in Jupyter notebooks?

gentest: ability to generate test_all.py

gentest --all / -a to generate test_all.py for all test files in the directory
gentest -a [testfile1 [testfile2 ...]] for specific ones
If there's a test_all.py there already, only clobber it if it was recognizably generated by gentest. Probably just checking for tdda gentest in the first few lines of the file would be enough.

rexpy failing on parsing list of user agents

I am unsure if this is expected behavior, but I get a failure when I tried to parse a list of user-agents. It's quite possible that this kind of list is out-of-scope, but I wonder if rexpy could fail gracefully or inform the user of its limits.

head  agents.txt
Mozilla/5.0 (compatible; U; ABrowse 0.6; Syllable) AppleWebKit/420+ (KHTML, like Gecko)
Mozilla/5.0 (compatible; U; ABrowse 0.6;  Syllable) AppleWebKit/420+ (KHTML, like Gecko)
Mozilla/5.0 (compatible; ABrowse 0.4; Syllable)
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)
Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR   3.5.30729)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0;   Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;   SV1) ; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; Acoo Browser; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; Avant Browser)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1;   .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; Maxthon; InfoPath.1; .NET CLR 3.5.30729; .NET CLR 3.0.30618)
Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; GTB5;

head  agents.txt | python tdda/rexpy/rexpy.py
Traceback (most recent call last):
  File "tdda/rexpy/rexpy.py", line 995, in <module>
    main(**params)
  File "tdda/rexpy/rexpy.py", line 954, in main
    patterns = extract(strings)
  File "tdda/rexpy/rexpy.py", line 760, in extract
    r = Extractor(examples, tag=tag, verbose=verbose)
  File "tdda/rexpy/rexpy.py", line 170, in __init__
    self.extract()                  # Stores results
  File "tdda/rexpy/rexpy.py", line 180, in extract
    self.results = self.batch_extract(self.example_freqs.keys())
  File "tdda/rexpy/rexpy.py", line 241, in batch_extract
    grouped = refine_groups(r, self.example_freqs)
  File "tdda/rexpy/rexpy.py", line 624, in refine_groups
    regex = cre(vrle2re(pattern, tagged=True))
  File "tdda/rexpy/rexpy.py", line 65, in cre
    memo[rex] = c =re.compile(rex)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 194, in compile
    return _compile(pattern, flags)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/re.py", line 249, in _compile
    p = sre_compile.compile(pattern, flags)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/sre_compile.py", line 583, in compile
    "sorry, but this version only supports 100 named groups"
AssertionError: sorry, but this version only supports 100 named groups

pandas.np module is deprecated

When using tdda.constraints.pd.constraints.detect_df in an environment with pandas >= 1.0, pandas issues the following warning:

D:\anaconda3\envs\myenv\lib\site-packages\tdda\constraints\pd\constraints.py:225: FutureWarning:

  The pandas.np module is deprecated and will be removed from pandas in a future version. Import numpy directly instead

Issue observed with tdda 1.0.31 and pandas 1.0.1.

pandas.np appears to be used in different places: https://github.com/tdda/tdda/search?q=%22pd.np%22&unscoped_q=%22pd.np%22.

tdda - v1.0.1 and 1.0.6 - test failing due to error - Windows 10, Windows 7

I am attempting to run tdda test after pip installing the program in an anaconda environment currently running python 3.6 (also attempted with python 3.5) on a Windows 10 (v1803 build 17134.48) machine. The test fails with 1 error on v1.0.0 and 1 error/1 fail on v1.0.1. The error is the same for both versions.

ERROR (from CLI)

`(py36test) E:\analysis>tdda test > tdda_out.txt
........................................................................sssss..F....E...................................sssss.........................

ERROR: setUpClass (tdda.constraints.pd.testpdconstraints.TestPandasCommandLine)

Traceback (most recent call last):
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\pd\testpdconstraints.py", line 1372, in setUpClass
cls.setUpHelper()
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\pd\testpdconstraints.py", line 1234, in setUpHelper
cls.execute_command(argv)
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\pd\testpdconstraints.py", line 1381, in execute_command
return check_shell_output(argv)
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\pd\testpdconstraints.py", line 1396, in check_shell_output
result = subprocess.check_output(UTF8DefiniteObject(args))
File "e:\anaconda2\envs\py36test\lib\subprocess.py", line 336, in check_output
**kwargs).stdout
File "e:\anaconda2\envs\py36test\lib\subprocess.py", line 403, in run
with Popen(*popenargs, **kwargs) as process:
File "e:\anaconda2\envs\py36test\lib\subprocess.py", line 709, in init
restore_signals, start_new_session)
File "e:\anaconda2\envs\py36test\lib\subprocess.py", line 971, in _execute_child
args = list2cmdline(args)
File "e:\anaconda2\envs\py36test\lib\subprocess.py", line 461, in list2cmdline
needquote = (" " in arg) or ("\t" in arg) or not arg
TypeError: a bytes-like object is required, not 'str'

======================================================================
FAIL: testDiscoverCmd (tdda.constraints.pd.testpdconstraints.TestPandasCommandAPI)

Traceback (most recent call last):
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\pd\testpdconstraints.py", line 1248, in testDiscoverCmd
'"tddafile":',
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\referencetest\referencetest.py", line 847, in assertFileCorrect
self._check_failures(failures, msgs)
File "e:\anaconda2\envs\py36test\lib\site-packages\tdda\referencetest\referencetest.py", line 1000, in _check_failures
self.assert_fn(failures == 0, '\n'.join(msgs))
AssertionError: False is not true : 5 lines are different, starting at line 6
Compare with:
fc C:\Users\solom\AppData\Local\Temp\elements92.tdda e:\anaconda2\envs\py36test\lib\site-packages\tdda\constraints\testdata\elements92_pandas.tdda

Note exclusions:
"as_at":
"local_time":
"utc_time":
"source":
"host":
"user":
"tddafile":

Ran 149 tests in 2.063s

FAILED (failures=1, errors=1, skipped=10)
`

Python info

Python 3.6.5 | packaged by conda-forge | (default, Apr 6 2018, 16:13:55) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

Environment Info

`name: py36test
channels:

conda-forge
defaults
dependencies:
arrow-cpp=0.9.0=py36_vc14_7
atomicwrites=1.1.5=py36_0
attrs=18.1.0=py_0
boost-cpp=1.66.0=vc14_1
brotli=1.0.2=vc14_0
ca-certificates=2018.4.16=0
certifi=2018.4.16=py36_0
cmake=3.11.1=0
colorama=0.3.9=py36_0
cython=0.28.2=py36_0
feather-format=0.4.0=py36_vc14_2
flatbuffers=1.9.0=vc14_0
gflags=2.2.1=vc14_0
libflang=5.0.0=vc14_20180208
llvm-meta=5.0.0=0
lz4-c=1.8.1=vc14_0
mkl_fft=1.0.2=py36_0
mkl_random=1.0.1=py36_0
more-itertools=4.1.0=py_0
openblas=0.2.20=vc14_7
openmp=5.0.0=vc14_1
openssl=1.0.2o=vc14_0
pandas=0.23.0=py36_1
parquet-cpp=1.4.0=vc14_0
pip=9.0.3=py36_0
pluggy=0.6.0=py_0
py=1.5.3=py_0
pyarrow=0.9.0=py36_vc14_1
pytest=3.6.0=py36_1
python=3.6.5=1
python-dateutil=2.7.3=py_0
pytz=2018.4=py_0
rapidjson=1.1.0=0
setuptools=39.2.0=py36_0
six=1.11.0=py36_1
snappy=1.1.7=vc14_1
thrift-cpp=0.11.0=vc14_2
vc=14=0
vs2015_runtime=14.0.25420=0
wheel=0.31.0=py36_0
wincertstore=0.2=py36_0
zlib=1.2.11=vc14_0
zstd=1.3.3=vc14_0
blas=1.0=mkl
icc_rt=2017.0.4=h97af966_0
intel-openmp=2018.0.0=8
mkl=2018.0.2=1
numpy=1.14.3=py36h9fa60d3_1
numpy-base=1.14.3=py36h555522e_1
pip:
- tdda==1.0.1
  prefix: E:\Anaconda2\envs\py36test
  `

Pip Freeze (if needed)

atomicwrites==1.1.5 attrs==18.1.0 certifi==2018.4.16 colorama==0.3.9 Cython==0.28.2 feather-format==0.4.0 mkl-fft==1.0.2 mkl-random==1.0.1 more-itertools==4.1.0 numpy==1.14.3 pandas==0.23.0 pluggy==0.6.0 py==1.5.3 pyarrow==0.9.0 pytest==3.6.0 python-dateutil==2.7.3 pytz==2018.4 six==1.11.0 tdda==1.0.1 wincertstore==0.2
Note:

I was able to install tdda in a python 3.5 anaconda environment on Ubuntu 18.04 without this issue.
After research, I attempted to install tdda in a python 2.x environment; however, the requirements for
pyarrow only allow for a python 3.x installation.
I usually try to avoid pip installing in a Conda environment; so, when it is needed I do a pip install
tdda --no-deps. Then I install the dependencies via conda install (as many as I can). This hopefully
minimizes any issues with pip installing in Conda.
Version 1.0.0 of tdda had the same error as far as it happening at the interface between
UTF8DefiniteObject(s) and subprocess.check_output().

Any help would be appreciated. Please let me know if you need more information from me.

Get one example from each Regex generated

I'm currently using extract to get the regex rules for a set of examples like in the code below. What I would like as well is some sort of dictionary where I can see what regex was assigned to each example. Is there such a thing?

extractor = extract(examples, as_object=True)

result = extractor.results.rex

DataFrame 'int' columns fail to verify as "real" type

Pandas dataframe columns of type 'int' fail to verify as "real" type with tdda.constraints.verify_df despite that real numbers (not float) include integers (https://en.wikipedia.org/wiki/Real_number).

Consider the following two csv files:

csv1.csv

col1
0
0.1

csv2.csv

col1
0
0

and the following contraints file:

constraints.tdda

{
    "fields": {
        "col1": {
            "type": "real"
        }
    }
}

In the case of csv1.csv, pandas will infer the column type to be 'float64' and in the case of csv2.csv, the column type will be 'int64'.

Running tdda.constraints.verify_df(pd.read_csv("csv1.csv"), "constraints.tdda").to_dataframe() reports no failures:

  field  failures  passes  type
0  col1         0       1  True

However tdda.constraints.verify_df(pd.read_csv("csv2.csv"), "constraints.tdda").to_dataframe() reports failures on the type of 'col1':

  field  failures  passes   type
0  col1         1       0  False

Given that the type is named "real" rather than "float", I would expect 'int' to be accepted as "real".

Constraint specification with field pattern matching

I have in my dataset about 100 columns that should have the exact same constraint specification. If I use tdda.constraints.pd.constraints.discover_df, it'll generate slightly different constraints because a few of those columns have too few data points.

It would be very convenient if I could have something like:

{
    "i18n.*.names": {
        field-name: field-constraints,
        ...
    }
}

so it would apply to i18n.en.names, i18.es.names and so on.

Is this a reasonable feature request?

Request: Ability to get DataFrame containing constraints found by discover

While at PyDataLondon, Giles Weaver suggested it would be useful to be able to turn DatasetConstraints object returned by constraint discovery into a DataFrame for structured access.

This is possible (at least with Pandas), but would be a little weird as the natural structure (one row per field, on column per constraint) would naturally lead to rather heterogeneous types in at least some columns. For example, min of a string field vs. min of an integer vs. min of a timestamp. Although Pandas does support heterogeneous ("object") columns, many/most column stores and databases don't. Another option would be turn everything to strings, but clearly that loses type information etc.

For now, I've simply added a to_dict method to DatasetConstraints to that you can say

constraints = discover_constraints(df)
d = constraints.to_dict()

(It was doing this before generating JSON output anyway, so it was a trivial change.)

I suspect this will be at least as useful as getting a DataFrame out, and perhaps more useful.

This is there in 0.3.6; I haven't pushed it to PyPI yet.

tdda verify ticks and crosses don't show in terminals with weird encodings

At the PyData London 2017 TDDA tutorial, an attendee was using a terminal with an (unknown) "weird" encoding. As a result, the ticks and crosses used by tdda verify did not show: there were just hex codes.

Suggestion is to have an option to generate pure ASCII output (perhaps pass/fail). (It doesn't seem like this would be a great default, if only for reasons of space.)

Add gentest examples

tdda examples should include Gentest examples
tdda examples gentest should generate Gentest examples (only)

Support for py3.9+

I see that python 3.7+ is the latest supported version. Is there a plan to support more recent versions of python like 3.9+?

The reason why I ask is that I tried in 3.9 and failed the tdda test and nothing else ran.

Epsilon doesn't affect integer columns

I have a dataframe with a few integer columns. One of these violate the min constraint I generated from my original dataset slightly, so I tried adjusting the epsilon, but it didn't change anything no matter how much I changed the epsilon value.

I only took a quick look at the code and my question is: might it be this check, which prevents fuzzy_greater_than from being called (and thus epsilon not being applied) on integer columns?

-j / -n N parallel flags

Parallelize over N workers (-n N) or over all workers (-j).

Support for JSON datasets

I'd really like to use tdda to find anomalies in a dataset of mine, but it is stored as json objects (one per row).
I think it's not a farfetched request, it's a format somewhat normal in the Big Data community and used frequently in the AWS environment (e.g. Events > Kinesis > S3 > Athena).

It turns out this format is readily readable by pandas>=19.0.0 with pd.read_json:

pd.read_json(path_to_json, lines=True)

Allow generation of reference data for a specific module in pytest

TL;DR;

I couldn't generate reference data when running pytest for a specific module

pytest /mnt/c/Users/RowanM/Code/roughwork/tdda_tests/testing.py -s --write-all

But could do so when running all tests

pytest -s --write-all

I could generate tests for a specific function in my test module which is a big plus making the above functionality less-essential!

pytest -s --write-all -k specific_module

Thanks for creating tdda & maintaining, I really needed some way to automate the generation of reference tests for my data pipelines and this is fitting in extremely well! 😃

The problem

To reproduce this bug create the following directories & files in a directory of choice

├── conftest.py
├── ref_data
└── testing.py

Conftest.py

import pytest
from tdda.referencetest import referencepytest


def pytest_addoption(parser):
    referencepytest.addoption(parser)


def pytest_collection_modifyitems(session, config, items):
    referencepytest.tagged(config, items)


@pytest.fixture(scope="module")
def ref(request):
    return referencepytest.ref(request)


referencepytest.set_default_data_location("ref_data")

testing.py

import pandas as pd
from tdda.referencetest import referencepytest, tag

def test_specific_module(ref) -> None:

    input = pd.DataFrame([1,2,3])
    ref.assertDataFrameCorrect(input, "CorrectDataFrame.csv")

& finally an empty ref_data folder

and run:

❯ pytest /mnt/c/Users/RowanM/Code/roughwork/tdda_tests/testing.py -s --write-all
======================================================================================= test session starts =======================================================================================
platform linux -- Python 3.8.2, pytest-5.4.3, py-1.9.0, pluggy-0.13.1
rootdir: /mnt/c/Users/RowanM/Code/roughwork/tdda_tests
plugins: dash-1.13.4
collected 1 item                                                                                                                                                                                  

testing.py F

============================================================================================ FAILURES =============================================================================================
______________________________________________________________________________________ test_specific_module _______________________________________________________________________________________

ref = <tdda.referencetest.referencetest.ReferenceTest object at 0x7f13883526a0>

    def test_specific_module(ref) -> None:
    
        input = pd.DataFrame([1,2,3])
>       ref.assertDataFrameCorrect(input, "CorrectDataFrame.csv")

testing.py:7: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/tdda/referencetest/referencetest.py:386: in assertDataFrameCorrect
    self._write_reference_dataset(df, expected_path)
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/tdda/referencetest/referencetest.py:794: in _write_reference_dataset
    self.pandas.write_csv(df, reference_path)
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/tdda/referencetest/checkpandas.py:392: in write_csv
    writer(df, csvfile, **kwargs)
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/tdda/referencetest/checkpandas.py:469: in default_csv_writer
    return df.to_csv(csvfile, **options)
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/pandas/core/generic.py:3204: in to_csv
    formatter.save()
/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/pandas/io/formats/csvs.py:184: in save
    f, handles = get_handle(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

path_or_buf = 'ref_data/CorrectDataFrame.csv', mode = 'w', encoding = 'utf-8', compression = None, memory_map = False, is_text = True

    def get_handle(
        path_or_buf,
        mode: str,
        encoding=None,
        compression: Optional[Union[str, Mapping[str, Any]]] = None,
        memory_map: bool = False,
        is_text: bool = True,
    ):
        """
        Get file handle for given path/buffer and mode.
    
        Parameters
        ----------
        path_or_buf : str or file handle
            File path or object.
        mode : str
            Mode to open path_or_buf with.
        encoding : str or None
            Encoding to use.
        compression : str or dict, default None
            If string, specifies compression mode. If dict, value at key 'method'
            specifies compression mode. Compression mode must be one of {'infer',
            'gzip', 'bz2', 'zip', 'xz', None}. If compression mode is 'infer'
            and `filepath_or_buffer` is path-like, then detect compression from
            the following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise
            no compression). If dict and compression mode is 'zip' or inferred as
            'zip', other entries passed as additional compression options.
    
            .. versionchanged:: 1.0.0
    
               May now be a dict with key 'method' as compression mode
               and other keys as compression options if compression
               mode is 'zip'.
    
        memory_map : boolean, default False
            See parsers._parser_params for more information.
        is_text : boolean, default True
            whether file/buffer is in text format (csv, json, etc.), or in binary
            mode (pickle, etc.).
    
        Returns
        -------
        f : file-like
            A file-like object.
        handles : list of file-like objects
            A list of file-like object that were opened in this function.
        """
        try:
            from s3fs import S3File
    
            need_text_wrapping = (BufferedIOBase, RawIOBase, S3File)
        except ImportError:
            need_text_wrapping = (BufferedIOBase, RawIOBase)  # type: ignore
    
        handles: List[IO] = list()
        f = path_or_buf
    
        # Convert pathlib.Path/py.path.local or string
        path_or_buf = stringify_path(path_or_buf)
        is_path = isinstance(path_or_buf, str)
    
        compression, compression_args = get_compression_method(compression)
        if is_path:
            compression = infer_compression(path_or_buf, compression)
    
        if compression:
    
            # GZ Compression
            if compression == "gzip":
                if is_path:
                    f = gzip.open(path_or_buf, mode)
                else:
                    f = gzip.GzipFile(fileobj=path_or_buf)
    
            # BZ Compression
            elif compression == "bz2":
                if is_path:
                    f = bz2.BZ2File(path_or_buf, mode)
                else:
                    f = bz2.BZ2File(path_or_buf)
    
            # ZIP Compression
            elif compression == "zip":
                zf = _BytesZipFile(path_or_buf, mode, **compression_args)
                # Ensure the container is closed as well.
                handles.append(zf)
                if zf.mode == "w":
                    f = zf
                elif zf.mode == "r":
                    zip_names = zf.namelist()
                    if len(zip_names) == 1:
                        f = zf.open(zip_names.pop())
                    elif len(zip_names) == 0:
                        raise ValueError(f"Zero files found in ZIP file {path_or_buf}")
                    else:
                        raise ValueError(
                            "Multiple files found in ZIP file."
                            f" Only one file per ZIP: {zip_names}"
                        )
    
            # XZ Compression
            elif compression == "xz":
                f = _get_lzma_file(lzma)(path_or_buf, mode)
    
            # Unrecognized Compression
            else:
                msg = f"Unrecognized compression type: {compression}"
                raise ValueError(msg)
    
            handles.append(f)
    
        elif is_path:
            if encoding:
                # Encoding
>               f = open(path_or_buf, mode, encoding=encoding, newline="")
E               FileNotFoundError: [Errno 2] No such file or directory: 'ref_data/CorrectDataFrame.csv'

/home/wsl-rowanm/miniconda3/envs/mplan/lib/python3.8/site-packages/pandas/io/common.py:428: FileNotFoundError

Ideas

Files with variable names. Start by trying Miró output.
Clever comparison of CSVs
Handling parquet files (better)
Databasse support
When you get a test failure, option to accommodate by (e.g.) generalizing regular expression.

Get TDDA Discover to generate regular expressions

It should.