Code Monkey home page Code Monkey logo

alan-turing-institute / clevercsv Goto Github PK

View Code? Open in Web Editor NEW
1.2K 18.0 69.0 3.61 MB

CleverCSV is a Python package for handling messy CSV files. It provides a drop-in replacement for the builtin CSV module with improved dialect detection, and comes with a handy command line application for working with CSV files.

Home Page: https://clevercsv.readthedocs.io

License: MIT License

Makefile 0.91% Python 92.13% C 6.96%
csv csv-reader csv-format csv-parsing csv-files csv-parser csv-import csv-converter csv-reading csv-export

clevercsv's Introduction


Github Actions Build Status PyPI version Documentation Status Downloads Binder

CleverCSV provides a drop-in replacement for the Python csv package with improved dialect detection for messy CSV files. It also provides a handy command line tool that can standardize a messy file or generate Python code to import it.

Useful links:


Contents: Quick Start | Introduction | Installation | Usage | Python Library | Command-Line Tool | Version Control Integration | Contributing | Notes


Quick Start

Click here to go to the introduction with more details about CleverCSV. If you're in a hurry, below is a quick overview of how to get started with the CleverCSV Python package and the command line interface.

For the Python package:

# Import the package
>>> import clevercsv

# Load the file as a list of rows
# This uses the imdb.csv file in the examples directory
>>> rows = clevercsv.read_table('./imdb.csv')

# Load the file as a Pandas Dataframe
# Note that df = pd.read_csv('./imdb.csv') would fail here
>>> df = clevercsv.read_dataframe('./imdb.csv')

# Use CleverCSV as drop-in replacement for the Python CSV module
# This follows the Sniffer example: https://docs.python.org/3/library/csv.html#csv.Sniffer
# Note that csv.Sniffer would fail here
>>> with open('./imdb.csv', newline='') as csvfile:
...     dialect = clevercsv.Sniffer().sniff(csvfile.read())
...     csvfile.seek(0)
...     reader = clevercsv.reader(csvfile, dialect)
...     rows = list(reader)

And for the command line interface:

# Install the full version of CleverCSV (this includes the command line interface)
$ pip install clevercsv[full]

# Detect the dialect
$ clevercsv detect ./imdb.csv
Detected: SimpleDialect(',', '', '\\')

# Generate code to import the file
$ clevercsv code ./imdb.csv

import clevercsv

with open("./imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

# Explore the CSV file as a Pandas dataframe
$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.
CleverCSV has loaded the data into the variable: df
>>> df

Introduction

  • CSV files are awesome! They are lightweight, easy to share, human-readable, version-controllable, and supported by many systems and tools!
  • CSV files are terrible! They can have many different formats, multiple tables, headers or no headers, escape characters, and there's no support for recording metadata!

CleverCSV is a Python package that aims to solve some of the pain points of CSV files, while maintaining many of the good things. The package automatically detects (with high accuracy) the format (dialect) of CSV files, thus making it easier to simply point to a CSV file and load it, without the need for human inspection. In the future, we hope to solve some of the other issues of CSV files too.

CleverCSV is based on science. We investigated thousands of real-world CSV files to find a robust way to automatically detect the dialect of a file. This may seem like an easy problem, but to a computer a CSV file is simply a long string, and every dialect will give you some table. In CleverCSV we use a technique based on the patterns of row lengths of the parsed file and the data type of the resulting cells. With our method we achieve 97% accuracy for dialect detection, with a 21% improvement on non-standard (messy) CSV files compared to the Python standard library.

We think this kind of work can be very valuable for working data scientists and programmers and we hope that you find CleverCSV useful (if there's a problem, please open an issue!) Since the academic world counts citations, please cite CleverCSV if you use the package. Here's a BibTeX entry you can use:

@article{van2019wrangling,
        title = {Wrangling Messy {CSV} Files by Detecting Row and Type Patterns},
        author = {{van den Burg}, G. J. J. and Naz{\'a}bal, A. and Sutton, C.},
        journal = {Data Mining and Knowledge Discovery},
        year = {2019},
        volume = {33},
        number = {6},
        pages = {1799--1820},
        issn = {1573-756X},
        doi = {10.1007/s10618-019-00646-y},
}

And of course, if you like the package please spread the word! You can do this by Tweeting about it (#CleverCSV) or clicking the ⭐️ on GitHub!

Installation

CleverCSV is available on PyPI. You can install either the full version, which includes the command line interface and all optional dependencies, using

$ pip install clevercsv[full]

or you can install a lighter, core version of CleverCSV with

$ pip install clevercsv

Usage

CleverCSV consists of a Python library and a command line tool called clevercsv.

Python Library

We designed CleverCSV to provide a drop-in replacement for the built-in CSV module, with some useful functionality added to it. Therefore, if you simply want to replace the builtin CSV module with CleverCSV, you can import CleverCSV as follows, and use it as you would use the builtin csv module.

import clevercsv

CleverCSV provides an improved version of the dialect sniffer in the CSV module, but it also adds some useful wrapper functions. These functions automatically detect the dialect and aim to make working with CSV files easier. We currently have the following helper functions:

  • detect_dialect: takes a path to a CSV file and returns the detected dialect
  • read_table: automatically detects the dialect and encoding of the file, and returns the data as a list of rows. A version that returns a generator is also available: stream_table
  • read_dataframe: detects the dialect and encoding of the file and then uses Pandas to read the CSV into a DataFrame. Note that this function requires Pandas to be installed.
  • read_dicts: detect the dialect and return the rows of the file as dictionaries, assuming the first row contains the headers. A streaming version called stream_dicts is also available.
  • write_table: write a table (a list of lists) to a file using the RFC-4180 dialect.
  • write_dicts: write a list of dictionaries to a file using the RFC-4180 dialect.

Of course, you can also use the traditional way of loading a CSV file, as in the Python CSV module:

import clevercsv

with open("data.csv", "r", newline="") as fp:
  # you can use verbose=True to see what CleverCSV does
  dialect = clevercsv.Sniffer().sniff(fp.read(), verbose=False)
  fp.seek(0)
  reader = clevercsv.reader(fp, dialect)
  rows = list(reader)

Since CleverCSV v0.8.0, dialect detection is a lot faster than in previous versions. However, for large files, you can speed up detection even more by supplying a sample of the document to the sniffer instead of the whole file, for example:

dialect = clevercsv.Sniffer().sniff(fp.read(10000))

You can also speed up encoding detection by installing cCharDet, it will automatically be used when it is available on the system.

That's the basics! If you want more details, you can look at the code of the package, the test suite, or the API documentation. If you run into any issues or have comments or suggestions, please open an issue on GitHub.

Command-Line Tool

To use the command line tool, make sure that you install the full version of CleverCSV (see above).

The clevercsv command line application has a number of handy features to make working with CSV files easier. For instance, it can be used to view a CSV file on the command line while automatically detecting the dialect. It can also generate Python code for importing data from a file with the correct dialect. The full help text is as follows:

usage: clevercsv [-h] [-V] [-v] command ...

Available commands:
  help         Display help information
  detect       Detect the dialect of a CSV file
  view         View the CSV file on the command line using TabView
  standardize  Convert a CSV file to one that conforms to RFC-4180
  code         Generate Python code to import a CSV file
  explore      Explore the CSV file in an interactive Python shell

Each of the commands has further options (for instance, the code and explore commands have support for importing the CSV file as a Pandas DataFrame). Use clevercsv help <command> or man clevercsv <command> for more information. Below are some examples for each command.

Note that each command accepts the -n or --num-chars flag to set the number of characters used to detect the dialect. This can be especially helpful to speed up dialect detection on large files.

Code

Code generation is useful when you don't want to detect the dialect of the same file over and over again. You simply run the following command and copy the generated code to a Python script!

$ clevercsv code imdb.csv

# Code generated with CleverCSV

import clevercsv

with open("imdb.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter=",", quotechar="", escapechar="\\")
    rows = list(reader)

We also have a version that reads a Pandas dataframe:

$ clevercsv code --pandas imdb.csv

# Code generated with CleverCSV

import clevercsv

df = clevercsv.read_dataframe("imdb.csv", delimiter=",", quotechar="", escapechar="\\")

Detect

Detection is useful when you only want to know the dialect.

$ clevercsv detect imdb.csv
Detected: SimpleDialect(',', '', '\\')

The --plain flag gives the components of the dialect on separate lines, which makes combining it with grep easier.

$ clevercsv detect --plain imdb.csv
delimiter = ,
quotechar =
escapechar = \

Explore

The explore command is great for a command-line based workflow, or when you quickly want to start working with a CSV file in Python. This command detects the dialect of a CSV file and starts an interactive Python shell with the file already loaded! You can either have the file loaded as a list of lists:

$ clevercsv explore milk.csv
Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: rows
>>>
>>> len(rows)
381

or you can load the file as a Pandas dataframe:

$ clevercsv explore -p imdb.csv
Dropping you into an interactive shell.

CleverCSV has loaded the data into the variable: df
>>>
>>> df.head()
                   fn        tid  ... War Western
0  titles01/tt0012349  tt0012349  ...   0       0
1  titles01/tt0015864  tt0015864  ...   0       0
2  titles01/tt0017136  tt0017136  ...   0       0
3  titles01/tt0017925  tt0017925  ...   0       0
4  titles01/tt0021749  tt0021749  ...   0       0

[5 rows x 44 columns]

Standardize

Use the standardize command when you want to rewrite a file using the RFC-4180 standard:

$ clevercsv standardize --output imdb_standard.csv imdb.csv

In this particular example the use of the escape character is replaced by using quotes.

View

This command allows you to view the file in the terminal. The dialect is of course detected using CleverCSV! Both this command and the standardize command support the --transpose flag, if you want to transpose the file before viewing or saving:

$ clevercsv view --transpose imdb.csv

Version Control Integration

If you'd like to make sure that you never commit a messy (non-standard) CSV file to your repository, you can install a pre-commit hook. First, install pre-commit using the installation instructions. Next, add the following configuration to the .pre-commit-config.yaml file in your repository:

repos:
  - repo: https://github.com/alan-turing-institute/CleverCSV-pre-commit
    rev: v0.6.6   # or any later version
    hooks:
      - id: clevercsv-standardize

Finally, run pre-commit install to set up the git hook. Pre-commit will now use CleverCSV to standardize your CSV files following RFC-4180 whenever you commit a CSV file to your repository.

Contributing

If you want to encourage development of CleverCSV, the best thing to do now is to spread the word!

If you encounter an issue in CleverCSV, please open an issue or submit a pull request. Don't hesitate, you're helping to make this project better for everyone! If GitHub's not your thing but you still want to contact us, you can send an email to gertjanvandenburg at gmail dot com instead. You can also ask questions on Gitter.

Note that all contributions to the project must adhere to the Code of Conduct.

The CleverCSV package was originally written by Gertjan van den Burg and came out of scientific research on wrangling messy CSV files by Gertjan van den Burg, Alfredo Nazabal, and Charles Sutton.

Notes

CleverCSV is licensed under the MIT license. Please cite our research if you use CleverCSV in your work.

Copyright (c) 2018-2021 The Alan Turing Institute.

clevercsv's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

clevercsv's Issues

Suggestion: Make Pandas dependency optional

Hello,
First of all thanks for writing this cool library.
Is it possible to make Pandas an optional dependency? From looking at the code briefly, it seems it should work without Pandas too if you are just dealing with reading csv files. There are projects that don't need Pandas but could use this library.
Thanks

A Possible Typo in the Readthedocs docs of Clevercsv package.

In the description of clevercsv.escape module,

Potential Escape Character is defined as A character is considered a potential escape character if it is in the “Punctuation, Other” Unicode category and in the list of blocked characters.

Block Char is defined as Characters that are in the Punctuation Other category but that should not be considered as escape character.

This two definitions are mutually contradictory, I think the definition of Potential Escape Character would be,
"A character is considered a potential escape character if it is in the “Punctuation, Other” Unicode category and not in the list of blocked characters."

0.7.4: pep517 build fails

Looks like something is wrong

+ /usr/bin/python3 -sBm build -w --no-isolation
* Getting build dependencies for wheel...
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 351, in <module>
    main()
  File "/usr/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 333, in main
    json_out['return_val'] = hook(**hook_input['kwargs'])
  File "/usr/lib/python3.8/site-packages/pep517/in_process/_in_process.py", line 118, in get_requires_for_build_wheel
    return hook(config_settings)
  File "/usr/lib/python3.8/site-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
    return self._get_build_requires(config_settings, requirements=['wheel'])
  File "/usr/lib/python3.8/site-packages/setuptools/build_meta.py", line 320, in _get_build_requires
    self.run_setup()
  File "/usr/lib/python3.8/site-packages/setuptools/build_meta.py", line 484, in run_setup
    super(_BuildMetaLegacyBackend,
  File "/usr/lib/python3.8/site-packages/setuptools/build_meta.py", line 335, in run_setup
    exec(code, locals())
  File "<string>", line 99, in <module>
  File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 86, in setup
    _install_setup_requires(attrs)
  File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 75, in _install_setup_requires
    dist = MinimalDistribution(attrs)
  File "/usr/lib/python3.8/site-packages/setuptools/__init__.py", line 59, in __init__
    self.set_defaults._disable()
AttributeError: 'MinimalDistribution' object has no attribute 'set_defaults'

ERROR Backend subprocess exited when trying to invoke get_requires_for_build_wheel

Here is list of installed modules in build env

Package         Version
--------------- --------------
appdirs         1.4.4
attrs           22.1.0
Brlapi          0.8.3
build           0.9.0
contourpy       1.0.6
cssselect       1.1.0
cycler          0.11.0
distro          1.8.0
dnspython       2.2.1
exceptiongroup  1.0.0
extras          1.0.0
fixtures        4.0.0
fonttools       4.38.0
gpg             1.18.0-unknown
iniconfig       1.1.1
kiwisolver      1.4.4
libcomps        0.1.19
louis           3.24.0
lxml            4.9.1
matplotlib      3.6.2
numpy           1.23.1
olefile         0.46
packaging       21.3
pbr             5.9.0
pep517          0.13.0
Pillow          9.3.0
pip             22.3.1
pluggy          1.0.0
ply             3.11
PyGObject       3.42.2
pyparsing       3.0.9
pytest          7.2.0
python-dateutil 2.8.2
rpm             4.17.0
scour           0.38.2
setuptools      65.6.3
six             1.16.0
testtools       2.5.0
tomli           2.0.1
wheel           0.38.4

clevercsv sniffer slows to a crawl on large-ish files (e.g. FEC data)

Hello,

This is a very neat project! I was thinking "I should collect a bunch of CSV files from the web and do statistics to see what the dialects are, and their predominance, to be able to better detect them" and then I found your paper and Python package! Congrats on this very nice contribution.

I am trying to see how clevercsv performs on FEC data. For instance, let's consider this file:

https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip

$ head -5 fec-indiv-1979–1980.csv
C00078279|A|M11|P|80031492155|22Y||MCKENNON, K R|MIDLAND|MI|00000|||10031979|400|||||CONTRIBUTION REF TO INDIVIDUAL|3062020110011466469
C00078279|A|M11||79031415137|15||OREFFICE, P|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1500||||||3061920110000382948
C00078279|A|M11||79031415137|15||DOWNEY, J|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|300||||||3061920110000382949
C00078279|A|M11||79031415137|15||BLAIR, E|MIDLAND|MI|00000|DOW CHEMICAL CO||10261979|1000||||||3061920110000382950
C00078287|A|Q1||79031231889|15||BLANCHARD, JOHN A|CHICAGO|IL|60685|||03201979|200||||||3061920110000383914

When I try to open the file with clevercsv, it takes an inordinate of time, and seems to be hanging. So I tried to use the sniffer as suggested in your example Binder.

# downloaded, unzipped and renamed to a CSV file from:
# https://www.fec.gov/files/bulk-downloads/1980/indiv80.zip
content = open("fec-indiv-1979–1980.csv").read()
clevercsv.Sniffer().sniff(content, verbose=True)

It prints out this:

Running normal form detection ...
Not normal, has potential escapechar.
Running data consistency measure ...

and then a while later (a few minutes later) starts printing:

Considering 92 dialects.
SimpleDialect(',', '', ''):	P =    22104.867952	T =        0.003101	Q =       68.546737
SimpleDialect(',', '"', ''):	P =    13927.762095	T =        0.003668	Q =       51.090510
SimpleDialect(',', '"', '/'):	P =    13839.682333	T =        0.002461	Q =       34.060128
SimpleDialect(',', "'", ''):	P =    12072.093333	T =        0.003278	Q =       39.571560
SimpleDialect(';', '', ''):	P =      106.613556	T =        0.000003	Q =        0.000345
SimpleDialect(';', '"', ''):	P =       99.261000	T =        0.000000	Q =        0.000000
SimpleDialect(';', '"', '/'):	P =       50.238917	skip.
SimpleDialect(';', "'", ''):	P =       49.981222	skip.
SimpleDialect('', '', ''):	P =      308.696000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', ''):	P =      194.530000	T =        0.000000	Q =        0.000000
SimpleDialect('', '"', '/'):	P =       96.652000	T =        0.000000	Q =        0.000000
SimpleDialect('', "'", ''):	P =      144.787000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =    17818.683137	T =        0.346978	Q =     6182.686103
SimpleDialect(' ', '', '/'):	P =    17818.565863	T =        0.346984	Q =     6182.762051
SimpleDialect(' ', '"', ''):	P =    11300.749933	T =        0.353544	Q =     3995.309179
SimpleDialect(' ', '"', '/'):	P =    10372.973520	T =        0.355343	Q =     3685.960429
SimpleDialect(' ', "'", ''):	P =     7231.699311	T =        0.343354	Q =     2483.032090
SimpleDialect(' ', "'", '/'):	P =     7231.658120	T =        0.343362	Q =     2483.075319
SimpleDialect('#', '', ''):	P =      163.330000	skip.
SimpleDialect('#', '"', ''):	P =      103.253000	skip.
SimpleDialect('#', '"', '/'):	P =       67.761333	skip.
SimpleDialect('#', "'", ''):	P =       78.132000	skip.
SimpleDialect('$', '', ''):	P =      155.096500	skip.
SimpleDialect('$', '"', ''):	P =       97.764000	skip.
SimpleDialect('$', '"', '/'):	P =       64.601000	skip.
SimpleDialect('$', "'", ''):	P =       72.892500	skip.
SimpleDialect('%', '', ''):	P =      104.950222	skip.
SimpleDialect('%', '', '\\'):	P =      104.783889	skip.
SimpleDialect('%', '"', ''):	P =       65.896889	skip.
SimpleDialect('%', '"', '/'):	P =       65.765333	skip.
SimpleDialect('%', '"', '\\'):	P =       65.730556	skip.
SimpleDialect('%', "'", ''):	P =       49.648556	skip.
SimpleDialect('%', "'", '\\'):	P =       49.482222	skip.
SimpleDialect('&', '', ''):	P =     2940.570750	skip.
SimpleDialect('&', '', '/'):	P =     2940.446000	skip.
SimpleDialect('&', '"', ''):	P =     1936.209667	skip.
SimpleDialect('&', '"', '/'):	P =     1441.305200	skip.
SimpleDialect('&', "'", ''):	P =     1340.900250	skip.
SimpleDialect('&', "'", '/'):	P =     1340.775500	skip.
SimpleDialect('*', '', ''):	P =      156.344000	skip.
SimpleDialect('*', '"', ''):	P =       97.514500	skip.
SimpleDialect('*', '"', '/'):	P =       65.599000	skip.
SimpleDialect('*', "'", ''):	P =       73.142000	skip.
SimpleDialect('+', '', ''):	P =      156.344000	skip.
SimpleDialect('+', '', '\\'):	P =      156.094500	skip.
SimpleDialect('+', '"', ''):	P =       99.011500	skip.
SimpleDialect('+', '"', '/'):	P =       65.266333	skip.
SimpleDialect('+', '"', '\\'):	P =       99.011500	skip.
SimpleDialect('+', "'", ''):	P =       73.890500	skip.
SimpleDialect('+', "'", '\\'):	P =       73.890500	skip.
SimpleDialect('-', '', ''):	P =     1456.570500	skip.
SimpleDialect('-', '"', ''):	P =      914.921750	skip.
SimpleDialect('-', '"', '/'):	P =      700.916267	skip.
SimpleDialect('-', "'", ''):	P =      687.513667	skip.
SimpleDialect(':', '', ''):	P =      155.096500	skip.
SimpleDialect(':', '"', ''):	P =       97.514500	skip.
SimpleDialect(':', '"', '/'):	P =       64.933667	skip.
SimpleDialect('<', '', ''):	P =      155.845000	skip.
SimpleDialect('<', '"', ''):	P =       97.764000	skip.
SimpleDialect('<', '"', '/'):	P =       65.100000	skip.
SimpleDialect('<', "'", ''):	P =       73.391500	skip.
SimpleDialect('?', '', ''):	P =      155.595500	skip.
SimpleDialect('?', '"', ''):	P =       98.512500	skip.
SimpleDialect('?', '"', '/'):	P =       96.652000	skip.
SimpleDialect('@', '', ''):	P =      156.344000	skip.
SimpleDialect('@', '"', ''):	P =       98.762000	skip.
SimpleDialect('@', '"', '/'):	P =       64.767333	skip.
SimpleDialect('@', "'", ''):	P =       73.391500	skip.
SimpleDialect('\\', '', ''):	P =      105.282889	skip.
SimpleDialect('\\', '"', ''):	P =       66.063222	skip.
SimpleDialect('\\', '"', '/'):	P =       66.098000	skip.
SimpleDialect('\\', "'", ''):	P =       74.140000	skip.
SimpleDialect('^', '', ''):	P =      154.597500	skip.
SimpleDialect('^', '"', ''):	P =       97.514500	skip.
SimpleDialect('^', '"', '/'):	P =       64.601000	skip.
SimpleDialect('^', "'", ''):	P =       72.643000	skip.
SimpleDialect('_', '', ''):	P =      156.094500	skip.
SimpleDialect('_', '"', ''):	P =       98.013500	skip.
SimpleDialect('_', '"', '/'):	P =       65.100000	skip.
SimpleDialect('_', "'", ''):	P =       73.391500	skip.
SimpleDialect('|', '', ''):	P =   293996.190476	T =        0.946519	Q =   278273.106576
SimpleDialect('|', '', '/'):	P =   146998.094048	skip.
SimpleDialect('|', '', '@'):	P =   146998.094048	skip.
SimpleDialect('|', '', '\\'):	P =   146998.092857	skip.
SimpleDialect('|', '"', ''):	P =   185266.666667	skip.
SimpleDialect('|', '"', '/'):	P =    46024.763214	skip.
SimpleDialect('|', '"', '@'):	P =    92633.332143	skip.
SimpleDialect('|', '"', '\\'):	P =    92633.330952	skip.
SimpleDialect('|', "'", ''):	P =    12535.572981	skip.
SimpleDialect('|', "'", '/'):	P =    12535.572981	skip.
SimpleDialect('|', "'", '@'):	P =    12535.572765	skip.
SimpleDialect('|', "'", '\\'):	P =    12535.572981	skip.

I would say this takes about 30 minutes, and finally it concludes:

SimpleDialect('|', '', '')

I think I understand what's going on: You designed this for small-ish datasets, and so you reprocess the whole file for every dialog to determine what makes most sense.

I would be tempted to think this is because I feed the data as a variable content, following your example, rather than provide the filename directly. However when I tried to read the read_csv method directly with the filename, it also was really, very, very slow. So I think in all situations currently, clevercsv trips on this file, and more generally this type of file.

When I take the initiative to truncate the data arbitrarily, clevercsv works beautifully. But shouldn't the truncating be something the library, as opposed to the user does?

clevercsv.Sniffer().sniff(content[0:1000], verbose=True)

provides in a few seconds:

Running normal form detection ...
Didn't match any normal forms.
Running data consistency measure ...
Considering 4 dialects.
SimpleDialect(',', '', ''):	P =        4.500000	T =        0.000000	Q =        0.000000
SimpleDialect('', '', ''):	P =        0.009000	T =        0.000000	Q =        0.000000
SimpleDialect(' ', '', ''):	P =        1.562500	T =        0.312500	Q =        0.488281
SimpleDialect('|', '', ''):	P =        8.571429	T =        0.952381	Q =        8.163265
SimpleDialect('|', '', '')

@GjjvdBurg If this is not a known problem, may I suggest using some variation of "infinite binary search"?

  • We start with a small size, and we truncate the file to that small size.
  • The detected dialect may be wrong because we are only working on a small subset of the file, so we double the amount of content that we provide the sniffer, and see if it answers the same thing.
  • We repeat a predetermined number of times (for instance 4 times), until we've asserted the sniffer has detected the same dialect for larger portions of the file.

I have implemented this algorithm here:
https://gist.github.com/jlumbroso/c123a30a2380b58989c7b12fe4b4f49e

When I run it on the above mentioned file, it immediately (without futzing) produces the correct answer:

In[3]: probe_sniff(content)
Out[3]: SimpleDialect('|', '', '')

And on the off-chance you would like me to add this algorithm to the codebase, where would it go?

Testsuite failure with Python 3.11

Hello,

Debian is currently migrating to Python 3.11 and the clevercsv testsuite fails with the following error:

I: pybuild base:240: cd /<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build; python3.11 -m unittest discover -v -s tests/test_unit
test_get_best_set_1 (test_consistency.ConsistencyTestCase.test_get_best_set_1) ... ok
test_get_best_set_2 (test_consistency.ConsistencyTestCase.test_get_best_set_2) ... ok
test_code_1 (test_console.ConsoleTestCase.test_code_1) ... ERROR
/usr/lib/python3.11/unittest/case.py:622: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with outcome.testPartExecutor(self):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_code_2 (test_console.ConsoleTestCase.test_code_2) ... ERROR
test_code_3 (test_console.ConsoleTestCase.test_code_3) ... ERROR
/usr/lib/python3.11/unittest/case.py:622: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='ISO-8859-1'>
  with outcome.testPartExecutor(self):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_code_4 (test_console.ConsoleTestCase.test_code_4) ... ERROR
test_code_5 (test_console.ConsoleTestCase.test_code_5) ... ERROR
test_detect_base (test_console.ConsoleTestCase.test_detect_base) ... 
  test_detect_base (test_console.ConsoleTestCase.test_detect_base) (name='simple') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py:50: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="simple"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_detect_base (test_console.ConsoleTestCase.test_detect_base) (name='escaped') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py:55: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="escaped"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_detect_opts_1 (test_console.ConsoleTestCase.test_detect_opts_1) ... ERROR
/usr/lib/python3.11/unittest/case.py:622: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='windows-1252'>
  with outcome.testPartExecutor(self):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_detect_opts_2 (test_console.ConsoleTestCase.test_detect_opts_2) ... ERROR
test_detect_opts_3 (test_console.ConsoleTestCase.test_detect_opts_3) ... ERROR
test_standardize_1 (test_console.ConsoleTestCase.test_standardize_1) ... ERROR
test_standardize_2 (test_console.ConsoleTestCase.test_standardize_2) ... ERROR
test_standardize_3 (test_console.ConsoleTestCase.test_standardize_3) ... ERROR
test_standardize_in_place (test_console.ConsoleTestCase.test_standardize_in_place) ... ERROR
test_standardize_in_place_multi (test_console.ConsoleTestCase.test_standardize_in_place_multi) ... ok
test_standardize_in_place_multi_noop (test_console.ConsoleTestCase.test_standardize_in_place_multi_noop) ... ok
test_standardize_in_place_noop (test_console.ConsoleTestCase.test_standardize_in_place_noop) ... ok
test_standardize_multi (test_console.ConsoleTestCase.test_standardize_multi) ... ok
test_standardize_multi_encoding (test_console.ConsoleTestCase.test_standardize_multi_encoding) ... ok
test_standardize_multi_errors (test_console.ConsoleTestCase.test_standardize_multi_errors) ... ok
test_parse_builtin_1 (test_cparser.ParserTestCase.test_parse_builtin_1) ... ok
test_parse_builtin_10 (test_cparser.ParserTestCase.test_parse_builtin_10) ... ok
test_parse_builtin_11 (test_cparser.ParserTestCase.test_parse_builtin_11) ... ok
test_parse_builtin_12 (test_cparser.ParserTestCase.test_parse_builtin_12) ... ok
test_parse_builtin_13 (test_cparser.ParserTestCase.test_parse_builtin_13) ... ok
test_parse_builtin_14 (test_cparser.ParserTestCase.test_parse_builtin_14) ... ok
test_parse_builtin_15 (test_cparser.ParserTestCase.test_parse_builtin_15) ... ok
test_parse_builtin_2 (test_cparser.ParserTestCase.test_parse_builtin_2) ... ok
test_parse_builtin_3 (test_cparser.ParserTestCase.test_parse_builtin_3) ... ok
test_parse_builtin_4 (test_cparser.ParserTestCase.test_parse_builtin_4) ... ok
test_parse_builtin_5 (test_cparser.ParserTestCase.test_parse_builtin_5) ... ok
test_parse_builtin_6 (test_cparser.ParserTestCase.test_parse_builtin_6) ... ok
test_parse_builtin_7 (test_cparser.ParserTestCase.test_parse_builtin_7) ... ok
test_parse_builtin_8 (test_cparser.ParserTestCase.test_parse_builtin_8) ... ok
test_parse_builtin_9 (test_cparser.ParserTestCase.test_parse_builtin_9) ... ok
test_parse_differ_1 (test_cparser.ParserTestCase.test_parse_differ_1) ... ok
test_parse_differ_2 (test_cparser.ParserTestCase.test_parse_differ_2) ... ok
test_parse_dq_1 (test_cparser.ParserTestCase.test_parse_dq_1) ... ok
test_parse_dq_2 (test_cparser.ParserTestCase.test_parse_dq_2) ... ok
test_parse_escape_1 (test_cparser.ParserTestCase.test_parse_escape_1) ... ok
test_parse_escape_2 (test_cparser.ParserTestCase.test_parse_escape_2) ... ok
test_parse_mix_double_escape_1 (test_cparser.ParserTestCase.test_parse_mix_double_escape_1) ... ok
test_parse_no_delim_1 (test_cparser.ParserTestCase.test_parse_no_delim_1) ... ok
test_parse_no_delim_2 (test_cparser.ParserTestCase.test_parse_no_delim_2) ... ok
test_parse_no_delim_3 (test_cparser.ParserTestCase.test_parse_no_delim_3) ... ok
test_parse_no_delim_4 (test_cparser.ParserTestCase.test_parse_no_delim_4) ... ok
test_parse_no_delim_5 (test_cparser.ParserTestCase.test_parse_no_delim_5) ... ok
test_parse_no_delim_6 (test_cparser.ParserTestCase.test_parse_no_delim_6) ... ok
test_parse_other_1 (test_cparser.ParserTestCase.test_parse_other_1) ... ok
test_parse_other_2 (test_cparser.ParserTestCase.test_parse_other_2) ... ok
test_parse_other_3 (test_cparser.ParserTestCase.test_parse_other_3) ... ok
test_parse_other_4 (test_cparser.ParserTestCase.test_parse_other_4) ... ok
test_parse_other_5 (test_cparser.ParserTestCase.test_parse_other_5) ... ok
test_parse_other_6 (test_cparser.ParserTestCase.test_parse_other_6) ... ok
test_parse_quote_mismatch_1 (test_cparser.ParserTestCase.test_parse_quote_mismatch_1) ... ok
test_parse_quote_mismatch_2 (test_cparser.ParserTestCase.test_parse_quote_mismatch_2) ... ok
test_parse_quote_mismatch_3 (test_cparser.ParserTestCase.test_parse_quote_mismatch_3) ... ok
test_parse_quote_mismatch_4 (test_cparser.ParserTestCase.test_parse_quote_mismatch_4) ... ok
test_parse_return_quoted_1 (test_cparser.ParserTestCase.test_parse_return_quoted_1) ... ok
test_parse_return_quoted_2 (test_cparser.ParserTestCase.test_parse_return_quoted_2) ... ok
test_parse_return_quoted_3 (test_cparser.ParserTestCase.test_parse_return_quoted_3) ... ok
test_parse_simple_1 (test_cparser.ParserTestCase.test_parse_simple_1) ... ok
test_parse_simple_2 (test_cparser.ParserTestCase.test_parse_simple_2) ... ok
test_parse_simple_3 (test_cparser.ParserTestCase.test_parse_simple_3) ... ok
test_parse_simple_4 (test_cparser.ParserTestCase.test_parse_simple_4) ... ok
test_parse_simple_5 (test_cparser.ParserTestCase.test_parse_simple_5) ... ok
test_parse_simple_6 (test_cparser.ParserTestCase.test_parse_simple_6) ... ok
test_parse_simple_7 (test_cparser.ParserTestCase.test_parse_simple_7) ... ok
test_parse_single_1 (test_cparser.ParserTestCase.test_parse_single_1) ... ok
test_delimiters (test_detect.DetectorTestCase.test_delimiters) ... ok
test_detect (test_detect.DetectorTestCase.test_detect) ... ok
test_has_header (test_detect.DetectorTestCase.test_has_header) ... ok
test_has_header_regex_special_delimiter (test_detect.DetectorTestCase.test_has_header_regex_special_delimiter) ... ok
test_abstraction_1 (test_detect_pattern.PatternTestCase.test_abstraction_1) ... ok
test_abstraction_10 (test_detect_pattern.PatternTestCase.test_abstraction_10) ... ok
test_abstraction_11 (test_detect_pattern.PatternTestCase.test_abstraction_11) ... ok
test_abstraction_12 (test_detect_pattern.PatternTestCase.test_abstraction_12) ... ok
test_abstraction_13 (test_detect_pattern.PatternTestCase.test_abstraction_13) ... ok
test_abstraction_14 (test_detect_pattern.PatternTestCase.test_abstraction_14) ... ok
test_abstraction_15 (test_detect_pattern.PatternTestCase.test_abstraction_15) ... ok
test_abstraction_16 (test_detect_pattern.PatternTestCase.test_abstraction_16) ... ok
test_abstraction_2 (test_detect_pattern.PatternTestCase.test_abstraction_2) ... ok
test_abstraction_3 (test_detect_pattern.PatternTestCase.test_abstraction_3) ... ok
test_abstraction_4 (test_detect_pattern.PatternTestCase.test_abstraction_4) ... ok
test_abstraction_5 (test_detect_pattern.PatternTestCase.test_abstraction_5) ... ok
test_abstraction_6 (test_detect_pattern.PatternTestCase.test_abstraction_6) ... ok
test_abstraction_7 (test_detect_pattern.PatternTestCase.test_abstraction_7) ... ok
test_abstraction_8 (test_detect_pattern.PatternTestCase.test_abstraction_8) ... ok
test_abstraction_9 (test_detect_pattern.PatternTestCase.test_abstraction_9) ... ok
test_fill_empties_1 (test_detect_pattern.PatternTestCase.test_fill_empties_1) ... ok
test_pattern_score_1 (test_detect_pattern.PatternTestCase.test_pattern_score_1) ... ok
test_pattern_score_2 (test_detect_pattern.PatternTestCase.test_pattern_score_2) ... ok
test_pattern_score_3 (test_detect_pattern.PatternTestCase.test_pattern_score_3) ... ok
test_bytearray (test_detect_type.TypeDetectorTestCase.test_bytearray) ... ok
test_date (test_detect_type.TypeDetectorTestCase.test_date) ... ok
test_datetime (test_detect_type.TypeDetectorTestCase.test_datetime) ... ok
test_number (test_detect_type.TypeDetectorTestCase.test_number) ... ok
test_type_score_1 (test_detect_type.TypeDetectorTestCase.test_type_score_1) ... ok
test_type_score_2 (test_detect_type.TypeDetectorTestCase.test_type_score_2) ... ok
test_type_score_3 (test_detect_type.TypeDetectorTestCase.test_type_score_3) ... ok
test_unicode_alphanum (test_detect_type.TypeDetectorTestCase.test_unicode_alphanum) ... ok
test_unix_path (test_detect_type.TypeDetectorTestCase.test_unix_path) ... ok
test_url (test_detect_type.TypeDetectorTestCase.test_url) ... ok
test_read_dict_fieldnames_chain (test_dict.DictTestCase.test_read_dict_fieldnames_chain) ... ok
test_read_dict_fieldnames_from_file (test_dict.DictTestCase.test_read_dict_fieldnames_from_file) ... ok
test_read_dict_fields (test_dict.DictTestCase.test_read_dict_fields) ... ok
test_read_dict_no_fieldnames (test_dict.DictTestCase.test_read_dict_no_fieldnames) ... ok
test_read_duplicate_fieldnames (test_dict.DictTestCase.test_read_duplicate_fieldnames) ... ok
test_read_long (test_dict.DictTestCase.test_read_long) ... ok
test_read_long_with_rest (test_dict.DictTestCase.test_read_long_with_rest) ... ok
test_read_long_with_rest_no_fieldnames (test_dict.DictTestCase.test_read_long_with_rest_no_fieldnames) ... ok
test_read_multi (test_dict.DictTestCase.test_read_multi) ... ok
test_read_semi_sep (test_dict.DictTestCase.test_read_semi_sep) ... ok
test_read_short (test_dict.DictTestCase.test_read_short) ... ok
test_read_with_blanks (test_dict.DictTestCase.test_read_with_blanks) ... ok
test_typo_in_extrasaction_raises_error (test_dict.DictTestCase.test_typo_in_extrasaction_raises_error) ... ok
test_write_field_not_in_field_names_ignore (test_dict.DictTestCase.test_write_field_not_in_field_names_ignore) ... ok
test_write_field_not_in_field_names_raise (test_dict.DictTestCase.test_write_field_not_in_field_names_raise) ... ok
test_write_fields_not_in_fieldnames (test_dict.DictTestCase.test_write_fields_not_in_fieldnames) ... ok
test_write_multiple_dict_rows (test_dict.DictTestCase.test_write_multiple_dict_rows) ... ok
test_write_no_fields (test_dict.DictTestCase.test_write_no_fields) ... ok
test_write_simple_dict (test_dict.DictTestCase.test_write_simple_dict) ... ok
test_writeheader_return_value (test_dict.DictTestCase.test_writeheader_return_value) ... ok
test_encoding_1 (test_encoding.EncodingTestCase.test_encoding_1) ... ok
test_encoding_2 (test_encoding.EncodingTestCase.test_encoding_2) ... ok
test_encoding_3 (test_encoding.EncodingTestCase.test_encoding_3) ... ok
test_sniffer_fuzzing (test_fuzzing.FuzzingTestCase.test_sniffer_fuzzing) ... ok
test_form_1 (test_normal_forms.NormalFormTestCase.test_form_1) ... ok
test_form_2 (test_normal_forms.NormalFormTestCase.test_form_2) ... ok
test_form_3 (test_normal_forms.NormalFormTestCase.test_form_3) ... ok
test_form_4 (test_normal_forms.NormalFormTestCase.test_form_4) ... ok
test_form_5 (test_normal_forms.NormalFormTestCase.test_form_5) ... ok
test_filter_urls (test_potential_dialects.PotentialDialectTestCase.test_filter_urls) ... ok
test_get_delimiters (test_potential_dialects.PotentialDialectTestCase.test_get_delimiters) ... ok
test_get_quotechars (test_potential_dialects.PotentialDialectTestCase.test_get_quotechars) ... ok
test_masked_by_quotechar (test_potential_dialects.PotentialDialectTestCase.test_masked_by_quotechar) ... ok
test_no_delim (test_reader.ReaderTestCase.test_no_delim) ... ok
test_read_bigfield (test_reader.ReaderTestCase.test_read_bigfield) ... ok
test_read_eof (test_reader.ReaderTestCase.test_read_eof) ... ok
test_read_eol (test_reader.ReaderTestCase.test_read_eol) ... ok
test_read_escape (test_reader.ReaderTestCase.test_read_escape) ... ok
test_read_linenum (test_reader.ReaderTestCase.test_read_linenum) ... ok
test_read_oddinputs (test_reader.ReaderTestCase.test_read_oddinputs) ... ok
test_simple (test_reader.ReaderTestCase.test_simple) ... ok
test_with_gen (test_reader.ReaderTestCase.test_with_gen) ... ok
test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) ... 
  test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:95: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="simple"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='escaped') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:100: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="escaped"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple_nchar') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:115: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="simple_nchar"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple_encoding') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:120: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='latin1'>
  with self.subTest(name="simple_encoding"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_read_table (test_wrappers.WrappersTestCase.test_read_table) ... 
  test_read_table (test_wrappers.WrappersTestCase.test_read_table) (name='simple') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:126: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="simple"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_read_table (test_wrappers.WrappersTestCase.test_read_table) (name='escaped') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:131: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="escaped"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_stream_table (test_wrappers.WrappersTestCase.test_stream_table) ... 
  test_stream_table (test_wrappers.WrappersTestCase.test_stream_table) (name='simple') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:161: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="simple"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
  test_stream_table (test_wrappers.WrappersTestCase.test_stream_table) (name='escaped') ... ERROR
/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py:166: ResourceWarning: unclosed file <_io.TextIOWrapper name=3 mode='w' encoding='UTF-8'>
  with self.subTest(name="escaped"):
ResourceWarning: Enable tracemalloc to get the object allocation traceback
test_write_dicts (test_wrappers.WrappersTestCase.test_write_dicts) ... 
  test_write_dicts (test_wrappers.WrappersTestCase.test_write_dicts) (name='dialect') ... ERROR
test_write_table (test_wrappers.WrappersTestCase.test_write_table) ... 
  test_write_table (test_wrappers.WrappersTestCase.test_write_table) (name='dialect') ... ERROR
  test_write_table (test_wrappers.WrappersTestCase.test_write_table) (name='transposed') ... ERROR
test_write_arg_valid (test_write.WriterTestCase.test_write_arg_valid) ... ok
test_write_bigfield (test_write.WriterTestCase.test_write_bigfield) ... ok
test_write_csv_dialect (test_write.WriterTestCase.test_write_csv_dialect) ... ok
test_write_quoting (test_write.WriterTestCase.test_write_quoting) ... ok
test_write_simpledialect (test_write.WriterTestCase.test_write_simpledialect) ... ok

======================================================================
ERROR: test_code_1 (test_console.ConsoleTestCase.test_code_1)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 126, in test_code_1
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_code_2 (test_console.ConsoleTestCase.test_code_2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 166, in test_code_2
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_code_3 (test_console.ConsoleTestCase.test_code_3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 191, in test_code_3
    tmpfname = self._build_file(table, dialect, encoding=encoding)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_code_4 (test_console.ConsoleTestCase.test_code_4)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 232, in test_code_4
    tmpfname = self._build_file(table, dialect, encoding=encoding)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_code_5 (test_console.ConsoleTestCase.test_code_5)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 273, in test_code_5
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_detect_base (test_console.ConsoleTestCase.test_detect_base) (name='simple')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 51, in test_detect_base
    self._detect_test_wrap(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 34, in _detect_test_wrap
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_detect_base (test_console.ConsoleTestCase.test_detect_base) (name='escaped')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 56, in test_detect_base
    self._detect_test_wrap(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 34, in _detect_test_wrap
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_detect_opts_1 (test_console.ConsoleTestCase.test_detect_opts_1)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 72, in test_detect_opts_1
    tmpfname = self._build_file(table, dialect, encoding=encoding)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_detect_opts_2 (test_console.ConsoleTestCase.test_detect_opts_2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 89, in test_detect_opts_2
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_detect_opts_3 (test_console.ConsoleTestCase.test_detect_opts_3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 106, in test_detect_opts_3
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_standardize_1 (test_console.ConsoleTestCase.test_standardize_1)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 314, in test_standardize_1
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_standardize_2 (test_console.ConsoleTestCase.test_standardize_2)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 334, in test_standardize_2
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_standardize_3 (test_console.ConsoleTestCase.test_standardize_3)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 358, in test_standardize_3
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_standardize_in_place (test_console.ConsoleTestCase.test_standardize_in_place)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 379, in test_standardize_in_place
    tmpfname = self._build_file(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_console.py", line 28, in _build_file
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 96, in test_read_dataframe
    self._df_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 27, in _df_test
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='escaped')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 101, in test_read_dataframe
    self._df_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 27, in _df_test
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple_nchar')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 116, in test_read_dataframe
    self._df_test(table, dialect, num_char=10)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 27, in _df_test
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_dataframe (test_wrappers.WrappersTestCase.test_read_dataframe) (name='simple_encoding')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 121, in test_read_dataframe
    self._df_test(table, dialect, num_char=10, encoding="latin1")
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 27, in _df_test
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_table (test_wrappers.WrappersTestCase.test_read_table) (name='simple')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 127, in test_read_table
    self._read_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 49, in _read_test
    tmpfname = self._write_tmpfile(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 43, in _write_tmpfile
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_read_table (test_wrappers.WrappersTestCase.test_read_table) (name='escaped')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 132, in test_read_table
    self._read_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 49, in _read_test
    tmpfname = self._write_tmpfile(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 43, in _write_tmpfile
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_stream_table (test_wrappers.WrappersTestCase.test_stream_table) (name='simple')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 162, in test_stream_table
    self._stream_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 57, in _stream_test
    tmpfname = self._write_tmpfile(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 43, in _write_tmpfile
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_stream_table (test_wrappers.WrappersTestCase.test_stream_table) (name='escaped')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 167, in test_stream_table
    self._stream_test(table, dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 57, in _stream_test
    tmpfname = self._write_tmpfile(table, dialect)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 43, in _write_tmpfile
    w = writer(tmpid, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_write_dicts (test_wrappers.WrappersTestCase.test_write_dicts) (name='dialect')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 258, in test_write_dicts
    self._write_test_dicts(items, exp, dialect=dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 238, in _write_test_dicts
    wrappers.write_dicts(items, tmpfname, **kwargs)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/wrappers.py", line 445, in write_dicts
    w = DictWriter(fp, fieldnames=fieldnames, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/dict_read_write.py", line 105, in __init__
    self.writer = writer(f, dialect, *args, **kwds)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_write_table (test_wrappers.WrappersTestCase.test_write_table) (name='dialect')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 215, in test_write_table
    self._write_test_table(table, exp, dialect=dialect)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 195, in _write_test_table
    wrappers.write_table(table, tmpfname, **kwargs)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/wrappers.py", line 409, in write_table
    w = writer(fp, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

======================================================================
ERROR: test_write_table (test_wrappers.WrappersTestCase.test_write_table) (name='transposed')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 219, in test_write_table
    self._write_test_table(table, exp, dialect=dialect, transpose=True)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/tests/test_unit/test_wrappers.py", line 195, in _write_test_table
    wrappers.write_table(table, tmpfname, **kwargs)
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/wrappers.py", line 409, in write_table
    w = writer(fp, dialect=dialect)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<<PKGBUILDDIR>>/.pybuild/cpython3_3.11/build/clevercsv/write.py", line 32, in __init__
    self._writer = csv.writer(csvfile, dialect=self.dialect)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: "quotechar" must be a 1-character string

----------------------------------------------------------------------
Ran 156 tests in 0.057s

FAILED (errors=25)

As you can see, it's the same error that's repeated. I can't promise anything, but I might try to submit a patch later.

Detection breaks on good file

The following file is from Google Sheets. In one column there is a markdown formatted multiline text. The problem is that CleverCSV detects this file wrong, braking it.

Essentially it select this super weird star(*) based delimiter, which then breaks the whole file.

Running normal form detection ...
Not normal, has potential escapechar.
Running data consistency measure ...
SimpleDialect(',', '', ''):	P =       14.309419	T =        0.672613	Q =        9.624698
SimpleDialect(',', '', '/'):	P =       14.268794	T =        0.615974	Q =        8.789203
SimpleDialect(',', '"', ''):	P =       37.647059	T =        0.942647	Q =       35.487889
SimpleDialect(',', '"', '/'):	P =       18.751838	skip.
SimpleDialect('', '', ''):	P =        0.313000	skip.
SimpleDialect('', '"', ''):	P =        0.040000	skip.
SimpleDialect(' ', '', ''):	P =       45.500250	T =        0.332927	Q =       15.148254
SimpleDialect(' ', '"', ''):	P =       13.000500	skip.
SimpleDialect('#', '', ''):	P =       26.065333	skip.
SimpleDialect('#', '"', ''):	P =        0.040000	skip.
SimpleDialect('*', '', ''):	P =       93.639500	T =        0.843074	Q =       78.945071
SimpleDialect('*', '"', ''):	P =        0.040000	skip.
SimpleDialect('-', '', ''):	P =       39.078500	skip.
SimpleDialect('-', '"', ''):	P =        0.040000	skip.
SimpleDialect(':', '', ''):	P =       21.732000	skip.
SimpleDialect(':', '"', ''):	P =        9.750500	skip.
SimpleDialect('_', '', ''):	P =        0.406000	skip.
SimpleDialect('_', '"', ''):	P =        0.269500	skip.

CSV file attached.
csv_good_dialect_star.csv

Link to Google Sheets: https://docs.google.com/spreadsheets/d/1pbU8Fe0h-NvCc5Cxxbg_nonJgYZB4mHdHNsrmva57CE/edit?usp=sharing

Unexpected exceptions on many inputs

I ran pythonfuzz on clevercsv's detection engine, and got many input samples resulting in unexpected exceptions, e.g. a file just containing three double-quotes results in a SystemError:

clevercsv detect doublequote-doublequote-doublequote.csv 

[SystemError]
PyEval_EvalFrameEx returned a result with an error set

I'm using clevercsv 0.5.3 from pypi with Python 3.7 running on Ubuntu 19.10.

I've attached a ZIP file containing the fuzzing script and some sample input files leading to unexpected exceptions.

clevercsv-fuzzing.zip

Allow avoidance of Pandas dependency

Pandas is a large dependency. That‘s incompatible with using CleverCSV as pre-commit hook under CI. I have scanned the source code and this does not take away my intuition that Pandas isn't strictly needed for CleverCSV. Alternatively or additionally, could you transition to Polars instead of Pandas? That has subpackages.

Is there a way to distinguish between ,, and ,"", in the reader?

PostgreSQL RDBMS has CSV export/import functionality (https://www.postgresql.org/docs/9.6/sql-copy.html) that has a quirk in its CSV format: they treat ,, as SQL NULL and ,"", as a zero length string value. These are different values, and it is important to be able to distinguish between them.

Is there a way to parse a line like foo,,bar,"",baz as ['foo', None, bar, '', 'baz'] ? Without forking and playing with the state machine in C, of course. :-)

Confidence score

Is there a way to use the sniffer with a confidence score threshold? I am noticing that while the library works well for many type of CSV, I have a couple of control cases that aren't CSV at all, fixed-width files actually, where the sniffer is returning a dialect. I'd like to have access to the confidence score of sniffer in order to base my decision on using the returned delimiter.

As a matter of fact, I have ran quite a few files through the sniffer and I haven't got a None response yet, which makes believe the logic is a little bit to eager to produce a dialect, even at low confidence.

Below I show the file on the left alongside with the delimiter on the right.

image

Unicode characters cause UnicodeEncodeError from clevercsv.wrappers.write_table on Windows 10

Hello and thank you for your work on this excellent library! I'm running on a Windows 10 machine and encountering a UnicodeEncodeError when attempting to write data that includes Unicode using clevercsv.wrappers.write_table.

It appears that adding an optional encoding argument to clevercsv.wrappers.write_table would fix this, as it works when I use the clevercsv.writer without the wrapper as a workaround (below).

Workaround:

with open("outfile.csv", "w", newline="", encoding="utf-8") as fp:
    w = clevercsv.writer(fp)
    w.writerows(data_list)

Stack Trace:

Traceback (most recent call last):
  File "<REDACTED>", line 143, in <module>
    report.create_csv_report()
  File "<REDACTED>", line 42, in create_csv_report
  File "<REDACTED>\lib\site-packages\clevercsv\wrappers.py", line 441, in write_table
    w.writerows(table)
  File "<REDACTED>\lib\site-packages\clevercsv\write.py", line 60, in writerows
    return self._writer.writerows(rows)
  File "<REDACTED>\local\programs\python\python37-32\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u2033' in position 250: character maps to <undefined>

Invalid abstract representation of the file with repeating newline

I found that the make_abstract method does not correctly construct file abstraction for repeating newline.

Some examples:

# +--------------------------------+
# | \n                             |
# | \n                             |
# | word, digit, bool, float\n     |
# | dolorum,7539,true,88292972.3\n |
# +--------------------------------+
>>> make_abstraction(data, dialect).split("R")
# ['', 'CDCDCDC', 'CDCDCDC']

Another example:

# +--------------------------------+
# | Lorem Ipsum\n                  |
# | Lorem Ipsum\n                  |
# | \n                             |
# | word, digit, bool, float\n     |
# | dolorum,7539,true,88292972.3\n |
# +--------------------------------+
>>> make_abstraction(data, dialect).split("R")
# ['C', 'C', 'CDCDCDC', 'CDCDCDC']

As a workaround to make it work for my use case, during the pre-processing we need to put some character (eg. space) between newline characters, which in the second example works as follows:

def correct_empty_rows(data):
    def recursive_replace(s, sub, new):
        while sub in s:
            s = s.replace(sub, new)
        return s

    escape_sequences = ["\n", "\r\n", "\r"]  # LF, CRLF, CR
    separator = " "
    for seq in escape_sequences:
        data = recursive_replace(data, 2 * seq, seq + separator + seq)
    return data

# +--------------------------------+
# | Lorem Ipsum\n                  |
# | Lorem Ipsum\n                  |
# | <SPACE>\n                      |
# | word, digit, bool, float\n     |
# | dolorum,7539,true,88292972.3\n |
# +--------------------------------+
>>> make_abstraction(correct_empty_rows(data), dialect).split("R")
# ['C', 'C', 'C', 'CDCDCDC', 'CDCDCDC']

I guess the method should return row patterns, so I assume that repeating newlines should be reflected in its output as well.

Please migrate away from setup.py

Hello!

setup.py has been deprecated for a while now (although support hasn't yet bet removed). It would be nice if this project moved away from it before something actually breaks :)

If you want to stick with setuptools, newer versions do support building via a PEP-617-style pyproject.toml file:

You can see an example of a migration to pyproject.toml with setuptools that I did here: jaseg/python-mpv#241

Cheers,

a standardize that fixes?

I was hoping that clevercsv would help me with this. I get these files from the California Secretary of State's office. The data is full of junk. The fields are tab separated but people can apparently insert tabs into their data, where they are not escaped. There are sometimes newlines put into the data so that you get what looks like most of a line and then a small fragment of another. Any other such noise.

An example file is here:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_CD.TSV (600k lines)

and just the top of it:

https://opencalaccess.org/misc/CVR_CAMPAIGN_DISCLOSURE_top1k_CD.TSV

I get this:

 $ clevercsv standardize --output CVR_CAMPAIGN_DISCLOSURE_std_CD.TSV CVR_CAMPAIGN_DISCLOSURE_CD.TSV
 Traceback (most recent call last):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 67, in _parse_data
     for row in parser:
 cparser.Error: line contains NULL byte
 
 During handling of the above exception, another exception occurred:
 
 Traceback (most recent call last):
   File "/home/ray/.local/bin/clevercsv", line 8, in <module>
     sys.exit(main())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/__main__.py", line 20, in main
     sys.exit(realmain())
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/__init__.py", line 8, in main
     return app.run()
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 383, in run
     return self.run_command(command)
   File "/home/ray/.local/lib/python3.10/site-packages/wilderness/application.py", line 400, in run_command
     return command.handle()
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 151, in handle
     retval = self.handle_path(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/console/commands/standardize.py", line 173, in handle_path
     dialect = detect_dialect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/wrappers.py", line 398, in detect_dialect
     dialect = Detector().detect(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/detect.py", line 127, in detect
     return consistency_detector.detect(sample, delimiters=delimiters)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 121, in detect
     scores = self.compute_consistency_scores(data, dialects)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 170, in compute_consistency_scores
     T = self.compute_type_score(data, dialect)
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/consistency.py", line 199, in compute_type_score
     for row in parse_string(data, dialect, return_quoted=True):
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 132, in parse_data
     yield from _parse_data(
   File "/home/ray/.local/lib/python3.10/site-packages/clevercsv/cparser_util.py", line 70, in _parse_data
     raise Error(str(e))
 clevercsv.exceptions.Error: line contains NULL byte

An example of the kind of fix I have to do is this. Almost all of the rows in this file have 86 fields. But:

 line_num: 47045, fields # 81
 line_num: 47046, fields # 6

So these two lines need to be joined.

Add support to release aarch64 wheels

Problem

On aarch64, pip install CleverCSV builds the wheels from source code and then install it. It requires user to have development environment installed on his system. also, it take some time to build the wheels than downloading and extracting the wheels from pypi.

Resolution

On aarch64, pip install CleverCSV should download the wheels from pypi

@GjjvdBurg , Please let me know your interest on releasing aarch64 wheels. I can help in this.

Understanding the licensing

Hi CleverCSV folk,

I was wondering how to consider the licensing of CleverCSV.

The LICENSE file says the project is MIT, but the src/abstraction.csv file says GPL-2.0. Is one an error?

Error: field larger than field limit (131072)

Hi, using the clevercsv code, I get the following code:

import clevercsv

with open("./sample_train_postings.csv", "r", newline="", encoding="utf-8") as fp:
    reader = clevercsv.reader(fp, delimiter="\t", quotechar="`", escapechar="")
    rows = list(reader)

Which in turn gives me this error:

Error: field larger than field limit (131072)

Any idea how to handle this?

header detect error

on latest version. with csv content:
{"fake": "json", "fake2":"json2"}

we got header true, should be false.

detection is really slow in some cases

Hey there, first of all, great project!

The following commands takes a significant amount of time:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 13.2 sec per loop
python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 8.24 sec per loop

After benchmarking a little bit, the apparent cause is that the unix_path and url regexes in the detector are susceptible to a ReDOS .

These change, which replace the regexes with (hopefully) equivalent ones fixes the most oblivious issues:

-    "url": "((https?|ftp):\/\/(?!\-))?(((([\p{L}\p{N}]*\-?[\p{L}\p{N}]+)+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
-    "unix_path": "(\/|~\/|\.\/)(?:[a-zA-Z0-9\.\-\_]+\/?)+",
+    "url": "((https?|ftp):\/\/(?!\-))?(((?:[\p{L}\p{N}-]+\.)+([a-z]{2,}|local)(\.[a-z]{2,3})?)|localhost|(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}(\:\d{1,5})?))(\/[\p{L}\p{N}_\/()~?=&%\-\#\.:]*)?(\.[a-z]+)?",
+    "unix_path": "[~.]?(?:\/[a-zA-Z0-9\.\-\_]+)+\/?",

New results:

> python3 -m timeit -n 1 -- "from clevercsv import Detector; Detector().detect('fileurl="file://$PROJECT_DIR$/../aaaaaa_aaaaaaa_aaaaa/.aaa/." filepath=$')" 
1 loop, best of 5: 4.17 msec per loop
:0: UserWarning: The test results are likely unreliable. The worst time (347 msec) was more than four times slower than the best time (4.17 msec).
> python3 -m timeit "from clevercsv import Detector; Detector().detect('a'*18)" 
1 loop, best of 5: 217 usec per loop

Python version: 3.8

help for data type detection

CleverCSV looks very promising !
I try to use this package to extract csv to database. We need detect datatype of each column. Is there anyway this package can help ?
The other question, is it possible(performance/memory) we can handle multi GB data by using this package ?

Thanks for help.
hong

Code generation UnicodeDecodeError

UnicodeDecodeError

'charmap' codec can't decode byte 0x9e in position 1563: character maps to

at /usr/lib/python3.9/encodings/cp1254.py:23 in decode
19│ return codecs.charmap_encode(input,self.errors,encoding_table)[0]
20│
21│ class IncrementalDecoder(codecs.IncrementalDecoder):
22│ def decode(self, input, final=False):
→ 23│ return codecs.charmap_decode(input,self.errors,decoding_table)[0]
24│
25│ class StreamWriter(Codec,codecs.StreamWriter):
26│ pass
27│

Built-in support for cChardet

I noticed that in my applications most of the runtime of CleverCSV is used up by chardet. Since uChardet is much faster it might be worth supporting it by default. (I know that it's possible to determine the encoding first with cChardet and then pass it to CleverCSV. I'm just suggesting this as a possible enhancement – maybe also for people not aware of cChardet.)

0.7.5: pytest is failing in `tests/test_unit/test_encoding.py::EncodingTestCase::test_encoding_cchardet` unit

I'm packaging your module as an rpm package so I'm using the typical PEP517 based build, install and test cycle used on building packages from non-root account.

  • python3 -sBm build -w --no-isolation
  • because I'm calling build with --no-isolation I'm using during all processes only locally installed modules
  • install .whl file in </install/prefix>
  • run pytest with $PYTHONPATH pointing to sitearch and sitelib inside </install/prefix>
  • build is performed in env which is cut off from access to the public network (pytest is executed with -m "not network")

Here is pytest output:

+ PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-clevercsv-0.7.5-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-clevercsv-0.7.5-2.fc35.x86_64/usr/lib/python3.8/site-packages
+ /usr/bin/pytest -ra -m 'not network'
==================================================================================== test session starts ====================================================================================
platform linux -- Python 3.8.16, pytest-7.2.1, pluggy-1.0.0
rootdir: /home/tkloczko/rpmbuild/BUILD/CleverCSV-0.7.5
collected 156 items

tests/test_unit/test_consistency.py ..                                                                                                                                                [  1%]
tests/test_unit/test_console.py ....................                                                                                                                                  [ 14%]
tests/test_unit/test_cparser.py .................................................                                                                                                     [ 45%]
tests/test_unit/test_detect.py ....                                                                                                                                                   [ 48%]
tests/test_unit/test_detect_pattern.py ....................                                                                                                                           [ 60%]
tests/test_unit/test_detect_type.py ..........                                                                                                                                        [ 67%]
tests/test_unit/test_dict.py ....................                                                                                                                                     [ 80%]
tests/test_unit/test_encoding.py F.                                                                                                                                                   [ 81%]
tests/test_unit/test_fuzzing.py .                                                                                                                                                     [ 82%]
tests/test_unit/test_normal_forms.py .....                                                                                                                                            [ 85%]
tests/test_unit/test_potential_dialects.py ....                                                                                                                                       [ 87%]
tests/test_unit/test_reader.py .........                                                                                                                                              [ 93%]
tests/test_unit/test_wrappers.py .....                                                                                                                                                [ 96%]
tests/test_unit/test_write.py .....                                                                                                                                                   [100%]

========================================================================================= FAILURES ==========================================================================================
__________________________________________________________________________ EncodingTestCase.test_encoding_cchardet __________________________________________________________________________

self = <test_encoding.EncodingTestCase testMethod=test_encoding_cchardet>

    @unittest.skipIf(
        platform.system() == "Windows",
        reason="No faust-cchardet wheels for Windows (yet)",
    )
    def test_encoding_cchardet(self):
        for case in self.cases:
            table = case["table"]
            encoding = case["encoding"]
            with self.subTest(encoding=encoding):
                out_encoding = case["cchardet_encoding"]
                tmpfname = self._build_file(table, encoding)
                detected = get_encoding(tmpfname, try_cchardet=True)
>               self.assertEqual(out_encoding, detected)
E               AssertionError: 'WINDOWS-1252' != 'ISO-8859-1'
E               - WINDOWS-1252
E               + ISO-8859-1

tests/test_unit/test_encoding.py:83: AssertionError
===================================================================================== warnings summary ======================================================================================
../../../../../usr/lib/python3.8/site-packages/wilderness/tester.py:23
  /usr/lib/python3.8/site-packages/wilderness/tester.py:23: PytestCollectionWarning: cannot collect test class 'Tester' because it has a __init__ constructor (from: tests/test_unit/test_console.py)
    class Tester:

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================================== short test summary info ==================================================================================
FAILED tests/test_unit/test_encoding.py::EncodingTestCase::test_encoding_cchardet - AssertionError: 'WINDOWS-1252' != 'ISO-8859-1'
========================================================================= 1 failed, 155 passed, 1 warning in 0.86s ==========================================================================

Here is list of installed modules in build env

Package                       Version
----------------------------- -----------------
alabaster                     0.7.12
appdirs                       1.4.4
asn1crypto                    1.5.1
attrs                         22.2.0
Babel                         2.11.0
bcrypt                        3.2.2
build                         0.9.0
cffi                          1.15.1
chardet                       5.1.0
charset-normalizer            3.0.1
contourpy                     1.0.6
cryptography                  38.0.4
cssselect                     1.1.0
cycler                        0.11.0
Cython                        0.29.33
distro                        1.8.0
docutils                      0.19
exceptiongroup                1.0.0
extras                        1.0.0
fixtures                      4.0.0
fonttools                     4.38.0
gpg                           1.18.0-unknown
idna                          3.4
imagesize                     1.4.1
importlib-metadata            5.1.0
iniconfig                     1.1.1
Jinja2                        3.1.2
kiwisolver                    1.4.4
libcomps                      0.1.19
lxml                          4.9.1
MarkupSafe                    2.1.1
matplotlib                    3.6.3
numpy                         1.24.1
olefile                       0.46
packaging                     21.3
pandas                        1.5.2
pbr                           5.9.0
pep517                        0.13.0
Pillow                        9.4.0
pip                           22.3.1
pluggy                        1.0.0
ply                           3.11
pyasn1                        0.4.8
pyasn1-modules                0.2.8
pycparser                     2.21
Pygments                      2.14.0
PyGObject                     3.42.2
pyparsing                     3.0.9
pytest                        7.2.1
python-dateutil               2.8.2
pytz                          2022.4
PyYAML                        6.0
regex                         2022.10.31
requests                      2.28.1
rpm                           4.17.0
scour                         0.38.2
setuptools                    65.6.3
six                           1.16.0
snowballstemmer               2.2.0
Sphinx                        5.3.0
sphinxcontrib-applehelp       1.0.2.dev20221204
sphinxcontrib-devhelp         1.0.2.dev20221204
sphinxcontrib-htmlhelp        2.0.0
sphinxcontrib-jsmath          1.0.1.dev20221204
sphinxcontrib-qthelp          1.0.3.dev20221204
sphinxcontrib-serializinghtml 1.1.5
termcolor                     1.1.0
testtools                     2.5.0
tomli                         2.0.1
tpm2-pkcs11-tools             1.33.7
tpm2-pytss                    1.1.0
urllib3                       1.26.12
wheel                         0.38.4
wilderness                    0.1.9
zipp                          3.11.0

delimiter detection error

For a pipe delimited file it is being detected as space delimited when i feed in around 1000 rows (however csv lib correctly detects pipe), if i only feed in 100 rows then clevercsv is also able to detect pipe

ie like below data but 55 columns wide
"Private Company"|"Point Rd"

Header Detection Error

CleverCSV's Sniffer assumes the following .csv file does not have a header. More specifically, it recognizes "1" (an inferred int) and "1.2" (an inferred float) as incompatible types. As the third column has "incompatible" types, Detector.has_header() will return False.

col1,col2,col3
hello,"hello world", 1.2
world,"hello world", 1.2
test,"hello world 您", 1

Relevant code:

def clevercsv_guess_if_headers_exist(input: IO) -> bool:
    input_data = input
    if not isinstance(input, io.TextIOBase):
        input_data = io.StringIO(input.read(SAMPLE_LENGTH).decode())
    stream_pos = input_data.tell()
    sniffer = clevercsv.Sniffer()
    input_data.seek(0)
    has_header = sniffer.has_header(input_data.read(SAMPLE_LENGTH))
    input_data.seek(stream_pos)
    return has_header

Add conveniant support for stdin

Hi,

I just discovered this great project, thanks a lot for this amazing work 😃

Since CSV processing usually occurs in data flow process, it would be great to improve conveniency as reading CSV data through stdin.

Writing to stdout is easy already, because sys.stdout is passed directly to csv.writer , but reading is a bit more tricky.

import io
from sys import stdout, stdin

import clevercsv
import chardet

# read
input_data = stdin.buffer.read() # Read as binary
detected_encoding = chardet.detect(input_data)['encoding'] # Guess encoding

csvfile = io.StringIO(input_data.decode(detected_encoding))

dialect = clevercsv.Sniffer().sniff(csvfile.read())
csvfile.seek(0)

reader = clevercsv.reader(csvfile, dialect)
rows = reader


# write
writer = clevercsv.write.writer(sys.stdout, encoding='utf8')
writer.writerows(rows)

consistency.get_best_set has unexpected results

Summary

In short, if one of the scores has a Q = nan, then the max score could be nan, which is weird.

Details

I was trying to get the dialect for a CSV file and was getting dialects = None. I dug around through the code and found two functions which may be part of the issue: get_best_set and consistency_scores.

I got some dialects using clevercsv.potential_dialects.get_dialects, and then some scores using clevercsv.consistency.consistency_scores

I had a set of scores that looked like this:

{SimpleDialect(',', '', ''): {'Q': 52.58875739644971,
  'pattern': 61.53846153846154,
  'type': 0.8545673076923077},
 SimpleDialect(',', '', '/'): {'Q': nan,
  'pattern': 30.76846153846154,
  'type': nan},
 SimpleDialect('', '', ''): {'Q': nan, 'pattern': 0.064, 'type': nan},
...

There were many other scores I have just grabbed the first three here.

I would expect the first dialect to be the "best" one, but that is not the output :(

Passing those scores in get_best_set returns an empty set.

get_best_set currently looks like this:

def get_best_set(scores):
    H = set()
    Qmax = max((score["Q"] for score in scores.values()))
    H = set([d for d, score in scores.items() if score["Q"] == Qmax])
    return H

It just picks out the item which has the best Q score.
The line Qmax = max((score["Q"] for score in scores.values())) depends on the builtin max function. That function produces unexpected results when some of the values it is checking include nan. See: https://stackoverflow.com/questions/4237914/python-max-min-builtin-functions-depend-on-parameter-order

Because of that, Qmax could equal nan, and then the output of get_best_set will be the empty set. (It will not return the other entries that do have Q = 'nan', because float("nan") == float("nan") is False...)

Why are there nan values?

consistency_scores has nan as default values.

def consistency_scores(data, dialects, skip=True, logger=print):
    scores = {}

    Qmax = -float("inf")
    for dialect in sorted(dialects):
        P = pattern_score(data, dialect)
        if P < Qmax and skip:
            scores[dialect] = {
                "pattern": P,
                "type": float("nan"),
                "Q": float("nan"),
            }
            logger("%15r:\tP = %15.6f\tskip." % (dialect, P))
            continue
        T = type_score(data, dialect)
        Q = P * T
        Qmax = max(Q, Qmax)
        scores[dialect] = {"pattern": P, "type": T, "Q": Q}
        logger(
            "%15r:\tP = %15.6f\tT = %15.6f\tQ = %15.6f" % (dialect, P, T, Q)
        )
    return scores

The defaults are set here:

            scores[dialect] = {
                "pattern": P,
                "type": float("nan"),
                "Q": float("nan"),
            }

So if scores has one score with a nan, then nan could be the result.. surely that is not correct.

I think that the defaults should be None, and then checks for if Q is None etc could be made elsewhere.

Thoughts?

Full set of scores:

SimpleDialect('_', '', '') {'Q': nan, 'type': nan, 'pattern': 0.48983333333333334}
SimpleDialect(':', '', '') {'Q': nan, 'type': nan, 'pattern': 0.2815}
SimpleDialect('-', '', '') {'Q': nan, 'type': nan, 'pattern': 6.31279292929293}
SimpleDialect('#', '', '') {'Q': nan, 'type': nan, 'pattern': 4.823358974358975}
SimpleDialect(' ', '', '') {'Q': nan, 'type': nan, 'pattern': 4.8963066685493155}
SimpleDialect('&', '', '') {'Q': nan, 'type': nan, 'pattern': 4.3835}
SimpleDialect('!', '', '') {'Q': nan, 'type': nan, 'pattern': 3.525}
SimpleDialect('', '', '') {'Q': nan, 'type': nan, 'pattern': 0.064}
SimpleDialect('*', '', '') {'Q': nan, 'type': nan, 'pattern': 0.48604545454545456}
SimpleDialect('¨', '', '') {'Q': nan, 'type': nan, 'pattern': 3.514666666666667}
SimpleDialect('?', '', '') {'Q': nan, 'type': nan, 'pattern': 4.627111111111111}
SimpleDialect('*', '', '/') {'Q': nan, 'type': nan, 'pattern': 0.48150000000000004}
SimpleDialect(',', '', '/') {'Q': nan, 'type': nan, 'pattern': 30.76846153846154}
SimpleDialect('+', '', '') {'Q': nan, 'type': nan, 'pattern': 0.2815}
SimpleDialect(',', '', '') {'Q': 52.58875739644971, 'type': 0.8545673076923077, 'pattern': 61.53846153846154}

Date delimiter prevales on column delimiter

Hi everyone,

I have a lot of files with long formated dates (with timestamps). The file is delimited by a "," but the timestamps splits hours, minutes, seconds... with a ":".
Example :
item1,item2,2022-03-28 17:24:10,item4,2022-01-02 22:43:59,item6

I should have :
Column 1 | Column 2 | Column 3 | Column 4 | Column 5 | Column 6
item1 | item 2 | 22-03...:10| item4 | 22-01...:59| item6
but it gives the following result :
C 1 |C2| C3 |C4| C5
item1,item 2,2022-03-28 17|24|10,item4,2022-01-02 22|43|59,item6

Same thing with some very surprising characters recognised as delimiters when the delimiter is really obvious and more frequent :

4819736,lilly354128,fr,Alin,lilly354128,,,,Lilly,DARS,,2 rue des ezda,,4581350,SAINT LA FORET,,FR,+330000000,,,,,,,,,,,,644365,"""LE TEMPS DE DÉCOUVRIR...""","* COFFRET DÉCOUVERTE SEPT PARFUMS
On this line, CleverCSV have found a delimiter.... and the delimiter is the star "*" ! I really don't understand the logic ?

I have resolved this issue with a pre-analysis algorithm that verifies, when there are 2 potentials delimiters, if one of them verifies a common formats like, dates, time, adress, floating numbers (coma case). But the annoying thing is that I need to parse two times, one with my pre-analysis and then with CleverCSV.
That could be very nice to have a Clever version for this too.

Have a nice day !

Make CleverCSV a pre-commit hook

Would you consider making CleverCSV a pre-commit hook, by adding a .pre-commit-hooks.yaml file to your repository? This would allow people to have csv files in their repositories to automatically check them whenever they want to commit them.

The only thing missing to make it work is, as I understand:

  • An option to overwrite the input file instead of printing to STDOUT (I think this could be a flag, like --in-place)
  • The possibility of clevercsv to accept a list of file arguments to process.
  • (Probably implemented already) A different exit status whether the input file was altered (!=0) or not (=0).

I can file a PR for the .pre-commit-hooks.yaml, if you agree (after the above points are implemented).

Thanks for your consideration!

allowed delimiters?

is there an option to say only detect delimiters within a certain list? ie (, | \t..etc)
right now for a fixed width file it picks + as a delimiter

delimiter is not correct for some csv file

hi,

We have a sample of csv file:

bytearray(b'fake data'),20:53:06,2019-09-01T19:28:21
bytearray(b'fake data'),19:33:15,2005-02-15T19:10:31
bytearray(b'fake data'),10:43:05,1992-10-12T14:49:24
bytearray(b'fake data'),10:36:49,1999-07-18T17:27:55
bytearray(b'fake data'),03:33:35,1982-04-24T17:38:45
bytearray(b'fake data'),14:49:47,1983-01-05T22:17:42
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45
bytearray(b'fake data'),10:35:30,2006-10-27T02:30:45

the delimiter guess by clevercsv is ":", it should be ",".

Thanks.

delimiter error for json type data

we have attached file, clevercsv guess delimiter ':', instead of ','
The three data type are json, time without time zone and time with time zone.

"{""fake"": ""json"", ""fake2"":""json2""}",13:31:38,06:00:04+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",22:13:29,14:20:11+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:37:27,22:04:28+03:00
"{""fake"": ""json"", ""fake2"":""json2""}",04:25:28,23:12:53+01:00
"{""fake"": ""json"", ""fake2"":""json2""}",21:04:15,08:23:58+02:00
"{""fake"": ""json"", ""fake2"":""json2""}",10:37:03,11:06:42+05:30
"{""fake"": ""json"", ""fake2"":""json2""}",10:17:24,23:38:47+06:00
"{""fake"": ""json"", ""fake2"":""json2""}",00:02:51,20:04:45-06:00

Would you please help to take a look ?

Thanks
hong

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.