Code Monkey home page Code Monkey logo

dataclass-csv's Introduction

Build Status pypi Downloads

Dataclass CSV

Dataclass CSV makes working with CSV files easier and much better than working with Dicts. It uses Python's Dataclasses to store data of every row on the CSV file and also uses type annotations which enables proper type checking and validation.

Main features

  • Use dataclasses instead of dictionaries to represent the rows in the CSV file.
  • Take advantage of the dataclass properties type annotation. DataclassReader use the type annotation to perform validation of the data of the CSV file.
  • Automatic type conversion. DataclassReader supports str, int, float, complex, datetime and bool, as well as any type whose constructor accepts a string as its single argument.
  • Helps you troubleshoot issues with the data in the CSV file. DataclassReader will show exactly in which line of the CSV file contain errors.
  • Extract only the data you need. It will only parse the properties defined in the dataclass
  • Familiar syntax. The DataclassReader is used almost the same way as the DictReader in the standard library.
  • It uses dataclass features that let you define metadata properties so the data can be parsed exactly the way you want.
  • Make the code cleaner. No more extra loops to convert data to the correct type, perform validation, set default values, the DataclassReader will do all this for you.
  • In additon of the DataclassReader the library also provides a DataclassWriter which enables creating a CSV file using a list of instances of a dataclass.

Installation

pipenv install dataclass-csv

Getting started

Using the DataclassReader

First, add the necessary imports:

from dataclasses import dataclass

from dataclass_csv import DataclassReader

Assuming that we have a CSV file with the contents below:

firstname,email,age
Elsa,[email protected], 11
Astor,[email protected], 7
Edit,[email protected], 3
Ella,[email protected], 2

Let's create a dataclass that will represent a row in the CSV file above:

@dataclass
class User:
    firstname: str
    email: str
    age: int

The dataclass User has 3 properties, firstname and email is of type str and age is of type int.

To load and read the contents of the CSV file we do the same thing as if we would be using the DictReader from the csv module in the Python's standard library. After opening the file we create an instance of the DataclassReader passing two arguments. The first is the file and the second is the dataclass that we wish to use to represent the data of every row of the CSV file. Like so:

with open(filename) as users_csv:
    reader = DataclassReader(users_csv, User)
    for row in reader:
        print(row)

The DataclassReader internally uses the DictReader from the csv module to read the CSV file which means that you can pass the same arguments that you would pass to the DictReader. The complete argument list is shown below:

dataclass_csv.DataclassReader(
    f,
    cls,
    fieldnames=None,
    restkey=None,
    restval=None,
    dialect='excel',
    *args,
    **kwds
)

All keyword arguments support by DictReader are supported by the DataclassReader, with the addition of:

validate_header - The DataclassReader will raise a ValueError if the CSV file cointain columns with the same name. This validation is performed to avoid data being overwritten. To skip this validation set validate_header=False when creating a instance of the DataclassReader, see an example below:

reader = DataclassReader(f, User, validate_header=False)

If you run this code you should see an output like this:

User(firstname='Elsa', email='[email protected]', age=11)
User(firstname='Astor', email='[email protected]', age=7)
User(firstname='Edit', email='[email protected]', age=3)
User(firstname='Ella', email='[email protected]', age=2)

Error handling

One of the advantages of using the DataclassReader is that it makes it easy to detect when the type of data in the CSV file is not what your application's model is expecting. And, the DataclassReader shows errors that will help to identify the rows with problem in your CSV file.

For example, say we change the contents of the CSV file shown in the Getting started section and, modify the age of the user Astor, let's change it to a string value:

Astor, [email protected], test

Remember that in the dataclass User the age property is annotated with int. If we run the code again an exception will be raised with the message below:

dataclass_csv.exceptions.CsvValueError: The field `age` is defined as <class 'int'> but
received a value of type <class 'str'>. [CSV Line number: 3]

Note that apart from telling what the error was, the DataclassReader will also show which line of the CSV file contain the data with errors.

Default values

The DataclassReader also handles properties with default values. Let's modify the dataclass User and add a default value for the field email:

from dataclasses import dataclass


@dataclass
class User:
    firstname: str
    email: str = 'Not specified'
    age: int

And we modify the CSV file and remove the email for the user Astor:

Astor,, 7

If we run the code we should see the output below:

User(firstname='Elsa', email='[email protected]', age=11)
User(firstname='Astor', email='Not specified', age=7)
User(firstname='Edit', email='[email protected]', age=3)
User(firstname='Ella', email='[email protected]', age=2)

Note that now the object for the user Astor have the default value Not specified assigned to the email property.

Default values can also be set using dataclasses.field like so:

from dataclasses import dataclass, field


@dataclass
class User:
    firstname: str
    email: str = field(default='Not specified')
    age: int

Mapping dataclass fields to columns

The mapping between a dataclass property and a column in the CSV file will be done automatically if the names match, however, there are situations that the name of the header for a column is different. We can easily tell the DataclassReader how the mapping should be done using the method map. Assuming that we have a CSV file with the contents below:

First Name,email,age
Elsa,[email protected], 11

Note that now, the column is called First Name and not firstname

And we can use the method map, like so:

reader = DataclassReader(users_csv, User)
reader.map('First name').to('firstname')

Now the DataclassReader will know how to extract the data from the column First Name and add it to the to dataclass property firstname

Supported type annotation

At the moment the DataclassReader support int, str, float, complex, datetime, and bool. When defining a datetime property, it is necessary to use the dateformat decorator, for example:

from dataclasses import dataclass
from datetime import datetime

from dataclass_csv import DataclassReader, dateformat


@dataclass
@dateformat('%Y/%m/%d')
class User:
    name: str
    email: str
    birthday: datetime


if __name__ == '__main__':

    with open('users.csv') as f:
        reader = DataclassReader(f, User)
        for row in reader:
            print(row)

Assuming that the CSV file have the following contents:

name,email,birthday
Edit,[email protected],2018/11/23

The output would look like this:

User(name='Edit', email='[email protected]', birthday=datetime.datetime(2018, 11, 23, 0, 0))

Fields metadata

It is important to note that the dateformat decorator will define the date format that will be used to parse date to all properties in the class. Now there are situations where the data in a CSV file contains two or more columns with date values in different formats. It is possible to set a format specific for every property using the dataclasses.field. Let's say that we now have a CSV file with the following contents:

name,email,birthday, create_date
Edit,[email protected],2018/11/23,2018/11/23 10:43

As you can see the create_date contains time information as well.

The dataclass User can be defined like this:

from dataclasses import dataclass, field
from datetime import datetime

from dataclass_csv import DataclassReader, dateformat


@dataclass
@dateformat('%Y/%m/%d')
class User:
    name: str
    email: str
    birthday: datetime
    create_date: datetime = field(metadata={'dateformat': '%Y/%m/%d %H:%M'})

Note that the format for the birthday field was not speficied using the field metadata. In this case the format specified in the dateformat decorator will be used.

Handling values with empty spaces

When defining a property of type str in the dataclass, the DataclassReader will treat values with only white spaces as invalid. To change this behavior, there is a decorator called @accept_whitespaces. When decorating the class with the @accept_whitespaces all the properties in the class will accept values with only white spaces.

For example:

from dataclass_csv import DataclassReader, accept_whitespaces

@accept_whitespaces
@dataclass
class User:
    name: str
    email: str
    birthday: datetime
    created_at: datetime

If you need a specific field to accept white spaces, you can set the property accept_whitespaces in the field's metadata, like so:

@dataclass
class User:
    name: str
    email: str = field(metadata={'accept_whitespaces': True})
    birthday: datetime
    created_at: datetime

User-defined types

You can use any type for a field as long as its constructor accepts a string:

class SSN:
    def __init__(self, val):
        if re.match(r"\d{9}", val):
            self.val = f"{val[0:3]}-{val[3:5]}-{val[5:9]}"
        elif re.match(r"\d{3}-\d{2}-\d{4}", val):
            self.val = val
        else:
            raise ValueError(f"Invalid SSN: {val!r}")


@dataclasses.dataclass
class User:
    name: str
    ssn: SSN

Using the DataclassWriter

Reading a CSV file using the DataclassReader is great and gives us the type-safety of Python's dataclasses and type annotation, however, there are situations where we would like to use dataclasses for creating CSV files, that's where the DataclassWriter comes in handy.

Using the DataclassWriter is quite simple. Given that we have a dataclass User:

from dataclasses import dataclass


@dataclass
class User:
    firstname: str
    lastname: str
    age: int

And in your program we have a list of users:

users = [
    User(firstname="John", lastname="Smith", age=40),
    User(firstname="Daniel", lastname="Nilsson", age=10),
    User(firstname="Ella", "Fralla", age=4)
]

In order to create a CSV using the DataclassWriter import it from dataclass_csv:

from dataclass_csv import DataclassWriter

Initialize it with the required arguments and call the method write:

with open("users.csv", "w") as f:
    w = DataclassWriter(f, users, User)
    w.write()

That's it! Let's break down the snippet above.

First, we open a file called user.csv for writing. After that, an instance of the DataclassWriter is created. To create a DataclassWriter we need to pass the file, the list of User instances, and lastly, the type, which in this case is User.

The type is required since the writer uses it when trying to figure out the CSV header. By default, it will use the names of the properties defined in the dataclass, in the case of the dataclass User the title of each column will be firstname, lastname and age.

See below the CSV created out of a list of User:

firstname,lastname,age
John,Smith,40
Daniel,Nilsson,10
Ella,Fralla,4

The DataclassWriter also takes a **fmtparams which accepts the same parameters as the csv.writer, for more information see: https://docs.python.org/3/library/csv.html#csv-fmt-params

Now, there are situations where we don't want to write the CSV header. In this case, the method write of the DataclassWriter accepts an extra argument, called skip_header. The default value is False and when set to True it will skip the header.

Modifying the CSV header

As previously mentioned the DataclassWriter uses the names of the properties defined in the dataclass as the CSV header titles, however, depending on your use case it makes sense to change it. The DataclassWriter has a map method just for this purpose.

Using the User dataclass with the properties firstname, lastname and age. The snippet below shows how to change firstname to First name and lastname to Last name:

with open("users.csv", "w") as f:
   w = DataclassWriter(f, users, User)

   # Add mappings for firstname and lastname
   w.map("firstname").to("First name")
   w.map("lastname").to("Last name")

   w.write()

The CSV output of the snippet above will be:

First name,Last name,age
John,Smith,40
Daniel,Nilsson,10
Ella,Fralla,4

Copyright and License

Copyright (c) 2018 Daniel Furtado. Code released under BSD 3-clause license

Credits

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

dataclass-csv's People

Contributors

alecbz avatar carltonyeung avatar dfurtado avatar johnthagen avatar millar avatar spamaps avatar zivanfi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dataclass-csv's Issues

TypeError: 'module' object is not callable referring to "field_type(value)"

I am hoping this is just a naive user question:

refer to this line of code in the DataClassReader:

transformed_value = field_type(value)

Task:

Using subclasses defining a logical data model, read a csv file with name different than in the data model. In the DataClassReader using the .map method to map columns names in the csv file to field names in the sales class. The class 'sales' includes 'markets' and 'products' master data as subclasses.

code and traceback:

from dataclasses import dataclass, field
import dataclass_csv
import datetime

#define dataclasses to align with a logical dimensional data model
@DataClass
class markets:
market_id : str = field(metadata={'accept_whitespaces': True},init=False)
market : str = field(metadata={'accept_whitespaces': True},init=True)

@DataClass
class products:
upc : str = field(metadata={'accept_whitespaces': True},init=True)
product : str = field(metadata={'accept_whitespaces': True},init=True)

@DataClass
@dataclass_csv.dateformat('%Y-%m-%d')
class sales(markets, products):
weekend : datetime = field(metadata={'accept_whitespaces': False},init=True)
total_units : float = field(metadata={'accept_whitespaces': False},init=True,default=0.00000)
total_revenue : float = field(metadata={'accept_whitespaces': False},init=True,default=0.00000)

#assume all data is provided in a single file.

with open('sales.csv','r') as infile:
reader = dataclass_csv.DataclassReader(infile, sales)
reader.map('Week').to('weekend')
reader.map('Market').to('market')
reader.map('EANUPC').to('upc')
reader.map('Product Description').to('product')
reader.map('Sales Units').to('total_units')
reader.map('Sales Dollars').to('total_revenue')
for row in reader:
print(row)

At this point I am expecting the see records like:
sales(weekend = ..., ..., total_revenue = 100.00)
...

However I get the following traceback on the first iteration of the for loop:

Traceback (most recent call last):

File "", line 44, in
for row in reader:

File "/Users/user/opt/anaconda3/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 211, in next
return self._process_row(row)

File "/Users/user/opt/anaconda3/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 195, in _process_row
transformed_value = field_type(value)

TypeError: 'module' object is not callable


Looking at the dataclass_reader.py I see "field_type = field.type", but not defined as a callable function anywhere.

Python 3.7.0 64 bit
Spyder IDE 3.3.6
Mac OS

Default values are passed to their constructor

This generalises #20.

I completed the example from the README. The following works.

import dataclasses
import io
import re

from dataclass_csv import DataclassReader


class SSN:
    def __init__(self, val):
        if re.match(r"\d{9}", val):
            self.val = f"{val[0:3]}-{val[3:5]}-{val[5:9]}"
        elif re.match(r"\d{3}-\d{2}-\d{4}", val):
            self.val = val
        else:
            raise ValueError(f"Invalid SSN: {val!r}")


@dataclasses.dataclass
class User:
    name: str
    ssn: SSN


print(list(DataclassReader(io.StringIO("name,ssn\nAlice,123456789"), User)))

After replacing the SSN in the data with a default value, the error below occurs.

class User:
    name: str
    ssn: SSN = SSN("123456789")


print(list(DataclassReader(io.StringIO("name,ssn\nAlice,"), User)))
  File "readme.py", line 10, in __init__
    if re.match(r"\d{9}", val):
TypeError: expected string or bytes-like object

The reason being that the SSN constructor is called with the default value, which is an object. The next snippet works around that.

class User:
    name: str
    ssn: SSN = "123456789"

But now this class can no longer be instantiated directly without creating broken objects, and type checkers will complain.

I think if a default value exists it should be used as is, and not passed to the constructor of its class.

Feature Request: Support custom field types

It'd be nice to have some way of allowing users to specify custom field types. E.g.:

  • a metadata field that maps raw strings to the desired type
  • an ABC that users can subclass and override some parse method that the library will call (similar to how Go's csv lib works)

Add support for boolean fields

Description

Support various string representations of bool values.

Use Case

I'd like to use dataclass-csv as part of a data pipeline which imports data from a CSV intermediate format into PostgreSQL. Currently any non-null boolean value is parsed as True, which prevents me from using this nice package.

Proposed solution

    def _parse_bool_value(self, bool_value):
        """Parses `str` representation of a boolean value to a `bool`. Case-insensitive.
        Values parsed as True: 'yes', 'true', 't', 'y', '1'
        Values parsed as False: 'no', 'false', 'f', 'n', '0
        All other values(including empty/NULL) parsed as None.
        """
        return str2bool(bool_value)

We can use this in _process_row(). This leverages a very lightweight package called str2bool and I put a little wrapper around it just so we can have a unit test around this parsing in case we decide to change the implementation later.

If you'd rather not add this dependency, we could re-implement str2bool pretty easily.

I've got this ready to roll locally, so feel free to assign this one to me and I can get a PR submitted shortly.

Btw, I'm really liking this package and dataclasses in general and would be glad to help out more in the future. Thanks!

AttributeError: 'str' object has no attribute '__dict__'

Hi, not sure I'm missing something, new to the lib. I am having the reported error, on these lines of the lib:

image

is it my fault? My code is like this:

def deserializeFromCsv(self, base_folder: str) -> MondrianAccount:
with open(base_folder + "FakeAccount.csv") as users_csv:
reader = DataclassReader(users_csv, MondrianAccount)
print(reader)
for row in reader:
print(row)

    return reader[0]

I'm not using Union, if that helps

Populate dataclass fields in their order with row data

As a convenient alternative to explicitly mapping column to field names, in case they don't directly match, it would be handy if dataclass-csv could populate dataclass' fields in the order they are declared. This would be particularly handy in the presence of CSV files that have many columns with cryptic headers (whose generation is not under the users control).

E.g. given the following CSV file

firstname, age
Jean, 77
Jim, 22
...

and dataclass

@dataclass
class Person:
  forename: str
  age: int

I would expected the following Person instances to be created:

User(forename='Jean', age=77)
User(forename='Jim', age=22)
...

That is, the fields from top to bottom are initialised with the row data from left to right.

Report mapping non-existant column

The following example from the README works.

import dataclasses
import io

from dataclass_csv import DataclassReader


@dataclasses.dataclass
class User:
    firstname: str
    email: str = "Not specified"


reader = DataclassReader(io.StringIO("firstname,e-mail\nElsa,[email protected]"), User)
reader.map("e-mail").to("email")
print(list(reader))  # [User(firstname='Elsa', email='[email protected]')]

Unfortunately introducing a typo leads to silent data loss.

reader.map("e_mail").to("email")
print(list(reader))  # [User(firstname='Elsa', email='Not specified')]

I think raising when an explicitly mapped column does not occur in the header of the data would help prevent errors.

Support to split object properties to separate columns.

Say If I have the following classes for serialization to csv.

@dataclass
class Link:
    title: str
    url: str
    
@dataclass
class SearchResult:	
    paper_name: Link
    authors: list[Link]
    publication: Link

I want to split paper_name into paper_name.title and paper_name.url columns in the generated csv file.

See also #33 (comment).

Maybe some annotation can be used on the classes of the property type to indicate the purpose.

Override constructor with transformation function

This generalises #17.

Currently any class not special-cased by the library must have a constructor that takes a single string.

Even for built-in types this leads to workarounds like the following.

class Time(datetime.time):
    def __new__(cls, value: str):
        hour, minute = value.split(":")
        return super().__new__(cls, int(hour), int(minute))

I think supporting transformation functions would add much-needed flexibility.

One possible syntax might be this.

def strptime(value: str):
    hour, minute = value.split(":")
    return datetime.time(int(hour), int(minute))

reader.map('time').using(strptime)  # With or without .to()

Allow NULL values more conveniently, and fix README TypeError: non-default argument 'age' follows default argument

CSV files and other databases commonly have fields that are NULL (missing / empty). As an easy example, a string field in a row might be of zero length. Handling such situations is currently complicated and confusing with dataclass-csv.

The current code allows a workaround by specifying a default value.

The example given for default values is this:

@dataclass
class User:
    firstname: str
    email: str = 'Not specified'
    age: int

But that generates this error:

Traceback (most recent call last):
  File "example.py", line 5, in <module>
    class User:
  File "/usr/lib/python3.8/dataclasses.py", line 1019, in dataclass
    return wrap(cls)
  File "/usr/lib/python3.8/dataclasses.py", line 1011, in wrap
    return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
  File "/usr/lib/python3.8/dataclasses.py", line 925, in _process_class
    _init_fn(flds,
  File "/usr/lib/python3.8/dataclasses.py", line 502, in _init_fn
    raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'age' follows default argument

It actually works to simply put all the fields which have default values at the end of the class, since the order of class members doesn't matter:

@dataclass
class User:
    firstname: str
    age: int
    email: str = ''

A quick, partial solution would seem to be fixing the example, and noting the need to put the fields with default values at the end.

But requiring that the fields not appear in order reduces the clarify of the code and is more convoluted than necessary.

I think it would be good to also automatically move the fields needing default values to the end of the generated class.

It is perhaps also worthwhile to provide a way to specify that NULL values are ok for all fields, though that requires figuring out what value to use as a default default, I guess....

DataclassWriter

It would be great if, given a List[SomeDataclass] it were possible to write back out a series of dataclasses into their corresponding CSV representation.

This would allow someone to read a CSV into dataclasses, modify the data in a type safe way, and then write it back out to CSV.

Type annotations for *args and **kwds

The type annotations for the DataclassReader are incorrect:

class DataclassReader:
def __init__(
self,
f: Any,
cls: Type[object],
fieldnames: Optional[Sequence[str]] = None,
restkey: Optional[str] = None,
restval: Optional[Any] = None,
dialect: str = "excel",
*args: List[Any],
**kwds: Dict[str, Any],
):

class DataclassReader:
def __init__(
self,
f: Any,
cls: Type[object],
fieldnames: Optional[Sequence[str]] = ...,
restkey: Optional[str] = ...,
restval: Optional[Any] = ...,
dialect: str = ...,
*args: List[Any],
**kwds: Dict[str, Any],
) -> None: ...

Even though *args and **kwds are technically a List and a Dict respectively, they are not typed the same way. See mypy docs or PEP-484. Changing both to Any solves this problem.

Mypy output when calling the DataclassReader Ctor:

reader = DataclassReader(
    lines,
    MyDataclass,
    delimiter="+",
#             ^ Argument "delimiter" to "DataclassReader" has incompatible type "str"; expected "Dict[str, Any]"
)

Docstrings need to be added

Description

At the moment the code does not have any docstrings. Docstrings are helpful when developing since the developers can get help right in their text editor, IDE or the python REPL.

Support Optional fields

It would be useful if, rather than having to add a default (potentially meaningless) value to a field, instead you could mark it as Optional.

from dataclasses import dataclass
from typing import Optional

@dataclass
class User:
    firstname: str
    email: Optional[str]
    age: int

If the field is not provided in the row, the field is set to None. This will allow for a type-safe way to detect and handle missing values.

datetime: support a formatter function in addition to the existing dateformat string

While the existing datetime support is nice, some additional flexibility would be nice. When reading date strings, for example, the ability to use a well known parser such as dateutil.parser or, the ability to manipulate the datetime object as it's created (such as making sure the conversion to UTC happens correctly, if needed), or even to instantiate a datetime from a non-human format, such as a unix timestamp.

To that end, in addition to the dateformat metadata that exists today, maybe some sort of datefactory or dateformatter metadata, which would be a callable, passing in the value, and returning a well-formed datetime object would be useful.

Raise an error when using field(init=False)

Description

The issue occurs when defining a field in the dataclass with the property init=False. When it init property is set to false it means that the auto-generated dataclass initializer won't add the field in the list of arguments.

How to reproduce

  1. Create a dataclass and set init=False in one of the fields, for example:
from dataclasses import dataclass, field

from dataclass_csv import DataclassReader


@dataclass
class User:
    firstname: str
    lastname: str
    age: int = field(init=False)
  1. Try reading a csv file containing the expected data
def main():

    with open('users.csv') as f:
        reader = DataclassReader(f, User)

        for row in reader:
            print(row)


if __name__ == '__main__':
    main()
  1. You will see the error on the terminal:
Traceback (most recent call last):
  File "app.py", line 25, in <module>
    main()
  File "app.py", line 20, in main
    for row in reader:
  File "/home/daniel/Envs/test-oGAe6Cfg/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 161, in __next__
    return self._process_row(row)
  File "/home/daniel/Envs/test-oGAe6Cfg/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 157, in _process_row
    return self.cls(*values)
TypeError: __init__() takes 3 positional arguments but 4 were given

Expected behavior

When setting init=False in a dataclass field, the field should not be passed to the dataclass initializer.

Other information

OS: Debian 4.9.130-2
Python version: 3.7.1
dataclass-csv version: 1.0.1

DeprecationWarning for distutils package

It seems that distutils will be soon be obsolete.

/usr/local/lib/python3.10/site-packages/dataclass_csv/dataclass_reader.py:5
  /usr/local/lib/python3.10/site-packages/dataclass_csv/dataclass_reader.py:5: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
    from distutils.util import strtobool

Support 'Path' columns

Hi,

thanks for this very nice package. Do you have plans to support more custom fields as well (e.g. by passing in conversion functions?).

For me, the one type I very often use is "Path" as a lot of csv-files I have reference paths somewhere. Of course I can easily convert later, but if I define a dataclass, would be nice to define it correctly from the start!

Thanks

Raise an error when the csv file has header items with same name

Description

At the moment the DataclassReader behave exactly the same as the DictReader when the csv file contains duplicated headers. By default, it will get the value of the last column, for example, assuming that we have a csv file with the following content:

firstname,lastname,age,age
John,Test,41,23
Edit,Test,3,34

Using the DictReader, we get the values:

[
    OrderedDict([('firstname', 'John'), ('lastname', 'Test'), ('age', '23')]), 
    OrderedDict([('firstname', 'Edit'), ('lastname', 'Test'), ('age', '34')])
]

Note the the values for the field age are the values of the second age column. The DataclassReader has the same behaviour, if we would have a data class containing the field age, it would get the values of the second age column (23 and 34).

This can easily lead to mistakes and unpredictable results when reading the csv file.

Proposed behavior

It would be nice if instead of just getting the values of the last column with the same name, the DataclassReader could raise an error saying that the "The CSV file {filename} contains duplicated column names: {column names}".

The DataclassReader could also get a boolean argument called duplicate_columns_error (we can find a better name) which would disable this verification and have the same behavior as the DictReader.

Need also to discuss if it would be better to have this feature disabled or enabled by default.

Other information

OS: Debian 4.9.130-2
Python version: 3.7.1
dataclass-csv version: 1.0.1

Support frozen dataclasses

Since CSV data is often used without changing it, it would be great if dataclass-csv supported frozen dataclasses. This way, users could enjoy the convenience of dataclass-csv and the additional safety of immutable data.

Automatically generate candidate dataclass

When just exploring lots of different csv files, it would be very handy to have an optional way to scan a csv file and generate a dataclass that would work with it.

It could start off just reading row headers and using them to come up with legal attribute names.

Next, automatically generating attribute types by analyzing the contents of each column would also be handy.

If that funtionality could be provided as module functions and called from other software, it could eventually be possible to start playing with csv files that you run across, and have the module automatically read a csv in and provide access to the rows as objects which could be introspected.

datetime fields with default values trigger an error

I ran into a problem with dataclasses with fields that have default values. I managed to reproduce the issue with a small change in one of the tests: Modify UserWithDateFormatDecorator in tests/mocks.py to contain a datetime with a default value:

@dateformat('%Y-%m-%d')
@dataclasses.dataclass
class UserWithDateFormatDecorator:
    name: str
    create_date: datetime
    date_with_default: datetime = datetime.now()

Executing the test shows the error:

TypeError: strptime() argument 1 must be str, not datetime.datetime

Apparently, dataclass-csv tries to parse the default value as if it were an input string from the CSV, although in reality it is already a datetime (and can't be changed to a str either).

Does not support __future__.annotations

This code:

#!/usr/bin/env python3

from dataclasses import dataclass

from dataclass_csv import DataclassReader


@dataclass
class User:
    firstname: str
    email: str
    age: int


def main() -> None:
    with open('test.csv') as users_csv:
        reader = DataclassReader(users_csv, User)
        for row in reader:
            print(row)


if __name__ == '__main__':
    main()

Runs correctly:

User(firstname='Elsa', email='[email protected]', age=11)
User(firstname='Astor', email='[email protected]', age=7)
User(firstname='Edit', email='[email protected]', age=3)
User(firstname='Ella', email='[email protected]', age=2)

With the following test.csv:

firstname,email,age
Elsa,[email protected], 11
Astor,[email protected], 7
Edit,[email protected], 3
Ella,[email protected], 2

But using __future__.annotations, which will be mandatory in Python 4.0 and is useful in Python 3.7+, breaks DataclassReader:

#!/usr/bin/env python3

from __future__ import annotations

from dataclasses import dataclass

from dataclass_csv import DataclassReader


@dataclass
class User:
    firstname: str
    email: str
    age: int


def main() -> None:
    with open('test.csv') as users_csv:
        reader = DataclassReader(users_csv, User)
        for row in reader:
            print(row)


if __name__ == '__main__':
    main()
Traceback (most recent call last):
  File "test.py", line 25, in <module>
    main()
  File "test.py", line 20, in main
    for row in reader:
  File "venv/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 211, in __next__
    return self._process_row(row)
  File "venv/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 159, in _process_row
    or '__origin__' in field_type.__dict__

Cannot read a nested dataclass

Nested dataclasses are not handled by the DataclassReader. While the DataclassWriter serializes a nested dataclass as a tuple (thanks to the recursive behavior of dataclasses.astuple()), the DataclassReader tries to initialize the nested dataclass with:

transformed_value = field_type(value)

This cannot work because value is a string representing a tuple, whereas field_type (that in this case is a dataclass) constructor expects the fields necessary to build the dataclass.

The solution is pretty simple:

if dataclasses.is_dataclass(field_type):
    import ast
    fields = ast.literal_eval(value)
    values[field.name] = field_type(*fields)
    continue

This IF-statement could be placed in dataclass_reader.py, line 230. This is just a draft, but if you think it would be a good improvement I can propose a PR with error handling and tests as well.

remapping of csv field -> dataclass field has issues when csv field has name match to dataclass field

I came across this problem when trying to work around #17.

Given a data file with a column 'date', which contains an integer (a unix timestamp), and
a dataclass with a field of date which is of type datetime.datetime. Since there isn't a way to convert that in dataclass_csv today, I munged the dataclass a bit, made date an init=False field, introduced an InitVar of unix_ts, and used post_init to convert the unix timestamp to a datetime field like so:

@dataclass
class SomeData:
    date: datetime.datetime = field(init=False)
    unix_ts: InitVar[int]

    def __post_init__(self, unix_ts: int):
        self.date = datetime.datetime.utcfromtimestamp(unix_ts)

This works exactly as expected: the init signature is looking for unix_ts instead of date, date gets set with the value of unix_ts, all my consumers of the dataclass object don't have to change.

... until I try to load the data from dataclass_csv, at any rate.

reader = DataclassReader(inp, SomeData)
reader.map('date').to('unix_ts')
for record in reader:
   ...

This generates the standard 'hey, you're telling me to map onto a datetime field, and didn't tell me how to convert:

AttributeError: Unable to parse the datetime string value. Date format not specified. To specify a date format for all datetime fields in the class, use the @dateformat decorator. To define a date format specifically for this field, change its definition to: `date: datetime = field(metadata={'dateformat': <date_format>})`

Okay, fine, I figure I'll open an issue (here it is!), and I'll rename the date field on my dataclass to something else. Then I run into #9 because despite having init=False, it's still trying to build it into the construction of my class.

So, between #9 and #17, I'm actually stuck at the moment, but while working both of those out, I encountered this issue and thought I'd raise it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.