๐ Hi, Iโm @dfurtado
๐ Iโm interested in programming, retro videogames, skateboarding, buddhism and dogs!
๐ฑ Always trying to improve myself as a human beign
Map CSV to Data Classes
License: Other
๐ Hi, Iโm @dfurtado
๐ Iโm interested in programming, retro videogames, skateboarding, buddhism and dogs!
๐ฑ Always trying to improve myself as a human beign
Hi, not sure I'm missing something, new to the lib. I am having the reported error, on these lines of the lib:
is it my fault? My code is like this:
def deserializeFromCsv(self, base_folder: str) -> MondrianAccount:
with open(base_folder + "FakeAccount.csv") as users_csv:
reader = DataclassReader(users_csv, MondrianAccount)
print(reader)
for row in reader:
print(row)
return reader[0]
I'm not using Union, if that helps
At the moment the DataclassReader
behave exactly the same as the DictReader
when the csv file contains duplicated headers. By default, it will get the value of the last column, for example, assuming that we have a csv file with the following content:
firstname,lastname,age,age
John,Test,41,23
Edit,Test,3,34
Using the DictReader
, we get the values:
[
OrderedDict([('firstname', 'John'), ('lastname', 'Test'), ('age', '23')]),
OrderedDict([('firstname', 'Edit'), ('lastname', 'Test'), ('age', '34')])
]
Note the the values for the field age
are the values of the second age
column. The DataclassReader
has the same behaviour, if we would have a data class containing the field age
, it would get the values of the second age column (23 and 34).
This can easily lead to mistakes and unpredictable results when reading the csv file.
It would be nice if instead of just getting the values of the last column with the same name, the DataclassReader
could raise an error saying that the "The CSV file {filename} contains duplicated column names: {column names}".
The DataclassReader
could also get a boolean argument called duplicate_columns_error
(we can find a better name) which would disable this verification and have the same behavior as the DictReader
.
Need also to discuss if it would be better to have this feature disabled or enabled by default.
OS: Debian 4.9.130-2
Python version: 3.7.1
dataclass-csv version: 1.0.1
I ran into a problem with dataclasses with fields that have default values. I managed to reproduce the issue with a small change in one of the tests: Modify UserWithDateFormatDecorator
in tests/mocks.py
to contain a datetime
with a default value:
@dateformat('%Y-%m-%d')
@dataclasses.dataclass
class UserWithDateFormatDecorator:
name: str
create_date: datetime
date_with_default: datetime = datetime.now()
Executing the test shows the error:
TypeError: strptime() argument 1 must be str, not datetime.datetime
Apparently, dataclass-csv tries to parse the default value as if it were an input string from the CSV, although in reality it is already a datetime
(and can't be changed to a str
either).
CSV files and other databases commonly have fields that are NULL (missing / empty). As an easy example, a string field in a row might be of zero length. Handling such situations is currently complicated and confusing with dataclass-csv.
The current code allows a workaround by specifying a default value.
The example given for default values is this:
@dataclass
class User:
firstname: str
email: str = 'Not specified'
age: int
But that generates this error:
Traceback (most recent call last):
File "example.py", line 5, in <module>
class User:
File "/usr/lib/python3.8/dataclasses.py", line 1019, in dataclass
return wrap(cls)
File "/usr/lib/python3.8/dataclasses.py", line 1011, in wrap
return _process_class(cls, init, repr, eq, order, unsafe_hash, frozen)
File "/usr/lib/python3.8/dataclasses.py", line 925, in _process_class
_init_fn(flds,
File "/usr/lib/python3.8/dataclasses.py", line 502, in _init_fn
raise TypeError(f'non-default argument {f.name!r} '
TypeError: non-default argument 'age' follows default argument
It actually works to simply put all the fields which have default values at the end of the class, since the order of class members doesn't matter:
@dataclass
class User:
firstname: str
age: int
email: str = ''
A quick, partial solution would seem to be fixing the example, and noting the need to put the fields with default values at the end.
But requiring that the fields not appear in order reduces the clarify of the code and is more convoluted than necessary.
I think it would be good to also automatically move the fields needing default values to the end of the generated class.
It is perhaps also worthwhile to provide a way to specify that NULL values are ok for all fields, though that requires figuring out what value to use as a default default, I guess....
Say If I have the following classes for serialization to csv.
@dataclass
class Link:
title: str
url: str
@dataclass
class SearchResult:
paper_name: Link
authors: list[Link]
publication: Link
I want to split paper_name
into paper_name.title
and paper_name.url
columns in the generated csv file.
See also #33 (comment).
Maybe some annotation can be used on the classes of the property type to indicate the purpose.
While the existing datetime support is nice, some additional flexibility would be nice. When reading date strings, for example, the ability to use a well known parser such as dateutil.parser or, the ability to manipulate the datetime object as it's created (such as making sure the conversion to UTC happens correctly, if needed), or even to instantiate a datetime from a non-human format, such as a unix timestamp.
To that end, in addition to the dateformat
metadata that exists today, maybe some sort of datefactory
or dateformatter
metadata, which would be a callable, passing in the value, and returning a well-formed datetime object would be useful.
I came across this problem when trying to work around #17.
Given a data file with a column 'date', which contains an integer (a unix timestamp), and
a dataclass with a field of date
which is of type datetime.datetime. Since there isn't a way to convert that in dataclass_csv today, I munged the dataclass a bit, made date
an init=False
field, introduced an InitVar of unix_ts
, and used post_init to convert the unix timestamp to a datetime field like so:
@dataclass
class SomeData:
date: datetime.datetime = field(init=False)
unix_ts: InitVar[int]
def __post_init__(self, unix_ts: int):
self.date = datetime.datetime.utcfromtimestamp(unix_ts)
This works exactly as expected: the init signature is looking for unix_ts instead of date, date gets set with the value of unix_ts, all my consumers of the dataclass object don't have to change.
... until I try to load the data from dataclass_csv, at any rate.
reader = DataclassReader(inp, SomeData)
reader.map('date').to('unix_ts')
for record in reader:
...
This generates the standard 'hey, you're telling me to map onto a datetime field, and didn't tell me how to convert:
AttributeError: Unable to parse the datetime string value. Date format not specified. To specify a date format for all datetime fields in the class, use the @dateformat decorator. To define a date format specifically for this field, change its definition to: `date: datetime = field(metadata={'dateformat': <date_format>})`
Okay, fine, I figure I'll open an issue (here it is!), and I'll rename the date
field on my dataclass to something else. Then I run into #9 because despite having init=False, it's still trying to build it into the construction of my class.
So, between #9 and #17, I'm actually stuck at the moment, but while working both of those out, I encountered this issue and thought I'd raise it.
The following example from the README works.
import dataclasses
import io
from dataclass_csv import DataclassReader
@dataclasses.dataclass
class User:
firstname: str
email: str = "Not specified"
reader = DataclassReader(io.StringIO("firstname,e-mail\nElsa,[email protected]"), User)
reader.map("e-mail").to("email")
print(list(reader)) # [User(firstname='Elsa', email='[email protected]')]
Unfortunately introducing a typo leads to silent data loss.
reader.map("e_mail").to("email")
print(list(reader)) # [User(firstname='Elsa', email='Not specified')]
I think raising when an explicitly mapped column does not occur in the header of the data would help prevent errors.
It seems that distutils
will be soon be obsolete.
/usr/local/lib/python3.10/site-packages/dataclass_csv/dataclass_reader.py:5
/usr/local/lib/python3.10/site-packages/dataclass_csv/dataclass_reader.py:5: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
from distutils.util import strtobool
I am hoping this is just a naive user question:
refer to this line of code in the DataClassReader:
Task:
Using subclasses defining a logical data model, read a csv file with name different than in the data model. In the DataClassReader using the .map method to map columns names in the csv file to field names in the sales class. The class 'sales' includes 'markets' and 'products' master data as subclasses.
from dataclasses import dataclass, field
import dataclass_csv
import datetime
#define dataclasses to align with a logical dimensional data model
@DataClass
class markets:
market_id : str = field(metadata={'accept_whitespaces': True},init=False)
market : str = field(metadata={'accept_whitespaces': True},init=True)
@DataClass
class products:
upc : str = field(metadata={'accept_whitespaces': True},init=True)
product : str = field(metadata={'accept_whitespaces': True},init=True)
@DataClass
@dataclass_csv.dateformat('%Y-%m-%d')
class sales(markets, products):
weekend : datetime = field(metadata={'accept_whitespaces': False},init=True)
total_units : float = field(metadata={'accept_whitespaces': False},init=True,default=0.00000)
total_revenue : float = field(metadata={'accept_whitespaces': False},init=True,default=0.00000)
#assume all data is provided in a single file.
with open('sales.csv','r') as infile:
reader = dataclass_csv.DataclassReader(infile, sales)
reader.map('Week').to('weekend')
reader.map('Market').to('market')
reader.map('EANUPC').to('upc')
reader.map('Product Description').to('product')
reader.map('Sales Units').to('total_units')
reader.map('Sales Dollars').to('total_revenue')
for row in reader:
print(row)
At this point I am expecting the see records like:
sales(weekend = ..., ..., total_revenue = 100.00)
...
Traceback (most recent call last):
File "", line 44, in
for row in reader:
File "/Users/user/opt/anaconda3/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 211, in next
return self._process_row(row)
File "/Users/user/opt/anaconda3/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 195, in _process_row
transformed_value = field_type(value)
TypeError: 'module' object is not callable
Looking at the dataclass_reader.py I see "field_type = field.type", but not defined as a callable function anywhere.
Python 3.7.0 64 bit
Spyder IDE 3.3.6
Mac OS
Hi,
thanks for this very nice package. Do you have plans to support more custom fields as well (e.g. by passing in conversion functions?).
For me, the one type I very often use is "Path" as a lot of csv-files I have reference paths somewhere. Of course I can easily convert later, but if I define a dataclass, would be nice to define it correctly from the start!
Thanks
At the moment the code does not have any docstrings. Docstrings are helpful when developing since the developers can get help right in their text editor, IDE or the python REPL.
When reading a csv file, it would be nice if when iterating through DataclassReader the type checker could recognise the objects as being the type that the DataclassReader was instantiated with. At the moment it just returns Any
.
This generalises #20.
I completed the example from the README. The following works.
import dataclasses
import io
import re
from dataclass_csv import DataclassReader
class SSN:
def __init__(self, val):
if re.match(r"\d{9}", val):
self.val = f"{val[0:3]}-{val[3:5]}-{val[5:9]}"
elif re.match(r"\d{3}-\d{2}-\d{4}", val):
self.val = val
else:
raise ValueError(f"Invalid SSN: {val!r}")
@dataclasses.dataclass
class User:
name: str
ssn: SSN
print(list(DataclassReader(io.StringIO("name,ssn\nAlice,123456789"), User)))
After replacing the SSN in the data with a default value, the error below occurs.
class User:
name: str
ssn: SSN = SSN("123456789")
print(list(DataclassReader(io.StringIO("name,ssn\nAlice,"), User)))
File "readme.py", line 10, in __init__
if re.match(r"\d{9}", val):
TypeError: expected string or bytes-like object
The reason being that the SSN constructor is called with the default value, which is an object. The next snippet works around that.
class User:
name: str
ssn: SSN = "123456789"
But now this class can no longer be instantiated directly without creating broken objects, and type checkers will complain.
I think if a default value exists it should be used as is, and not passed to the constructor of its class.
As a convenient alternative to explicitly mapping column to field names, in case they don't directly match, it would be handy if dataclass-csv could populate dataclass' fields in the order they are declared. This would be particularly handy in the presence of CSV files that have many columns with cryptic headers (whose generation is not under the users control).
E.g. given the following CSV file
firstname, age
Jean, 77
Jim, 22
...
and dataclass
@dataclass
class Person:
forename: str
age: int
I would expected the following Person
instances to be created:
User(forename='Jean', age=77)
User(forename='Jim', age=22)
...
That is, the fields from top to bottom are initialised with the row data from left to right.
It'd be nice to have some way of allowing users to specify custom field types. E.g.:
parse
method that the library will call (similar to how Go's csv lib works)When reading datetime fields from a file the @dateformat() works well but when writing the same dataclass with DataclassWriter it does not write them to the file in this format.
It would be useful if, rather than having to add a default (potentially meaningless) value to a field, instead you could mark it as Optional
.
from dataclasses import dataclass
from typing import Optional
@dataclass
class User:
firstname: str
email: Optional[str]
age: int
If the field is not provided in the row, the field is set to None
. This will allow for a type-safe way to detect and handle missing values.
This code:
#!/usr/bin/env python3
from dataclasses import dataclass
from dataclass_csv import DataclassReader
@dataclass
class User:
firstname: str
email: str
age: int
def main() -> None:
with open('test.csv') as users_csv:
reader = DataclassReader(users_csv, User)
for row in reader:
print(row)
if __name__ == '__main__':
main()
Runs correctly:
User(firstname='Elsa', email='[email protected]', age=11)
User(firstname='Astor', email='[email protected]', age=7)
User(firstname='Edit', email='[email protected]', age=3)
User(firstname='Ella', email='[email protected]', age=2)
With the following test.csv
:
firstname,email,age
Elsa,[email protected], 11
Astor,[email protected], 7
Edit,[email protected], 3
Ella,[email protected], 2
But using __future__.annotations
, which will be mandatory in Python 4.0 and is useful in Python 3.7+, breaks DataclassReader
:
#!/usr/bin/env python3
from __future__ import annotations
from dataclasses import dataclass
from dataclass_csv import DataclassReader
@dataclass
class User:
firstname: str
email: str
age: int
def main() -> None:
with open('test.csv') as users_csv:
reader = DataclassReader(users_csv, User)
for row in reader:
print(row)
if __name__ == '__main__':
main()
Traceback (most recent call last):
File "test.py", line 25, in <module>
main()
File "test.py", line 20, in main
for row in reader:
File "venv/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 211, in __next__
return self._process_row(row)
File "venv/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 159, in _process_row
or '__origin__' in field_type.__dict__
Nested dataclasses are not handled by the DataclassReader
. While the DataclassWriter
serializes a nested dataclass as a tuple (thanks to the recursive behavior of dataclasses.astuple()
), the DataclassReader
tries to initialize the nested dataclass with:
This cannot work because value
is a string representing a tuple, whereas field_type
(that in this case is a dataclass) constructor expects the fields necessary to build the dataclass.
The solution is pretty simple:
if dataclasses.is_dataclass(field_type):
import ast
fields = ast.literal_eval(value)
values[field.name] = field_type(*fields)
continue
This IF-statement could be placed in dataclass_reader.py
, line 230. This is just a draft, but if you think it would be a good improvement I can propose a PR with error handling and tests as well.
The issue occurs when defining a field in the dataclass
with the property init=False
. When it init
property is set to false it means that the auto-generated dataclass
initializer won't add the field in the list of arguments.
dataclass
and set init=False
in one of the fields, for example:from dataclasses import dataclass, field
from dataclass_csv import DataclassReader
@dataclass
class User:
firstname: str
lastname: str
age: int = field(init=False)
def main():
with open('users.csv') as f:
reader = DataclassReader(f, User)
for row in reader:
print(row)
if __name__ == '__main__':
main()
Traceback (most recent call last):
File "app.py", line 25, in <module>
main()
File "app.py", line 20, in main
for row in reader:
File "/home/daniel/Envs/test-oGAe6Cfg/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 161, in __next__
return self._process_row(row)
File "/home/daniel/Envs/test-oGAe6Cfg/lib/python3.7/site-packages/dataclass_csv/dataclass_reader.py", line 157, in _process_row
return self.cls(*values)
TypeError: __init__() takes 3 positional arguments but 4 were given
When setting init=False
in a dataclass
field, the field should not be passed to the dataclass
initializer.
OS: Debian 4.9.130-2
Python version: 3.7.1
dataclass-csv version: 1.0.1
It would be great if, given a List[SomeDataclass]
it were possible to write back out a series of dataclass
es into their corresponding CSV representation.
This would allow someone to read a CSV into dataclass
es, modify the data in a type safe way, and then write it back out to CSV.
This generalises #17.
Currently any class not special-cased by the library must have a constructor that takes a single string.
Even for built-in types this leads to workarounds like the following.
class Time(datetime.time):
def __new__(cls, value: str):
hour, minute = value.split(":")
return super().__new__(cls, int(hour), int(minute))
I think supporting transformation functions would add much-needed flexibility.
One possible syntax might be this.
def strptime(value: str):
hour, minute = value.split(":")
return datetime.time(int(hour), int(minute))
reader.map('time').using(strptime) # With or without .to()
dataclass-csv 1.1.2 .docx
Example for french documentation
Support various string representations of bool
values.
I'd like to use dataclass-csv
as part of a data pipeline which imports data from a CSV intermediate format into PostgreSQL. Currently any non-null boolean value is parsed as True
, which prevents me from using this nice package.
def _parse_bool_value(self, bool_value):
"""Parses `str` representation of a boolean value to a `bool`. Case-insensitive.
Values parsed as True: 'yes', 'true', 't', 'y', '1'
Values parsed as False: 'no', 'false', 'f', 'n', '0
All other values(including empty/NULL) parsed as None.
"""
return str2bool(bool_value)
We can use this in _process_row()
. This leverages a very lightweight package called str2bool
and I put a little wrapper around it just so we can have a unit test around this parsing in case we decide to change the implementation later.
If you'd rather not add this dependency, we could re-implement str2bool
pretty easily.
I've got this ready to roll locally, so feel free to assign this one to me and I can get a PR submitted shortly.
Btw, I'm really liking this package and dataclasses
in general and would be glad to help out more in the future. Thanks!
Since CSV data is often used without changing it, it would be great if dataclass-csv supported frozen dataclasses. This way, users could enjoy the convenience of dataclass-csv and the additional safety of immutable data.
When just exploring lots of different csv files, it would be very handy to have an optional way to scan a csv file and generate a dataclass that would work with it.
It could start off just reading row headers and using them to come up with legal attribute names.
Next, automatically generating attribute types by analyzing the contents of each column would also be handy.
If that funtionality could be provided as module functions and called from other software, it could eventually be possible to start playing with csv files that you run across, and have the module automatically read a csv in and provide access to the rows as objects which could be introspected.
The type annotations for the DataclassReader are incorrect:
dataclass-csv/dataclass_csv/dataclass_reader.py
Lines 48 to 59 in 872ce02
dataclass-csv/dataclass_csv/dataclass_reader.pyi
Lines 5 to 16 in 872ce02
Even though *args
and **kwds
are technically a List
and a Dict
respectively, they are not typed the same way. See mypy docs or PEP-484. Changing both to Any
solves this problem.
Mypy output when calling the DataclassReader Ctor:
reader = DataclassReader(
lines,
MyDataclass,
delimiter="+",
# ^ Argument "delimiter" to "DataclassReader" has incompatible type "str"; expected "Dict[str, Any]"
)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.