Hi. I just noticed that cantools will default to UTF-8 if the user d

What if the user want to pass encoding as None to <co

Encoding defaults to UTF-8 for dbc. Vector uses CP1252. about cantools HOT 12 CLOSED

cantools commented on June 7, 2024

Encoding defaults to UTF-8 for dbc. Vector uses CP1252.

from cantools.

Comments (12)

eerimoq commented on June 7, 2024

Sounds like a good idea, but how should it be implemented? Choose encoding based on the file extension?

from cantools.

juleq commented on June 7, 2024

I would say:
Encoding defaults to None.

If encoding is none, select the default encoding by the type argument. If that is none, too, select the encoding by file name extension. If this did not help either, default to UTF-8 again.

The selections are in a dict [type, default_encoding].

The described selection process could be a local function named e.g. guess_encoding(filename, type).

from cantools.

eerimoq commented on June 7, 2024

What if the user want to pass encoding as None to open() to use the platform dependent encoding?

https://docs.python.org/3/library/functions.html#open

Maybe adding a special default encoding, like 'auto', which will do as you suggested.

from cantools.

juleq commented on June 7, 2024

I would suggest using a special option 'platform' because I would argue that selecting the encoding based on the platform is the most edgiest of cases. The file formats are associated with specific tools that use one specific encoding. If I deviate from that, I should know what I am doing and make that explicit by using respective arguments to cantools.

I know that this strategy is different from the one that open uses, but open is general purpose. And thats not necessarily the most convenient even though it is the most consistent with Python. But still, this is cantools :-).

from cantools.

eerimoq commented on June 7, 2024

All I know is that file encoding is much harder to get right than one can imagine. There are always some use case you don't think about. When I'm in the unknown I tend to implement as few restrictions as possible in the API. An additional platform argument might work, but would be nice if it's not needed.

I totally agree that DBC-files should have the same default encoding as CANdb++, we just have to figure out how to do it in a good way =)

from cantools.

eerimoq commented on June 7, 2024

Btw, how do you know that CP1252 is the default encoding?

from cantools.

juleq commented on June 7, 2024

At first I assumed it. Then I noticed, that the canmatrix project uses iso-8859-1 in one of its examples. So I verified by creating a dbc with an € char in it. I read that with an editor in 1252 mode and it came out fine.

from cantools.

juleq commented on June 7, 2024

8859 and 1252 are basically the same, but M$ replaced a few control chars with printables like €.

from cantools.

eerimoq commented on June 7, 2024

Let's implement it as you first suggested. If someone want to use the platform encoding they can always use load() instead of load_file().

from cantools.

eerimoq commented on June 7, 2024

I implemented the suggested behavior on master, not yet released. Please give it a try. Consult the documentation for details.

from cantools.

juleq commented on June 7, 2024

I have updated to master and removed all the arguments to load_files. Does work like charm. It even fixes an issue I had in my early days with cantools: Some clever customer worked around regular quotes not being allowed in the comment field of CANdb++ by using fancy quotes... Which also happen to be in the char range that CP1252 adds to the ISO charset. Since I did not get the encoding right then, I got broken dbc files when saving with cantools (CANdb++ would refuse to open).

By the way, the last potential hurdle for this use case is, that a cantools user needs to pass the correct encoding to write() when saving the db string. A wrapper save_file that uses the appropriate enconding could work around that. Otherwise I would expect to find an increasing amound of dbc files written with the wrong encoding in the wild.

I also did verify that e.g a degC survives dbc to kcd translation (the latter being written in UTF8).

The sym default encoding seems also fine, I have checked one of the Peak tools and that uses UTF8 with BOM.

Great. Thanks.

from cantools.

eerimoq commented on June 7, 2024

Great that it works!

Yeah, feel free to add a dump (or write) function.

from cantools.

Encoding defaults to UTF-8 for dbc. Vector uses CP1252. about cantools HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent