After toml was not able to parse <code class="notrans

List index out of range + unparseable UTF8 chars about toml HOT 3 CLOSED

Warchant commented on June 12, 2024

List index out of range + unparseable UTF8 chars

from toml.

Comments (3)

JamesParrott commented on June 12, 2024 1

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

If I understand it and recall it correctly:

String values must always be quoted, so the file ought to be parsed as long as non-standard white space is quoted (all else being well).

In Toml 1.0.0, no type of white space at all can be in a bare key (unquoted).

unquoted-key = 1*( ALPHA / DIGIT / %x2D / %x5F ) ; A-Z / a-z / 0-9 / - / _

So if the file contains unquoted non-standard whitespace, correct behaviour of a "strict-mode" Toml 1.0.0 parser is to raise an error. But I think one test suites lets the tester choose to allow things like this still to be parsed.

toml = expression *( newline expression )

expression =  ws [ comment ]
expression =/ ws keyval ws [ comment ]
expression =/ ws table ws [ comment ]

;; Whitespace

ws = *wschar
wschar =  %x20  ; Space
wschar =/ %x09  ; Horizontal tab

https://github.com/toml-lang/toml/blob/8eae5e1c005bc5836098505f85a7aa06568999dd/toml.abnf#L18C1-L28C33

But Toml is still a language under active development. In the latest WIP, even Emoji could be legal in bare keys. I'm not familiar with ABNF notation or unicode ranges to say for sure what the ranges below contain
, but I believe the intention was still to exclude any type of white space from bare keys.

;; Unquoted key

unquoted-key = 1*unquoted-key-char
unquoted-key-char = ALPHA / DIGIT / %x2D / %x5F         ; a-z A-Z 0-9 - _
unquoted-key-char =/ %xB2 / %xB3 / %xB9 / %xBC-BE       ; superscript digits, fractions
unquoted-key-char =/ %xC0-D6 / %xD8-F6 / %xF8-37D       ; non-symbol chars in Latin block
unquoted-key-char =/ %x37F-1FFF                         ; exclude GREEK QUESTION MARK, which is basically a semi-colon
unquoted-key-char =/ %x200C-200D / %x203F-2040          ; from General Punctuation Block, include the two tie symbols and ZWNJ, ZWJ
unquoted-key-char =/ %x2070-218F / %x2460-24FF          ; include super-/subscripts, letterlike/numberlike forms, enclosed alphanumerics
unquoted-key-char =/ %x2C00-2FEF / %x3001-D7FF          ; skip arrows, math, box drawing etc, skip 2FF0-3000 ideographic up/down markers and spaces
unquoted-key-char =/ %xF900-FDCF / %xFDF0-FFFD          ; skip D800-DFFF surrogate block, E000-F8FF Private Use area, FDD0-FDEF intended for process-internal use (unicode)
unquoted-key-char =/ %x10000-EFFFF                      ; all chars outside BMP range, excluding Private Use planes (F0000-10FFFF)

toml-lang/toml#891
https://github.com/toml-lang/toml/blob/23c3fb79f3f54ebc01110b963d7119006d91facc/toml.abnf#L55

from toml.

JamesParrott commented on June 12, 2024

Well done enumerating all the possibilities. I hope the devs deem fit to address it and give this one (and all the others) the attention they deserve.

In the mean time, while you wait for a fix, there are plenty of other great options. Don't feel that you too need to fork your own TOML reader and writer library like I did....

Why don't you formalise your findings, and add a test to: https://github.com/uiri/toml/blob/master/tests/test_api.py as a PR?

You'll face the same problem I did - you'll submit a PR that causes the CI pipeline to run the test to fail.

However this is because the underlying problems are: firstly the code in toml is broken, and secondly the existing test masks this problem so is also broken.

When code is broken, it should fail a test. Writing a test that fails is step 1 in Test Driven Development.

Just like the lack of tests does not imply code is correct, the existence of a broken test, no other test covering that area, and then passing all the tests, does not imply code is correct either.

from toml.

Warchant commented on June 12, 2024

A question though - if toml file contains non-standard (utf8) spaces (such as zero-width space), should toml parsing succeed or fail?

Now it fails.

from toml.

List index out of range + unparseable UTF8 chars about toml HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent