Code Monkey home page Code Monkey logo

parsers.jl's People

Contributors

amgad-naiem avatar ararslan avatar donm avatar drvi avatar juliatagbot avatar klausc avatar kristofferc avatar liozou avatar nalimilan avatar nickrobinson251 avatar nsajko avatar oscardssmith avatar quinnj avatar rafaqz avatar timholy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

parsers.jl's Issues

BoundsError for test that parses incorrect UUID

      From worker 3:    │ === EXCEPTION SUMMARY ===
      From worker 3:    │ 
      From worker 3:    │ BoundsError: attempt to access 37-element view(::Vector{UInt8}, 122:158) with eltype UInt8 at index [38]
      From worker 3:    │  [1] throw_boundserror(A::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int32}}, true}, I::Tuple{Int64})
      From worker 3:    │    @ Base ./abstractarray.jl:703
      From worker 3:    │ 
      From worker 3:    │ ===========================
      From worker 3:    │ 
      From worker 3:    │ Original Error message:
      From worker 3:    │ 
      From worker 3:    │ ERROR: BoundsError: attempt to access 37-element view(::Vector{UInt8}, 122:158) with eltype UInt8 at index [38]
      From worker 3:    │ Stacktrace:
      From worker 3:    │   [1] throw_boundserror(A::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int32}}, true}, I::Tuple{Int64})
      From worker 3:    │     @ Base ./abstractarray.jl:703
      From worker 3:    │   [2] checkbounds
      From worker 3:    │     @ ./abstractarray.jl:668 [inlined]
      From worker 3:    │   [3] getindex
      From worker 3:    │     @ ./subarray.jl:314 [inlined]
      From worker 3:    │   [4] peekbyte
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/utils.jl:356 [inlined]
      From worker 3:    │   [5] typeparser(#unused#::Type{Base.UUID}, source::SubArray{UInt8, 1, Vector{UInt8}, Tuple{UnitRange{Int32}}, true}, pos::Int64, len::Int64, b::UInt8, code::Int16, pl::Parsers.PosLen, options::Parsers.Options)
      From worker 3:    │     @ Parsers ~/.julia/packages/Parsers/mPY37/src/hexadecimal.jl:94
      From worker 3:    │   [6] #23
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:382 [inlined]
      From worker 3:    │   [7] checkforsentinel
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:244 [inlined]
      From worker 3:    │   [8] stripwhitespace
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:89 [inlined]
      From worker 3:    │   [9] findquoted
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:220 [inlined]
      From worker 3:    │  [10] stripwhitespace
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:89 [inlined]
      From worker 3:    │  [11] _finddelimiter
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:368 [inlined]
      From worker 3:    │  [12] checkemptysentinel
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:40 [inlined]
      From worker 3:    │  [13] #6
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/components.jl:13 [inlined]
      From worker 3:    │  [14] _xparse
      From worker 3:    │     @ ~/.julia/packages/Parsers/mPY37/src/Parsers.jl:343 [inlined]

"." is still parsed as 0.0

Looks like #50 didn't fix this completely

julia> CSV.read("/Users/andreasnoack/Desktop/testdot.csv")
1×2 DataFrames.DataFrame
│ Row │ a       │  b     │
│     │ Float64 │ String │
├─────┼─────────┼────────┤
│ 10.0     │  .     │

julia> readlines("/Users/andreasnoack/Desktop/testdot.csv")
2-element Array{String,1}:
 "a, b"
 "., ."

Order of arguments

The order of arguments is opposite to Base.parse. Typically, types are usually the first argument to function in general. Is there a reason for choosing the non-standard order here?

unsupported or misplaced expression error from match!

I am investigating this issue JuliaData/CSV.jl#359, which I think comes from Parsers.jl. When running tests on both the released version and master of Parsers.jl on Julia master (currently v"1.1.0-DEV.794"), I get many errors of the form

Misc: Error During Test at /home/tamas/.julia/dev/Parsers/test/runtests.jl:596
  Got exception outside of a @test
  unsupported or misplaced expression error
  Stacktrace:
   [1] top-level scope
   [2] match! at /home/tamas/.julia/dev/Parsers/src/tries.jl:137 [inlined]
   [3] #parse!#25 at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:588 [inlined]
   [4] parse! at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:578 [inlined]
   [5] #parse!#23 at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:508 [inlined]
   [6] parse! at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:483 [inlined]
   [7] #parse!#22 at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:384 [inlined]
   [8] parse! at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:384 [inlined]
   [9] #parse#11 at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:305 [inlined]
   [10] parse(::Parsers.Delimited{false,false,Parsers.Quoted{Parsers.Sentinel{typeof(Parsers.defaultparser),Parsers.Trie{0x00,false,missing,2,Tuple{Parsers.Trie{0x6e,false,missing,2,Tuple{Parsers.Trie{0x75,false,missing,2,Tuple{Parsers.Trie{0x6c,false,missing,2,Tuple{Parsers.Trie{0x6c,true,missing,2,Tuple{}}}}}}}}}}}},Parsers.Trie{0x00,false,missing,8,Tuple{Parsers.Trie{0x2c,true,missing,8,Tuple{}}}}}, ::Base.GenericIOBuffer{Array{UInt8,1}}, ::Type{Int64}) at /home/tamas/.julia/dev/Parsers/src/Parsers.jl:305
   [11] top-level scope at /home/tamas/.julia/dev/Parsers/test/runtests.jl:634
   [12] top-level scope at /home/tamas/src/julia-git/usr/share/julia/stdlib/v1.1/Test/src/Test.jl:1083
   [13] top-level scope at /home/tamas/.julia/dev/Parsers/test/runtests.jl:598
   [14] top-level scope at /home/tamas/src/julia-git/usr/share/julia/stdlib/v1.1/Test/src/Test.jl:1083
   [15] top-level scope at /home/tamas/.julia/dev/Parsers/test/runtests.jl:7
   [16] include at ./boot.jl:317 [inlined]
   [17] include_relative(::Module, ::String) at ./loading.jl:1038
   [18] include(::Module, ::String) at ./sysimg.jl:29
   [19] include(::String) at ./client.jl:403
   [20] top-level scope at none:0
   [21] eval(::Module, ::Any) at ./boot.jl:319
   [22] exec_options(::Base.JLOptions) at ./client.jl:243
   [23] _start() at ./client.jl:436

Julia 1.0.2 works fine. Manifest.toml below.

crash when parsing subnormal floats with compiled package

Please let me know if this issue belongs in PackageCompiler.jl instead of here.

This issue seems to be related to @n8xm's #83 issue, so I'll be using test values from there.

I originally had this crash in a compiled package using JSON.jl, but the issue seems to come from Parsers.jl

consider:

Module Foo
module Foo

using Parsers

julia_main()::Cint = main()

function main()
    try
        str_arg = ARGS[1]
        @info "Parsing \"$str_arg\" -> $(Parsers.parse(Float64,str_arg))"
    catch
        Base.invokelatest(Base.display_error, Base.catch_stack())
        return 1
    end
    return 0
end

end

With

  • Ubuntu 20.04.2 LTS
  • Julia v1.6.1
  • Parsers v2.0.3
  • PackageCompiler v1.3.0
Test ✅
$ ./Foo_compiled/bin/Foo "9.3494547075363499E-311"
[ Info: Parsing "9.3494547075363499E-311" -> 0.0
Test ❌
$ ./Foo_compiled/bin/Foo "9.349454707536349E-311"
realloc(): invalid pointer

signal (6): Aborted
in expression starting at none:0
gsignal at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7f01ef4ce3ed)
unknown function (ip: 0x7f01ef4d647b)
realloc at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
__gmpz_realloc at /home/user/.julia/dev/Foo_compiled/bin/../lib/julia/libgmp.so (unknown line)
__gmpz_import at /home/user/.julia/dev/Foo_compiled/bin/../lib/julia/libgmp.so (unknown line)
_widen at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:13 [inlined]
_scale at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:445 [inlined]
_scale at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:431 [inlined]
scale at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:392 [inlined]
parseexp at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:334 [inlined]
parsefrac at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:309 [inlined]
parsedigits at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:242 [inlined]
typeparser at /home/user/.julia/packages/Parsers/wEs2o/src/floats.jl:170 [inlined]
xparse2 at /home/user/.julia/packages/Parsers/wEs2o/src/Parsers.jl:691 [inlined]
xparse2 at /home/user/.julia/packages/Parsers/wEs2o/src/Parsers.jl:671 [inlined]
parse at /home/user/.julia/packages/Parsers/wEs2o/src/Parsers.jl:170
parse at /home/user/.julia/packages/Parsers/wEs2o/src/Parsers.jl:170 [inlined]
macro expansion at ./logging.jl:340 [inlined]
main at /home/user/.julia/dev/Foo/src/Foo.jl:11
julia_main at /home/user/.julia/dev/Foo/src/Foo.jl:5 [inlined]
julia_main at ./none:36
unknown function (ip: 0x7f01e2949b5c)
_jl_invoke at /buildworker/worker/package_linux64/build/src/gf.c:2237 [inlined]
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2419
julia_main at /home/user/.julia/dev/Foo_compiled/bin/Foo.so (unknown line)
main at ./Foo_compiled/bin/Foo (unknown line)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_start at ./Foo_compiled/bin/Foo (unknown line)
Allocations: 2285413 (Pool: 2284512; Big: 901); GC: 3
Aborted (core dumped)

Please let me know if you need additionnal details

Unclear documentation for `getstring`

Here is a part of the docs for getstring

If the actual parsed String is needed, however, you can pass your source and the res.val::PosLen to Parsers.getstring to get the actual parsed String value.

But getstring takes 3 arguments, the last one which is called e. I don't see any description of e in the docstring and how it should be used. Looking at the tests, it seems that 0x00 is passed in.

Parsing floating point numbers starting with a point

Parsing floating point numbers with a leading point fails:

julia> Parsers.parse(".24409E+03", Float64)
ERROR: Parsers.Error (INVALID: EOF):
initial value parsing failed, reached EOF
failed to parse Float64, encountered: '\0'

In the same time Base.parse has no problem with this format:
julia> Base.parse(Float64, ".24409E+03")
244.09

`stripquoted` doesn't know to strip quoted whitespace when space is the delimiter

julia> str = "{hey, there }"
"{hey, there }"

expected behaviour, with delim=',' (and default wh1, wh2):

julia> res = Parsers.xparse(String, str; openquotechar='{', closequotechar='}', stripquoted=true, delim=',')
Parsers.Result{PosLen}(37, 13, PosLen(0x000000000020000a))

julia> Parsers.getstring(str, res.val, 0x22)
"hey, there"

unexpected behaviour, with delim=' ' (and wh1 changed, since its required):

julia> res = Parsers.xparse(String, str; openquotechar='{', closequotechar='}', stripquoted=true, delim=' ', wh1=0x00)
Parsers.Result{PosLen}(37, 13, PosLen(0x000000000020000b))

julia> Parsers.getstring(str, res.val, 0x22)  # has trailing whitespace
"hey, there " 

The trailing whitespace in the quoted string isn't striped in the second case, because we explicitly set wh1 to a value that wasn't ' ' due to ' ' being the delimiter... but there's no way to tell Parsers "inside quotes treat ' ' as whitespace not the delimiter, just like you treat ',' as a regular character inside quotes even when comma is the delimiter"

One way to fix this would be to hardcode certain characters as always being whitespace when quoted e.g.

-                if options.stripquoted && b != options.wh1 && b != options.wh2
+                if options.stripquoted && b != options.wh1 && b != options.wh2 && b != UInt8(' ') && b != UInt8('\t')
                     lastnonwhitespacepos = pos
                 end

Parsers.jl/src/strings.jl

Lines 100 to 102 in 462fb55

if options.stripquoted && b != options.wh1 && b != options.wh2
lastnonwhitespacepos = pos
end

Bug: csv file with a char column

Trying to read a csv file with a char column throws error:

df = DataFrame(a = [1,2,3], b = ['q','w','e'])
CSV.write("test.csv", df)
CSV.read("test.csv", types = [Int, Char]) # ERROR: MethodError: no method matching zero(::Type{Char})

Parsing subnormal float with too many sig figs returns zero, even if it is larger than smallest subnormal

I use Base.parse, I get different results than if I use Parsers.parse:

julia> parse(Float64,"9.3494547075363499E-311")
9.3494547075363e-311

julia> parse(Float64,"9.349454707536349E-311")
9.3494547075363e-311

julia> using Parsers

julia> Parsers.parse(Float64,"9.3494547075363499E-311")
0.0

julia> Parsers.parse(Float64,"9.349454707536349E-311")
9.3494547075363e-311

I believe what is happening is that:

  1. We are dealing with subnormal floats
  2. If the subnormal float has "too many" sig figs

It is my understanding that subnormal floats sacrifice precision in order to allow representations that are "close" to zero. However, because 9.3494547075363499E-311 is larger than the smallest subnormal float, the behavior of Base.parse is what I would expect. I would only expect a parser to return 0.0 if the number being parsed is smaller than the smallest subnormal float.

By the way, Python's built-in parser behaves like Base.parse and not Parsers.parse. Using Python 3.9.6:

>>> float("9.3494547075363499E-311")
9.3494547075363e-311
>>> float("9.349454707536349E-311")
9.3494547075363e-311

Similarly, C's atof behaves like Base.parse and not Parsers.parse:

printf("%e\n", atof("9.3494547075363499E-311"));
printf("%e\n", atof("9.349454707536349E-311"));

The above C source produces the output:

9.349455e-311
9.349455e-311

Is there a good reason why Parsers.parse behaves differently from Base.parse in this case?

Parsers.jl version 2.5 appears to break OpenML.jl

OpenML depends on ARFFFiles.jl, which in turn depends on Parsers.jl

No error is thrown below if I pin Parsers to 2.4:

julia> OpenML.load(42638)
ERROR: Invalid nominal "\"C85\"" in column 'cabin' of row 2, expecting one of "B45", "E31", "B57 B59 B63 B66", "B36", "A21", "C78", "D34", "D19", "A9", "D15", "D56", "C103", "C123", "C31", "C23 C25 C27", "F G63", "B61", "C53", "D43", "C130", "C132", "C101", "C55 C57", "B71", "C46", "C116", "F", "A29", "G6", "C6", "C28", "C51", "E46", "C54", "C97", "D22", "B10", "F4", "E45", "E52", "D30", "B58 B60", "E34", "C62 C64", "A11", "B11", "C80", "F33", "C85", "D37", "C86", "D21", "C89", "F E46", "A34", "D", "B26", "C22 C26", "B69", "C32", "B78", "F E57", "F2", "A18", "C106", "B51 B53 B55", "D10 D12", "E60", "E50", "E39 E41", "B52 B54 B56", "C39", "B24", "D28", "B41", "C7", "D40", "D38", "C105", "A6", "D33", "B30", "C52", "B28", "C83", "F G73", "A5", "D26", "C110", "E101", "F E69", "D47", "B86", "C2", "E33", "B19", "A7", "C49", "A32", "B4", "B80", "A31", "D36", "C93", "D35", "C87", "B77", "E67", "B94", "C125", "C99", "C118", "D7", "A19", "B49", "C65", "E36", "B18", "C124", "C91", "E40", "T", "C128", "B35", "C82", "B96 B98", "E10", "E44", "C104", "C111", "C92", "E38", "E12", "E63", "A14", "B37", "C30", "D20", "B79", "E25", "D46", "B73", "C95", "B38", "B39", "B22", "C70", "A16", "C68", "A10", "E68", "A20", "D50", "D9", "A23", "B50", "A26", "D48", "E58", "C126", "D49", "B5", "B20", "E24", "C90", "C45", "E8", "B101", "D45", "E121", "D11", "E77", "F38", "B3", "D6", "B82 B84", "D17", "A36", "B102", "E49", "C47", "E17", "A24", "C50", "B42" or "C148"
Stacktrace:
  [1] error(s::String)
    @ Base ./error.jl:35
  [2] _readcolumns_readdatum
    @ ~/.julia/packages/ARFFFiles/o3ClW/src/ARFFFiles.jl:965 [inlined]
  [3] _readcolumns_readdatum
    @ ~/.julia/packages/ARFFFiles/o3ClW/src/ARFFFiles.jl:900 [inlined]
  [4] readcolumns(r::ARFFFiles.ARFFReader{IOStream}; opts_sq::Parsers.Options, opts_dq::Parsers.Options, date_opts_sq::Vector{Parsers.Options}, date_opts_dq::Vector{Parsers.Options}, maxbytes::Nothing, chunkbytes::Int64)
    @ ARFFFiles ~/.julia/packages/ARFFFiles/o3ClW/src/ARFFFiles.jl:847
  [5] #3
    @ ~/.julia/packages/OpenML/dTbTl/src/data.jl:87 [inlined]
  [6] load(::OpenML.var"#3#5"{Nothing}, ::String; opts::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ ARFFFiles ~/.julia/packages/ARFFFiles/o3ClW/src/ARFFFiles.jl:577
  [7] load
    @ ~/.julia/packages/ARFFFiles/o3ClW/src/ARFFFiles.jl:574 [inlined]
  [8] load(id::Int64; maxbytes::Nothing)
    @ OpenML ~/.julia/packages/OpenML/dTbTl/src/data.jl:87
  [9] load(id::Int64)
    @ OpenML ~/.julia/packages/OpenML/dTbTl/src/data.jl:69
 [10] top-level scope
    @ REPL[7]:1

Here I'm running in an env that has only OpenML 0.3.0 in it.

In both my working and failing environments, the version of ARFFFiles.jl is the same, namely 1.4.1.

Parsers reads out of bounds in `checkdelim!`

❯ julia --check-bounds=yes --compiled-modules=no
julia> using Parsers

julia> Parsers.checkdelim!(codeunits("::::"), 1, 4, Parsers.Options(delim = "::", ignorerepeated = true)) == 5
ERROR: BoundsError: attempt to access 4-codeunit String at index [5]
Stacktrace:
 [1] checkbounds
   @ ./strings/basic.jl:216 [inlined]
 [2] codeunit
   @ ./strings/string.jl:117 [inlined]
 [3] getindex
   @ ./strings/basic.jl:756 [inlined]
 [4] peekbyte
   @ ~/.julia/packages/Parsers/gi2J3/src/utils.jl:345 [inlined]
 [5] checkdelim!(source::Base.CodeUnits{UInt8, String}, pos::Int64, len::Int64, options::Parsers.Options)
   @ Parsers ~/.julia/packages/Parsers/gi2J3/src/Parsers.jl:375
 [6] top-level scope
   @ REPL[2]:1

(running with --compiled-modules=no to make sure no precompile file with bounds checking turned off is cached)

Int64 parsing error

julia> using Parsers
julia> Parsers.parse(Int64, "-9223372036854775807")
9223372036854775807
julia> Parsers.parse(Int64, "-9223372036854775807\t")
9223372036854775807

Obviously the sign is missing. This was described in slack.

I will submit a fix in a new PR immediately.

Use quotes to disambiguate empty and missing strings

For quoted strings, I'd expect that two consecutive quotes represent an empty string (a known value), but I get a missing value instead:

julia> Parsers.xparse(String, "\"\",", 1, 3, Parsers.Options(quoted=true, sentinel=missing)).val.missingvalue
true

Having this would be helpful for CSV parsers which want to differentiate between unknown/missing strings and empty strings (like e.g. Postres does):

"a","","c"   # middle field holds an empty string
"a",,"c"     # missing value in the middle field

parse(Date, io::IO) always returns Date(1)

When input is an IO object, Date parser always returns Date(1).

julia> using Dates

julia> Parsers.parse(Date, "2020-07-30")
2020-07-30

julia> Parsers.parse(Date, IOBuffer("2020-07-30"))
0001-01-01

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

Float64 parsing error

From JuliaData/CSV.jl#397

Base.parse works correctly but Parsers.parse failed:

julia> Base.parse(Float64, "74810199.033988851037472901827191090834")
7.481019903398885e7

julia> Parsers.parse(Float64, "74810199.033988851037472901827191090834")
ERROR: InexactError: check_top_bit(Int64, -3)
Stacktrace:
 [1] throw_inexacterror(::Symbol, ::Any, ::Int64) at ./boot.jl:583
 [2] check_top_bit at ./boot.jl:597 [inlined]
 [3] toUInt64 at ./boot.jl:708 [inlined]
 [4] Type at ./boot.jl:738 [inlined]
 [5] convert at ./number.jl:7 [inlined]
 [6] cconvert at ./essentials.jl:355 [inlined]
 [7] mul_2exp! at ./gmp.jl:146 [inlined]
 [8] mul_2exp! at ./gmp.jl:148 [inlined]
 [9] scale at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:96 [inlined]
 [10] scale at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:114 [inlined]
 [11] #_defaultparser#46(::UInt8, ::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Parsers.StringBuffer, ::Parsers.Result{Float64}, ::Type{Int128}) at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:196
 [12] (::getfield(Parsers, Symbol("#kw##_defaultparser")))(::NamedTuple{(:decimal,),Tuple{UInt8}}, ::typeof(Parsers._defaultparser), ::Parsers.StringBuffer, ::Parsers.Result{Float64}, ::Type{Int128}) at ./none:0
 [13] #_defaultparser#46 at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:194 [inlined]
 [14] _defaultparser at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:126 [inlined]
 [15] #defaultparser#45 at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:123 [inlined]
 [16] defaultparser at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/floats.jl:123 [inlined]
 [17] #parse!#13 at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:349 [inlined]
 [18] parse! at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:349 [inlined]
 [19] #parse#15 at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:351 [inlined]
 [20] parse at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:351 [inlined]
 [21] #parse#16(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::Function, ::Type{Float64}, ::String) at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:362
 [22] parse(::Type{Float64}, ::String) at /Users/tomkwong/.julia/packages/Parsers/v5u2B/src/Parsers.jl:361

Type-piracy of `Tuple{Ptr{UInt8},Int}`

function Base.hash(x::Tuple{Ptr{UInt8},Int}, h::UInt)
h += Base.memhash_seed
ccall(Base.memhash, UInt, (Ptr{UInt8}, Csize_t, UInt32), x[1], x[2], h % UInt32) + h
end
Base.isequal(x::Tuple{Ptr{UInt8}, Int}, y::String) = hash(x) === hash(y)
Base.convert(::Type{String}, x::Tuple{Ptr{UInt8}, Int}) = unsafe_string(x[1], x[2])

Maybe you want to try and upstream those definitions into Base?

Parse no longer works with \t

I am new to Julia and still figuring things out but it seems that the recent changes have broken parse for tab delimited text (under Julia 1.1):

julia> Parsers.parse(Float64,"12.34\t5.0",Parsers.Options(delim='\t'))

ERROR: Parsers.Error (INVALID: OK | EOF | INVALID_DELIMITER ):initial value parsing succeeded, reached EOF, invalid delimiterattempted to parse Float64 from: "12.34\t5.0"
Stacktrace:
[1] #parse#4(::Int64, ::Int64, ::Function, ::Type{Float64}, ::String, ::Parsers.Options{false,false,false,Nothing,UInt8,Nothing}) at C:\Users\smury.julia\packages\Parsers\bkn21\src\Parsers.jl:109
[2] parse(::Type{Float64}, ::String, ::Parsers.Options{false,false,false,Nothing,UInt8,Nothing}) at C:\Users\smury.julia\packages\Parsers\bkn21\src\Parsers.jl:108
[3] top-level scope at none:0

but
julia> Parsers.parse(Float64,"12.34,5.0",Parsers.Options(delim=','))

12.34

Error parsing Date

Hi there,

I'm parsing some data that has been written out from SAS.

Date("25JUL1985", "dduuuyyyy") works, but
Parsers.parse(Date, "25JUL1985", Parsers.Options(dateformat="dduuuyyyy")) doesn't.

Any ideas?

Relying on `Parsers.xparse` behavior when a parse fails

I am working with strings interspersed with letters. I'd like to parse numbers up to either the next letter or the end of the string. I can do this with xparse:

julia> Parsers.xparse(Int, "1234Q", 1, 5)
(1234, -32607, 1, 5, 5)

If a parse fails, the first tuple element x holds the result I'm looking for. However, it is not documented whether this is intentional behavior (that the x is correct up to the moment of the parse failing).

I'm curious – is this a sanctioned use of Parsers.xparse? The docs state that "x is a value of type T, even if parsing does not succeed" but do not say that the answer is guaranteed to be correct.

If it is guaranteed, I'd be happy to submit a docs PR. Thanks for your work on this package!

option to ignore whitespace stripping within quotes

Hey @quinnj,

I'm rewriting our data loading at the moment, migrating to Parsers.jl.

My request is more or less the opposite of #106: For CSV parsing, it would be great to provide an option that allows us strip whitespaces around unquoted fields, but leave it within quotes.

For example, a CSV

A, B   ,  C,D     
      "hello", "good day"     ,      "   same same     "   ,   whatever         

should Ideally parse into ["A", "B", "C", "D"] for the first line and

["hello", "good day", "   same same     ", "whatever"]`

for the second.

Would it be straightforward to add that as an option?

Re-enable JET

In #153, we removed JET from tests as it was erroring even with supported Julia versions, and was generally difficult to keep the nightly tests green, as JET relies on various implementation details that sometimes change between Julia nightly versions. We should re-enable it in some form in the future, perhaps as an independent CI action.

Parsing custom types is more geared towards numeric types

This originated from rofinn/FilePathsBase.jl#100

I'm using CSV.File to read a csv file where some of the columns contain file paths. So I figured I'd do this:

types = Dict(:name => typeof(Path()))
CSV.File(file, types = types)

But I'm getting this error:

ERROR: MethodError: no method matching zero(::Type{FilePathsBase.PosixPath})
Closest candidates are:
  zero(::Type{LibGit2.GitHash}) at /build/julia/src/julia-1.5.0/usr/share/julia/stdlib/v1.5/LibGit2/src/oid.jl:220
  zero(::Type{Missing}) at missing.jl:103
  zero(::Type{Dates.Date}) at /build/julia/src/julia-1.5.0/usr/share/julia/stdlib/v1.5/Dates/src/types.jl:405
  ...
Stacktrace:
 [1] xparse at /home/yakir/.julia/packages/Parsers/DAskp/src/Parsers.jl:752 [inlined]
 [2] parsevalue!(::Type{FilePathsBase.PosixPath}, ::UInt8, ::SentinelArrays.SentinelArray{FilePathsBase.PosixPath,1,UndefInitializer,Missing,Array{FilePathsBase.PosixPath,1}}, ::Array{AbstractArray{T,1} where T,1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Int64, ::Int64, ::Int64, ::Array{Type,1}, ::Array{UInt8,1}) at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:914
 [3] macro expansion at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:634 [inlined]
 [4] parsecustom! at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:624 [inlined]
 [5] parserow at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:683 [inlined]
 [6] parsefilechunk!(::Val{false}, ::Int64, ::Dict{Type,Type}, ::Array{AbstractArray{T,1} where T,1}, ::Array{UInt8,1}, ::Int64, ::Int64, ::Int64, ::Array{Int64,1}, ::Float64, ::Array{CSV.RefPool,1}, ::Int64, ::Int64, ::Array{Type,1}, ::Array{UInt8,1}, ::Bool, ::Parsers.Options{false,false,true,false,Missing,UInt8,Nothing}, ::Nothing, ::Type{Tuple{Tuple{SentinelArrays.SentinelArray{FilePathsBase.PosixPath,1,UndefInitializer,Missing,Array{FilePathsBase.PosixPath,1}},FilePathsBase.PosixPath},Tuple{SentinelArrays.SentinelArray{FilePathsBase.PosixPath,1,UndefInitializer,Missing,Array{FilePathsBase.PosixPath,1}},FilePathsBase.PosixPath},Tuple{SentinelArrays.SentinelArray{FilePathsBase.PosixPath,1,UndefInitializer,Missing,Array{FilePathsBase.PosixPath,1}},FilePathsBase.PosixPath}}}) at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:557
 [7] CSV.File(::CSV.Header{false,Parsers.Options{false,false,true,false,Missing,UInt8,Nothing},Array{UInt8,1}}; startingbyteposition::Nothing, endingbyteposition::Nothing, limit::Nothing, threaded::Nothing, typemap::Dict{Type,Type}, tasks::Int64, debug::Bool) at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:265
 [8] CSV.File(::String; header::Int64, normalizenames::Bool, datarow::Int64, skipto::Nothing, footerskip::Int64, transpose::Bool, comment::Nothing, use_mmap::Nothing, ignoreemptylines::Bool, select::Nothing, drop::Nothing, missingstrings::Array{String,1}, missingstring::String, delim::Nothing, ignorerepeated::Bool, quotechar::Char, openquotechar::Nothing, closequotechar::Nothing, escapechar::Char, dateformat::Nothing, dateformats::Nothing, decimal::UInt8, truestrings::Array{String,1}, falsestrings::Array{String,1}, type::Nothing, types::Dict{Symbol,DataType}, typemap::Dict{Type,Type}, categorical::Nothing, pool::Float64, lazystrings::Bool, strict::Bool, silencewarnings::Bool, debug::Bool, parsingdebug::Bool, kw::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/yakir/.julia/packages/CSV/MKemC/src/file.jl:217
 [9] loadcsv(::String) at /home/yakir/MAT2db.jl/src/MAT2db.jl:31
 [10] process_csv(::String) at /home/yakir/MAT2db.jl/src/MAT2db.jl:43
 [11] top-level scope at ./timing.jl:174 [inlined]
 [12] top-level scope at ./REPL[4]:0
 [13] run_repl(::REPL.AbstractREPL, ::Any) at /build/julia/src/julia-1.5.0/usr/share/julia/stdlib/v1.5/REPL/src/REPL.jl:288

Resolve precompilation issues

Parsers.jl is a large source TTFX in the Julia ecosystem.

The previous precompilation PR to #108 resolve this was reverted due to bugs in base precompilation on windows JuliaData/CSV.jl#994, so the problem remains.

This issue is to track the problem explicitly. My main questions are:

  1. What are the potential fixes to Base? is there an issue for tracking these precompilation segfaults?
  2. Can we precompile here for linux/mac, as the segfaults are only on windows? (also lowest effort to reward ratio)
  3. Can we rewrite methods here to remove the compilation costs?
  4. What can we do to help this along?

Two major sources seem to be the Union type fields in Options, and the size of the Vector constants being inlined everywhere to set float precision. Moving these to a non-inlined methods for the first saves a lot of compilation time (easy), using stable stuct for options with boolean check values instead of runtime isa helps the second.

InlineStrings.jl tests fail on Parsers.jl `main`

Using the unreleased version of Parsers.jl post-#127 the following test cases from InlineStrings.jl fail (these are cut down from the InlineStrings.jl tests, to see the failures in the context of the full testset see https://github.com/JuliaData/Parsers.jl/actions/runs/3300589688/jobs/5445241946).

I think the first of these might be considered a bug in Parsers.jl, the rest i really don't know. they may all be fine, but i think worth reviewing before we make a new release (to decide if it should be marked breaking and/or how to update InlineStrings.jl)

# test/fails.jl

using InlineStrings, Test, Parsers
using Parsers: SENTINEL, OK, EOF, OVERFLOW, QUOTED, DELIMITED, INVALID_DELIMITER, INVALID_QUOTED_FIELD, ESCAPED_STRING, NEWLINE, SUCCESS

@testset begin
    testcases = [
       # Failure due to parsing to a different value!
      ("\"a", InlineString7(), NamedTuple(), OK | QUOTED | INVALID_QUOTED_FIELD | EOF), # invalid quoted
      
      # Failure due to added ESCAPED_STRING code
      ("\"\\", InlineString7(), (; escapechar=UInt8('\\')), OK | QUOTED | INVALID_QUOTED_FIELD | EOF), # \\ e, invalid quoted
      
      # Failure due to added OK code
      ("NA", InlineString7(), (; sentinel=["NA"]), EOF | SENTINEL), # sentinel
            
      # Failures due to no EOF code
      ("\"\",", InlineString7(), NamedTuple(), OK | QUOTED | EOF | DELIMITED), # same e & cq
      ("\"a\",", InlineString7("a"), NamedTuple(), OK | QUOTED | EOF | DELIMITED), # quoted
      ("a,", InlineString7("a"), NamedTuple(), OK | EOF | DELIMITED),
      ("a__", InlineString7("a"), (; delim="__"), OK | EOF | DELIMITED),
      ("a,", InlineString7("a"), (; ignorerepeated=true), OK | EOF | DELIMITED),
      ("a__", InlineString7("a"), (; delim="__", ignorerepeated=true), OK | EOF | DELIMITED),
  ]
    for (i, case) in enumerate(testcases)
        println("\n---")
        println("testing case = $i")
        buf, check, opts, checkcode = case
        res = Parsers.xparse(InlineString7, buf; opts...)
        @show buf

        if !(check== res.val)
            @show check
            @show res.val
        end
        @test check === res.val

        if !(checkcode == res.code)
            @show Parsers.codes(checkcode)
            @show Parsers.codes(res.code)
        end
        @test checkcode == res.code
    end
end
julia> include("test/fails.jl")

---
testing case = 1
buf = "\"a"
check = ""
res.val = "a"
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:34
  Expression: check === res.val
   Evaluated: "" === "a"
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:34 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 2
buf = "\"\\"
Parsers.codes(checkcode) = "INVALID: OK | QUOTED | EOF | INVALID_QUOTED_FIELD "
Parsers.codes(res.code) = "INVALID: OK | QUOTED | ESCAPED_STRING | EOF | INVALID_QUOTED_FIELD "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: -32667 == -32155
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 3
buf = "NA"
Parsers.codes(checkcode) = "SUCCESS: SENTINEL | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | SENTINEL | EOF "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 34 == 35
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 4
buf = "\"\","
Parsers.codes(checkcode) = "SUCCESS: OK | QUOTED | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | QUOTED | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 45 == 13
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 5
buf = "\"a\","
Parsers.codes(checkcode) = "SUCCESS: OK | QUOTED | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | QUOTED | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 45 == 13
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 6
buf = "a,"
Parsers.codes(checkcode) = "SUCCESS: OK | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 41 == 9
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 7
buf = "a__"
Parsers.codes(checkcode) = "SUCCESS: OK | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 41 == 9
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 8
buf = "a,"
Parsers.codes(checkcode) = "SUCCESS: OK | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 41 == 9
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7

---
testing case = 9
buf = "a__"
Parsers.codes(checkcode) = "SUCCESS: OK | DELIMITED | EOF "
Parsers.codes(res.code) = "SUCCESS: OK | DELIMITED "
test set: Test Failed at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:40
  Expression: checkcode == res.code
   Evaluated: 41 == 9
Stacktrace:
 [1] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:464 [inlined]
 [2] macro expansion
   @ ~/repos/InlineStrings.jl/test/fails.jl:40 [inlined]
 [3] macro expansion
   @ /Applications/Julia-1.8.app/Contents/Resources/julia/share/julia/stdlib/v1.8/Test/src/Test.jl:1357 [inlined]
 [4] top-level scope
   @ ~/repos/InlineStrings.jl/test/fails.jl:7
Test Summary: | Pass  Fail  Total  Time
test set      |    9     9     18  0.0s
ERROR: LoadError: Some tests did not pass: 9 passed, 9 failed, 0 errored, 0 broken.
in expression starting at /Users/nickr/repos/InlineStrings.jl/test/fails.jl:6

Is it slower to parse smaller Int types?

I tried changing some code so that it parses integer data into the smallest Integer type that will fit it, rather than Int64 (i know the integers will be small, e.g. if i know they'll be single-digit integers 0-9, i set it to use Int8, and so on).

And i saw the parsing time increase compared to parsing them as Int64...

Is it expected that parsing integer data into Int will be faster than into smaller Integer types?

This tiny benchmark, seems to match what i see:

julia> using Parsers, BenchmarkTools

julia> buf = Vector{UInt8}("123");

julia> pos, len =  1, 3;

julia> opts = Parsers.Options()
Parsers.Options([""], nothing, false, false, 0x20, 0x09, false, 0x22, 0x22, 0x22, nothing, 0x2e, nothing, nothing, nothing, nothing)

julia> @btime Parsers.xparse(Int64, buf, pos, len, opts);
  149.694 ns (1 allocation: 32 bytes)

julia> @btime Parsers.xparse(Int32, buf, pos, len, opts);
  155.153 ns (1 allocation: 32 bytes)

julia> @btime Parsers.xparse(Int16, buf, pos, len, opts);
  150.374 ns (1 allocation: 32 bytes)

julia> @btime Parsers.xparse(Int8, buf, pos, len, opts);
  152.745 ns (1 allocation: 32 bytes)
  
(jl_41hlU2) pkg> st
      Status `/private/var/folders/hx/1h0bbkfd18d4n1qrnwmrl4j00000gn/T/jl_41hlU2/Project.toml`
  [6e4b80f9] BenchmarkTools v1.2.0
  [69de0a69] Parsers v2.0.3

And it seems to scale up similarly, e.g. if using CSV.jl to read in some data that's all integers:

julia> using CSV, BenchmarkTools

julia> @btime CSV.File("ints.csv",  types=Int64, delim=',');
  160.968 μs (378 allocations: 22.48 KiB)

julia> @btime CSV.File("ints.csv",  types=Int32, delim=',');
  178.616 μs (342 allocations: 20.89 KiB)

julia> @btime CSV.File("ints.csv",  types=Int16, delim=',');
  173.254 μs (342 allocations: 19.62 KiB)

julia> @btime CSV.File("ints.csv",  types=Int8, delim=',');
  170.279 μs (342 allocations: 18.92 KiB)

(jl_zrE6yO) pkg> st
      Status `/private/var/folders/hx/1h0bbkfd18d4n1qrnwmrl4j00000gn/T/jl_zrE6yO/Project.toml`
  [6e4b80f9] BenchmarkTools v1.2.0
  [336ed68f] CSV v0.9.3

ints.csv


julia> versioninfo()
Julia Version 1.6.2
Commit 1b93d53fc4 (2021-07-14 15:36 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin18.7.0)
  CPU: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)

Delete the master branch?

I think 3 times recently I have gotten into a confused state where I have added the package master branch (instead of the current default branch) or checked it out locally in git. Then things are super weird for a while until I realize I am on the wrong branch. Perhaps the old default branch should just be removed?

`stripwhitespace` can leave trailing whitespace in quoted string

with stripwhitespace = true (xref #105), i would expect quoted strings to have both leading and trailing whitespace stripped, but when the quoted string is followed by irrelevant characters the trailing whitespace is left:

Setup:

julia> using Parsers  # v2.2.0

julia> using InlineStrings  # InlineStrings here just so easier to see the result than with String

julia> opts = Parsers.Options(
         stripwhitespace=true,
         quoted=true,
         openquotechar='\'',
         closequotechar='\'',
         sentinel=missing,
         delim=',',
       );

Works as expected (gives ABC):

julia> buf = b"'ABC '";

julia> res = Parsers.xparse(InlineString7, buf, 1, length(buf), opts)
Parsers.Result{String7}(37, 6, "ABC")

With random trailing characters after the quoted string, leaves trailing whitespace (gives ABC ):

julia> buf = b"'ABC ' **";

julia> res = Parsers.xparse(InlineString7, buf, 1, length(buf), opts)
Parsers.Result{String7}(37, 9, "ABC ")

--

context is nickrobinson251/PowerFlowData.jl#44 / nickrobinson251/PowerFlowData.jl#62

quarto build broken on Parsers v2.5.7

I failed to build a website using quarto using Parsers v2.5.7 with the following error message:

An error occurred while executing the following cell:
------------------
using CSV
using DataFrames
csv_path = joinpath("working_directory", "data", "scope_0.csv")
df = CSV.read(csv_path, DataFrame, header = [1,2])
------------------
MethodError: no method matching iterate(::Parsers.Token)
Closest candidates are:
  iterate(::Union{LinRange, StepRangeLen}) at /usr/local/julia/share/julia/base/range.jl:826
  iterate(::Union{LinRange, StepRangeLen}, ::Integer) at /usr/local/julia/share/julia/base/range.jl:826
  iterate(::T) where T<:Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}} at /usr/local/julia/share/julia/base/dict.jl:695
  ...
Stacktrace:
  [1] indexed_iterate(I::Parsers.Token, i::Int64)
    @ Base ./tuple.jl:92
  [2] checkcommentandemptyline(buf::Vector{UInt8}, pos::Int64, len::Int64, cmt::Any, ignoreemptyrows::Bool, nlines::Base.RefValue{Int64})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/detection.jl:276
  [3] skiptorow(buf::Vector{UInt8}, pos::Int64, len::Int64, oq::Parsers.Token, eq::UInt8, cq::Parsers.Token, cmt::Any, ignoreemptyrows::Bool, cur::Int64, dest::Int64)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/detection.jl:191
  [4] detectcolumnnames(buf::Vector{UInt8}, headerpos::Int64, datapos::Int64, len::Int64, options::Parsers.Options, header::Any, normalizenames::Bool)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/detection.jl:178
  [5] CSV.Context(source::CSV.Arg, header::CSV.Arg, normalizenames::CSV.Arg, datarow::CSV.Arg, skipto::CSV.Arg, footerskip::CSV.Arg, transpose::CSV.Arg, comment::CSV.Arg, ignoreemptyrows::CSV.Arg, ignoreemptylines::CSV.Arg, select::CSV.Arg, drop::CSV.Arg, limit::CSV.Arg, buffer_in_memory::CSV.Arg, threaded::CSV.Arg, ntasks::CSV.Arg, tasks::CSV.Arg, rows_to_check::CSV.Arg, lines_to_check::CSV.Arg, missingstrings::CSV.Arg, missingstring::CSV.Arg, delim::CSV.Arg, ignorerepeated::CSV.Arg, quoted::CSV.Arg, quotechar::CSV.Arg, openquotechar::CSV.Arg, closequotechar::CSV.Arg, escapechar::CSV.Arg, dateformat::CSV.Arg, dateformats::CSV.Arg, decimal::CSV.Arg, truestrings::CSV.Arg, falsestrings::CSV.Arg, stripwhitespace::CSV.Arg, type::CSV.Arg, types::CSV.Arg, typemap::CSV.Arg, pool::CSV.Arg, downcast::CSV.Arg, lazystrings::CSV.Arg, stringtype::CSV.Arg, strict::CSV.Arg, silencewarnings::CSV.Arg, maxwarnings::CSV.Arg, debug::CSV.Arg, parsingdebug::CSV.Arg, validate::CSV.Arg, streaming::CSV.Arg)
    @ CSV ~/.julia/packages/CSV/jFiCn/src/context.jl:392
  [6] #File#25
    @ ~/.julia/packages/CSV/jFiCn/src/file.jl:221 [inlined]
  [7] read(source::String, sink::Type; copycols::Bool, kwargs::Base.Pairs{Symbol, Vector{Int64}, Tuple{Symbol}, NamedTuple{(:header,), Tuple{Vector{Int64}}}})
    @ CSV ~/.julia/packages/CSV/jFiCn/src/CSV.jl:91
  [8] top-level scope
    @ In[4]:5
  [9] eval
    @ ./boot.jl:373 [inlined]
 [10] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1196
LoadError: MethodError: no method matching iterate(::Parsers.Token)
Closest candidates are:
  iterate(::Union{LinRange, StepRangeLen}) at /usr/local/julia/share/julia/base/range.jl:826
  iterate(::Union{LinRange, StepRangeLen}, ::Integer) at /usr/local/julia/share/julia/base/range.jl:826
  iterate(::T) where T<:Union{Base.KeySet{<:Any, <:Dict}, Base.ValueIterator{<:Dict}} at /usr/local/julia/share/julia/base/dict.jl:695

Installing Parsers v2.3.2 fixed the failure. Pipeline failure

Reading a large CSV file with BigFloats takes a long time

Hi there, I have been redirected from Slack to post an issue here.

I have had some trouble loading a large CSV file with large values. E.g., a CSV file with 2 million observations, 4 columns of large Float64 values. BigFloat is much worse.

julia> @time df = DataFrames.DataFrame(CSV.File("/Users/jakeireland/Desktop/hamming_bound_integers_2000000.csv"))
344.206327 seconds (1.97 G allocations: 38.920 GiB, 21.82% gc time)

julia> @time df = DataFrames.DataFrame(CSV.read("/Users/jakeireland/Desktop/hamming_bound_integers_2000000.csv"))
545.099823 seconds (1.97 G allocations: 38.958 GiB, 22.85% gc time)

This is much faster in base R, and faster again using readr.

Behaviour of `xparse` when encountering invalid delimiters

I am trying to rely on xparse to correctly parse a value when i know the input contains invalid characters (i.e. an invalid delimiter). I am hoping/expecting to get the correct value and a INVALID_DELIMITER return code.

(I gather we do want it to be possible to rely on xparse in the presence of invalid delimiters, given #78).

But xparse doesn't always return the correct value (using Parsers.jl v2.0.6).
For example, when trying to parse a Float64 when there are special characters like /

julia> using Parsers

julia> buf = codeunits("1.0 /");

julia> res = Parsers.xparse(Float64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32607, 5, 2.3255508133e-314)

julia> res.val, Parsers.codes(res.code)
(2.3255508133e-314, "INVALID: OK | EOF | INVALID_DELIMITER ")

here xparse returned the expected code (INVALID_DELIMITER), but not the correct value (expected is res.val === 1.0)


Looking at what might be happening

The internal xparse2 gives the correct value, suggesting the typeparser actually does extract the correct value (and the "incorrect" code is due to simplifications in xparse2 and doesn't matter here)

julia> res = Parsers.xparse2(Float64, str, 1, length(str), Parsers.XOPTIONS)
Parsers.Result{Float64}(-32735, 5, 1.0)

julia> res.val, Parsers.codes(res.code)
(1.0, "INVALID: OK | EOF ")

And calling typeparser directly, I see the correct value (as expected):

julia> b, code = buf[1], Parsers.SUCCESS;

julia> Parsers.typeparser(Float64, buf, 1, length(buf), b, code, Parsers.XOPTIONS)
(1.0, 1, 4)

This isn't specific to the /character or to Float64, e.g. parsing Int64s:

julia> buf = codeunits("2 _");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(2, 1, 2)
julia> buf = codeunits("3 *");

julia> res = Parsers.xparse(Int64, buf, 1, length(buf), Parsers.XOPTIONS)
Parsers.Result{Int64}(-32607, 3, 4738866224)

julia> res.val, Parsers.codes(res.code)
(4738866224, "INVALID: OK | EOF | INVALID_DELIMITER ")

julia> Parsers.typeparser(Int64, buf, 1, length(buf), buf[1], code, Parsers.XOPTIONS)
(3, 1, 2)

So i suspect, this isn't to do with the typeparsers, but to do with the logic for handling invalid cases in xparse.

In particular, i think it's because xparse doesn't populate the value when the codes is not ok:

  • first typeparser returns the correct value
  • then in xparse correctly sets the code to INVALID_DELIMITER and send us to donedone

    Parsers.jl/src/Parsers.jl

    Lines 532 to 540 in 6b560d4

    # didn't find delimiter or newline, so we're invalid, keep parsing until we find delimiter, newline, or len
    code |= INVALID_DELIMITER
    while true
    pos += 1
    incr!(source)
    if eof(source, pos, len)
    code |= EOF
    @goto donedone
    end
  • but then donedone check's if ok(code) (which is false) and then doesn't pass the value to Result

    Parsers.jl/src/Parsers.jl

    Lines 659 to 666 in 6b560d4

    @label donedone
    tlen = pos - startpos
    if ok(code)
    y::T = x
    return Result{S}(code, tlen, y)
    else
    return Result{S}(code, tlen)
    end

So we have everything we need... but we're not using it.


Possible solution?

I think donedone might be doing this to handle the cases where we get sent to donedone before we've even called typeparser (e.g. because we hit "end of file" before hitting non-whitespace characters)

If this diagnosis is correct, i wonder if we should just handle that explicitly, rather than checking ok(code) e.g.
via a different goto-label, e.g.

+@label earlydone
+    # earlydone means parsing finished before calling `typeparser(T, ...)` to parse a `value::T`
+    tlen = pos - startpos
+    return Result{S}(code, tlen)
+
 @label donedone
     tlen = pos - startpos
-    if ok(code)
-        y::T = x
-        return Result{S}(code, tlen, y)
-    else
-        return Result{S}(code, tlen)
-    end
+    y::T = x
+    return Result{S}(code, tlen, y)

Parsers.jl v2.4.1 breaks InlineStrings.jl

Specifically #130 breaks the tests for InlineStrings.jl (JuliaStrings/InlineStrings.jl#48)

On v2.4.0

julia> res = Parsers.xparse(InlineString7, "")
Parsers.Result{String7}(33, 0, "")

julia> res.val
""

On v2.4.1

julia> res = Parsers.xparse(InlineString7, "")
Parsers.Result{PosLen}(33, 0, PosLen(0x0000000000100000))

julia> res.val
PosLen(0x0000000000100000)

Support AbstractString with tryparse function

Looks like tryparse just needs to be defined to accept AbstractString.... is there any reason not to?

julia> Parsers.tryparse(split("1,2", ",")[1], Float64)
ERROR: MethodError: no method matching tryparse(::SubString{String}, ::Type{Float64})
Closest candidates are:
  tryparse(::String, ::Type{T}; kwargs...) where T at /Users/tomkwong/.julia/packages/Parsers/oDXb6/src/Parsers.jl:262
  tryparse(::IO, ::Type{T}; kwargs...) where T at /Users/tomkwong/.julia/packages/Parsers/oDXb6/src/Parsers.jl:266
Stacktrace:
 [1] top-level scope at none:0

Streamline `xparse` interface

I think we have too many different xparse methods that set different defaults.

julia> Parsers.xparse(String, str)  # == Parsers.xparse(String, str; quoted=true)
options.quoted = true
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.xparse(String, str, 1, sizeof(str))
options.quoted = true
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.xparse(String, str, 1, sizeof(str), Parsers.Options())
options.quoted = false
Parsers.Result{PosLen}(33, 15, PosLen(0x000000000010000f))

julia> Parsers.xparse(String, str, 1, sizeof(str), Parsers.Options(; quoted=true))
options.quoted = true
Parsers.Result{PosLen}(5, 6, PosLen(0x0000000000200003))
  1. The first hits this, which passes quoted::Bool=true:

    Parsers.jl/src/Parsers.jl

    Lines 211 to 212 in e2259a6

    # for testing purposes only, it's much too slow to dynamically create Options for every xparse call
    function xparse(::Type{T}, buf::Union{AbstractVector{UInt8}, AbstractString, IO}; pos::Integer=1, len::Integer=buf isa IO ? 0 : sizeof(buf), sentinel=nothing, wh1::Union{UInt8, Char}=UInt8(' '), wh2::Union{UInt8, Char}=UInt8('\t'), quoted::Bool=true, openquotechar::Union{UInt8, Char}=UInt8('"'), closequotechar::Union{UInt8, Char}=UInt8('"'), escapechar::Union{UInt8, Char}=UInt8('"'), ignorerepeated::Bool=false, ignoreemptylines::Bool=false, delim::Union{UInt8, Char, PtrLen, AbstractString, Nothing}=UInt8(','), decimal::Union{UInt8, Char}=UInt8('.'), comment=nothing, trues=nothing, falses=nothing, dateformat::Union{Nothing, String, Dates.DateFormat}=nothing, debug::Bool=false, stripwhitespace::Bool=false, stripquoted::Bool=false) where {T}

  2. The second hits this, which uses Parsers.XOPTIONS

    Parsers.jl/src/Parsers.jl

    Lines 217 to 218 in e2259a6

    xparse(::Type{T}, buf::AbstractString, pos, len, options=Parsers.XOPTIONS) where {T} =
    xparse(T, codeunits(buf), pos, len, options)

    • and XOPTIONS has quoted=true
      const XOPTIONS = Options(missing, UInt8(' '), UInt8('\t'), UInt8('"'), UInt8('"'), UInt8('"'), UInt8(','), UInt8('.'), nothing, nothing, nothing, false, false, nothing, true, false, false)
    • (aside: Why does XOPTIONS exist?)
  3. The third hits the same method as 2, but passing in Parsers.Options() which has quoted=false:

    quoted::Bool=false,

  4. The fourth hits the same method as 2/3, but passes in Parsers.Options(; quoted=true) to set that explicitly ...but returns a different answer to 1 (xparse(String, str; quoted=true)), because Options() defaults to delim=nothing whereas 1 sets delim=UInt8(',')

    delim::Union{Nothing, UInt8, Char, String}=nothing,

This is now very off-topic from your original issue (sorry), so can move it to a new issue, but i think perhaps we could simplify the xparse interface to make this whole thing a little less confusing / more explicit.

I think i'd be in favour of requiring a user-given ::Options argument.

cc @quinnj

Originally posted by @nickrobinson251 in #119 (comment)

`realloc(): invalid pointer`

When updating from Parsers 0.2.18 to 0.2.20 I get the following when parsing a CSV file:

realloc(): invalid pointer

signal (6): Aborted
in expression starting at no file:0
gsignal at /lib64/libc.so.6 (unknown line)
abort at /lib64/libc.so.6 (unknown line)
__libc_message at /lib64/libc.so.6 (unknown line)
malloc_printerr at /lib64/libc.so.6 (unknown line)
realloc at /lib64/libc.so.6 (unknown line)
jl_gc_counted_realloc_with_old_size at /buildworker/worker/package_linux64/build/src/gc.c:2777
__gmpz_realloc at /usr/local/julia/bin/../lib/julia/libgmp.so (unknown line)
__gmpz_mul_2exp at /usr/local/julia/bin/../lib/julia/libgmp.so (unknown line)
mul_2exp! at ./gmp.jl:146 [inlined]
mul_2exp! at ./gmp.jl:148 [inlined]
scale at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:96 [inlined]
scale at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:114 [inlined]
#_defaultparser#46 at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:255 [inlined]
_defaultparser at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:126 [inlined]
#defaultparser#45 at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:123 [inlined]
defaultparser at /root/.julia/packages/Parsers/v5u2B/src/floats.jl:123 [inlined]
#parse!#13 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:349 [inlined]
parse! at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:349 [inlined]
#parse!#29 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:644 [inlined]
parse! at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:642 [inlined]
#parse!#28 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:618 [inlined]
parse! at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:616 [inlined]
#parse!#27 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:572 [inlined]
parse! at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:547 [inlined]
#parse!#26 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:448 [inlined]
parse! at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:448 [inlined]
#parse#15 at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:351 [inlined]
parse at /root/.julia/packages/Parsers/v5u2B/src/Parsers.jl:351 [inlined]
parsefield at /root/.julia/packages/CSV/eWuJV/src/tables.jl:88 [inlined]
getproperty at /root/.julia/packages/CSV/eWuJV/src/tables.jl:182
getproperty at /root/.julia/packages/CSV/eWuJV/src/tables.jl:148 [inlined]
macro expansion at /root/.julia/packages/Tables/qIlOP/src/utils.jl:55 [inlined]
eachcolumn at /root/.julia/packages/Tables/qIlOP/src/utils.jl:47 [inlined]
buildcolumns at /root/.julia/packages/Tables/qIlOP/src/fallbacks.jl:95
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
columns at /root/.julia/packages/Tables/qIlOP/src/fallbacks.jl:149 [inlined]
Type at /root/.julia/packages/DataFrames/IKMvt/src/other/tables.jl:21
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
|> at ./operators.jl:813
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
#read#105 at /root/.julia/packages/CSV/eWuJV/src/CSV.jl:315
unknown function (ip: 0x7fa269275a4b)
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
#read at ./none:0
jl_fptr_trampoline at /buildworker/worker/package_linux64/build/src/gf.c:1831
jl_apply_generic at /buildworker/worker/package_linux64/build/src/gf.c:2184
...

I'll note that this error is occurring in a fairly deep application. I can try to make a minimal reproducible test if requested.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.