Code Monkey home page Code Monkey logo

Comments (8)

quinnj avatar quinnj commented on June 30, 2024

Hmmmm.....we should actually be accounting for quoted strings/other types during type detection. Can you post a specific case (file) where this isn't happening? Definitely a bug.

from csv.jl.

alyst avatar alyst commented on June 30, 2024

It's attenu.csv.gz
I get

CSV.CSVError("error parsing a `Int64` value on column 3, row 170; encountered 'c'")

from csv.jl.

alyst avatar alyst commented on June 30, 2024

I've reported this issue and fixed it in #65, but in the master the issue is present once again (the "quoted numbers detected as string column" test that I provided was also modified to expect that the 2nd (quoted) column is Int).
Is it because the logic was changed to ignore the quotes when detecting column types?

Somewhat similar issue is that during type detection quoted nulls ("NA") are detected as nulls.

from csv.jl.

quinnj avatar quinnj commented on June 30, 2024

@alyst, sorry for what's happened here. With the port to Nulls, I took the time to do quite a number of large refactorings, involving initial type detection, parsing, streaming, etc. In the process, there were some two dozen other issues closed, and I tried to make sure all existing issues stayed resolved, but I was worried that something would regress.

In this case, the behavior that has changed, and that I'd like to support is that quoted fields are not automatically treated as Strings. The reasoning here is that I've personally encountered several different csv sources where, for some reason or another, a system chooses to quote all fields, regardless of being a string or not, or containing characters needing escaping or not.

In the case of the attenu.csv.gz file, the correct way to read that file would now be

julia> df = CSV.read(joinpath(dir, "attenu.csv"); null="NA", types=Dict(3=>Union{Null, String}))

i.e. it's pretty easy to manually specify that the 3rd column should be Strings with null values as "NA".

Does that all make sense? Sorry again if this has messed anything up at all.

from csv.jl.

alyst avatar alyst commented on June 30, 2024

The mode that doesn't automatically treat quoted columns as strings definitely makes sense. But for me it's rather an indication of the problem with the .csv file.
Maybe it's possible to add an option (say, quoted_values=:string/:detect) specifying whether to always treat quoted columns as strings or to ignore the quotes and try to detect the type of the value.

from csv.jl.

alyst avatar alyst commented on June 30, 2024

Re attenu.csv.gz, the problem is that the column is inferred as non-null, but then null occurs during actual parsing.

from csv.jl.

alyst avatar alyst commented on June 30, 2024

FWIW in R, when we remove from attenu a few rows with non-digits in the 3rd column, read.csv(stringsAsFactors=FALSE) imports the 3rd column as int and treats quoted NAs as NA.
readr::read_csv() also imports it as int. But, read_csv() has quoted_na option (TRUE by default), which specifies whether to treat "NA" as NA or as a string (however, for attenu.csv.gz this option doesn't seem to have any effect, the 3rd column at 79th row is always NA).

from csv.jl.

quinnj avatar quinnj commented on June 30, 2024

This should be fixed on master with the switch to CSV.File (CSV.read still relies on the old CSV.Source, but there are plans to switch it over.

Note for now, you can get a NamedTuple of Vectors on master by doing using Tables; table = CSV.File(file; kwargs...) |> columntable

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.