Requested CSV Parsing Features [comment here for new requests] about csv.jl HOT 31 CLOSED

juliadata commented on August 17, 2024

Requested CSV Parsing Features [comment here for new requests]

from csv.jl.

Comments (31)

quinnj commented on August 17, 2024 2

yeah, it's on my todo list, after I clean up SQLite & ODBC

from csv.jl.

JeffBezanson commented on August 17, 2024 1

Feature request: be able to specify a custom parsing function for a certain column. I give you a function, and the type it will return, and you can pass the function a string to parse. For example, the dateformat=fmt option could then be implemented by passing e.g. parsers = (2 => s->Date(s,fmt),), where column 2 has type Date. Could be spelled many ways.

from csv.jl.

ScottPJones commented on August 17, 2024

First off, make sure you handle RFC 4180 correctly (which means CRLF must be handled, that is part of the standard).

from csv.jl.

ScottPJones commented on August 17, 2024

Able to accept/detect compressed files as input; also able to tell CSV what compression to expect
Accept URLs as input (this may require some upstream work as Requests.jl sucks hardcore right now for downloading, but maybe we can just use Base.download)

Aren't these two things really something that could be done outside of the CSV parser?

Ability to specify an arbitrary ASCII delimiter

tabs are very frequently used, also M$ uses ; when , is used as a decimal separator, AFAIK

Ability to specify an arbitrary ASCII newline character; not sure what to do about CRLF (\r\n)

Is this really necessary? I think all you really need to do is treat: \r, \n, or \r\n all as newlines,
and not complicate things with allowing an arbitrary character.
Have you seen that anything other than those three (old Mac OS, Unix/Linux, DOS/Windows/etc formats)?
The one thing you might want to be able to deal with is an array of strings, but that should be a separate entry point into the parser, IMO.

Ability to specify a quote character that quotes field values and allows delimiters/newlines in field values
Ability to specify an escape character that allows for the quote character inside a quoted field value

Are those necessary? RFC 4180 handles those situations just fine, "" is used to quote a double quote,
and a value enclosed in " is allowed to span multiple lines.

Ability to provide a custom header of column names
Ability to specify a custom line where the column names can be found in the file; the data must start on the line following the column names;

Sounds fine.

not sure what to do about headerless CSV files (if there is such a thing)

Actually, "headerless" CSV files are a very frequent case (more frequent in my experience than
CSV files with headers). RFC 4180 uses the mime type to determine that, if you have a mime type.

Ability to specify the types CSV should expect for each column individually or for every column
Ability to specify the formats of Date/DateTime columns for parsing
Ability to tell CSV to use 'x' number of rows for type inference

All nice features.

Ability to specify a thousands separator character
Ability to specify a decimal character (e.g. ',' for Europe); not sure how to handle the implementation here

Those can frequently be auto-detected, either the numbers will be enclosed in double quotes if they have , in them, or the M$ way (that I haven't actually seen in the wild) of using ; for separators.
If you are trying to infer types from the data, then you have to be careful here, because just because something is within quotes doesn't mean it isn't a numeric field.

Ability to specify a custom NULL value to expect (e.g. "NA", "\N", "NULL", etc.)

May be simply nothing between the commas, or a comma at the end of the line, that's what I've usually seen. Remember, strings might not be within quotes, just like numbers might be in quotes,
so you might have foo,null,bar,"123.456,30"

Ability to skip blank lines
Ability to do "SKIP, LIMIT" type SQL functionality when parsing (i.e. read only a chunk of a file)
Ability to not parse specified columns (ignore them)

Those 3 are nice.

Ability to specify a # of lines to skip at the end of a file

What would that be for? That seems like it would be hard to figure out, you'd have to load the data, and then throw out the last # lines, right?

from csv.jl.

StefanKarpinski commented on August 17, 2024

I don't think that plain strings should be treated as URLs that will be downloaded. I think that only a URL type should work that way. Otherwise it's way too easy for someone to trick you into making arbitrary HTTP / whatever protocol requests.

from csv.jl.

StefanKarpinski commented on August 17, 2024

This also plays well with multiple dispatch – instead of adding URL handling logic to the main body, you can just add methods for the URL type and then call the normal method from there.

from csv.jl.

ScottPJones commented on August 17, 2024

This also plays well with multiple dispatch

Ah, the awesome beauty of Julia! 😀

from csv.jl.

jiahao commented on August 17, 2024

I've also encountered some CSVs where the header is split across two rows, and this confuses the heck out of all the other CSV readers I've found. It would be good to support multirow headers.

from csv.jl.

quinnj commented on August 17, 2024

Really, multi-row headers, huh? Never heard of that before. can you share an example? It's not immediately obvious how we would handle that in a general case (100 row header?)

from csv.jl.

jiahao commented on August 17, 2024

Here is a small snippet of a medical data set, called MIMIC II. A lot of the data are specified in plaintext format, like so:

     Time          MCL1       I
(hh:mm:ss.mmm)     (mV)    (mV)
[21:49:14.504]   -0.069  -0.072
[21:49:14.512]   -0.138  -0.040
[21:49:14.520]   -0.207   0.136
...

Not really a CSV per se, but the only reader I've found that can handle these data well is Excel...

from csv.jl.

quinnj commented on August 17, 2024

Ah.....I see. No I've definitely seen these before. The general thing I want to do in these cases is vcat the "header rows" into a single column name, i.e.:

Time_(hh:mm:ss.mmm)
MCL1_(mV)
I_(mV)

from csv.jl.

jiahao commented on August 17, 2024

Concatenating the header rows would be fine.

from csv.jl.

jiahao commented on August 17, 2024

It would also be nice to be able to handle fields with \0 characters. For some reason this seems to be rather common in CSV files in the wild. pandas and fread both choke in the presence of \0s.

from csv.jl.

quinnj commented on August 17, 2024

The parser as is in this package handles \0 embedded in strings. Do you think there are cases where they're embedded in other types?

from csv.jl.

jiahao commented on August 17, 2024

Not that I've encountered. I've only seen \0 in string fields.

from csv.jl.

ScottPJones commented on August 17, 2024

Yes, you really need to handle \0 in quoted strings at the very least (within quoted strings, you really need to be able to handle absolutely any character).
Have you looked at handling common Excel weirdnesses, like ="000123"?

from csv.jl.

johnmyleswhite commented on August 17, 2024

I think you'll ultimately need to support multicharacter end-of-column and end-of-row delimiters. Pandas does this: they often demonstrate parsing the MovieLens data sets, where :: is the end-of-column delimiter.

from csv.jl.

ScottPJones commented on August 17, 2024

@johnmyleswhite That's a strange one! What other oddities have you come across that should be handled?
Thinking of the discussion in #3, couldn't all of that be handled efficiently by generating a reader for that, so that handling multicharacter eoc and eor delimiters would not slow down the 99% case of single character delimiters?

from csv.jl.

felipenoris commented on August 17, 2024

How about exporting variables to csv files?

from csv.jl.

quinnj commented on August 17, 2024

what do you mean exactly @felipenoris ?

from csv.jl.

felipenoris commented on August 17, 2024

@quinnj , I mean save vector or matrix to file. Equivalent to writetable("output.csv", df) on DataFrames.

from csv.jl.

AndyGreenwell commented on August 17, 2024

From the first entry in this thread:

not sure what to do about headerless CSV files (if there is such a thing)

Attempting to load a couple of large CSV files now for which there was no header row included.

With DataFrames.jl, providing header=false just treats all rows as data and provides generic column names of x1, x2, x3, etc.

Any possibility of the same behavior here for "headerless" CSV files.

from csv.jl.

quinnj commented on August 17, 2024

Yep, this is supported. Just provide the header manually and set the datarow, like CSV.read(file; header=["col1","col2","col3"], datarow=1)

from csv.jl.

JeffBezanson commented on August 17, 2024

Refactoring request: remove the dependency on DataFrames and (less importantly) NullableArrays.

The coupling to DataFrames already seems to be very light, which is great. AFAICT, the core functionality already just inserts values into a vector of column vectors. The only issue is that DataFrames is a pretty large dependency.

I also think missing values should only be handled for columns of type Nullable{T}. If Nullable is requested or inferred as a column type then a NullableArray can be used for that column, I'm fine with that. Ideally you could also stream! into e.g. an Array{Nullable} if you wanted though.

from csv.jl.

quinnj commented on August 17, 2024

Like current master? :)

The newest updates (project DECOUPLE) make CSV.jl basically oblivious to where it's sending data. It provides a Data.getfield{T}(source::CSV.Source, ::Type{T}, row, col) => Nullable{T} method that can be used by any Sink type. The DataFrames code that ingests a CSV.Source is now at https://github.com/JuliaData/DataStreams.jl/blob/a7246389c07df6ee22ebb10c6fd221743bd68b89/src/DataStreams.jl#L260.

The idea is to make it as easy as possible for anyone to write their own "Sink" type that could ingest from a CSV.Source (and still use the convenience functions like CSV.read(source, sink)).

from csv.jl.

davidagold commented on August 17, 2024

CSV.jl/src/CSV.jl

Line 4 in cadded8

using DataStreams, DataFrames, WeakRefStrings

Isn't this an implicit dependency on DataFrames? Anyway DataStreams requires DataFrames, and CSV requires DataStreams...

EDIT: DataFrames => DataStreams. Too much Data.

from csv.jl.

JeffBezanson commented on August 17, 2024

@quinnj That's great! I'll have to try to write a Sink.

CSV does still do using DataFrames though, so it should probably be added back to REQUIRES.

Anyway this isn't super urgent since you can get the data from a DataFrame without copying.

from csv.jl.

quinnj commented on August 17, 2024

@davidagold is right, DataStreams now has the explicit dependency on DataFrames.

from csv.jl.

davidagold commented on August 17, 2024

@quinnj Would you support moving the DataFrame sink methods to DataFrames?

from csv.jl.

quinnj commented on August 17, 2024

BTW @JeffBezanson, the new docs on the Source/Sink interface are here, still trying to figure out why they're not hosting correctly.

from csv.jl.

quinnj commented on August 17, 2024

Closing in favor of specific issues going forward.

from csv.jl.

Requested CSV Parsing Features [comment here for new requests] about csv.jl HOT 31 CLOSED

Comments (31)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent