The csv.jl from juliadata

Feature to skip blank/empty rows, or commented rows

CSV.read() return value

I'd like to decide on the Julia structure that CSV.read() returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:

Matrix{Any}: easy to allocate and work with, but we lose type information and header values
Dict{String,Vector{T}}: column_name => column_values::Vector{T}, use sentinel values for nulls
Dict{String,NullableArray{T}}: same as above, but we naturally represent nulls
DataFrame with NullableArray columns: hook into DF, but hook into DF
Tables.Table: this would be essentially an in-memory or on-disk SQLite database (I'll probably plan on supporting this one regardless, but maybe it could be the only option?)
Any other recommendations out there?

I'm leaning towards Dict{String,NullableArray{T}} as it's the most straightforward

@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan

Feature to ignore parsing columns (by name or index?)

Currently, CSV.write is a pretty naive implementation. There's probably a lot that could be done to improve performance, look at this post for some ideas to speed up: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/.

Better control over ignoring leading/trailing whitespace

Writing a DataFrame to file

Does CSV.jl have a way to write DataFrames to file? I couldn't find a way to construct a Sink from a DataFrame.

I plan on writing 2 separate DataFrames to the same file, and I couldn't do this with DataFrame's writetable method.

Parsing error (I believe)

I'm trying to read the following file:

Name,Age,Children
John, 38., 3
Sally, 23., 1
Kirk, 64., 5

and get this:

   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.5.0-rc2+0 (2016-08-12 11:25 UTC)
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org/ release
|__/                   |  x86_64-w64-mingw32

julia> using CSV

julia> x = CSV.Source("data.csv")
CSV.Source: data.csv
    CSV.Options:
        delim: ','
        quotechar: '"'
        escapechar: '\\'
        null: ""
        dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
Data.Schema:
rows: 3 cols: 3
Columns:
 "Name"      WeakRefString{UInt8}
 "Age"       Float64
 "Children"  Int64

julia> Data.getfield(x,Float64,1,2)
ERROR: CSV.CSVError("error parsing a `Float64` value on column 2, row 1; encountered 'J'")
 in checknullend at C:\Users\anthoff\.julia\v0.5\CSV\src\parsefields.jl:52 [inlined]
 in parsefield(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Type{Float64}, ::CSV.Options, ::Int64, ::Int64, ::Base.RefValue{CSV.ParsingState}) at C:\Users\anthoff\.julia\v0.5\CSV\src\parsefields.jl:126
 in getfield(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{Float64}, ::Int64, ::Int64) at C:\Users\anthoff\.julia\v0.5\CSV\src\Source.jl:195

I'm on master for CSV and DataStreams.

Am I using the API in the wrong way, or is this a bug?

Appveyor url is incorrect.

The appveyor link in the homepage points to https://ci.appveyor.com/project/JuliaData/documenter-jl/branch/master, which is clearly incorrect.

I dont know what it should be, since https://ci.appveyor.com/project/JuliaData/CSV-jl seems to lead nowhere. Is this set up on appveyor at all?

Reading strings into DataFrame produces strange result

When I read a csv file into a DataFrame like this:

df = DataFrame(CSV.csv("filename.csv"))

and that file has string columns, they end up showing as "#undef" when I show the DataFrame on the REPL. Probably because the type of a string column in the DataFrame ends up as DataArray{DataStreams.Data.PointerString{T},1}? Could those actually end up as a normal UTF8String in the DataFrame?

CSV.read leaves file open on Windows 7?

using CSV
CSV.read("data.txt")

looks like to leave the file open in Windows 7. If I try to delete the file data.txt in Windows Explorer, I get the dialog "File In Use; The action can't be completed because the file is open in Julia Programming Language; Close the file and try again". If I close Julia, I can delete the file. However, I can delete the file in Git Bash without closing Julia.

On Mac it seems to work fine.

Versions

Julia Version 0.5.0
CSV v"0.1.2"

"Lazy" read a CSV?

I wonder about the potential use/worth of a "lazily" read CSV file. My idea here is:

"reading" a CSV file would mainly involve constructing a CSV.File (with all properties set), storing a reference to the file's mmap, and going through and figuring out all the field offsets
The resulting type (let's call it a CSV.Table, would have various indexing operations defined to "use" it, but all the actual value parsing would be delayed until physical access of those values. If a column was never accessed, its values would never actually be parsed.

My hope is that this would potentially allow an extremely fast "parsing" experience with some of the cost deferred until actual values are needed.

The actual implementation may be tricky to sort out exactly how to tell if a value has been parsed or not, but that's a separate concern.

@johnmyleswhite

question: how to handle `parsefield(io, Char)`

It comes up quite often: eg a dataset I am now working with encodes gender as M and F, another uses E, U, O for employment, unemployment, out of the labor force, etc.

Using String is not optimal. I could define a method for parse(Char, str; raise=true) as suggested by the manual, or one for parsefield(io, Char, ...).

Error parsing Netflix Prize data

A typical CSV file from the Netflix Prize data set looks like:

3884821:
249897724,2,2001-11-26
483,3,2000-01-13
4875839,5,2059-07-27

and so on.

Reading this snippet into CSV throws an error:

julia> CSV.csv("test.txt")
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 3; encountered '-'")
 in parsefield at /Users/test/.julia/v0.4/CSV/src/getfields.jl:86
 in parsefield! at /Users/test/.julia/v0.4/CSV/src/Source.jl:216
 in stream! at /Users/test/.julia/v0.4/CSV/src/Source.jl:230
 in stream! at /Users/test/.julia/v0.4/DataStreams/src/DataStreams.jl:243
 in csv at /Users/test/.julia/v0.4/CSV/src/Source.jl:284

CSV.read is still incredibly slow on Windows with 0.4.1

@quinnj

I just did a test on a 1000 row dataset.
I needed to set type detect to 1000 which may be slightly "unfair" compared to readcsv which has no types. Still CSV.read is far to slow:
** it takes 13 seconds instead of 0.04s **

I note that the functions were already compiled in the example below.

I was hoping that this works better now, as you indicated here:
https://groups.google.com/forum/#!searchin/julia-users/csv/julia-users/IFkPso4JUac/lNLgLoCqAwAJ

Any hints?

julia> f="T:\temp\julia1k.csv"
"T:\temp\julia1k.csv"

julia> @time f1=readcsv(f);
0.043854 seconds (239.86 k allocations: 8.536 MB)

julia> @time df=readtable(f);
0.039639 seconds (221.93 k allocations: 10.359 MB, 15.51% gc time)

julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
13.760476 seconds (1.79 M allocations: 73.616 MB, 0.12% gc time)

julia> @show size(f1),size(f2),size(df)
(size(f1),size(f2),size(df)) = ((1000,77),(999,77),(999,77))
((1000,77),(999,77),(999,77))

julia> versioninfo(true)
Julia Version 0.4.1
Commit cbe1bee* (2015-11-08 10:33 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
WORD_SIZE: 64
Microsoft Windows [Version 6.1.7601]
uname: MSYS_NT-6.1 2.3.0(0.290/5/3) 2015-09-29 10:48 x86_64 unknown
Memory: 31.694698333740234 GB (26403.6875 MB free)
Uptime: 1.1864766877332e6 sec
Load Avg: 0.0 0.0 0.0
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz:
speed user nice sys idle irq ticks
#1 3410 MHz 4675942 0 2720907 1179080330 152865 ticks
#2 3410 MHz 609105 0 854667 1185013080 87454 ticks
#3 3410 MHz 6070357 0 9145699 1171260702 124348 ticks
#4 3410 MHz 786603 0 1347911 1184342104 18033 ticks
#5 3410 MHz 6533228 0 11563262 1168380019 145923 ticks
#6 3410 MHz 106033 0 37487 1186332833 1404 ticks
#7 3410 MHz 5059143 0 8723326 1172693743 114894 ticks
#8 3410 MHz 2913786 0 1522086 1182040247 29094 ticks

BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Environment:
.CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
GROOVY_HOME = C:\Program Files (x86)\Groovy\Groovy-2.2.2
HOMEDRIVE = C:
HOMEPATH = \Users\workstation
JAVA_HOME = C:\Program Files\Java\jre8
JULIA_HOME = C:\Program Files\Juno\resources\app\julia\bin
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.groovy;.gy

Package Directory: C:\Users\workstation.julia
27 required packages:

Coverage 0.2.3
DataFrames 0.6.10
Dates 0.4.4
FastAnonymous 0.3.2
Gadfly 0.3.18
Graphics 0.1.3
HDF5 0.5.6
Iterators 0.1.9
JLD 0.5.6
JSON 0.5.0
Jewel 1.0.7
Libz 0.0.2
Lint 0.1.68
Loess 0.0.5
Mocha 0.1.0
NumericExtensions 0.6.2
ODBC 0.3.10
ProfileView 0.1.1
ProgressMeter 0.2.2
PyCall 1.2.0
RDatasets 0.1.2
SQLite 0.3.0
SortingAlgorithms 0.0.6
StatsBase 0.7.4
TypeCheck 0.0.3
WinRPM 0.1.13
ZipFile 0.2.5
57 additional packages:
ArrayViews 0.6.4
BinDeps 0.3.19
Blosc 0.1.4
BufferedStreams 0.0.2
CSV 0.0.2
Cairo 0.2.31
Calculus 0.1.14
Codecs 0.1.5
ColorTypes 0.2.0
Colors 0.6.0
Compat 0.7.7
Compose 0.3.18
Conda 0.1.8
Contour 0.0.8
DataArrays 0.2.20
DataStreams 0.0.2
DataStructures 0.3.13
Debug 0.1.6
Distances 0.2.1
Distributions 0.8.7
Docile 0.5.19
DualNumbers 0.1.5
FactCheck 0.4.1
FileIO 0.0.3
FixedPointNumbers 0.1.1
GZip 0.2.18
Grid 0.4.0
Gtk 0.9.2
GtkUtilities 0.0.6
Hexagons 0.0.4
HttpCommon 0.2.4
HttpParser 0.1.1
ImmutableArrays 0.0.11
JuliaParser 0.6.3
KernelDensity 0.1.2
LNR 0.0.2
Lazy 0.10.1
LibExpat 0.1.0
Logging 0.2.0
MacroTools 0.2.0
MbedTLS 0.2.0
MySQL 0.0.0- master (unregistered, dirty)
NaNMath 0.1.1
NullableArrays 0.0.2
NumericFuns 0.2.4
Optim 0.4.4
PDMats 0.3.6
Reexport 0.0.3
Requests 0.3.2
Requires 0.2.1
SHA 0.1.2
Showoff 0.0.6
StatsFuns 0.2.0
URIParser 0.1.1
WoodburyMatrices 0.1.2
Zlib 0.1.12
lib 0.0.0- non-repo (unregistered)

julia>

CSV.read slow

I have no idea why, but when i try to read a matrix with CSV.read it appears to be much slower than the classic readdlm.

 @time s = CSV.read("file",delim= ' ',rows_for_type_detect=1,header=false);
 89.093956 seconds (796.97 M allocations: 18.175 GB, 5.53% gc time)

 @time a = readdlm("file");
 19.948536 seconds (466.06 k allocations: 4.317 GB, 8.12% gc time)

I wonder....what the hell am i doing wrong? The file is a matrix with 4845 rows and 24348 columns

Would it make sense to register CSV.jl with FileIO?

The FileIO package allows for various extensions to be registered to invoke other packages' read routines when encountered. Would it make sense to register CSV.read for the .csv extension and possibly others like .tsv? That way

load("myfile.csv")

could be used instead of remembering whether the call is CSV.read or read_csv or readtable or ...

Support multiple "null" strings

STABLE docs link not working

https://juliadata.github.io/CSV.jl/stable yields a 404 error.

Add an option to "unnullify" columns that do not contain nulls.

Right now the optional nullable argument to CSV.read allows for the columns to all be NullableVector types or for none to be NullableVectors. Could an optional argument be added that when nullable = true any columns that do not contain nulls are "unnullified" after being read? The code could be as simple as

unnullify(x::NullableArray) = anynull(x) ? x : Array(x)

although I haven't looked inside the package to see exactly where this could be done.

Inconsistent resources used

If I run

@elapsed pvs = CSV.read("chunky.csv")

The outcome is about 30 seconds, and fits comfortably into memory.

However if I run:

@elapsed pvs = CSV.read("chunky.csv", nullable=false, types=correct_types)

I end up having to crash out the REPL as the read consumes my entire memory and swap.

The csv is perfectly formed with no null values.

Is there a good reason for this happening? Otherwise I find it counter-intuitive that a bunch of Nullable arrays containing data would be smaller than the data. Is something fancy like mmap going on?

EDIT: I think the damage is being done by supplying types, but still not sure why. I am also on the current master branch.

ERROR: UndefVarError: f not defined

When doing a CSV.read I'm getting an undefined var error. It looks like the last commit changed the param name of "f" to "io", but didn't update the variables in the function "countlines".

julia> CSV.read("../ndsparse_use_cases/relations/market_to_bu_relation.csv", types = [String,String]) ERROR: UndefVarError: f not defined in countlines(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8) at /home/jnelson/.julia/v0.5/CSV/src/io.jl:79 in #Source#3(::String, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:76 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0 in #Source#2(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::String) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:39 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0 in #read#4(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::CSV.#read, ::String, ::Type{T}) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:274 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T}) at ./<missing>:0 (repeats 2 times) in eval(::Module, ::Any) at ./boot.jl:234 in macro expansion at ./REPL.jl:92 [inlined] in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46

Try to leverage Julia's multi-threading

This would involve chunking up the input file to be able to use the @threads for ... basic multi-threading interface.

CSV.read is overwriting a method in base on 0.4.1

WARNING: Method definition read(Base.AbstractIOBuffer, Type{UInt8}) in module Base
 overwritten in module CSV at C:\Users\amellnik\.julia\v0.4\CSV\src\Source.jl:13.

It looks like the base method was recently added: JuliaLang/julia@bb744fd

error when trying to read columns as Strings

This used to work. Could use some help here:

julia> CSV.read("file.csv",types = [String,String,String],nullable=false)
ERROR: TypeError: streamto!: in typeassert, expected String, got WeakRefStrings.WeakRefString{UInt8}
 in streamto!(::DataFrames.DataFrame, ::Type{DataStreams.Data.Field}, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{String}, ::Type{String}, ::Int64, ::Int64, ::DataStreams.Data.Schema{true}, ::Base.#identity) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:172
 in stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrame, ::DataStreams.Data.Schema{true}, ::DataStreams.Data.Schema{true}, ::Array{Function,1}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:186
 in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:150
 in stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:144
 in #read#21(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /home/jeff/.julia/v0.5/CSV/src/Source.jl:248
 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)

unable to precompile package on clean install of current v0.5 julia master

Prior to this I had just reinstalled Julia from source from the master branch and ran the following

Pkg.add("DataFrames")
Pkg.add("CSV")
using DataFrames
using CSV

and this is the truncated output, skipping over the deprecation warnings

while loading /home/cjprybol/.julia/v0.5/Docile/src/Extensions/Extensions.jl, in expression starting on line 16
ERROR: LoadError: `UTF16String` has been moved to the package LegacyStrings.jl:
Run Pkg.add("LegacyStrings") to install LegacyStrings on Julia v0.5-;
Then do `using LegacyStrings` to get `UTF16String`.

 in include_from_node1(::String) at ./loading.jl:426
 in macro expansion; at ./none:2 [inlined]
 in anonymous at ./<missing>:?
 in eval(::Module, ::Any) at ./boot.jl:234
 in process_options(::Base.JLOptions) at ./client.jl:239
 in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/WeakRefStrings/src/WeakRefStrings.jl, in expression starting on line 30
ERROR: LoadError: Failed to precompile WeakRefStrings to /home/cjprybol/.julia/lib/v0.5/WeakRefStrings.ji
 in compilecache(::String) at ./loading.jl:505
 in require(::Symbol) at ./loading.jl:337
 in include_from_node1(::String) at ./loading.jl:426
 in macro expansion; at ./none:2 [inlined]
 in anonymous at ./<missing>:?
 in eval(::Module, ::Any) at ./boot.jl:234
 in process_options(::Base.JLOptions) at ./client.jl:239
 in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 4
ERROR: Failed to precompile CSV to /home/cjprybol/.julia/lib/v0.5/CSV.ji
 in compilecache(::String) at ./loading.jl:505
 in require(::Symbol) at ./loading.jl:364

julia> Pkg.add("LegacyStrings")
INFO: Cloning cache of LegacyStrings from https://github.com/JuliaArchive/LegacyStrings.jl.git
INFO: Installing LegacyStrings v0.1.1
INFO: Package database updated

julia> using LegacyStrings
WARNING: could not import Base.lastidx into LegacyStrings
WARNING: using LegacyStrings.ascii in module Main conflicts with an existing identifier.
WARNING: using LegacyStrings.utf8 in module Main conflicts with an existing identifier.

julia> using CSWARNING: both LegacyStrings and Base export "ASCIIString"; uses of it in module Main must be qualified
WARNING: both LegacyStrings and Base export "ByteString"; uses of it in module Main must be qualified
WARNING: both LegacyStrings and Base export "UTF8String"; uses of it in module Main must be qualified
julia> using CSV
INFO: Precompiling module CSV...
ERROR: LoadError: `UTF16String` has been moved to the package LegacyStrings.jl:
Run Pkg.add("LegacyStrings") to install LegacyStrings on Julia v0.5-;
Then do `using LegacyStrings` to get `UTF16String`.

 in include_from_node1(::String) at ./loading.jl:426
 in macro expansion; at ./none:2 [inlined]
 in anonymous at ./<missing>:?
 in eval(::Module, ::Any) at ./boot.jl:234
 in process_options(::Base.JLOptions) at ./client.jl:239
 in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/WeakRefStrings/src/WeakRefStrings.jl, in expression starting on line 30
ERROR: LoadError: Failed to precompile WeakRefStrings to /home/cjprybol/.julia/lib/v0.5/WeakRefStrings.ji
 in compilecache(::String) at ./loading.jl:505
 in require(::Symbol) at ./loading.jl:337
 in include_from_node1(::String) at ./loading.jl:426
 in macro expansion; at ./none:2 [inlined]
 in anonymous at ./<missing>:?
 in eval(::Module, ::Any) at ./boot.jl:234
 in process_options(::Base.JLOptions) at ./client.jl:239
 in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 4
ERROR: Failed to precompile CSV to /home/cjprybol/.julia/lib/v0.5/CSV.ji
 in compilecache(::String) at ./loading.jl:505
 in require(::Symbol) at ./loading.jl:364

Project 0.5

This issue is for tracking progress towards the next major release, corresponding to the broader release of Julia 0.5:

Rewrite high-level functions of the form CSV.read(io_or_file, sink, args...; kwargs...) and CSV.write(io_or_file, source, args...; kwargs...)
Add tests for reading to: DataFrame, CSV.Sink, SQLite.Sink, Feather.Sink, ODBC.Sink, Domo.Sink
Test write from: DataFrame, CSV.Source, SQLite.Source, Feather.Source, ODBC.Source, Domo.Source
Ensure tests pass on 0.4 & 0.5
Add benchmarks for reading/writing to/from DataFrames vs. R's fread/fwrite and pandas dataframes, run benchmarks through Travis, commit results to perf/historical directory
Update docs to include an examples + guid page, in addition to the "reference" page

Pkg.test() on windows with Julia-0.5 fails

Pkg.test("CSV") on windows fails. If the file type of the following files are modified to unix, then the tests pass.

baseball.csv
stocks.csv
test_utf8.csv
test_single_column.csv
test_empty_file_newlines.csv

Support `quotecharopen` and `quotecharclose`

It would be useful if the quotes did not have to always open and close with the same character.
I have a delimitted format that looks like

{reddish}, {red, ish}
{darkish}, {dark, ish}
{greenish}, {green, ish}
....
{greyish}, {grey, ish}

Where every cell is quoted using { and }

read-process-write lines in a CSV

Following this SO post: I'd like to do some processing on a CSV, along the lines of:

infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"

data = readcsv(infile; header=true)
map!(replace_nulls, data[1])
writecsv(outfile, data; header=true)

Except that this doesn't work AFAICT since readdlm doesn't support headers.
(The workaround is probably fine but I'm personally having some silly issue with it)

Is this something supported by CSV.jl? Would I need to read/write a DataFrame, or could I just iterate over lines?

Automatically take care of quoting fields when writing out

Right now, this is handled by the quotefields::Bool keyword in various methods/constructors, but I think it would be better to just detect if we need to do this and take care of it automatically.

append + header

It would be nice/helpful/awesome if CSV.write("foo.csv", some_dataframe, append = true, header = true) would

output a header line if the file is empty
check the identity and order of the column names if the file already contains a header

Type determination error

I have a CSV file with 1156 rows and 3 columns. Most of the entries in column 3 are NA's and some of them contains Int64s. When reading with CSV.read the type of column 3 is coming out to be Nullable{WeakRefString{UInt8}} which should have ideally been Nullable{Int64}. When i put first entry in third column manually in 101th row it reads it as Nullable{Int64} but considers the column to be of type Nullable{WeakRefString{UInt8}} if first entry is in row 102 or ahead. So I presume the first 101 rows are being used to infer the type of the column. How can we correctly infer the type if the first element is beyond position 102?

This is the link to the sample csv file. The function to read the file is CSV.read("sample.csv")

StackOverflow when reading from ResponseStream

On Julia 0.5:

julia> using Requests

julia> using CSV

julia> stream = Requests.get_streaming("https://raw.githubusercontent.com/JuliaData/CSV.jl/master/test/test_files/test_utf8.csv")
ResponseStream(Request(https://raw.githubusercontent.com/JuliaData/CSV.jl/master/test/test_files/test_utf8.csv, 3 headers, 0 bytes in body))

julia> CSV.read(stream)
ERROR: StackOverflowError:
 in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:0
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
 in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
 in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
 ...
 in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
 in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
 in read(::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:287

Load error with v0.5 of julia

With the latest commit I get the following error if I load SQLite (and so CSV):

ERROR: LoadError: LoadError: error in method definition: function Core.getfield must be explicitly imported to be extended
 in include(::ASCIIString) at ./boot.jl:264
 in include_from_node1(::ASCIIString) at ./loading.jl:417
 in include(::ASCIIString) at ./boot.jl:264
 in include_from_node1(::ASCIIString) at ./loading.jl:417
 in eval(::Module, ::Any) at ./boot.jl:267
 [inlined code] from ./sysimg.jl:14
 in require(::Symbol) at ./loading.jl:348
 in eval(::Module, ::Any) at ./boot.jl:267
while loading /home/martin/.julia/v0.5/CSV/src/getfields.jl, in expression starting on line 458
while loading /home/martin/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 89

Versioninfo:

julia> versioninfo()
Julia Version 0.5.0-dev+3123
Commit 01dd5ec (2016-03-12 05:08 UTC)
Platform Info:
  System: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, haswell)

Problem parsing null non-string fields in tab-separated file

I'm having trouble parsing tab-separated files with non-string columns (as inferred from the data or explicitly declared via the types keyword) containing null values (denoted by adjacent delimiters). The parser is grabbing the following cell, which can be seen in the following example (Julia 0.5, CSV.jl 0.1.1):

Input:

$ cat /tmp/foo.txt
A	B	C	D
1	2016	x	100
2		2014	200

Result:

julia> CSV.read("/tmp/foo.txt"; delim='\t')
2×4 DataFrames.DataFrame
│ Row │ A │ B    │ C     │ D     │
│ 1   │ 1 │ 2016 │ "x"   │ 100   │
│ 2   │ 2 │ 2014 │ "200" │ #NULL │

Expected:

│ Row │ A │ B     │ C      │ D   │
│ 1   │ 1 │ 2016  │ "x"    │ 100 │
│ 2   │ 2 │ #NULL │ "2014" │ 200 │

If column C in row 2 did not contain a value of the same inferred type as column B of row 1 (Int64) then the parser would fail, unable to parse the cell as an integer, e.g.

Input:

$ cat /tmp/bar.txt
A	B	C	D
1	2016	x	100
2		y	200

Result:

julia> CSV.read("/tmp/bar.txt"; delim='\t')
ERROR: CSV.CSVError("error parsing a `Int64` value on column 2, row 2; encountered 'y'")
 in checknullend at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:56 [inlined]
 in parsefield at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:127 [inlined]
 in parsefield at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:107 [inlined]
 in streamfrom(::CSV.Source, ::Type{DataStreams.Data.Field}, ::Type{Nullable{Int64}}, ::Int64, ::Int64) at /Users/josh/.julia/v0.5/CSV/src/Source.jl:185
 in streamto!(::DataFrames.DataFrame, ::Type{DataStreams.Data.Field}, ::CSV.Source, ::Type{Nullable{Int64}}, ::Type{Nullable{Int64}}, ::Int64, ::Int64, ::Data
Streams.Data.Schema{true}, ::Base.#identity) at /Users/josh/.julia/v0.5/DataStreams/src/DataStreams.jl:171
 in stream!(::CSV.Source, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrame, ::DataStreams.Data.Schema{true}, ::DataStreams.Data.Schema{true}, ::Array{Fu
nction,1}) at /Users/josh/.julia/v0.5/DataStreams/src/DataStreams.jl:185
 in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /Users/josh/.julia/v0.5/DataStreams/
src/DataStreams.jl:149
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /Users/josh/.julia/v0.5/CSV/src/Source.jl:
289
 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)

Expected:

│ Row │ A │ B     │ C   │ D   │
│ 1   │ 1 │ 2016  │ "x" │ 100 │
│ 2   │ 2 │ #NULL │ "y" │ 200 │

Interestingly, if the delimiter is changed to a comma in the file, it loads as expected, which makes me think the the tab is being eaten as whitespace somewhere.

Also, string fields do not appear to have this problem.

Input:

$ cat /tmp/baz.txt
A	B	C	D
1	P	x	100
2		y	200

Result:

julia> CSV.read("/tmp/baz.txt"; delim='\t')
2×4 DataFrames.DataFrame
│ Row │ A │ B     │ C   │ D   │
│ 1   │ 1 │ "P"   │ "x" │ 100 │
│ 2   │ 2 │ #NULL │ "y" │ 200 │

Expected:

│ Row │ A │ B     │ C   │ D   │
│ 1   │ 1 │ "P"   │ "x" │ 100 │
│ 2   │ 2 │ #NULL │ "y" │ 200 │

ERROR: argument is an abstract type; size is indeterminate

Hello,

I have a CSV file named test.csv like this

ticker,dt,bid,ask
EUR/USD,20140101 21:55:34.378,1.37622,1.37693
EUR/USD,20140101 21:55:40.410,1.37624,1.37698
EUR/USD,20140101 21:55:47.210,1.37619,1.37696
EUR/USD,20140101 21:55:57.963,1.37616,1.37696
EUR/USD,20140101 21:56:03.117,1.37616,1.37694
EUR/USD,20140101 21:56:07.254,1.37616,1.37692
EUR/USD,20140101 21:56:16.911,1.3762,1.37695
EUR/USD,20140101 21:56:19.433,1.37615,1.37692
EUR/USD,20140101 21:56:24.971,1.37615,1.37691
EUR/USD,20140101 21:56:24.972,1.37615,1.37689

I'd like to load it using CSV.jl but it raises an error

julia> CSV.read("test.csv")
ERROR: argument is an abstract type; size is indeterminate
 in call at /Users/femto/.julia/v0.4/DataStreams/src/DataStreams.jl:189
 in stream! at /Users/femto/.julia/v0.4/DataStreams/src/DataStreams.jl:196
 in read at /Users/femto/.julia/v0.4/CSV/src/Source.jl:294

I have no idea how to fix this.

Kind regards

reading a big dataset

I'm trying to read a big data set using CSV.read() but it is very slow for my purpose, the dataset has 1 billion rows and 19 columns, after reading the conversions are made to extract data type from the Nullable type, any suggestions on how to read it faster? Any method to perform parallel reading?

error parsing a `Int64` value on column 1, row 3; encountered '.'

Hi,

I am interested in giving this package a try, but am having some difficulty reading my first CSV file.

I start by defining a Source:

f = CSV.Source("my.csv")

Then I try to stream! this to a Data.Table:

julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 3; encountered '.'")

Here is a snapshot of the first data values (below the headers) opening the CSV file in Notepad:

When I inspect f.schema.header, it read the headers correctly and when I inspect f.schema.types, it inferred the types correctly, i.e. 2 columns of Int64 and 20 columns of Float64.

Subsequent calls to stream! give similar errors but referring to different columns/rows:

julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 2; encountered '.'")

julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 2, row 1; encountered '.'")

Any ideas? The CSV file seems OK to me, so I suspect I need to set some option/schema property.

Thanks.

question: how to use with gzipped file

The following does not work:

fh = GZip.open("my_file.gz")
source = CSV.Source(fh;
                    delim = ';', dateformat = "yyyymmdd",
                    header = true,
                    rows_for_type_detect = false,
                    use_mmap = false)

and fails with the error message

ERROR: TypeError: Type: in typeassert, expected Int64, got Bool
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::GZip.GZipStream) at ./<
missing>:0

which is somewhat cryptic. Using the first few lines of the same file as an IOBuffer(string) works fine.

Does this library handle GZip streams? If yes, an example would be useful.

Using latest tagged GZip, master CSV. (Sorry for so many questions today, and thanks for your patience).

Entire column of missing data produces unsavable DataFrame

I was playing around with the NYC taxicab data and reading it in produces one particular column that is entirely empty:

using CSV
filename = "green_tripdata_2015-09.csv"
isfile(filename) || download("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv", filename)
df = CSV.read(filename)

julia> df[:Ehail_fee]
1494926-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
 #NULL
 ⋮

It looks like the field is missing for every row of the CSV file. However, I think it would be better to return something like NullableArray{Void} rather than the current NullableArray{WeakRefString}, since the latter produces a data frame that is hard to work with. For example, it cannot be saved with JLD:

julia> JLD.save("df.jld", "t", df[:Ehail_fee], compress=true)
ERROR: cannot write a pointer to JLD file
...

Reading of malformed CSV

Below I describe three behaviors CSV.read on malformed CSV files that I found unexpected.

I have the following file:

A;B;C
1,1,10
2,0,16

which is malformed as by error ; is given in head instead of , in the header.

The behavior of three standard utilities for reading such a file in Julia is:

readcsv from Base: loads whole file and replaces missing column names by empty strings;
readtable from DataFrames throws an error;
CSV.read reads only a single column of data into a data frame.

Additionally:

adding empty lines at the end of a file creates data rows with all nulls (as oposed to readcsv and readtable);
if there are less fields in data row than in the header (as in the example file below) null values are silently assigned to missing entries (readtable throws an error).

A,B,C
1,1,10
6,1

In documentation of CSV.read I have not found these behaviors described so I am not sure what is the intended functionality. I would recommend to at least to give a warning in those cases.

When CSV.read returns WeakRefString columns, how long can these pointers be trusted?

I can understand the desire to return string columns as Vector{WeakRefString{UInt8}} but I don't see any documentation of how long those references can be assumed to be valid. I assume that in the default case of a memory-mapped file the weak references are to memory locations corresponding to that memory-mapped file. Does the file handle persist or is the user at the mercy of the garbage collector? Either approach seems rather risky to me.

Tests failed on PkgEval

http://pkg.julialang.org/logs/CSV_0.4.log

Requested CSV Parsing Features [comment here for new requests]

This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:

This is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.

For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.

@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski

read of CSV file fails

I'm trying to read in an csv-export of my depot. However that fails with the following error whenn runnign CSV.read("depot.csv"; delim=";").

ERROR: MethodError: no method matching rem(::String, ::Type{UInt8})
Closest candidates are:
  rem(::Bool, ::Type{UInt8}) at int.jl:221
  rem(::Int16, ::Type{UInt8}) at int.jl:208
  rem(::Int32, ::Type{UInt8}) at int.jl:208
  ...
 in #Source#6(::String, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::String) at /Users/paul/.julia/v0.5/CSV/src/Source.jl:25
 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0
 in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /Users/paul/.julia/v0.5/CSV/src/Source.jl:294
 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)

The first two lines look like that:

"St�ck/Nom.";"Bezeichnung";"Notizen";"WKN";"Typ";"W�hrung";"Akt. Kurs";"Diff. abs";"Diff. %";"Wert in EUR";"Diff. abs";"Diff. %";"Datum";"Zeit";"B�rse";"Kaufkurs in EUR";"Kaufwert in EUR";"Diff. zum Jahreshoch";"Vola";
"1";"ISHSIII-CORE MSCI WLD DLA";;"A0RPWH";"ETF";"EUR";"100,00";"-0,04";"-0,10%";"1000,00";"+100,00";"+10,00%";"28.12.2016";"14:06:06";"XETRA";"39,3525";"1000,00";"-1,00%";"16,26";

Types and values of quoted strings

AFAIU automatic type detection does not take into account whether the extracted string was quoted or not, which leads to incorrect predictions for the column containing quoted numbers as the first N elements.

Checking for quotes should also be done when checking for null strings: quoted string should be automatically regarded as non-null.

Actually, encountered this while trying to convert RDatasets into using DataTables (datasets/attenu.csv.gz). RDataset has a nice test set of .csv files.

Make `CSV.Source` construction more efficient

we're currently doing all sorts of inefficient stuff, including not even using our own efficient CSV.getfield type parsers for type detection.

FileIO

Hi,
from what I read on julia-users and the CSV issue, you seem to be onto a very fast pattern for parsing files!
Do you think, that your approach will be applicable for parsing other file formats?
If so, it would be great to share your file and string types/methods with other IO libraries, probably by putting them into FileIO.jl or some generic file parser library.
It seems like what you came up with is also great for just parsing parts of the file, and editing the file while updating the parsed julia type (or the other way around).
Would be great to have this speed and functionality as a default for all Julia IO libraries.

@sjkelly maybe we can learn a thing or two for MeshIO.

Best,
Simon

Support CategoricalArray parsing

low level API: how to skip field, test for EOL, EOF

I am processing a CSV file by rows. The file is huge, only certain rows are kept according to a filter function, and constructing a DataFrame would be prohibitively expensive and also unnecessary. So I would prefer to read into a tuple/vector, from which I would construct a composite type. Some fields need to be skipped.

Currently using something like this:

"Read a line of given `types` from `source`. A `Void` type skips the field."
function parseline(source::CSV.Source, types; row=0)
    values = []
    for (col,T) in enumerate(types)
        if T == Void
            # FIXME: could we just skip?
            CSV.parsefield(source, Nullable{String}, row, col)
        else
            push!(values, CSV.parsefield(source, T, row, col))
        end
    end
    # FIXME: test for end of line, raise error if not reached
    values
end

Then I would so something like (pseudocode)

a = []
while !eof(source)
    v = MyType(parseline(source, (Int, Nullable{Date}, Void, Int, Date, Void))...)
    if keep_this_row(v)
        push!(a, v)
    end
end

Questions:

does eof(s::CSV.Source)=eof(s.io) make sense or should I use something else?
how to check for the end of the line after parsefield?
can I skip fields without parsing?

Or should I just use readsplitline and post-process that? But that way I can't make use of the missing value framework of CSV.

juliadata / csv.jl Goto Github PK

csv.jl's People

Contributors

Stargazers

Watchers

Forkers

csv.jl's Issues

Input:

Result:

Expected:

Input:

Result:

Expected:

Input:

Result:

Expected:

Recommend Projects

Recommend Topics

Recommend Org