juliadata / csv.jl Goto Github PK
View Code? Open in Web Editor NEWUtility library for working with CSV and other delimited files in the Julia programming language
Home Page: https://csv.juliadata.org/
License: Other
Utility library for working with CSV and other delimited files in the Julia programming language
Home Page: https://csv.juliadata.org/
License: Other
I'd like to decide on the Julia structure that CSV.read()
returns. Speak now or forever hold your peace (or write your own parser, i don't care). The current candidates are:
I'm leaning towards Dict{String,NullableArray{T}}
as it's the most straightforward
@johnmyleswhite @davidagold @StefanKarpinski @jiahao @RaviMohan
Currently, CSV.write
is a pretty naive implementation. There's probably a lot that could be done to improve performance, look at this post for some ideas to speed up: http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/.
Does CSV.jl have a way to write DataFrames to file? I couldn't find a way to construct a Sink
from a DataFrame.
I plan on writing 2 separate DataFrames to the same file, and I couldn't do this with DataFrame
's writetable method.
I'm trying to read the following file:
Name,Age,Children
John, 38., 3
Sally, 23., 1
Kirk, 64., 5
and get this:
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "?help" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.5.0-rc2+0 (2016-08-12 11:25 UTC)
_/ |\__'_|_|_|\__'_| | Official http://julialang.org/ release
|__/ | x86_64-w64-mingw32
julia> using CSV
julia> x = CSV.Source("data.csv")
CSV.Source: data.csv
CSV.Options:
delim: ','
quotechar: '"'
escapechar: '\\'
null: ""
dateformat: Base.Dates.DateFormat(Base.Dates.Slot[],"","english")
Data.Schema:
rows: 3 cols: 3
Columns:
"Name" WeakRefString{UInt8}
"Age" Float64
"Children" Int64
julia> Data.getfield(x,Float64,1,2)
ERROR: CSV.CSVError("error parsing a `Float64` value on column 2, row 1; encountered 'J'")
in checknullend at C:\Users\anthoff\.julia\v0.5\CSV\src\parsefields.jl:52 [inlined]
in parsefield(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::Type{Float64}, ::CSV.Options, ::Int64, ::Int64, ::Base.RefValue{CSV.ParsingState}) at C:\Users\anthoff\.julia\v0.5\CSV\src\parsefields.jl:126
in getfield(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{Float64}, ::Int64, ::Int64) at C:\Users\anthoff\.julia\v0.5\CSV\src\Source.jl:195
I'm on master for CSV and DataStreams.
Am I using the API in the wrong way, or is this a bug?
The appveyor link in the homepage points to https://ci.appveyor.com/project/JuliaData/documenter-jl/branch/master, which is clearly incorrect.
I dont know what it should be, since https://ci.appveyor.com/project/JuliaData/CSV-jl seems to lead nowhere. Is this set up on appveyor at all?
When I read a csv file into a DataFrame like this:
df = DataFrame(CSV.csv("filename.csv"))
and that file has string columns, they end up showing as "#undef"
when I show the DataFrame on the REPL. Probably because the type of a string column in the DataFrame ends up as DataArray{DataStreams.Data.PointerString{T},1}
? Could those actually end up as a normal UTF8String
in the DataFrame?
using CSV
CSV.read("data.txt")
looks like to leave the file open in Windows 7. If I try to delete the file data.txt
in Windows Explorer, I get the dialog "File In Use; The action can't be completed because the file is open in Julia Programming Language; Close the file and try again". If I close Julia, I can delete the file. However, I can delete the file in Git Bash without closing Julia.
On Mac it seems to work fine.
Versions
I wonder about the potential use/worth of a "lazily" read CSV file. My idea here is:
CSV.File
(with all properties set), storing a reference to the file's mmap, and going through and figuring out all the field offsetsCSV.Table
, would have various indexing operations defined to "use" it, but all the actual value parsing would be delayed until physical access of those values. If a column was never accessed, its values would never actually be parsed.My hope is that this would potentially allow an extremely fast "parsing" experience with some of the cost deferred until actual values are needed.
The actual implementation may be tricky to sort out exactly how to tell if a value has been parsed or not, but that's a separate concern.
It comes up quite often: eg a dataset I am now working with encodes gender as M
and F
, another uses E
, U
, O
for employment, unemployment, out of the labor force, etc.
Using String
is not optimal. I could define a method for parse(Char, str; raise=true)
as suggested by the manual, or one for parsefield(io, Char, ...)
.
A typical CSV file from the Netflix Prize data set looks like:
3884821:
249897724,2,2001-11-26
483,3,2000-01-13
4875839,5,2059-07-27
and so on.
Reading this snippet into CSV throws an error:
julia> CSV.csv("test.txt")
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 3; encountered '-'")
in parsefield at /Users/test/.julia/v0.4/CSV/src/getfields.jl:86
in parsefield! at /Users/test/.julia/v0.4/CSV/src/Source.jl:216
in stream! at /Users/test/.julia/v0.4/CSV/src/Source.jl:230
in stream! at /Users/test/.julia/v0.4/DataStreams/src/DataStreams.jl:243
in csv at /Users/test/.julia/v0.4/CSV/src/Source.jl:284
I just did a test on a 1000 row dataset.
I needed to set type detect to 1000 which may be slightly "unfair" compared to readcsv which has no types. Still CSV.read is far to slow:
** it takes 13 seconds instead of 0.04s **
I note that the functions were already compiled in the example below.
I was hoping that this works better now, as you indicated here:
https://groups.google.com/forum/#!searchin/julia-users/csv/julia-users/IFkPso4JUac/lNLgLoCqAwAJ
Any hints?
julia> f="T:\temp\julia1k.csv"
"T:\temp\julia1k.csv"
julia> @time f1=readcsv(f);
0.043854 seconds (239.86 k allocations: 8.536 MB)
julia> @time df=readtable(f);
0.039639 seconds (221.93 k allocations: 10.359 MB, 15.51% gc time)
julia> @time f2=CSV.read(f,rows_for_type_detect=1000);
13.760476 seconds (1.79 M allocations: 73.616 MB, 0.12% gc time)
julia> @show size(f1),size(f2),size(df)
(size(f1),size(f2),size(df)) = ((1000,77),(999,77),(999,77))
((1000,77),(999,77),(999,77))
julia> versioninfo(true)
Julia Version 0.4.1
Commit cbe1bee* (2015-11-08 10:33 UTC)
Platform Info:
System: Windows (x86_64-w64-mingw32)
CPU: Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
WORD_SIZE: 64
Microsoft Windows [Version 6.1.7601]
uname: MSYS_NT-6.1 2.3.0(0.290/5/3) 2015-09-29 10:48 x86_64 unknown
Memory: 31.694698333740234 GB (26403.6875 MB free)
Uptime: 1.1864766877332e6 sec
Load Avg: 0.0 0.0 0.0
Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz:
speed user nice sys idle irq ticks
#1 3410 MHz 4675942 0 2720907 1179080330 152865 ticks
#2 3410 MHz 609105 0 854667 1185013080 87454 ticks
#3 3410 MHz 6070357 0 9145699 1171260702 124348 ticks
#4 3410 MHz 786603 0 1347911 1184342104 18033 ticks
#5 3410 MHz 6533228 0 11563262 1168380019 145923 ticks
#6 3410 MHz 106033 0 37487 1186332833 1404 ticks
#7 3410 MHz 5059143 0 8723326 1172693743 114894 ticks
#8 3410 MHz 2913786 0 1522086 1182040247 29094 ticks
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.3
Environment:
.CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
CLASSPATH = C:\Users\workstation\Documents\mongojdbcdriver
GROOVY_HOME = C:\Program Files (x86)\Groovy\Groovy-2.2.2
HOMEDRIVE = C:
HOMEPATH = \Users\workstation
JAVA_HOME = C:\Program Files\Java\jre8
JULIA_HOME = C:\Program Files\Juno\resources\app\julia\bin
PATHEXT = .COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC;.groovy;.gy
Package Directory: C:\Users\workstation.julia
27 required packages:
julia>
I have no idea why, but when i try to read a matrix with CSV.read it appears to be much slower than the classic readdlm.
@time s = CSV.read("file",delim= ' ',rows_for_type_detect=1,header=false);
89.093956 seconds (796.97 M allocations: 18.175 GB, 5.53% gc time)
@time a = readdlm("file");
19.948536 seconds (466.06 k allocations: 4.317 GB, 8.12% gc time)
I wonder....what the hell am i doing wrong? The file is a matrix with 4845 rows and 24348 columns
The FileIO
package allows for various extensions to be registered to invoke other packages' read routines when encountered. Would it make sense to register CSV.read
for the .csv
extension and possibly others like .tsv
? That way
load("myfile.csv")
could be used instead of remembering whether the call is CSV.read
or read_csv
or readtable
or ...
https://juliadata.github.io/CSV.jl/stable yields a 404 error.
Right now the optional nullable
argument to CSV.read
allows for the columns to all be NullableVector
types or for none to be NullableVector
s. Could an optional argument be added that when nullable = true
any columns that do not contain nulls are "unnullified" after being read? The code could be as simple as
unnullify(x::NullableArray) = anynull(x) ? x : Array(x)
although I haven't looked inside the package to see exactly where this could be done.
If I run
@elapsed pvs = CSV.read("chunky.csv")
The outcome is about 30 seconds, and fits comfortably into memory.
However if I run:
@elapsed pvs = CSV.read("chunky.csv", nullable=false, types=correct_types)
I end up having to crash out the REPL as the read consumes my entire memory and swap.
The csv is perfectly formed with no null values.
Is there a good reason for this happening? Otherwise I find it counter-intuitive that a bunch of Nullable arrays containing data would be smaller than the data. Is something fancy like mmap going on?
EDIT: I think the damage is being done by supplying types, but still not sure why. I am also on the current master
branch.
When doing a CSV.read I'm getting an undefined var error. It looks like the last commit changed the param name of "f" to "io", but didn't update the variables in the function "countlines".
julia> CSV.read("../ndsparse_use_cases/relations/market_to_bu_relation.csv", types = [String,String]) ERROR: UndefVarError: f not defined in countlines(::Base.AbstractIOBuffer{Array{UInt8,1}}, ::UInt8, ::UInt8) at /home/jnelson/.julia/v0.5/CSV/src/io.jl:79 in #Source#3(::String, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:76 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0 in #Source#2(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::String) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:39 in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0 in #read#4(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::CSV.#read, ::String, ::Type{T}) at /home/jnelson/.julia/v0.5/CSV/src/Source.jl:274 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T}) at ./<missing>:0 (repeats 2 times) in eval(::Module, ::Any) at ./boot.jl:234 in macro expansion at ./REPL.jl:92 [inlined] in (::Base.REPL.##1#2{Base.REPL.REPLBackend})() at ./event.jl:46
This would involve chunking up the input file to be able to use the @threads for ...
basic multi-threading interface.
WARNING: Method definition read(Base.AbstractIOBuffer, Type{UInt8}) in module Base
overwritten in module CSV at C:\Users\amellnik\.julia\v0.4\CSV\src\Source.jl:13.
It looks like the base method was recently added: JuliaLang/julia@bb744fd
This used to work. Could use some help here:
julia> CSV.read("file.csv",types = [String,String,String],nullable=false)
ERROR: TypeError: streamto!: in typeassert, expected String, got WeakRefStrings.WeakRefString{UInt8}
in streamto!(::DataFrames.DataFrame, ::Type{DataStreams.Data.Field}, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{String}, ::Type{String}, ::Int64, ::Int64, ::DataStreams.Data.Schema{true}, ::Base.#identity) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:172
in stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrame, ::DataStreams.Data.Schema{true}, ::DataStreams.Data.Schema{true}, ::Array{Function,1}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:186
in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:150
in stream!(::CSV.Source{Base.AbstractIOBuffer{Array{UInt8,1}}}, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /home/jeff/.julia/v0.5/DataStreams/src/DataStreams.jl:144
in #read#21(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /home/jeff/.julia/v0.5/CSV/src/Source.jl:248
in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)
Prior to this I had just reinstalled Julia from source from the master branch and ran the following
Pkg.add("DataFrames")
Pkg.add("CSV")
using DataFrames
using CSV
and this is the truncated output, skipping over the deprecation warnings
while loading /home/cjprybol/.julia/v0.5/Docile/src/Extensions/Extensions.jl, in expression starting on line 16
ERROR: LoadError: `UTF16String` has been moved to the package LegacyStrings.jl:
Run Pkg.add("LegacyStrings") to install LegacyStrings on Julia v0.5-;
Then do `using LegacyStrings` to get `UTF16String`.
in include_from_node1(::String) at ./loading.jl:426
in macro expansion; at ./none:2 [inlined]
in anonymous at ./<missing>:?
in eval(::Module, ::Any) at ./boot.jl:234
in process_options(::Base.JLOptions) at ./client.jl:239
in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/WeakRefStrings/src/WeakRefStrings.jl, in expression starting on line 30
ERROR: LoadError: Failed to precompile WeakRefStrings to /home/cjprybol/.julia/lib/v0.5/WeakRefStrings.ji
in compilecache(::String) at ./loading.jl:505
in require(::Symbol) at ./loading.jl:337
in include_from_node1(::String) at ./loading.jl:426
in macro expansion; at ./none:2 [inlined]
in anonymous at ./<missing>:?
in eval(::Module, ::Any) at ./boot.jl:234
in process_options(::Base.JLOptions) at ./client.jl:239
in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 4
ERROR: Failed to precompile CSV to /home/cjprybol/.julia/lib/v0.5/CSV.ji
in compilecache(::String) at ./loading.jl:505
in require(::Symbol) at ./loading.jl:364
julia> Pkg.add("LegacyStrings")
INFO: Cloning cache of LegacyStrings from https://github.com/JuliaArchive/LegacyStrings.jl.git
INFO: Installing LegacyStrings v0.1.1
INFO: Package database updated
julia> using LegacyStrings
WARNING: could not import Base.lastidx into LegacyStrings
WARNING: using LegacyStrings.ascii in module Main conflicts with an existing identifier.
WARNING: using LegacyStrings.utf8 in module Main conflicts with an existing identifier.
julia> using CSWARNING: both LegacyStrings and Base export "ASCIIString"; uses of it in module Main must be qualified
WARNING: both LegacyStrings and Base export "ByteString"; uses of it in module Main must be qualified
WARNING: both LegacyStrings and Base export "UTF8String"; uses of it in module Main must be qualified
julia> using CSV
INFO: Precompiling module CSV...
ERROR: LoadError: `UTF16String` has been moved to the package LegacyStrings.jl:
Run Pkg.add("LegacyStrings") to install LegacyStrings on Julia v0.5-;
Then do `using LegacyStrings` to get `UTF16String`.
in include_from_node1(::String) at ./loading.jl:426
in macro expansion; at ./none:2 [inlined]
in anonymous at ./<missing>:?
in eval(::Module, ::Any) at ./boot.jl:234
in process_options(::Base.JLOptions) at ./client.jl:239
in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/WeakRefStrings/src/WeakRefStrings.jl, in expression starting on line 30
ERROR: LoadError: Failed to precompile WeakRefStrings to /home/cjprybol/.julia/lib/v0.5/WeakRefStrings.ji
in compilecache(::String) at ./loading.jl:505
in require(::Symbol) at ./loading.jl:337
in include_from_node1(::String) at ./loading.jl:426
in macro expansion; at ./none:2 [inlined]
in anonymous at ./<missing>:?
in eval(::Module, ::Any) at ./boot.jl:234
in process_options(::Base.JLOptions) at ./client.jl:239
in _start() at ./client.jl:318
while loading /home/cjprybol/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 4
ERROR: Failed to precompile CSV to /home/cjprybol/.julia/lib/v0.5/CSV.ji
in compilecache(::String) at ./loading.jl:505
in require(::Symbol) at ./loading.jl:364
This issue is for tracking progress towards the next major release, corresponding to the broader release of Julia 0.5:
CSV.read(io_or_file, sink, args...; kwargs...)
and CSV.write(io_or_file, source, args...; kwargs...)
Pkg.test("CSV")
on windows fails. If the file type of the following files are modified to unix
, then the tests pass.
baseball.csv
stocks.csv
test_utf8.csv
test_single_column.csv
test_empty_file_newlines.csv
It would be useful if the quotes did not have to always open and close with the same character.
I have a delimitted format that looks like
{reddish}, {red, ish}
{darkish}, {dark, ish}
{greenish}, {green, ish}
....
{greyish}, {grey, ish}
Where every cell is quoted using {
and }
Following this SO post: I'd like to do some processing on a CSV, along the lines of:
infile = "/path/to/input.csv"
outfile = "/path/to/output.csv"
data = readcsv(infile; header=true)
map!(replace_nulls, data[1])
writecsv(outfile, data; header=true)
Except that this doesn't work AFAICT since readdlm doesn't support headers.
(The workaround is probably fine but I'm personally having some silly issue with it)
Is this something supported by CSV.jl? Would I need to read/write a DataFrame, or could I just iterate over lines?
Right now, this is handled by the quotefields::Bool
keyword in various methods/constructors, but I think it would be better to just detect if we need to do this and take care of it automatically.
It would be nice/helpful/awesome if CSV.write("foo.csv", some_dataframe, append = true, header = true)
would
I have a CSV file with 1156 rows and 3 columns. Most of the entries in column 3 are NA's and some of them contains Int64
s. When reading with CSV.read
the type of column 3 is coming out to be Nullable{WeakRefString{UInt8}}
which should have ideally been Nullable{Int64}
. When i put first entry in third column manually in 101th row it reads it as Nullable{Int64}
but considers the column to be of type Nullable{WeakRefString{UInt8}}
if first entry is in row 102 or ahead. So I presume the first 101 rows are being used to infer the type of the column. How can we correctly infer the type if the first element is beyond position 102?
This is the link to the sample csv file. The function to read the file is CSV.read("sample.csv")
On Julia 0.5:
julia> using Requests
julia> using CSV
julia> stream = Requests.get_streaming("https://raw.githubusercontent.com/JuliaData/CSV.jl/master/test/test_files/test_utf8.csv")
ResponseStream(Request(https://raw.githubusercontent.com/JuliaData/CSV.jl/master/test/test_files/test_utf8.csv, 3 headers, 0 bytes in body))
julia> CSV.read(stream)
ERROR: StackOverflowError:
in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:0
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
...
in #Source#7(::Requests.ResponseStream{MbedTLS.SSLContext}, ::CSV.Options, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}) at /root/.julia/v0.5/CSV/src/Source.jl:57
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}) at ./<missing>:0
in #Source#6(::UInt8, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:25
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::Requests.ResponseStream{MbedTLS.SSLContext}, ::Type{DataFrames.DataFrame}) at /root/.julia/v0.5/CSV/src/Source.jl:294
in read(::Requests.ResponseStream{MbedTLS.SSLContext}) at /root/.julia/v0.5/CSV/src/Source.jl:287
With the latest commit I get the following error if I load SQLite (and so CSV):
ERROR: LoadError: LoadError: error in method definition: function Core.getfield must be explicitly imported to be extended
in include(::ASCIIString) at ./boot.jl:264
in include_from_node1(::ASCIIString) at ./loading.jl:417
in include(::ASCIIString) at ./boot.jl:264
in include_from_node1(::ASCIIString) at ./loading.jl:417
in eval(::Module, ::Any) at ./boot.jl:267
[inlined code] from ./sysimg.jl:14
in require(::Symbol) at ./loading.jl:348
in eval(::Module, ::Any) at ./boot.jl:267
while loading /home/martin/.julia/v0.5/CSV/src/getfields.jl, in expression starting on line 458
while loading /home/martin/.julia/v0.5/CSV/src/CSV.jl, in expression starting on line 89
Versioninfo:
julia> versioninfo()
Julia Version 0.5.0-dev+3123
Commit 01dd5ec (2016-03-12 05:08 UTC)
Platform Info:
System: Linux (x86_64-linux-gnu)
CPU: Intel(R) Core(TM) i3-4010U CPU @ 1.70GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.7.1 (ORCJIT, haswell)
I'm having trouble parsing tab-separated files with non-string columns (as inferred from the data or explicitly declared via the types
keyword) containing null values (denoted by adjacent delimiters). The parser is grabbing the following cell, which can be seen in the following example (Julia 0.5, CSV.jl 0.1.1):
$ cat /tmp/foo.txt
A B C D
1 2016 x 100
2 2014 200
julia> CSV.read("/tmp/foo.txt"; delim='\t')
2×4 DataFrames.DataFrame
│ Row │ A │ B │ C │ D │
│ 1 │ 1 │ 2016 │ "x" │ 100 │
│ 2 │ 2 │ 2014 │ "200" │ #NULL │
│ Row │ A │ B │ C │ D │
│ 1 │ 1 │ 2016 │ "x" │ 100 │
│ 2 │ 2 │ #NULL │ "2014" │ 200 │
If column C in row 2 did not contain a value of the same inferred type as column B of row 1 (Int64) then the parser would fail, unable to parse the cell as an integer, e.g.
$ cat /tmp/bar.txt
A B C D
1 2016 x 100
2 y 200
julia> CSV.read("/tmp/bar.txt"; delim='\t')
ERROR: CSV.CSVError("error parsing a `Int64` value on column 2, row 2; encountered 'y'")
in checknullend at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:56 [inlined]
in parsefield at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:127 [inlined]
in parsefield at /Users/josh/.julia/v0.5/CSV/src/parsefields.jl:107 [inlined]
in streamfrom(::CSV.Source, ::Type{DataStreams.Data.Field}, ::Type{Nullable{Int64}}, ::Int64, ::Int64) at /Users/josh/.julia/v0.5/CSV/src/Source.jl:185
in streamto!(::DataFrames.DataFrame, ::Type{DataStreams.Data.Field}, ::CSV.Source, ::Type{Nullable{Int64}}, ::Type{Nullable{Int64}}, ::Int64, ::Int64, ::Data
Streams.Data.Schema{true}, ::Base.#identity) at /Users/josh/.julia/v0.5/DataStreams/src/DataStreams.jl:171
in stream!(::CSV.Source, ::Type{DataStreams.Data.Field}, ::DataFrames.DataFrame, ::DataStreams.Data.Schema{true}, ::DataStreams.Data.Schema{true}, ::Array{Fu
nction,1}) at /Users/josh/.julia/v0.5/DataStreams/src/DataStreams.jl:185
in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{DataFrames.DataFrame}, ::Bool, ::Dict{Int64,Function}) at /Users/josh/.julia/v0.5/DataStreams/
src/DataStreams.jl:149
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /Users/josh/.julia/v0.5/CSV/src/Source.jl:
289
in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)
│ Row │ A │ B │ C │ D │
│ 1 │ 1 │ 2016 │ "x" │ 100 │
│ 2 │ 2 │ #NULL │ "y" │ 200 │
Interestingly, if the delimiter is changed to a comma in the file, it loads as expected, which makes me think the the tab is being eaten as whitespace somewhere.
Also, string fields do not appear to have this problem.
$ cat /tmp/baz.txt
A B C D
1 P x 100
2 y 200
julia> CSV.read("/tmp/baz.txt"; delim='\t')
2×4 DataFrames.DataFrame
│ Row │ A │ B │ C │ D │
│ 1 │ 1 │ "P" │ "x" │ 100 │
│ 2 │ 2 │ #NULL │ "y" │ 200 │
│ Row │ A │ B │ C │ D │
│ 1 │ 1 │ "P" │ "x" │ 100 │
│ 2 │ 2 │ #NULL │ "y" │ 200 │
Hello,
I have a CSV file named test.csv
like this
ticker,dt,bid,ask
EUR/USD,20140101 21:55:34.378,1.37622,1.37693
EUR/USD,20140101 21:55:40.410,1.37624,1.37698
EUR/USD,20140101 21:55:47.210,1.37619,1.37696
EUR/USD,20140101 21:55:57.963,1.37616,1.37696
EUR/USD,20140101 21:56:03.117,1.37616,1.37694
EUR/USD,20140101 21:56:07.254,1.37616,1.37692
EUR/USD,20140101 21:56:16.911,1.3762,1.37695
EUR/USD,20140101 21:56:19.433,1.37615,1.37692
EUR/USD,20140101 21:56:24.971,1.37615,1.37691
EUR/USD,20140101 21:56:24.972,1.37615,1.37689
I'd like to load it using CSV.jl but it raises an error
julia> CSV.read("test.csv")
ERROR: argument is an abstract type; size is indeterminate
in call at /Users/femto/.julia/v0.4/DataStreams/src/DataStreams.jl:189
in stream! at /Users/femto/.julia/v0.4/DataStreams/src/DataStreams.jl:196
in read at /Users/femto/.julia/v0.4/CSV/src/Source.jl:294
I have no idea how to fix this.
Kind regards
I'm trying to read a big data set using CSV.read() but it is very slow for my purpose, the dataset has 1 billion rows and 19 columns, after reading the conversions are made to extract data type from the Nullable type, any suggestions on how to read it faster? Any method to perform parallel reading?
Hi,
I am interested in giving this package a try, but am having some difficulty reading my first CSV file.
I start by defining a Source:
f = CSV.Source("my.csv")
Then I try to stream!
this to a Data.Table
:
julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 3; encountered '.'")
Here is a snapshot of the first data values (below the headers) opening the CSV file in Notepad:
When I inspect f.schema.header
, it read the headers correctly and when I inspect f.schema.types
, it inferred the types correctly, i.e. 2 columns of Int64 and 20 columns of Float64.
Subsequent calls to stream!
give similar errors but referring to different columns/rows:
julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 1, row 2; encountered '.'")
julia> ds = Data.stream!(f,Data.Table)
ERROR: CSV.CSVError("error parsing a `Int64` value on column 2, row 1; encountered '.'")
Any ideas? The CSV file seems OK to me, so I suspect I need to set some option/schema property.
Thanks.
The following does not work:
fh = GZip.open("my_file.gz")
source = CSV.Source(fh;
delim = ';', dateformat = "yyyymmdd",
header = true,
rows_for_type_detect = false,
use_mmap = false)
and fails with the error message
ERROR: TypeError: Type: in typeassert, expected Int64, got Bool
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::GZip.GZipStream) at ./<
missing>:0
which is somewhat cryptic. Using the first few lines of the same file as an IOBuffer(string)
works fine.
Does this library handle GZip
streams? If yes, an example would be useful.
Using latest tagged GZip
, master CSV
. (Sorry for so many questions today, and thanks for your patience).
I was playing around with the NYC taxicab data and reading it in produces one particular column that is entirely empty:
using CSV
filename = "green_tripdata_2015-09.csv"
isfile(filename) || download("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv", filename)
df = CSV.read(filename)
julia> df[:Ehail_fee]
1494926-element NullableArrays.NullableArray{WeakRefString{UInt8},1}:
#NULL
⋮
It looks like the field is missing for every row of the CSV file. However, I think it would be better to return something like NullableArray{Void}
rather than the current NullableArray{WeakRefString}
, since the latter produces a data frame that is hard to work with. For example, it cannot be saved with JLD:
julia> JLD.save("df.jld", "t", df[:Ehail_fee], compress=true)
ERROR: cannot write a pointer to JLD file
...
Below I describe three behaviors CSV.read
on malformed CSV files that I found unexpected.
I have the following file:
A;B;C
1,1,10
2,0,16
which is malformed as by error ;
is given in head instead of ,
in the header.
The behavior of three standard utilities for reading such a file in Julia is:
readcsv
from Base: loads whole file and replaces missing column names by empty strings;readtable
from DataFrames
throws an error;CSV.read
reads only a single column of data into a data frame.Additionally:
readcsv
and readtable
);readtable
throws an error).A,B,C
1,1,10
6,1
In documentation of CSV.read
I have not found these behaviors described so I am not sure what is the intended functionality. I would recommend to at least to give a warning in those cases.
I can understand the desire to return string columns as Vector{WeakRefString{UInt8}}
but I don't see any documentation of how long those references can be assumed to be valid. I assume that in the default case of a memory-mapped file the weak references are to memory locations corresponding to that memory-mapped file. Does the file handle persist or is the user at the mercy of the garbage collector? Either approach seems rather risky to me.
This issue is for listing out and refining the desired functionality of working with CSV files. The options so far, more or less implemented:
CSV.getfield(io::IOBuffer, ::Type{T})
, which would allow for fairly seamless streaming code\r
, \n
, and \r\n
and handle those three automatically]CSV
should expect for each column individually or for every columnThis is of course less feature-rich than pandas or data.table's fread, but I also had an epiphany of sorts the other day with regards to bazillion-feature CSV readers. They have to provide so many features because their languages suck Think about it, pandas needs to provide all these crazy options and parsing function capabilities because otherwise, you'd have to do additional processing in python, which kills the purpose of using a nice C pandas implementation. Same with R to some extent.
For CSV, I want to take the approach that if a certain feature can be done post parsing as efficiently as we'd be able to do it while parsing, then we shouldn't support it. Julia is great and fast, don't be afraid of processing your ugly, misshapen CSV files. We want this implementation to be fast and simple, no need to clutter with extraneous features. Sure we can provide stuff that is convenient for this or that, but I really don't think we need to go overboard.
@johnmyleswhite @davidagold @jiahao @RaviMohan @StefanKarpinski
I'm trying to read in an csv-export of my depot. However that fails with the following error whenn runnign CSV.read("depot.csv"; delim=";")
.
ERROR: MethodError: no method matching rem(::String, ::Type{UInt8})
Closest candidates are:
rem(::Bool, ::Type{UInt8}) at int.jl:221
rem(::Int16, ::Type{UInt8}) at int.jl:208
rem(::Int32, ::Type{UInt8}) at int.jl:208
...
in #Source#6(::String, ::UInt8, ::UInt8, ::String, ::Int64, ::Int64, ::Array{DataType,1}, ::Bool, ::Bool, ::Base.Dates.DateFormat, ::Int64, ::Int64, ::Int64, ::Bool, ::Type{T}, ::String) at /Users/paul/.julia/v0.5/CSV/src/Source.jl:25
in (::Core.#kw#Type)(::Array{Any,1}, ::Type{CSV.Source}, ::String) at ./<missing>:0
in #read#23(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{DataFrames.DataFrame}) at /Users/paul/.julia/v0.5/CSV/src/Source.jl:294
in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{DataFrames.DataFrame}) at ./<missing>:0 (repeats 2 times)
The first two lines look like that:
"St�ck/Nom.";"Bezeichnung";"Notizen";"WKN";"Typ";"W�hrung";"Akt. Kurs";"Diff. abs";"Diff. %";"Wert in EUR";"Diff. abs";"Diff. %";"Datum";"Zeit";"B�rse";"Kaufkurs in EUR";"Kaufwert in EUR";"Diff. zum Jahreshoch";"Vola";
"1";"ISHSIII-CORE MSCI WLD DLA";;"A0RPWH";"ETF";"EUR";"100,00";"-0,04";"-0,10%";"1000,00";"+100,00";"+10,00%";"28.12.2016";"14:06:06";"XETRA";"39,3525";"1000,00";"-1,00%";"16,26";
AFAIU automatic type detection does not take into account whether the extracted string was quoted or not, which leads to incorrect predictions for the column containing quoted numbers as the first N elements.
Checking for quotes should also be done when checking for null
strings: quoted string should be automatically regarded as non-null.
Actually, encountered this while trying to convert RDatasets
into using DataTables
(datasets/attenu.csv.gz
). RDataset
has a nice test set of .csv files.
we're currently doing all sorts of inefficient stuff, including not even using our own efficient CSV.getfield
type parsers for type detection.
Hi,
from what I read on julia-users and the CSV issue, you seem to be onto a very fast pattern for parsing files!
Do you think, that your approach will be applicable for parsing other file formats?
If so, it would be great to share your file and string types/methods with other IO libraries, probably by putting them into FileIO.jl or some generic file parser library.
It seems like what you came up with is also great for just parsing parts of the file, and editing the file while updating the parsed julia type (or the other way around).
Would be great to have this speed and functionality as a default for all Julia IO libraries.
@sjkelly maybe we can learn a thing or two for MeshIO.
Best,
Simon
I am processing a CSV file by rows. The file is huge, only certain rows are kept according to a filter function, and constructing a DataFrame
would be prohibitively expensive and also unnecessary. So I would prefer to read into a tuple/vector, from which I would construct a composite type. Some fields need to be skipped.
Currently using something like this:
"Read a line of given `types` from `source`. A `Void` type skips the field."
function parseline(source::CSV.Source, types; row=0)
values = []
for (col,T) in enumerate(types)
if T == Void
# FIXME: could we just skip?
CSV.parsefield(source, Nullable{String}, row, col)
else
push!(values, CSV.parsefield(source, T, row, col))
end
end
# FIXME: test for end of line, raise error if not reached
values
end
Then I would so something like (pseudocode)
a = []
while !eof(source)
v = MyType(parseline(source, (Int, Nullable{Date}, Void, Int, Date, Void))...)
if keep_this_row(v)
push!(a, v)
end
end
Questions:
eof(s::CSV.Source)=eof(s.io)
make sense or should I use something else?parsefield
?Or should I just use readsplitline
and post-process that? But that way I can't make use of the missing value framework of CSV
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.