Comments (6)
Great question. There's actually new functionality that should make this very easy. It would look something like:
num_columns = [input # of columns in the file here]
CSV.read(infile, CSV.Sink, outfile; transforms=Dict(i=>replace_nulls for i = 1:num_columns))
Basically this will open up the input CSV file, start reading field-by-field, and apply the replace_nulls
function to each field that is parsed before writing it out to the CSV output file. No materializing of a DataFrame in between. Note however, that your replace_nulls
function will need to operate on any various input types it may encounter in your input file, but you probably already accounted for that.
Let me know if you run into any issues; the transforms=Dict()
functionality is brand new, but I've tried to put it through some good testing before the release.
from csv.jl.
Cool!
I got this error while trying it out:
import CSV
in_csv = CSV.Source(joinpath(file_path, in_file))
num_columns = length(CSV.readsplitline(in_csv))
CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
transforms=Dict(i=>replace_nulls for i = 1:num_columns))
MethodError: no method matching transform(::DataStreams.Data.Schema{true}, ::Dict{Int64,#replace_nulls})
Closest candidates are:
transform(::DataStreams.Data.Schema{RowsAreKnown}, !Matched::Dict{Int64,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:70
transform(::DataStreams.Data.Schema{RowsAreKnown}, !Matched::Dict{String,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:80
in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,#replace_nulls}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:147
in stream!(::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,#replace_nulls}, ::String) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:143
in #read#22(::Bool, ::Dict{Int64,#replace_nulls}, ::Array{Any,1}, ::Function, ::String, ::Type{T}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/CSV/src/Source.jl:282
in (::CSV.#kw##read...
On Pkg.checkout() of "CSV" and "DataStreams".
from csv.jl.
Oh drat, that's something I need to fix in DataStreams. For now, you should be able to do
CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
transforms=Dict{Int,Function}(i=>replace_nulls for i = 1:num_columns))
from csv.jl.
Now I get:
# originally written for strings, but I thought it should be fine for other types as well
function replace_nulls(elem)
const null_value = "None"
return elem == null_value ? "" : elem
end
CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
transforms=Dict{Int,Function}(i=>replace_nulls for i = 1:num_columns))
MethodError: Cannot `convert` an object of type Type{Union{Nullable{WeakRefString{UInt8}},String}} to an object of type DataType
This may have arisen from a call to the constructor DataType(...),
since type constructors fall back to convert methods.
in transform(::DataStreams.Data.Schema{true}, ::Dict{Int64,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:75
in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,Function}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:147
in stream!(::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,Function}, ::String) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:143
in #read#22(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{T}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/CSV/src/Source.jl:282
in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T}, ::String, ...
For the mean time I also have the alternate implementation from that SO post (as well as an operational python version I coded in a few mins to solve the real-world problem TBH), but I just thought this would be good to know in case perf issues come up.
from csv.jl.
No worries, this is part of the evolution of Julia, its libraries, and users learning a slightly new paradigm. In your case, the problem here is that your replace_nulls
function isn't "type stable", meaning given different types of inputs, it doesn't have a reliable return type. This is important because, unlike python or javascript, type stability allows Julia to achieve near-C performance.
In your case, if your null values are all represented by "None", then you could just do
CSV.read(in_file; null="None")
and None
will then be seen as a special value representing NULL in your dataset. No transforming or data manipulation necessary.
from csv.jl.
Awesome! I added a function for the non-string case and it seems to work.
function replace_nulls(elem)
return elem
end
function replace_nulls(elem::String)
const null_string = "None"
return elem == null_string ? "" : elem
end
The use case here is a transform on a CSV before loading into a DB. The original had empty values ,,
as nulls, but sometimes also "None"
string values I decided I wanted to also enter as nulls.
from csv.jl.
Related Issues (20)
- Multithreaded parsing error should be warning HOT 7
- Error reading CSV - missing lines HOT 2
- Load error with Parsers.Options HOT 4
- Configurable max inline string length
- Precompilation issue in Ubuntu 22.04.2 LTS (libLLVM-14jl.so (unknown line)) HOT 14
- Formatting issues in examples
- Cannot compile this package on Julia 1.9.1 in Ubuntu 22.04 container HOT 3
- "Missing" Values HOT 2
- Keyword `decimal` not respected for AbstractFloats in CSV.write()
- Can't transfer CSV.jl v0.10.11 from Windows to Linux HOT 2
- CSV.write somehow cannot write file with name `con.csv` in Windows?! HOT 5
- Add Zenodo badge to README HOT 6
- Segfault on Julia 1.9 on Intel Sapphire Rapids during precompilation
- `bufsize` of `write` is defined to be length of row but actually cells
- can not read the csv with large cells written by itself HOT 1
- Formatting broken on Examples page in documentation HOT 2
- CSV.jl fails to precompile on Ubuntu server, v0.10.5 and up. HOT 2
- Error on CSV.read attempt HOT 4
- `emptyvalue` keyword option
- CSV.Chunks splits file into uneven chunks
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csv.jl.