Code Monkey home page Code Monkey logo

Comments (6)

quinnj avatar quinnj commented on July 1, 2024

Great question. There's actually new functionality that should make this very easy. It would look something like:

num_columns = [input # of columns in the file here]
CSV.read(infile, CSV.Sink, outfile; transforms=Dict(i=>replace_nulls for i = 1:num_columns))

Basically this will open up the input CSV file, start reading field-by-field, and apply the replace_nulls function to each field that is parsed before writing it out to the CSV output file. No materializing of a DataFrame in between. Note however, that your replace_nulls function will need to operate on any various input types it may encounter in your input file, but you probably already accounted for that.

Let me know if you run into any issues; the transforms=Dict() functionality is brand new, but I've tried to put it through some good testing before the release.

from csv.jl.

ohadle avatar ohadle commented on July 1, 2024

Cool!
I got this error while trying it out:

import CSV
in_csv = CSV.Source(joinpath(file_path, in_file))
num_columns = length(CSV.readsplitline(in_csv))
CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
         transforms=Dict(i=>replace_nulls for i = 1:num_columns))
MethodError: no method matching transform(::DataStreams.Data.Schema{true}, ::Dict{Int64,#replace_nulls})
Closest candidates are:
  transform(::DataStreams.Data.Schema{RowsAreKnown}, !Matched::Dict{Int64,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:70
  transform(::DataStreams.Data.Schema{RowsAreKnown}, !Matched::Dict{String,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:80
 in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,#replace_nulls}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:147
 in stream!(::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,#replace_nulls}, ::String) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:143
 in #read#22(::Bool, ::Dict{Int64,#replace_nulls}, ::Array{Any,1}, ::Function, ::String, ::Type{T}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/CSV/src/Source.jl:282
 in (::CSV.#kw##read...

On Pkg.checkout() of "CSV" and "DataStreams".

from csv.jl.

quinnj avatar quinnj commented on July 1, 2024

Oh drat, that's something I need to fix in DataStreams. For now, you should be able to do

CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
         transforms=Dict{Int,Function}(i=>replace_nulls for i = 1:num_columns))

from csv.jl.

ohadle avatar ohadle commented on July 1, 2024

Now I get:

# originally written for strings, but I thought it should be fine for other types as well
function replace_nulls(elem)
  const null_value = "None"
  return elem == null_value ? "" : elem
end

CSV.read(joinpath(file_path, in_file), CSV.Sink, joinpath(file_path, out_file);
         transforms=Dict{Int,Function}(i=>replace_nulls for i = 1:num_columns))
MethodError: Cannot `convert` an object of type Type{Union{Nullable{WeakRefString{UInt8}},String}} to an object of type DataType
This may have arisen from a call to the constructor DataType(...),
since type constructors fall back to convert methods.
 in transform(::DataStreams.Data.Schema{true}, ::Dict{Int64,Function}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:75
 in #stream!#5(::Array{Any,1}, ::Function, ::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,Function}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:147
 in stream!(::CSV.Source, ::Type{CSV.Sink}, ::Bool, ::Dict{Int64,Function}, ::String) at /Users/olevinkr/.julia/v0.5/DataStreams/src/DataStreams.jl:143
 in #read#22(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::Type{T}, ::String, ::Vararg{String,N}) at /Users/olevinkr/.julia/v0.5/CSV/src/Source.jl:282
 in (::CSV.#kw##read)(::Array{Any,1}, ::CSV.#read, ::String, ::Type{T}, ::String, ...

For the mean time I also have the alternate implementation from that SO post (as well as an operational python version I coded in a few mins to solve the real-world problem TBH), but I just thought this would be good to know in case perf issues come up.

from csv.jl.

quinnj avatar quinnj commented on July 1, 2024

No worries, this is part of the evolution of Julia, its libraries, and users learning a slightly new paradigm. In your case, the problem here is that your replace_nulls function isn't "type stable", meaning given different types of inputs, it doesn't have a reliable return type. This is important because, unlike python or javascript, type stability allows Julia to achieve near-C performance.

In your case, if your null values are all represented by "None", then you could just do

CSV.read(in_file; null="None")

and None will then be seen as a special value representing NULL in your dataset. No transforming or data manipulation necessary.

from csv.jl.

ohadle avatar ohadle commented on July 1, 2024

Awesome! I added a function for the non-string case and it seems to work.

function replace_nulls(elem)
  return elem
end

function replace_nulls(elem::String)
  const null_string = "None"
  return elem == null_string ? "" : elem
end

The use case here is a transform on a CSV before loading into a DB. The original had empty values ,, as nulls, but sometimes also "None" string values I decided I wanted to also enter as nulls.

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.