Code Monkey home page Code Monkey logo

Comments (11)

quinnj avatar quinnj commented on June 19, 2024

For now, you can specify the type of a specific column, referencing the column by number or name, so that T will result in a Vector{T} and Nullable{T} will result in NullableVector{T}. So a call would look like:

CSV.read(file; nullable=true, types=Dict("non_null_column"=>Int64))

so all columns would be returned as NullableVector{T}, except non_null_column would be a regular Vector{T}.

One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be. A lot of my own personal uses for CSV.jl involve reading batches and batches of CSV files, with various ranges of data. It would be annoying to not know if I would get a Vector{T} or NullableVector{T}, just depending on whether there were nulls in that range of the data or not.

Definitely something to consider, but I think even just having the nullable=true keyword and being able to specify individual column type nullability gets us most of the way there.

from csv.jl.

nalimilan avatar nalimilan commented on June 19, 2024

One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be.

Isn't this unavoidable when parsing CSV? A column of numbers won't give the same array type as a column of strings. I'd say it's reasonable to have some type instability when reading the file (unless you specified column types of course), as long as operations after that are type-stable.

from csv.jl.

dmbates avatar dmbates commented on June 19, 2024

As @nalimilan says, type instability is inevitable unless the column types are pre-specified. Also, I suggest this as an optional argument, for which the default could be false

from csv.jl.

quinnj avatar quinnj commented on June 19, 2024

I guess the real issue is the current architecture of the entire parsing process: currently, it's based on the fact that all CSV.jl needs to do is provide Data.getfield(source::CSV.Source, ::Type{T}, row, col) and Data.getfield(source, ::Type{Nullable{T}}, row, col). In the CSV.Source construction process, we take a peak at the first n rows to determine the types, but in terms of nullability, we assume, by default, that there may be missing values in any column. This is how the nullable=true keyword argument was introduced, so that by default, the columns are assumed to not have missing values.

I guess my point is that right now the architecture requires us to decide upfront during Source construction whether a column will allow missing values or not. I.e. there's not a place in the current parsing process where you could check after the fact if there were missing values and unwrap if not.

from csv.jl.

nalimilan avatar nalimilan commented on June 19, 2024

Maybe you could apply the same strategy as what is currently used to choose the column type? I.e. check whether the first n rows contain a missing value. I don't see why it would be OK to do that to choose the type, but not for deciding whether null values should be allowed.

from csv.jl.

dmbates avatar dmbates commented on June 19, 2024

@nalimilan It seems to me that you really need the whole column to be able to check if you have any nulls. If the first 100 elements of a column are Ints I would be willing to guess that they all are and pay the price if I was wrong. But seeing the first 100 elements of a column don't contain any nulls would not make me confident that the whole column was free of nulls.

I guess there are still aspects of this package and how it ties into DataStreams that I don't understand. To me, if I was producing a DataFrame I need whole columns before returning the result, although I may be thinking too much like an R programmer here. Operations like push! and append! would be far too expensive to use in R but not in Julia. If it is the case that entire columns are not available until CSV.read returns then I can just do the unnullify operation afterwards.

from csv.jl.

davidanthoff avatar davidanthoff commented on June 19, 2024

I think the fact that CSV is streaming the data as it reads it is one of the best aspects of the package, and it integrates really nicely with e.g. Query.jl. For example, a query like this one:

q = @from i in CSV.Source("data.csv") begin
    @where i.Children > 2
    @select i
    @collect DataFrame
end

will actually never materialize the whole, unfiltered data from the CSV file in memory in the form of a vector or something like that. Instead, the filter is added into the data stream, and only the values that actually pass the filter will eventually end up in the arrays that constitute the DataFrame. But at no point will this allocate an array that holds all the values from the file. At least in theory that should for example enable one to read a file that would not fit into memory unfiltered, but where the subset that is left after the @where filter is small enough to fit into memory in a DataFrame. I say in theory, because I haven't tried it ;)

But as @quinnj pointed out, that essentially means that you need to make a call about the column type before you have read all the rows, otherwise this kind of streaming design can't work.

from csv.jl.

quinnj avatar quinnj commented on June 19, 2024

@dmbates, to follow up on your last comment: we can avoid providing entire columns at once and still avoid using push! or append!. Currently CSV.jl will, by default, try to determine the total # of rows in a file and will pass that upfront to a DataFrame so that it can pre-allocate the correct # of rows. But the operation to count the # of rows in a file is much much cheaper than trying to do any kind of type inference or null detection on the entire file as well.

from csv.jl.

dmbates avatar dmbates commented on June 19, 2024

At least on Linux the shell command wc -l <filename> is an extremely fast way to get the number of rows, assuming that records correspond to lines. It may give an upper bound on the number of records but that it still okay for use with sizehint!.

In another discussion a person said that the Julia countlines function should be as fast as wc in the shell, but that hasn't been my experience.

from csv.jl.

quinnj avatar quinnj commented on June 19, 2024

^ this is a good point, and a big part of the reason that I added the functionality to do CSV.read(file; rows=num_rows) where the user can provide num_rows. The reason I have to use my own CSV.countlines is because of the possibility of embedded newlines in a CSV field, which neither wc -l nor Base.countlines account for. i.e. I have a cell in my CSV file this is "lot's of string text \n with a newline in the middle". But for cases where the # of rows are known or you can safely and quickly pre-compute them beforehand, it's certainly an optimization to pass them in as the rows=num_rows argument

from csv.jl.

quinnj avatar quinnj commented on June 19, 2024

Closing as "nothing can currently be done without major re-architecting". I understand the case, but it's also easy enough to request non-null columns upfront or unwrap them afterwards if the user really wants.

from csv.jl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.