Comments (11)
For now, you can specify the type of a specific column, referencing the column by number or name, so that T
will result in a Vector{T}
and Nullable{T}
will result in NullableVector{T}
. So a call would look like:
CSV.read(file; nullable=true, types=Dict("non_null_column"=>Int64))
so all columns would be returned as NullableVector{T}
, except non_null_column
would be a regular Vector{T}
.
One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be. A lot of my own personal uses for CSV.jl involve reading batches and batches of CSV files, with various ranges of data. It would be annoying to not know if I would get a Vector{T}
or NullableVector{T}
, just depending on whether there were nulls in that range of the data or not.
Definitely something to consider, but I think even just having the nullable=true
keyword and being able to specify individual column type nullability gets us most of the way there.
from csv.jl.
One thing that I come back to with "auto unwrapping" NullableVectors is the analogy to dealing with type instability in Julia: a function's types shouldn't depend on specific input values. In a similar vein, I think it might be bad if we were to have the types of columns be dependent on the actual values parsed, as opposed to what they're setup to be.
Isn't this unavoidable when parsing CSV? A column of numbers won't give the same array type as a column of strings. I'd say it's reasonable to have some type instability when reading the file (unless you specified column types of course), as long as operations after that are type-stable.
from csv.jl.
As @nalimilan says, type instability is inevitable unless the column types are pre-specified. Also, I suggest this as an optional argument, for which the default could be false
from csv.jl.
I guess the real issue is the current architecture of the entire parsing process: currently, it's based on the fact that all CSV.jl needs to do is provide Data.getfield(source::CSV.Source, ::Type{T}, row, col)
and Data.getfield(source, ::Type{Nullable{T}}, row, col)
. In the CSV.Source
construction process, we take a peak at the first n
rows to determine the types, but in terms of nullability, we assume, by default, that there may be missing values in any column. This is how the nullable=true
keyword argument was introduced, so that by default, the columns are assumed to not have missing values.
I guess my point is that right now the architecture requires us to decide upfront during Source construction whether a column will allow missing values or not. I.e. there's not a place in the current parsing process where you could check after the fact if there were missing values and unwrap if not.
from csv.jl.
Maybe you could apply the same strategy as what is currently used to choose the column type? I.e. check whether the first n
rows contain a missing value. I don't see why it would be OK to do that to choose the type, but not for deciding whether null values should be allowed.
from csv.jl.
@nalimilan It seems to me that you really need the whole column to be able to check if you have any nulls. If the first 100 elements of a column are Int
s I would be willing to guess that they all are and pay the price if I was wrong. But seeing the first 100 elements of a column don't contain any nulls would not make me confident that the whole column was free of nulls.
I guess there are still aspects of this package and how it ties into DataStreams
that I don't understand. To me, if I was producing a DataFrame
I need whole columns before returning the result, although I may be thinking too much like an R programmer here. Operations like push!
and append!
would be far too expensive to use in R but not in Julia. If it is the case that entire columns are not available until CSV.read
returns then I can just do the unnullify operation afterwards.
from csv.jl.
I think the fact that CSV is streaming the data as it reads it is one of the best aspects of the package, and it integrates really nicely with e.g. Query.jl. For example, a query like this one:
q = @from i in CSV.Source("data.csv") begin
@where i.Children > 2
@select i
@collect DataFrame
end
will actually never materialize the whole, unfiltered data from the CSV file in memory in the form of a vector or something like that. Instead, the filter is added into the data stream, and only the values that actually pass the filter will eventually end up in the arrays that constitute the DataFrame
. But at no point will this allocate an array that holds all the values from the file. At least in theory that should for example enable one to read a file that would not fit into memory unfiltered, but where the subset that is left after the @where
filter is small enough to fit into memory in a DataFrame
. I say in theory, because I haven't tried it ;)
But as @quinnj pointed out, that essentially means that you need to make a call about the column type before you have read all the rows, otherwise this kind of streaming design can't work.
from csv.jl.
@dmbates, to follow up on your last comment: we can avoid providing entire columns at once and still avoid using push!
or append!
. Currently CSV.jl will, by default, try to determine the total # of rows in a file and will pass that upfront to a DataFrame so that it can pre-allocate the correct # of rows. But the operation to count the # of rows in a file is much much cheaper than trying to do any kind of type inference or null detection on the entire file as well.
from csv.jl.
At least on Linux the shell command wc -l <filename>
is an extremely fast way to get the number of rows, assuming that records correspond to lines. It may give an upper bound on the number of records but that it still okay for use with sizehint!
.
In another discussion a person said that the Julia countlines
function should be as fast as wc
in the shell, but that hasn't been my experience.
from csv.jl.
^ this is a good point, and a big part of the reason that I added the functionality to do CSV.read(file; rows=num_rows)
where the user can provide num_rows
. The reason I have to use my own CSV.countlines
is because of the possibility of embedded newlines in a CSV field, which neither wc -l
nor Base.countlines
account for. i.e. I have a cell in my CSV file this is "lot's of string text \n with a newline in the middle"
. But for cases where the # of rows are known or you can safely and quickly pre-compute them beforehand, it's certainly an optimization to pass them in as the rows=num_rows
argument
from csv.jl.
Closing as "nothing can currently be done without major re-architecting". I understand the case, but it's also easy enough to request non-null columns upfront or unwrap them afterwards if the user really wants.
from csv.jl.
Related Issues (20)
- Do not edit "N/A", "NA", and similar entries **by default**. HOT 3
- skipto breaks if there is a quote in the skipped rows HOT 3
- getproperty on File makes internal use of dot notation problematic HOT 1
- big integers are parsed as Float64
- Too many missing warnings HOT 3
- writeheader=true ineffective in combination with header=
- Do not convert quoted cells
- CSV.write should conditionally convert type unstable iterators
- [Bug] CSV.read randomly changes eltype of column HOT 7
- pool kwarg documentation HOT 1
- There is no clear method reading non-UTF8 gzipped file in example
- burntsushi's issue HOT 1
- Multithreaded parsing error should be warning HOT 7
- Error reading CSV - missing lines HOT 2
- Load error with Parsers.Options HOT 4
- Configurable max inline string length
- Precompilation issue in Ubuntu 22.04.2 LTS (libLLVM-14jl.so (unknown line)) HOT 14
- Formatting issues in examples
- Cannot compile this package on Julia 1.9.1 in Ubuntu 22.04 container HOT 3
- "Missing" Values HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from csv.jl.