Code Monkey home page Code Monkey logo

parquet.jl's Introduction

Parquet

CI Coverage Status

Reader

A parquet file or dataset can be loaded using the read_parquet function. A parquet dataset is a directory with multiple parquet files, each of which is a partition belonging to the dataset.

read_parquet(path; kwargs...) returns a Parquet.Table or Parquet.Dataset, which is the table contained in the parquet file or dataset in an Tables.jl compatible format.

Options:

  • rows: The row range to iterate through, all rows by default. Applicable only when reading a single file.
  • filter: Filter function to apply while loading only a subset of partitions from a dataset. The path to the partition is provided as a parameter.
  • batchsize: Maximum number of rows to read in each batch (default: row count of first row group). Applied only when reading a single file, and to each file when reading a dataset.
  • use_threads: Whether to use threads while reading the file; applicable only for Julia v1.3 and later and switched on by default if julia processes is started with multiple threads.
  • column_generator: Function to generate a partitioned column when not found in the partitioned table. Parameters provided to the function: table, column index, length of column to generate. Default implementation determines column values from the table path.

The returned object is a Tables.jl compatible Table and can be converted to other forms, e.g. a DataFrames.DataFrame via

using Parquet, DataFrames
df = DataFrame(read_parquet(path))

Partitions in a parquet file or dataset can also be iterated over using an iterator returned by the Tables.partitions method.

using Parquet, DataFrames
for partition in Tables.partitions(read_parquet(path))
    df = DataFrame(partition)
    ...
end

Lower Level Reader

Load a parquet file. Only metadata is read initially, data is loaded in chunks on demand. (Note: ParquetFiles.jl also provides load support for Parquet files under the FileIO.jl package.)

Parquet.File represents a Parquet file at path open for reading.

Parquet.File(path) => Parquet.File

Parquet.File keeps a handle to the open file and the file metadata and also holds a weakly referenced cache of page data read. If the parquet file references other files in its metadata, they will be opened as and when required for reading and closed when they are not needed anymore.

The close method closes the reader, releases open files and makes cached internal data structures available for GC. A Parquet.File instance must not be used once closed.

julia> using Parquet

julia> filename = "customer.impala.parquet";

julia> parquetfile = Parquet.File(filename)
Parquet file: customer.impala.parquet
    version: 1
    nrows: 150000
    created by: impala version 1.2-INTERNAL (build a462ec42e550c75fccbff98c720f37f3ee9d55a3)
    cached: 0 column chunks

Examine the schema.

julia> nrows(parquetfile)
150000

julia> ncols(parquetfile)
8

julia> colnames(parquetfile)
8-element Array{Array{String,1},1}:
 ["c_custkey"]
 ["c_name"]
 ["c_address"]
 ["c_nationkey"]
 ["c_phone"]
 ["c_acctbal"]
 ["c_mktsegment"]
 ["c_comment"]

julia> schema(parquetfile)
Schema:
    schema {
      optional INT64 c_custkey
      optional BYTE_ARRAY c_name
      optional BYTE_ARRAY c_address
      optional INT32 c_nationkey
      optional BYTE_ARRAY c_phone
      optional DOUBLE c_acctbal
      optional BYTE_ARRAY c_mktsegment
      optional BYTE_ARRAY c_comment
    }

The reader performs logical type conversions automatically for String (from byte arrays), decimals (from fixed length byte arrays) and DateTime (from Int96). It depends on the converted type being populated correctly in the file metadata to detect such conversions. To take care of files where such metadata is not populated, an optional map_logical_types argument can be provided while opening the parquet file. The map_logical_types value must map column names to a tuple of return type and converter functon. Return types of String and DateTime are supported as of now, and default implementations for them are included in the package.

julia> mapping = Dict(["column_name"] => (String, Parquet.logical_string));

julia> parquetfile = Parquet.File("filename"; map_logical_types=mapping);

The reader will interpret logical types based on the map_logical_types provided. The following logical type mapping methods are available in the Parquet package.

  • logical_timestamp(v; offset=Dates.Second(0)): Applicable for timestamps that are INT96 values. This converts the data read as Int128 types to DateTime types.
  • logical_string(v): Applicable for strings that are BYTE_ARRAY values. Without this, they are represented in a Vector{UInt8} type. With this they are converted to String types.
  • logical_decimal(v, precision, scale; use_float=true): Applicable for reading decimals from FIXED_LEN_BYTE_ARRAY, INT64, or INT32 values. This converts the data read as those types to Integer, Float64 or Decimal of the given precision and scale, depending on the options provided.

Variants of these methods or custom methods can also be applied by caller.

BatchedColumnsCursor

Create cursor to iterate over batches of column values. Each iteration returns a named tuple of column names with batch of column values. Files with nested schemas can not be read with this cursor.

BatchedColumnsCursor(parquetfile::Parquet.File; kwargs...)

Cursor options:

  • rows: the row range to iterate through, all rows by default.
  • batchsize: maximum number of rows to read in each batch (default: row count of first row group).
  • reusebuffer: boolean to indicate whether to reuse the buffers with every iteration; if each iteration processes the batch and does not need to refer to the same data buffer again, then setting this to true reduces GC pressure and can help significantly while processing large files.
  • use_threads: whether to use threads while reading the file; applicable only for Julia v1.3 and later and switched on by default if julia processes is started with multiple threads.

Example:

julia> typemap = Dict(["c_name"]=>(String,Parquet.logical_string), ["c_address"]=>(String,Parquet.logical_string));

julia> parquetfile = Parquet.File("customer.impala.parquet"; map_logical_types=typemap);

julia> cc = BatchedColumnsCursor(parquetfile)
Batched Columns Cursor on customer.impala.parquet
    rows: 1:150000
    batches: 1
    cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment

julia> batchvals, state = iterate(cc);

julia> propertynames(batchvals)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)

julia> length(batchvals.c_name)
150000

julia> batchvals.c_name[1:5]
5-element Array{Union{Missing, String},1}:
 "Customer#000000001"
 "Customer#000000002"
 "Customer#000000003"
 "Customer#000000004"
 "Customer#000000005"

RecordCursor

Create cursor to iterate over records. In parallel mode, multiple remote cursors can be created and iterated on in parallel.

RecordCursor(parquetfile::Parquet.File; kwargs...)

Cursor options:

  • rows: the row range to iterate through, all rows by default.
  • colnames: the column names to retrieve; all by default

Example:

julia> typemap = Dict(["c_name"]=>(String,Parquet.logical_string), ["c_address"]=>(String,Parquet.logical_string));

julia> parquetfile = Parquet.File("customer.impala.parquet"; map_logical_types=typemap);

julia> rc = RecordCursor(parquetfile)
Record Cursor on customer.impala.parquet
    rows: 1:150000
    cols: c_custkey, c_name, c_address, c_nationkey, c_phone, c_acctbal, c_mktsegment, c_comment

julia> records = collect(rc);

julia> length(records)
150000

julia> first_record = first(records);

julia> isa(first_record, NamedTuple)
true

julia> propertynames(first_record)
(:c_custkey, :c_name, :c_address, :c_nationkey, :c_phone, :c_acctbal, :c_mktsegment, :c_comment)

julia> first_record.c_custkey
1

julia> first_record.c_name
"Customer#000000001"

julia> first_record.c_address
"IVhzIApeRb ot,c,E"

Writer

You can write any Tables.jl column-accessible table that contains columns of these types and their union with Missing: Int32, Int64, String, Bool, Float32, Float64.

However, CategoricalArrays are not yet supported. Furthermore, these types are not yet supported: Int96, Int128, Date, and DateTime.

Writer Example

tbl = (
    int32 = Int32.(1:1000),
    int64 = Int64.(1:1000),
    float32 = Float32.(1:1000),
    float64 = Float64.(1:1000),
    bool = rand(Bool, 1000),
    string = [randstring(8) for i in 1:1000],
    int32m = rand([missing, 1:100...], 1000),
    int64m = rand([missing, 1:100...], 1000),
    float32m = rand([missing, Float32.(1:100)...], 1000),
    float64m = rand([missing, Float64.(1:100)...], 1000),
    boolm = rand([missing, true, false], 1000),
    stringm = rand([missing, "abc", "def", "ghi"], 1000)
)

file = tempname()*".parquet"
write_parquet(file, tbl)

parquet.jl's People

Contributors

altre avatar aviks avatar davidanthoff avatar dfdx avatar dm3 avatar femtotrader avatar giordano avatar juliatagbot avatar l1x avatar nalimilan avatar poke1024 avatar quinnj avatar ranocha avatar sa- avatar sglyon avatar ssikdar1 avatar tanmaykm avatar viralbshah avatar xiaodaigh avatar zaneli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquet.jl's Issues

DateTime writer support

I know that you mention in the README that a few types are not supported by the writer as of now including date-like types. I didn't see any issues referencing this so I wanted to add one to track its status.

Is this planned to be supported soon? If not, what needs to happen in order to support this? I might be able to create a PR at some point.

Error reading parquet files created using pandas to_parquet

I have a simple python pandas DataFrame with just one column Dt of type int and 6004090 rows. I have saved it in a parquet file using df.to_parquet, its using pyarrow engine to write the file. When I try to read it in Julia using Parquet I get below BoundError. However, strangely if I save say only first 10000 rows I can read it in Julia without any error. Please see below python and julia codes. Is this a bug?

#~~~~ Python code ~~~~#

In [1]: import pandas as pd

In [2]: df = pd.read_parquet("D://tmp//df.parquet")

In [3]: df.head()
Out[3]:
Dt
0 20060102
1 20060102
2 20060102
3 20060102
4 20060102

In [4]: df.shape
Out[4]: (6004090, 1)

In [5]: df.dtypes
Out[5]:
Dt int64
dtype: object

In [6]: df.to_parquet("D://tmp//test1.parquet", index=False)

In [7]: df.iloc[:10000].to_parquet("D://tmp//test2.parquet", index=False)

#~~~~ Julia code ~~~~#

julia> using Parquet

julia> # reading file with only first 10000 rows

julia> p = ParFile("D://tmp//test2.parquet")
Parquet file: D://tmp//test2.parquet
version: 1
nrows: 10000
created by: parquet-cpp version 1.5.1-SNAPSHOT
cached: 0 column chunks

julia> schema(JuliaConverter(Main), p, :Dt)

julia> rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Dt))
Record Cursor on D://tmp//test2.parquet
rows: 1:5
cols: Dt

julia> # reading file with all rows

julia> p = ParFile("D://tmp//test1.parquet")
Parquet file: D://tmp//test1.parquet
version: 1
nrows: 6004090
created by: parquet-cpp version 1.5.1-SNAPSHOT
cached: 0 column chunks

julia> schema(JuliaConverter(Main), p, :Dt)

julia> rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Dt))
ERROR: BoundsError: attempt to access 3415-element Array{Int64,1} at index [3543]
Stacktrace:
[1] getindex at .\array.jl:728 [inlined]
[2] #7 at .\none:0 [inlined]
[3] iterate at .\generator.jl:47 [inlined]
[4] collect_to! at .\array.jl:651 [inlined]
[5] collect_to_with_first!(::Array{Int64,1}, ::Int64, ::Base.Generator{Array{Int32,1},getfield(Parquet, Symbol("##7#8")){Array{Int64,1}}}, ::Int64) at .\array.jl:630
[6] collect at .\array.jl:611 [inlined]
[7] map_dict_vals(::Array{Int64,1}, ::Array{Int32,1}) at C:\Users\XXXX.julia\packages\Parquet\qSvbc\src\reader.jl:160
[8] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at C:\Users\XXXX.julia\packages\Parquet\qSvbc\src\reader.jl:184
[9] setrow(::ColCursor{Int64}, ::Int64) at C:\Users\XXXX.julia\packages\Parquet\qSvbc\src\cursor.jl:144
[10] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at C:\Users\XXXX.julia\packages\Parquet\qSvbc\src\cursor.jl:115
[11] (::getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64})(::String) at .\none:0
[12] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64}}) at .\generator.jl:47
[13] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{Dt}, ::Int64) at C:\Users\XXXX.julia\packages\Parquet\qSvbc\src\cursor.jl:269 (repeats 2 times)
[14] top-level scope at REPL[7]:1

Example in README

I have quite a bit of trouble understanding the below code. What is Customer and how is that defined?

julia> rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Customer))
Record Cursor on /home/tan/Work/julia/packages/Parquet/test/parquet-compatibility/parquet-testdata/impala/1.1.1-SNAPPY/customer.impala.parquet
    rows: 1:5
    cols: c_acctbal.c_mktsegment.c_nationkey.c_name.c_address.c_custkey.c_phone.c_comment

KeyError: key 5 not found

I have this parquet file, which is essentially a table of words embedding. I can open it using Pandas
image

But when I try to open it using Parquet.jl It shows the following error:

KeyError: key 5 not found

Stacktrace:
 [1] getindex(::Dict{Int64,Thrift.ThriftMetaAttribs}, ::Int64) at ./dict.jl:478
 [2] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.Statistics) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:181
 [3] read_container(::Thrift.TCompactProtocol, ::Type{Parquet.PAR2.Statistics}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:168
 [4] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.ColumnMetaData) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:190
 [5] read_container(::Thrift.TCompactProtocol, ::Type{Parquet.PAR2.ColumnMetaData}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:168
 [6] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.ColumnChunk) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:190
 [7] read at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [8] read at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:167 [inlined]
 [9] read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.ColumnChunk,1}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:369
 [10] read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.ColumnChunk,1}}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:168
 [11] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.RowGroup) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:190
 [12] read at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [13] read at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:167 [inlined]
 [14] read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.RowGroup,1}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:369
 [15] read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.RowGroup,1}}) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:168
 [16] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.FileMetaData) at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:190
 [17] read at /Users/catethos/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [18] read_thrift(::IOStream, ::Type{Parquet.PAR2.FileMetaData}) at /Users/catethos/.julia/packages/Parquet/HxyMJ/src/reader.jl:324
 [19] metadata(::IOStream, ::String, ::Int32) at /Users/catethos/.julia/packages/Parquet/HxyMJ/src/reader.jl:339
 [20] #ParFile#4(::Int64, ::Type, ::String, ::IOStream) at /Users/catethos/.julia/packages/Parquet/HxyMJ/src/reader.jl:57
 [21] Type at /Users/catethos/.julia/packages/Parquet/HxyMJ/src/reader.jl:55 [inlined]
 [22] ParFile(::String) at /Users/catethos/.julia/packages/Parquet/HxyMJ/src/reader.jl:46
 [23] top-level scope at In[7]:1

Any idea what is the problem?

Integration with Tables.jl

Tables.jl is the de-facto standard for tabular structured data. We can try to integrate it better with Parquet.jl e.g. Tables.schema(par::ParFile) should return the schema in the format expected by Tables.jl

Data Page read is correct?

Parquet.jl/src/reader.jl

Lines 253 to 263 in b68e6af

defn_levels = isrequired(par.schema, cname) ? Int[] : read_levels(io, max_definition_level(par.schema, cname), defn_enc, num_values)
#@debug("before reading repn levels bytesavailable in page: $(bytesavailable(io))")
# read repetition levels. skipped if all columns are at 1st level
repn_levels = ('.' in cname) ? read_levels(io, max_repetition_level(par.schema, cname), rep_enc, num_values) : Int[]
#@debug("before reading values bytesavailable in page: $(bytesavailable(io))")
# read values
vals = read_values(io, enc, ctype, num_values)
vals, defn_levels, repn_levels

In the code referenced above the data is read definition level then repetition and then encoded values.

This is different from the documentation which is ordered repetition level then definition and then encoded

image

Which is correct?

Support reading/writing categorical

Need

  • For write, Arrow.jl to be able to generate a arrow schema from table schema

  • For read, need the Arrow schema's flatbuffer definitions be readable. Maybe Arrow.jl will do that, so Parquet.jl can piggy back on that.

Update
Another possible approach is to guess what is needed, e.g. if every page in the column uses dictionary encoding then we can return that as categorical

Doesn't seem to work well with 0.6 format for structs

The Julia Converter in this package doesn't seem up to date with 0.6 struct format:

julia> schema(JuliaConverter(STDOUT), p, :Interaction)             
type Vector{key_valueType}                                         
    Vector{key_valueType}() = new()                                
    key::Vector{UInt8}                                             
    value::Vector{UInt8}                                           
end                                                                
                                                                   
type snapshot_input_typesType                                      
    snapshot_input_typesType() = new()                             
    key_value::Vector{key_valueType}                               
    year::Int32                                                    
end                                                                
                                                                   
type Vector{key_valueType}                                         
    Vector{key_valueType}() = new()                                
    key::Vector{UInt8}                                             
    value::Vector{UInt8}                                           
end                                                                
                                                                   
type snapshotType                                                  
    snapshotType() = new()                                         
    key_value::Vector{key_valueType}                               
    snapshot_input_types::snapshot_input_typesType                 
    month::Int32                                                   
end                                                                
                                                                   
type Vector{key_valueType}                                         
    Vector{key_valueType}() = new()                                
    key::Vector{UInt8}                                             
    value::Float64                                                 
end                                                                
                                                                   
type fired_trap_state_ids_scoresType                               
    fired_trap_state_ids_scoresType() = new()                      
    key_value::Vector{key_valueType}                               
    snapshot::snapshotType                                         
    day::Int32                                                     
end                                                                
                                                                   
type Vector{listType}                                              
    Vector{listType}() = new()                                     
    element::Vector{UInt8}                                         
end                                                                
                                                                   
type fired_trap_state_idsType                                      
    fired_trap_state_idsType() = new()                             
    list::Vector{listType}                                         
    fired_trap_state_ids_scores::fired_trap_state_ids_scoresType   
    week_of_week_year::Int32                                       
end                                                                
                                                                   
type Interaction                                                   
    Interaction() = new()                                          
    interaction_id::Vector{UInt8}                                  
    cohort_id::Vector{UInt8}                                       
    deployment_id::Vector{UInt8}                                   
    lesson_id::Vector{UInt8}                                       
    lesson_dom_id::Vector{UInt8}                                   
    lesson_finished::Bool                                          
    student_id::Vector{UInt8}                                      
    student_account_id::Vector{UInt8}                              
    instructor_id::Vector{UInt8}                                   
    lesson_score::Float64                                          
    lesson_score_max::Float64                                      
    lesson_attempt::Int32                                          
    scoring_method::Vector{UInt8}                                  
    screen_id::Vector{UInt8}                                       
    screen_attempt::Int32                                          
    screen_finished::Bool                                          
    screen_score::Float64                                          
    screen_score_max::Float64                                      
    screen_bank_id::Vector{UInt8}                                  
    screen_bank_path::Vector{UInt8}                                
    fired_trap_state_ids::fired_trap_state_idsType                 
    half_of_year::Int32                                            
    quarter_of_year::Int32                                         
end          
julia> type Vector{key_valueType}
           Vector{key_valueType}() = new()
           key::Vector{UInt8}
           value::Vector{UInt8}
       end

WARNING: deprecated syntax "inner constructor Vector(...) around REPL[15]:2".
Use "Vector{#s27}(...) where #s27" instead.
WARNING: static parameter key_valueType does not occur in signature for Type at REPL[15]:2.
The method will not be callable.

julia>

julia> type snapshot_input_typesType
           snapshot_input_typesType() = new()
           key_value::Vector{key_valueType}
           year::Int32
       end
ERROR: UndefVarError: key_valueType not defined           

Method overwrite warning

Whenever I load Parquet with using Parquet, I get this warning:

WARNING: Method definition copy!(T, T) in module Thrift at C:\Users\anthoff\.julia\v0.6\Thrift\src\base.jl:591 overwritten in module ProtoBuf at C:\Users\anthoff\.julia\v0.6\ProtoBuf\src\utils.jl:15.

Looking at those definitions, they look pretty type-piracy to me, so maybe the right solution would be to just give them some other name in these packages and not add these methods to Base.copy! in the first place?

Error reading a file

To reproduce, use this file: https://github.com/xiaodaigh/parquet-data-collection/blob/master/synthetic_data.parquet

Here is what I get:

julia> a = "synthetic_data.parquet"
"synthetic_data.parquet"

julia> p = ParFile(a)
Parquet file: synthetic_data.parquet
    version: 1
    nrows: 14
    created by: parquet-cpp version 1.5.1-SNAPSHOT
    cached: 0 column chunks


julia> schema(JuliaConverter(Main), p, :Customer)

julia> rc = RecCursor(p, 1:5, colnames(p), JuliaBuilder(p, Customer))
ERROR: EOFError: read end of file
Stacktrace:
 [1] read at .\iobuffer.jl:212 [inlined]
 [2] _read_varint at C:\Users\david\.julia\dev\Parquet\src\codec.jl:40 [inlined]
 [3] read_hybrid(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::UInt8, ::Int64, ::Type{Int32}, ::Array{Int32,1}; read_len::Bool) at C:\Users\david\.julia\dev\Parquet\src\codec.jl:129
 [4] read_rle_dict(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32) at C:\Users\david\.julia\dev\Parquet\src\codec.jl:118
 [5] read_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Int32, ::Int32, ::Int32) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:222
 [6] read_levels_and_values(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Tuple{Int32,Int32,Int32}, ::Int32, ::Int32, ::ParFile, ::Parquet.Page) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:261
 [7] values(::ParFile, ::Parquet.Page) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:239
 [8] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:178
 [9] setrow(::ColCursor{Array{UInt8,1}}, ::Int64) at C:\Users\david\.julia\dev\Parquet\src\cursor.jl:144
 [10] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at C:\Users\david\.julia\dev\Parquet\src\cursor.jl:115
 [11] (::Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64})(::String) at .\none:0
 [12] iterate at .\generator.jl:47 [inlined]
 [13] collect(::Base.Generator{Array{AbstractString,1},Parquet.var"#11#12"{ParFile,UnitRange{Int64},Int64}}) at .\array.jl:665
 [14] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{Customer}, ::Int64) at C:\Users\david\.julia\dev\Parquet\src\cursor.jl:269 (repeats 2 times)
 [15] top-level scope at REPL[42]:1

Originally reported by @xiaodaigh over at queryverse/ParquetFiles.jl#21.

String columns read back as `Vector{UInt8}`

String columns are still read back as Vector{UInt8}. As discussed, they can be determined as string by checking in the metadata.schema[i].converted_type for some i. If the type is BYTE_ARRAY then you need to check the converted_type parameter and if it's filled and is UTF8 (which is 0) then you need to cast the byte array as String.

using Parquet
using Random: randstring
tbl = (
    int32 = rand(Int32, 1000),
    int64 = rand(Int64, 1000),
    float32 = rand(Float32, 1000),
    float64 = rand(Float64, 1000),
    bool = rand(Bool, 1000),
    string = [randstring(8) for i in 1:1000],
    int32m = rand([missing, rand(Int32, 10)...], 1000),
    int64m = rand([missing, rand(Int64, 10)...], 1000),
    float32m = rand([missing, rand(Float32, 10)...], 1000),
    float64m = rand([missing, rand(Float64, 10)...], 1000),
    boolm = rand([missing, true, false], 1000),
    stringm = rand([missing, "abc", "def", "ghi"], 1000)
);

tmpfile = tempname()*".parquet"

@time write_parquet(tmpfile, tbl);

ok2(path) = begin
    par = ParFile(path);
    cc = Parquet.BatchedColumnsCursor(par)
    res = [adf for adf in cc]
    close(par)
    res
end

path = tmpfile

aa  = ok2(path)

aa[1][6] # this should be string

Can't read a specific file

This data file can't be read: https://github.com/xiaodaigh/parquet-data-collection/blob/master/dsd50p.parquet

I get:

julia> a = "dsd50p.parquet"
"dsd50p.parquet"

julia> b = ParFile(a)
ERROR: KeyError: key 5 not found
Stacktrace:
 [1] getindex(::Dict{Int64,Thrift.ThriftMetaAttribs}, ::Int64) at .\dict.jl:477
 [2] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.RowGroup) at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:181
 [3] read at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:169 [inlined]
 [4] read at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:167 [inlined]
 [5] read_container(::Thrift.TCompactProtocol, ::Array{Parquet.PAR2.RowGroup,1}) at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:369
 [6] read_container(::Thrift.TCompactProtocol, ::Type{Array{Parquet.PAR2.RowGroup,1}}) at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:168
 [7] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.FileMetaData) at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:190
 [8] read at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:169 [inlined]
 [9] read at C:\Users\david\.julia\packages\Thrift\Xjowa\src\base.jl:167 [inlined]
 [10] read_thrift(::IOStream, ::Type{Parquet.PAR2.FileMetaData}) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:324
 [11] metadata(::IOStream, ::String, ::Int32) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:339
 [12] ParFile(::String, ::IOStream; maxcache::Int64) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:57
 [13] ParFile at C:\Users\david\.julia\dev\Parquet\src\reader.jl:55 [inlined]
 [14] ParFile(::String) at C:\Users\david\.julia\dev\Parquet\src\reader.jl:46
 [15] top-level scope at REPL[14]:1

Reported by @xiaodaigh over at queryverse/ParquetFiles.jl#20.

Integrate with FileIO.jl

That and queryverse/IterableTables.jl#49 would enable a super smooth workflow for parquet files and full interop with all the things that IterableTables.jl interops with (including Query.jl etc.).

I'd be happy to handle the FileIO.jl integration (I just did that for lots of other file formats, so not a lot of extra learning for me at this point), but someone would have to fix this package first and make things work on julia 0.6 on all platforms.

ParFile creation prints ~1000 lines of Thrift messages

After #42 fix, Parfile creation on master branch generates ~1000 lines of Thrift messages. Not an error but blows up a REPL session. Not sure where printing is occurring.

MWE:

(v1.3) pkg> st
    Status `~/.julia/environments/v1.3/Project.toml`
  [626c502c] Parquet v0.3.0 #master (https://github.com/JuliaIO/Parquet.jl.git)
  [46a55296] ParquetFiles v0.2.1-DEV #master (https://github.com/queryverse/ParquetFiles.jl.git)

Load parfile:

using Parquet
parfile = "index.parquet"
p = ParFile(parfile)
read TSTRUCT Parquet.PAR2.FileMetaData
struct meta: ThriftMeta for Parquet.PAR2.FileMetaData
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :version, 8, true, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(2, :schema, 15, true, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.SchemaElement
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :_type, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(2, :type_length, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(3, :repetition_type, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(4, :name, 11, true, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(5, :num_children, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(6, :converted_type, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(7, :scale, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(8, :precision, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(9, :field_id, 8, false, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(10, :logicalType, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.LogicalType
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :STRING, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.StringType
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(2, :MAP, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.MapType
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(3, :LIST, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.ListType
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(4, :ENUM, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.EnumType
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(5, :DECIMAL, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.DecimalType
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :scale, 8, true, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(2, :precision, 8, true, Any[], Thrift.ThriftMeta[])]
]), Thrift.ThriftMetaAttribs(6, :DATE, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.DateType
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(7, :TIME, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.TimeType
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :isAdjustedToUTC, 2, true, Any[], Thrift.ThriftMeta[]), Thrift.ThriftMetaAttribs(2, :unit, 12, true, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.TimeUnit
Thrift.ThriftMetaAttribs[Thrift.ThriftMetaAttribs(1, :MILLIS, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.MilliSeconds
Thrift.ThriftMetaAttribs[]
]), Thrift.ThriftMetaAttribs(2, :MICROS, 12, false, Any[], Thrift.ThriftMeta[ThriftMeta for Parquet.PAR2.MicroSeconds
Thrift.ThriftMetaAttribs[]
])]
])]
......

There's about another 1000 lines of Thrift output after this.

Out of Bounds Error on a Parquet file containing Int96

In trying to work through your example with one of my own parquet files,

using Parquet
PQ = ParFile("MyFile.parquet")
schema(JuliaConverter(Main), PQ, :T_TREND)
rc = RecCursor(PQ, 1:5, colnames(PQ), JuliaBuilder(PQ, T_TREND))

I am running into this error (which I also run into when I use ParquetFiles.jl)

ERROR: LoadError: BoundsError: attempt to access 78497-element Array{Int128,1} at index [86164]
Stacktrace:
[1] getindex at .\array.jl:731 [inlined]
[2] #7 at .\none:0 [inlined]
[3] iterate at .\generator.jl:47 [inlined]
[4] collect_to! at .\array.jl:656 [inlined]
[5] collect_to_with_first!(::Array{Int128,1}, ::Int128, ::Base.Generator{Array{Int32,1},getfield(Pa
rquet, Symbol("##7#8")){Array{Int128,1}}}, ::Int64) at .\array.jl:643
[6] collect at .\array.jl:624 [inlined]
[7] map_dict_vals(::Array{Int128,1}, ::Array{Int32,1}) at C:\Users\XXX.julia\packages\Parquet\Hx
yMJ\src\reader.jl:160
[8] macro expansion at .\logging.jl:310 [inlined]
[9] values(::ParFile, ::Parquet.PAR2.ColumnChunk) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\s
rc\reader.jl:180
[10] setrow(::ColCursor{Int128}, ::Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\curso
r.jl:144
[11] ColCursor(::ParFile, ::UnitRange{Int64}, ::String, ::Int64) at C:\Users\XXX.julia\packages
Parquet\HxyMJ\src\cursor.jl:115
[12] (::getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64})(::String) at .\none:0
[13] iterate at .\generator.jl:47 [inlined]
[14] collect_to!(::Array{ColCursor{Array{UInt8,1}},1}, ::Base.Generator{Array{AbstractString,1},get
field(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},Int64}}, ::Int64, ::Int64) at .\array.jl:
656
[15] collect_to_with_first!(::Array{ColCursor{Array{UInt8,1}},1}, ::ColCursor{Array{UInt8,1}}, ::Ba
se.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,UnitRange{Int64},I
nt64}}, ::Int64) at .\array.jl:643
[16] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){ParFile,
UnitRange{Int64},Int64}}) at .\array.jl:624
[17] RecCursor(::ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::JuliaBuilder{T_TREND}, :
:Int64) at C:\Users\XXX.julia\packages\Parquet\HxyMJ\src\cursor.jl:269 (repeats 2 times)
[18] top-level scope at none:0
in expression starting at C:\Users\XXX\Desktop\Analysis\Main.jl:15

Now I believe this might have something to do with the fact that one of the columns is formatted as an Int96. By using shema I found this

spark_schema {
  optional INT96 time
  optional DOUBLE smh
  optional BYTE_ARRAY name# (from UTF8)
  optional DOUBLE value
  optional BYTE_ARRAY unit# (from UTF8)
  optional BYTE_ARRAY condition# (from UTF8)
  optional BYTE_ARRAY type# (from UTF8)
  optional BYTE_ARRAY source# (from UTF8)
  optional BYTE_ARRAY serialnumber# (from UTF8)
}

In addition I'm not sure of the dataset sizes this tool was tested on. My Parquet file has 20+ Million rows.

Timestamp columns are not supported.

Hi, it is us again.

So, we are on to the next data type, and this time it is timestamps. Yeah, I know these will be a pain, the type is already in the same test files as the boolean test.

I'm sorry for bugging you again this quickly, we VERY much love the help you already have given. It has made a big difference to us.

Improve read efficiency

To return a complete column from the parquet file, we need three things

  1. Read the dictionary page (if it exists)
  2. Use the values function to obtain values (or indices for dictionary), definition levels, and repetitive levels (usually empty for non-nested columns).

Step 2) is quite inefficient as it materialises the dictionary indices and the definition levels which can take up quite a lot of RAM if the column is large (think 100s millions of rows). It's actually much efficient to simply materialise the final column without allocating vectors for dictionary indices and the definition levels; we can just read them into two re-usable variables.

Zstd for Parquet reader

I would love to see the zstandard implemented for parquet reading here. I did a short blog talking about the benefits of this. The TLDR is that zstandard is just as fast as snappy but takes about 2/3 the amount of space that snappy does. I think it would be useful to have this implemented in this package.

Edit: Upon further digging around:

I've looked at the code and although I don't understand much of it I think that just adding the Julia implimation of the Zstd codec Using ParFile works for Zstd compressed files unfortunately I don't really know what to do from there. I've looked around and I cannot tell where you determine the codec to use. If I figure it out I may try and create a pull request.

tag new version?

The current tag is quite old, and I've gotten some errors on the old tag that I don't get on master. Is it about time to tag this? There are 8 untagged commits.

overloading of Base.eltype

Hello!

Thank you for developing Parquet.jl!

Trying to incorporate parquet files in to a package,
I realized that you overload Base.eltype on instances instead of types.

i.e. you do
eltype(cursor::RecordCursor{T}) where {T} = T

instead of
eltype(::Type{RecordCursor{T}}) where {T} = T

as stipulated in Julia documentation.

best,
Florian

`Batched` concept. Propose dropping `Batched` from names

I finally got why the name has Batched. It's because in Batched we are returning more than one value at a time hence "batched". If returning record by record (i.e. row by row) then it is returning one value per column at a time; this is not batched.

But in Julia and in Python and R, most tabular structures like DataFrames are stored columnar, i.e. it is natural to read multiple values at a time. Also, batched is not used anywhere else in the data ecosystem. Hence I suggest removing the Batched in the name, as reading things in batch mode is the norm.

Enable AppVeyor and add a badge

Hello,

I wonder if Parquet is test against julia 0.6 and Windows.
If that's the case maybe you should add an AppVeyor badge

Kind regards

`values(....)[1]` is outputting too many values for column with missing?

In Python, I do

import pandas as pd
import numpy as np

pd.DataFrame({"a":[1, np.nan, -1]}).to_parquet("somewhere.parquet")

and I read it using #master

pf = ParFile("somewhere")

# the file is very small so only one rowgroup
col_chunks = columns(pf, 1)

colnum = 1
col_chunk=col_chunks[colnum]

correct_vals = tbl[colnum]
coltype = eltype(correct_vals)
vals_from_file = values(pf, col_chunk)

And vals_from_file[1] is [1, -1, 1.95e-315]. So it's outputting a vector of length 3 even though only the first two values are valid.

Is it by design? Or it's a bit misalign or just outputting too many values?

string columns with missing are read back at `Int32`

In Python

import pandas as pd
import numpy as np

pd.DataFrame({"a":["abc", np.nan, "def"]}).to_parquet("somewhere.parquet")

in Julia on the master branch

pf = ParFile("somewhere")

# the file is very small so only one rowgroup
col_chunks = columns(pf, 1)

colnum = 1
col_chunk=col_chunks[colnum]

correct_vals = tbl[colnum]
coltype = eltype(correct_vals)
vals_from_file = values(pf, col_chunk)

and you will see vals_from_file[1] are Int32 instead of Vector{UInt8}.

The same data can be read in R and Python

Internal error. Incorrect state 8. Expected: (0, 6, 7)

When loading a file

using DataFrames
using ParquetFiles

Initially I saw #29 this issue. After cloning latest master I see the following error:

Internal error. Incorrect state 8. Expected: (0, 6, 7)

Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] chkstate at /home/gsc/.julia/packages/Thrift/hqiAN/src/protocols.jl:223 [inlined]
 [3] readStructBegin(::Thrift.TCompactProtocol) at /home/gsc/.julia/packages/Thrift/hqiAN/src/protocols.jl:395
 [4] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.DictionaryPageHeader) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:172
 [5] read at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [6] read(::Thrift.TCompactProtocol, ::Type{Parquet.PAR2.DictionaryPageHeader}) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:167
 [7] read_container(::Thrift.TCompactProtocol, ::Parquet.PAR2.PageHeader) at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:194
 [8] read at /home/gsc/.julia/packages/Thrift/hqiAN/src/base.jl:169 [inlined]
 [9] read_thrift(::IOStream, ::Type{Parquet.PAR2.PageHeader}) at /home/gsc/.julia/dev/Parquet/src/reader.jl:324
 [10] _pagevec(::Parquet.ParFile, ::Parquet.PAR2.ColumnChunk) at /home/gsc/.julia/dev/Parquet/src/reader.jl:124
 [11] #5 at /home/gsc/.julia/dev/Parquet/src/reader.jl:137 [inlined]
 [12] cacheget(::Parquet.PageLRU, ::Parquet.PAR2.ColumnChunk, ::getfield(Parquet, Symbol("##5#6")){Parquet.ParFile}) at /home/gsc/.julia/dev/Parquet/src/reader.jl:26
 [13] pages at /home/gsc/.julia/dev/Parquet/src/reader.jl:137 [inlined]
 [14] values(::Parquet.ParFile, ::Parquet.PAR2.ColumnChunk) at /home/gsc/.julia/dev/Parquet/src/reader.jl:166
 [15] setrow(::Parquet.ColCursor{Float64}, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:144
 [16] Parquet.ColCursor(::Parquet.ParFile, ::UnitRange{Int64}, ::String, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:115
 [17] (::getfield(Parquet, Symbol("##11#12")){Parquet.ParFile,UnitRange{Int64},Int64})(::String) at ./none:0
 [18] iterate at ./generator.jl:47 [inlined]
 [19] collect(::Base.Generator{Array{AbstractString,1},getfield(Parquet, Symbol("##11#12")){Parquet.ParFile,UnitRange{Int64},Int64}}) at ./array.jl:606
 [20] Parquet.RecCursor(::Parquet.ParFile, ::UnitRange{Int64}, ::Array{AbstractString,1}, ::Parquet.JuliaBuilder{ParquetFiles.RCType363}, ::Int64) at /home/gsc/.julia/dev/Parquet/src/cursor.jl:269 (repeats 2 times)
 [21] getiterator(::ParquetFiles.ParquetFile) at /home/gsc/.julia/packages/ParquetFiles/cLLFb/src/ParquetFiles.jl:74
 [22] columns at /home/gsc/.julia/packages/Tables/8f4rT/src/fallbacks.jl:153 [inlined]
 [23] DataFrame(::ParquetFiles.ParquetFile) at /home/gsc/.julia/packages/DataFrames/IKMvt/src/other/tables.jl:21
 [24] top-level scope at In[3]:1

I don' t know what to do. Please give me some advice.

Corrupt reads with large data files

I try to give as much information as possible. Unfortunately I am not able to supply data file which produces this issue (proprietary and confidential), but I will try in the coming time to produce a garbage file which can reproduce this issue.

Julia version = 1.4.1
Parquet.jl version = v0.6.1

I have used a program like this to assist with understanding the issue:

import PyCall
import Parquet
import DataFrames

np = PyCall.pyimport("numpy")
pandas = PyCall.pyimport("pandas")
parquet = PyCall.pyimport("pyarrow.parquet")

function pd_to_df(df_pd)
    df= DataFrames.DataFrame()
    for col in df_pd.columns
        df[!, Symbol(col)] = getproperty(df_pd, col).values
    end
    df
end

function raw_parquet(path::String; logical_map=true)::DataFrames.DataFrame
    p = Parquet.ParFile(path; map_logical_types = logical_map)
    colcursor = Parquet.BatchedColumnsCursor(p)
    # Create a DataFrame per batch and bring the DataFrames together
    df = reduce(vcat, DataFrames.DataFrame.(colcursor))
    return df
end

pqdata = raw_parquet("mydata.parquet")
refdata = pandas.read_parquet("mydata.parquet")
jrefdata = pd_to_df(refdata)


Core.Float64(x::Nothing) = NaN

# empties are in the same place
isnan.(Float64.(coalesce.(Matrix(pqdata), nothing))) == isnan.(Float64.(Matrix(jrefdata)))

z = Float64.(coalesce.(Matrix(pqdata), nothing)) - Float64.(Matrix(jrefdata))
z[isnan.(z)] .= 0.0

To summarise, I found that NaNs/missing data (Pandas fault) were in the same places, but actual values started to skew off
I can give statistics on the data (which might help you in building an understanding, and will help me later to build a reproducing case)

julia> sum(z .!= 0.0; dims = 1)
1×281 Array{Int64,2}:
 0  0  80072  80023  79068  311913  313816  309042  6590  6240  5987  73731  73474  71966  2620  2615  2590  …  433321  436259  428708  28396  27692  25520  59024  58439  58015  177502  177212  175058  18251  18206  18107
julia> size(z)
(1443915, 281)
julia> sum(sum(z .!= 0.0; dims = 1) .> 0)
279
shell> du -sh mydata.parquet
110M    mydata.parquet
julia> sum(.!isnan.(Float64.(Matrix(jrefdata)))) / prod(size(z))
0.0876967834447427
julia> sum(z .!= 0.0) / prod(size(z))
0.0658725672220012
julia>

I want to note that float point precision issues aren't a possible cause here (firstly because the encoded/decoded outputs should match exactly, and secondly because the actual underlying pqfile is all int32s, rather than actual floats, so the upcasting and subtractions should yield exact zeros)

Just to demonstrate this:

julia> zz = copy(z);
julia> zz[zz .== 0] .= 1e6; # set a big number to not affect minimums
julia> minimum(abs.(zz); dims=1)
1×281 Array{Float64,2}:
 1.0e6  1.0e6  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  1.0  1.0  2.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0
julia> minimum(minimum(abs.(zz); dims=1))
1.0
julia>

This also demonstrates in particular that every column encoded after the second is wrong in all its nonzero values.

I am not sure what more information I can provide at this point to help, but I would like to help as much as I can, so if you would have any questions do let me know and I will try to respond quickly.

Bug with `BatchedColumnCursor` reader

using Parquet
using Random: randstring
tbl = (
    int32 = rand(Int32, 1000),
    int64 = rand(Int64, 1000),
    float32 = rand(Float32, 1000),
    float64 = rand(Float64, 1000),
    bool = rand(Bool, 1000),
    string = [randstring(8) for i in 1:1000],
    int32m = rand([missing, rand(Int32, 10)...], 1000),
    int64m = rand([missing, rand(Int64, 10)...], 1000),
    float32m = rand([missing, rand(Float32, 10)...], 1000),
    float64m = rand([missing, rand(Float64, 10)...], 1000),
    boolm = rand([missing, true, false], 1000),
    stringm = rand([missing, "abc", "def", "ghi"], 1000)
);

tmpfile = tempname()*".parquet"

write_parquet(tmpfile, tbl);

ok(path) = begin
    cc = Parquet.BatchedColumnsCursor(ParFile(tmpfile))
    vals, _ = iterate(cc)
    vals
end

ok(tmpfile) # errors

See the above where I write a parquet file and read it back using the BatchedColumnCursor. But it fails.

The written file can be read by R and Python's parquet reader.

Update to 1.0

Would be good to ensure 1.0 support and drop older.

Register 0.5.3?

Is there anything that remains to be done before releasing 0.5.3?

Error with show on a parquet file

julia> p = ParFile(parfile)
Parquet file: data/Interactions/part-00000-ef53f9ac-98ec-4120-93ec-01601fd7f917-c000.snappy.parquet
    version: 1
    nrows: 95828
    created by: parquet-mr version 1.8.2 (build c6522788629e590a53eb79874b95f6c3ff11f16c)
Error showing value of type Parquet.ParFile:
ERROR: MethodError: no method matching length(::Parquet.PageLRU)
Closest candidates are:
  length(::SimpleVector) at essentials.jl:256
  length(::Base.MethodList) at reflection.jl:558
  length(::MethodTable) at reflection.jl:634
  ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.