When I try to write a large parquet file, (for example, ~1 billion rows). What is

hi, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Best practice on writting large parquet file using JSONWriter about parquet-go HOT 3 CLOSED

xitongsys commented on September 27, 2024

Best practice on writting large parquet file using JSONWriter

from parquet-go.

Comments (3)

chris920820 commented on September 27, 2024

To be honest, we I try to write a large testing file in my pipeline, I often get a panic (I encountered IndexOutOfBound and also nil pointer) . It could either be a user error, or code bug. But in either case, panic makes it extremely hard for debugging, and it will kill the main process (unless use recovery, but still it may not be a good practice I think). I hope maybe we can do something like

err := pw.Write(rec)
// and similarly
err := pw.Flush(true)

I know it may be a bunch of refactoring to do and testing to cover panic, but I believe it can make the code more reliable.
I will keeping testing on that pipeline, to get clear whether it is a user error or code bug.
Thanks for your help :)

from parquet-go.

xitongsys commented on September 27, 2024

hi, @chris920820

You should always use Flush(true) one time in your code before write stop. The parameter is a little confused. There are two buffers in the writer, one is to store records, the other is to store pages. When the size of records is larger than a page size, the record buffer will be cleaned and written to a page. When the pages buffer size is larger than the row group size, then the pages will be written to a row group, which means write to file. So I give a bool parameter to the Flush function. When it is false, the flush operation is only on records buffer. Otherwise both buffers will be cleaned.
All the process is controlled by the PageSize(default is 8K=8*1024) and RowGroupSize(default is 128M(128*1024*1024)). The users needn't do it by themselves. But you can't write too many records every time, because of the limitation of your memory size.
If you want to change the default RowGroupSize, you can set

pw.RowGroupSize=256*1024*1024

PS: I'm also considering to include the Flush function in WriteStop, so users needn't call it anymore.
2) I will add more error handlers in new version, sorry about the inconvenience.

from parquet-go.

chris920820 commented on September 27, 2024

@xitongsys
Hey!
Thanks for your kindly replying! It was very clear.
Adding some graceful error handling will certainly make the code more robust. I have now integrated this code into production usage, and it is in testing phase now. Hopeful everything will work great :)

from parquet-go.

Recommend Projects

Best practice on writting large parquet file using JSONWriter about parquet-go HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent