Comments (3)
To be honest, we I try to write a large testing file in my pipeline, I often get a panic
(I encountered IndexOutOfBound
and also nil pointer
) . It could either be a user error, or code bug. But in either case, panic
makes it extremely hard for debugging, and it will kill the main process (unless use recovery
, but still it may not be a good practice I think). I hope maybe we can do something like
err := pw.Write(rec)
// and similarly
err := pw.Flush(true)
I know it may be a bunch of refactoring to do and testing to cover panic
, but I believe it can make the code more reliable.
I will keeping testing on that pipeline, to get clear whether it is a user error or code bug.
Thanks for your help :)
from parquet-go.
hi, @chris920820
- You should always use Flush(true) one time in your code before write stop. The parameter is a little confused. There are two buffers in the writer, one is to store records, the other is to store pages. When the size of records is larger than a page size, the record buffer will be cleaned and written to a page. When the pages buffer size is larger than the row group size, then the pages will be written to a row group, which means write to file. So I give a bool parameter to the Flush function. When it is false, the flush operation is only on records buffer. Otherwise both buffers will be cleaned.
All the process is controlled by the PageSize(default is 8K=8*1024) and RowGroupSize(default is 128M(128*1024*1024)). The users needn't do it by themselves. But you can't write too many records every time, because of the limitation of your memory size.
If you want to change the default RowGroupSize, you can set
pw.RowGroupSize=256*1024*1024
PS: I'm also considering to include the Flush function in WriteStop, so users needn't call it anymore.
2) I will add more error handlers in new version, sorry about the inconvenience.
from parquet-go.
@xitongsys
Hey!
Thanks for your kindly replying! It was very clear.
Adding some graceful error handling will certainly make the code more robust. I have now integrated this code into production usage, and it is in testing phase now. Hopeful everything will work great :)
from parquet-go.
Related Issues (20)
- Inefficient Parquet Conversion with columnify (parquet-go) compared to pyarrow #93
- Release 1.7.0? HOT 1
- write down and then just 1 successful HOT 1
- Unrecognized tag 'converted type' HOT 1
- Does csv_writer has stream write pattern?
- Reading parquet files written by spark returns unexpected values
- Incompatibilities in v1.6.0? What are they? HOT 1
- Need help in reading parquet file in chunks from GCS
- Memory usage increased by more than 4x until OOM(140G) when upgrading from v1.5.4 to the latest commit HOT 2
- Library may be trying to convert the byte array into a string representation instead of preserving the raw byte data HOT 2
- Break backward compatibility HOT 1
- Unable to read parquet file HOT 2
- Go v1.22 cannot detect PAR_GO_PATH_DELIMITER as "." HOT 1
- Corrupted Parquet Statistics in Trino SQL HOT 1
- A parquet format question: reading compressed pages
- [Question] Read Perfomances
- [Question] it it possible to use an io.Reader as source without dumping it to a buffer or file? HOT 1
- INT96 is convert to datetime
- panic: runtime error: invalid memory address or nil pointer dereference
- ничего не работает
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquet-go.