Comments (8)
Hi mukunku,
thanks for your feedback and chaser. Sorry, have been quite busy. Please find a sample parquet file attached.
It has 100 rows and 5000 columns, 7'391 kb. In Python, I can read it within <0.36 secs.
In Parquetviewer, the analyzing stage took 10 sec or even caused my system to freeze. The loading stage took another 2 minutes. The memory usage was low, but there was 100% usage of a single cpu core.
As I just wanted to have the possibility to have a quick file preview in Windows by clicking on the parquet file, I now found another efficient solution for my case.
- A simple python script to read parquet and print essentials:
import pandas as pd
import sys
file = sys.argv[1]
df = pd.read_parquet(file)
print(df)
print(df.info())
input()
- A simple file association in the cmd prompt (as admin):
assoc .parquet=parquetfile
ftype parquetfile="<pathtoyourpython>\python.exe"
"<pathtoyourscriptfile>\parquet_win_preview.py" "%1" "%2" %*
from parquetviewer.
Please find a test file attached.
Dimensions [10000 x 40000], 3.25 sec reading time in Python vs ParquetViewer loading fields 20 sec, loading data 5 min +++ (did not wait for it to finish).
Test2.zip
from parquetviewer.
Hi there,
So because Parquet files are columnar storage and we're converting the data into a row based structure the reality is performance is going to get worse the more columns there are. I don't think there's any way to avoid that.
A few things you can try:
- Clicking "Select All" in the field dialog chooser twice will deselect all the columns. You can then search for the columns you care about and just check those.
- You can try using the multi-threaded parquet engine. This will increase CPU usage however it will split reading columns into multiple threads where each thread is reading a subset of columns. This should increase the read speed a bit.
For other kind of optimizations I can take a look but could you share a sample file or two so there's something to work with?
from parquetviewer.
Hey @mucmch Any chance you could share a sample file? I could maybe take a look to see if there's any way I can increase the performance any further.
from parquetviewer.
@mucmch Can you try the latest v2.6.0 beta release please? https://github.com/mukunku/ParquetViewer/releases/tag/v2.6.0.1
I tried loading that test file with this version and it loads the entire file in less than 5 seconds. I can see this by hovering my cursor over the Loaded:
text in the bottom right corner:
from parquetviewer.
I see a comparable performance of the new version on my side.
Also tested some other files that yield decent results for most files < 100 MB.
Thanks you lot for your effort @mukunku !
But I still should not stretch it too much...
e.g. a highly compressed 30 MB parquet file, that results in 3 GB when fully loaded [8'650 x 46'250 table] is still loading after 20 min...
from parquetviewer.
Thanks for testing out the latest release @mucmch . Glad it's working at least a little bit better.
Any chance you could share a file like you described? It's really hard for me to optimize without having a sample file.
from parquetviewer.
So unfortunately getting the application to support data with 10's of thousands of columns isn't going to be possible. Like the sample file you shared with 40k columns. I tried to see if I could somehow get it to work but WinForms simply can't handle it.
I did add some other improvements to handle files like these more gracefully so I'm going to mark this ticket as Won't fix
and close it out. I will detail the changes I made below along with my rationale for why I think this is good enough:
- I don't think a lot of people are going to be using files with 10's of thousands of columns
- Considering Parquet is columnar storage, I think it's reasonable for performance to degrade when accessing data on a row by row basis.
- I automated the ParquetEngine setting in v2.6.0 so that files with many columns automatically utilize the multi-threaded engine.
- The field selection dialog as of v2.7.0 will no longer crash/hang when opening files with a lot of columns. You won't be able to filter the columns unfortunately but at least the UI is responsive.
- As of v2.6.0 the loading screen shows a loading bar. Giving users some kind of idea when a file load will be completed.
- As of v2.7.0 the metadata viewer won't hang for a long time when loading for files with many columns
from parquetviewer.
Related Issues (20)
- Unsupported Nested Structs HOT 7
- [FEAT] IO Unblock? HOT 1
- The query doesn't not seem to be valid. Please try again cannot find column 'user_name' HOT 4
- [BUG] Opening folder with `_SUCCESS` files causes exception HOT 3
- [FEAT] Support Array of Struct Type HOT 8
- [QUESTION] Query formatting HOT 4
- [BUG] Error viewing metadata HOT 5
- [FEAT] Search text in multiple Parquet files in one folder HOT 6
- [BUG] Error when opening file containing columns of LIST type HOT 3
- [BUG] sbyte and byte types swapped HOT 1
- [BUG] Unable to open the file which contains a nullable guid column HOT 2
- [BUG] Exception repeatedly thrown when widening window to include a column containing arrays of bytes HOT 5
- [BUG] Unable to display data exported from Oracle database HOT 5
- Application Error when trying to open parquet file HOT 2
- [FEAT] View per-column metadata HOT 3
- [BUG] audio unsupported!!!! HOT 1
- [BUG] Timestamp display arrow 13.0 HOT 3
- [BUG] Cannot open parquet file when column name includes "/" HOT 1
- [BUG] Cannot open Delta Lake checkpoint parquet file HOT 4
- [BUG] Cannot Open Parquet HOT 8
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquetviewer.