Code Monkey home page Code Monkey logo

Comments (8)

mucmch avatar mucmch commented on July 3, 2024 1

Hi mukunku,

thanks for your feedback and chaser. Sorry, have been quite busy. Please find a sample parquet file attached.

It has 100 rows and 5000 columns, 7'391 kb. In Python, I can read it within <0.36 secs.
In Parquetviewer, the analyzing stage took 10 sec or even caused my system to freeze. The loading stage took another 2 minutes. The memory usage was low, but there was 100% usage of a single cpu core.

As I just wanted to have the possibility to have a quick file preview in Windows by clicking on the parquet file, I now found another efficient solution for my case.

  1. A simple python script to read parquet and print essentials:

import pandas as pd
import sys
file = sys.argv[1]
df = pd.read_parquet(file)
print(df)
print(df.info())
input()

  1. A simple file association in the cmd prompt (as admin):

assoc .parquet=parquetfile
ftype parquetfile="<pathtoyourpython>\python.exe"
"<pathtoyourscriptfile>\parquet_win_preview.py" "%1" "%2" %*

Test.zip

from parquetviewer.

mucmch avatar mucmch commented on July 3, 2024 1

Please find a test file attached.
Dimensions [10000 x 40000], 3.25 sec reading time in Python vs ParquetViewer loading fields 20 sec, loading data 5 min +++ (did not wait for it to finish).
Test2.zip

from parquetviewer.

mukunku avatar mukunku commented on July 3, 2024

Hi there,

So because Parquet files are columnar storage and we're converting the data into a row based structure the reality is performance is going to get worse the more columns there are. I don't think there's any way to avoid that.

A few things you can try:

  • Clicking "Select All" in the field dialog chooser twice will deselect all the columns. You can then search for the columns you care about and just check those.
  • You can try using the multi-threaded parquet engine. This will increase CPU usage however it will split reading columns into multiple threads where each thread is reading a subset of columns. This should increase the read speed a bit.

For other kind of optimizations I can take a look but could you share a sample file or two so there's something to work with?

from parquetviewer.

mukunku avatar mukunku commented on July 3, 2024

Hey @mucmch Any chance you could share a sample file? I could maybe take a look to see if there's any way I can increase the performance any further.

from parquetviewer.

mukunku avatar mukunku commented on July 3, 2024

@mucmch Can you try the latest v2.6.0 beta release please? https://github.com/mukunku/ParquetViewer/releases/tag/v2.6.0.1

I tried loading that test file with this version and it loads the entire file in less than 5 seconds. I can see this by hovering my cursor over the Loaded: text in the bottom right corner:
image

from parquetviewer.

mucmch avatar mucmch commented on July 3, 2024

I see a comparable performance of the new version on my side.
Also tested some other files that yield decent results for most files < 100 MB.
Thanks you lot for your effort @mukunku !

But I still should not stretch it too much...
e.g. a highly compressed 30 MB parquet file, that results in 3 GB when fully loaded [8'650 x 46'250 table] is still loading after 20 min...

from parquetviewer.

mukunku avatar mukunku commented on July 3, 2024

Thanks for testing out the latest release @mucmch . Glad it's working at least a little bit better.

Any chance you could share a file like you described? It's really hard for me to optimize without having a sample file.

from parquetviewer.

mukunku avatar mukunku commented on July 3, 2024

So unfortunately getting the application to support data with 10's of thousands of columns isn't going to be possible. Like the sample file you shared with 40k columns. I tried to see if I could somehow get it to work but WinForms simply can't handle it.

I did add some other improvements to handle files like these more gracefully so I'm going to mark this ticket as Won't fix and close it out. I will detail the changes I made below along with my rationale for why I think this is good enough:

  1. I don't think a lot of people are going to be using files with 10's of thousands of columns
  2. Considering Parquet is columnar storage, I think it's reasonable for performance to degrade when accessing data on a row by row basis.
  3. I automated the ParquetEngine setting in v2.6.0 so that files with many columns automatically utilize the multi-threaded engine.
  4. The field selection dialog as of v2.7.0 will no longer crash/hang when opening files with a lot of columns. You won't be able to filter the columns unfortunately but at least the UI is responsive.
  5. As of v2.6.0 the loading screen shows a loading bar. Giving users some kind of idea when a file load will be completed.
  6. As of v2.7.0 the metadata viewer won't hang for a long time when loading for files with many columns

from parquetviewer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.