A user should be able to specify how many rows to get back in a case by case basis.</p

How about this: There are three options, <code class="notransl

Allow user to specify how many rows/what method to use when doing a sql query,about jupyter-incubator/sparkmagic

Comments (9)

msftristew commented on May 23, 2024

I am starting to think about this. How do we want to deal with the fact that the sample method on dataframes takes a fraction rather than a flat number? Would the user simply provide a fraction to sample on if they want to sample their data rather than perform a simple take()?

from sparkmagic.

msftristew commented on May 23, 2024

It would be nice if the user could say "give me 10 random items from this dataframe" but that might not be possible or practical.

from sparkmagic.

aggFTW commented on May 23, 2024

For us to provide the functionality to give 10 random items from a df, we would have to do a query on behalf of the user to get the count of rows and then compute the fraction. That may or may not be acceptable based on the size of the dataframe.

For now, I'm hesitant to go that way and instead just expose what Spark exposes, which is the fraction. To be investigated, we'd have to know how livy/magics behave with a really large amount of rows. If the behavior is not acceptable, then maybe the only way to expose this is to enforce the maximum number of rows you can get back, which would indeed require we get the count of rows for the dataframe when sampling.

from sparkmagic.

msftristew commented on May 23, 2024

How about this:

There are three options, --maxRows, --sampleMethod, and --sampleFraction (abbreviated as -n, -m, -f)
--sampleMethod can be either "take" or "sample"
--maxRows can either be a positive integer N or some sentinel value (like None).
--sampleFraction is equal to some float F where 0 <= F <= 1
When --sampleMethod == "take" and --maxRows=N, that's equivalent to calling sql(...).take(N)
When --sampleMethod == "sample" and --maxRows=N and --sampleFraction=F, that's equivalent to calling sql(...).sample(False, F).take(N)
When --sampleMethod == "take" and --maxRows=None, that's equivalent to calling sql(...).collect()
When --sampleMethod == "sample" and --maxRows=None and --sampleFraction=F, that's equivalent to calling sql(...).sample(False, F).collect()

We would also add 3 configurations for the default values of these options.

from sparkmagic.

aggFTW commented on May 23, 2024

That looks good, but I am weary of the last two options, since python could crash due to memory even after paging through results.

from sparkmagic.

msftristew commented on May 23, 2024

I think it's important to have some way to capture the whole un-truncated dataframe in a SQL query, with the understanding that the user could use too much memory and crash the kernel. Obviously we would still default to a maximum of only 2500 rows (so this is opt-in).

from sparkmagic.

aggFTW commented on May 23, 2024

How can we make that understanding obvious to the user? We would be sacrificing reliability for a feature.

from sparkmagic.

msftristew commented on May 23, 2024

You need to understand the fact that you can't pull the contents of a large dataframe onto the Spark driver anyway; this is a prerequisite for using Spark effectively. The feature is no less reliable than Spark normally is in that regard (not to mention that with large enough records, capping to 2500 rows also doesn't prevent the user from getting OOM errors).

from sparkmagic.

aggFTW commented on May 23, 2024

Agreed. I know you like black or white guarantees, but I'm thinking of minimizing the probability of a customer getting OOM errors. Maybe we shouldn't and we should let Spark evolve to handle those. Let's go with exposing sample and fraction the way you outlined and see how it behaves for many rows.

from sparkmagic.

Allow user to specify how many rows/what method to use when doing a sql query about sparkmagic HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent