Comments (9)
I am starting to think about this. How do we want to deal with the fact that the sample method on dataframes takes a fraction rather than a flat number? Would the user simply provide a fraction to sample on if they want to sample their data rather than perform a simple take()?
from sparkmagic.
It would be nice if the user could say "give me 10 random items from this dataframe" but that might not be possible or practical.
from sparkmagic.
For us to provide the functionality to give 10 random items from a df, we would have to do a query on behalf of the user to get the count of rows and then compute the fraction. That may or may not be acceptable based on the size of the dataframe.
For now, I'm hesitant to go that way and instead just expose what Spark exposes, which is the fraction. To be investigated, we'd have to know how livy/magics behave with a really large amount of rows. If the behavior is not acceptable, then maybe the only way to expose this is to enforce the maximum number of rows you can get back, which would indeed require we get the count of rows for the dataframe when sampling.
from sparkmagic.
How about this:
- There are three options,
--maxRows
,--sampleMethod
, and--sampleFraction
(abbreviated as-n
,-m
,-f
) --sampleMethod
can be either"take"
or"sample"
--maxRows
can either be a positive integer N or some sentinel value (likeNone
).--sampleFraction
is equal to some float F where 0 <= F <= 1- When
--sampleMethod == "take"
and--maxRows=N
, that's equivalent to callingsql(...).take(N)
- When
--sampleMethod == "sample"
and--maxRows=N
and--sampleFraction=F
, that's equivalent to callingsql(...).sample(False, F).take(N)
- When
--sampleMethod == "take"
and--maxRows=None
, that's equivalent to callingsql(...).collect()
- When
--sampleMethod == "sample"
and--maxRows=None
and--sampleFraction=F
, that's equivalent to callingsql(...).sample(False, F).collect()
We would also add 3 configurations for the default values of these options.
from sparkmagic.
That looks good, but I am weary of the last two options, since python could crash due to memory even after paging through results.
from sparkmagic.
I think it's important to have some way to capture the whole un-truncated dataframe in a SQL query, with the understanding that the user could use too much memory and crash the kernel. Obviously we would still default to a maximum of only 2500 rows (so this is opt-in).
from sparkmagic.
How can we make that understanding obvious to the user? We would be sacrificing reliability for a feature.
from sparkmagic.
You need to understand the fact that you can't pull the contents of a large dataframe onto the Spark driver anyway; this is a prerequisite for using Spark effectively. The feature is no less reliable than Spark normally is in that regard (not to mention that with large enough records, capping to 2500 rows also doesn't prevent the user from getting OOM errors).
from sparkmagic.
Agreed. I know you like black or white guarantees, but I'm thinking of minimizing the probability of a customer getting OOM errors. Maybe we shouldn't and we should let Spark evolve to handle those. Let's go with exposing sample and fraction the way you outlined and see how it behaves for many rows.
from sparkmagic.
Related Issues (20)
- Sparkmagic Kerberos authentication issue HOT 1
- Plotly scala HOT 5
- Publish sparkmagic Docker images regularly HOT 1
- Run Tests Nightly to Catch Upstream Dependency Issues Earlier
- [BUG] Sparkmagic errors out using iPython 7.33.0 HOT 1
- pip deprecation warning when installing hdijupyterutils and autovizwidget HOT 1
- [QST] How to automatically load sparkmagic.magics when open a new ipython kernel tab HOT 1
- Document extending SparkMagic HOT 2
- how to pass python variable to %%sql cell ?
- [BUG] Default Docker container got broken HOT 4
- Support >= Pandas 2.0.0 HOT 5
- [BUG] error when first client connects HOT 1
- Jupyterlab 4.0.2 python 3.10 HOT 1
- [BUG] Cannot build Dockerfile.jupyter HOT 3
- [BUG] SparkMagic pyspark kernel magic(%%sql) hangs when running with Papermill. HOT 17
- Use variables in %%configure HOT 4
- [BUG] %%send-to-spark fails for dataframes with '\n' or ' characters HOT 2
- [BUG] launcher issue using jupyterlab 3.6.3 / sparkmagic 0.21.0 HOT 4
- Support notebook >= 7 HOT 1
- Does sparkmagic support dual scala/python spark session? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sparkmagic.