Code Monkey home page Code Monkey logo

Comments (9)

msftristew avatar msftristew commented on May 23, 2024

I am starting to think about this. How do we want to deal with the fact that the sample method on dataframes takes a fraction rather than a flat number? Would the user simply provide a fraction to sample on if they want to sample their data rather than perform a simple take()?

from sparkmagic.

msftristew avatar msftristew commented on May 23, 2024

It would be nice if the user could say "give me 10 random items from this dataframe" but that might not be possible or practical.

from sparkmagic.

aggFTW avatar aggFTW commented on May 23, 2024

For us to provide the functionality to give 10 random items from a df, we would have to do a query on behalf of the user to get the count of rows and then compute the fraction. That may or may not be acceptable based on the size of the dataframe.

For now, I'm hesitant to go that way and instead just expose what Spark exposes, which is the fraction. To be investigated, we'd have to know how livy/magics behave with a really large amount of rows. If the behavior is not acceptable, then maybe the only way to expose this is to enforce the maximum number of rows you can get back, which would indeed require we get the count of rows for the dataframe when sampling.

from sparkmagic.

msftristew avatar msftristew commented on May 23, 2024

How about this:

  • There are three options, --maxRows, --sampleMethod, and --sampleFraction (abbreviated as -n, -m, -f)
  • --sampleMethod can be either "take" or "sample"
  • --maxRows can either be a positive integer N or some sentinel value (like None).
  • --sampleFraction is equal to some float F where 0 <= F <= 1
  • When --sampleMethod == "take" and --maxRows=N, that's equivalent to calling sql(...).take(N)
  • When --sampleMethod == "sample" and --maxRows=N and --sampleFraction=F, that's equivalent to calling sql(...).sample(False, F).take(N)
  • When --sampleMethod == "take" and --maxRows=None, that's equivalent to calling sql(...).collect()
  • When --sampleMethod == "sample" and --maxRows=None and --sampleFraction=F, that's equivalent to calling sql(...).sample(False, F).collect()

We would also add 3 configurations for the default values of these options.

from sparkmagic.

aggFTW avatar aggFTW commented on May 23, 2024

That looks good, but I am weary of the last two options, since python could crash due to memory even after paging through results.

from sparkmagic.

msftristew avatar msftristew commented on May 23, 2024

I think it's important to have some way to capture the whole un-truncated dataframe in a SQL query, with the understanding that the user could use too much memory and crash the kernel. Obviously we would still default to a maximum of only 2500 rows (so this is opt-in).

from sparkmagic.

aggFTW avatar aggFTW commented on May 23, 2024

How can we make that understanding obvious to the user? We would be sacrificing reliability for a feature.

from sparkmagic.

msftristew avatar msftristew commented on May 23, 2024

You need to understand the fact that you can't pull the contents of a large dataframe onto the Spark driver anyway; this is a prerequisite for using Spark effectively. The feature is no less reliable than Spark normally is in that regard (not to mention that with large enough records, capping to 2500 rows also doesn't prevent the user from getting OOM errors).

from sparkmagic.

aggFTW avatar aggFTW commented on May 23, 2024

Agreed. I know you like black or white guarantees, but I'm thinking of minimizing the probability of a customer getting OOM errors. Maybe we shouldn't and we should let Spark evolve to handle those. Let's go with exposing sample and fraction the way you outlined and see how it behaves for many rows.

from sparkmagic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.