Code Monkey home page Code Monkey logo

Comments (16)

ggevay avatar ggevay commented on September 28, 2024

returns an array of DataBags

This part will be problematic. I think we assume throughout Emma that DataBags are not inside other data structures, but are just directly assigned to variables (or passed between functions).

I think the "Emma-idiomatic" way of doing this would be returning a DataBag of DataBags, but we will need efficient handling of nested DataBags for that.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

@fschueler this is a good idea, but I suggest to this as an emma-lib method for the time being.

@ggevay is right - wrapping bags in containers is problematic, as the compiler still does not understand the semantics of these containers, so this in fact introduces an optimization barrier.

To illustrate, consider this scenario.

// read a bunch of labeled points, type is DataBag[LDPoint[Long, String]]
val points = DataBag.readParquet[LDPoint[Long, String]](???)
// group by label, type is DataBag[Group[String, LDPoint[Long, String]]]
val groups = points.groupBy(_.label)
// split groups randomly, type is Array[DataBag[Group[String, LDPoint[Long, String]]]]
val splits = randomSplit(0.5, 0.3, 0.2)(groups)

// Emma cannot apply fold-group-fusion here because 
// it can't associate `split` with the `groupBy` application
for (split <- splits)
  println(split.values.count)

That being said, this is the only class of lost optimizations that I can think about at the moment, and on the other side I think that prohibiting the use of DataBag instances in containers might be a too restrictive constraint on the programming model, so I'm not so sure how to proceed.

For the concrete task, I can think of a solution which enforces clients to apply the projection as part of the splitting primitive, e.g.

def randomSplit(fractions: Double*)(xs: DataBag[A])(seed: Long): Array[DataBag[A]]

becomes

def randomSplit(fractions: Seq[Double], i: Int)(xs: DataBag[A])(seed: Long): DataBag[A]

and the problem above is solved as follows.

// split fractions for the following `randomSplit` application
val sfracs = Seq(0.5, 0.3, 0.2)

// Easier to apply fold-group-fusion upon 
// inlining the `randomSplit` method
for (i <- (0 until fracs))
  println(randomSplit(sfracs, i)(groups).values.count)

from emma.

fschueler avatar fschueler commented on September 28, 2024

We could also think about doing it as a simple key-assignment based on the index of the fraction array where we assign keys based on the associated probability and then group the data based on those indices.

val data = DataBag.readParquet[LDPoint[Long, Double]](???)
val fractions = Array(0.3, 0.3, 0.4)

val samples = for (instance <- data) yield {
    val index = /* sample based on normalized fractions over the values {0, 1, 2}*/
    (index, instance)
}

val splits = groupBy(_.1)

val train = splits.filter(_.1 == 1).map(_.2)
val test  = splits.filter(_.1 == 2).map(_.2)
val eval  = splits.filter(_.1 == 3).map(_.2)

Not sure how efficient this is, e.g. how many passes over the whole dataset will really be necessary.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

The problem is that this might cause an additional shuffle. I suggest to stick to the original idea with Array[DataBag[A]] as result type.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

@fschueler I suggest to proceed with your original suggestion. Let's add a method with the signature

def split(fractions: Double*)(seed: Long = 631431513L): Array[DataBag[A]]

to the DataBag[A] trait and implement it once per implementation (similar to sample).

from emma.

fschueler avatar fschueler commented on September 28, 2024

Alright, I'll work on it!

from emma.

fschueler avatar fschueler commented on September 28, 2024

As I see it, there are two (semantically different) options here:

  1. Imprecise splits

Go over the whole dataset once and assign split IDs based on the given weights (uniform sampling from the CDF). For large data this will converge to split(i).size = fraction(i) * N where N is the size of the whole dataset. For smaller data this could lead to empty splits even though they were assigned a small but nonzero weight.

  1. Precise splits

Basically do xs.sample(fraction(i) * N) for every split i. This would produce results where the final size actually corresponds to the given proportions but probably requires multiple traversals (if there is not a good algorithm to do this in one pass). Additionally, we can not just use the implemented sample because we would need sampling without replacement (should probably be implemented anyways).

I suggest to start with option 1. and add 2. later.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

👍 for imprecise splits.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.

For the Imprecise splits implementation, use something along the following lines (this is for Spark, similar for the other two).

val ts = rep.zipWithIndex()
      .mapPartitionsWithIndex({ (pid, it) =>
        for ((e, i) <- it) yield {
            val r = util.RanHash(seed).at(i)
            val j = r.nextLong(i + 1) // TODO: filter through inverse CDF
            (j, e)
        }
      })
for (f <- fractions.toArray) 
  yield for ((i, e) <- ts if i == f) yield e

from emma.

fschueler avatar fschueler commented on September 28, 2024

The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.

Yes, that's true. It's like sampling n times without replacement. So it's basically a chain of

val s1 = xs.sample(f1 * N, withReplacement = false)
val s2 = xs.diff(s1).sample(f2 * N, withReplacement = false)
...
val sn = xs.diff(s1.union(s2)....union(sn-1)) // no sample here, it's just the rest

if I understand it correctly.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024
xs.diff(s1).sample(f2 * N, withReplacement = false)

This will produce f2 * N with respect to xs - s1, not with respect to xs.

from emma.

fschueler avatar fschueler commented on September 28, 2024

Shouldn't it be with respect to xs if N = xs.size and f1...fn are fixed?

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

Yes, sorry, my bad.

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

Either way this will be way too expensive for such a basic operation.

from emma.

fschueler avatar fschueler commented on September 28, 2024

Yes, I agree! Going with the imprecise splits should be good enough for most of the cases where this might be used (especially train/test/predict splits).

from emma.

aalexandrov avatar aalexandrov commented on September 28, 2024

Merged in emma-lib in emmalanguage/emma-lib@48739a7 instead. See the discussion at the end of #353 for details.

from emma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.