For things like train-test splits or n-fold cross-validation we should implement a <co

returns an array of DataBag s </block

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<g-emoji class="g-emoji" alias="+1" fallback-src="https://github.githubassets.com/imag

[API] Add `randomSplit` DataBag method about emma HOT 16 CLOSED

fschueler commented on September 28, 2024

[API] Add `randomSplit` DataBag method

from emma.

Comments (16)

ggevay commented on September 28, 2024

returns an array of DataBags

This part will be problematic. I think we assume throughout Emma that DataBags are not inside other data structures, but are just directly assigned to variables (or passed between functions).

I think the "Emma-idiomatic" way of doing this would be returning a DataBag of DataBags, but we will need efficient handling of nested DataBags for that.

from emma.

aalexandrov commented on September 28, 2024

@fschueler this is a good idea, but I suggest to this as an emma-lib method for the time being.

@ggevay is right - wrapping bags in containers is problematic, as the compiler still does not understand the semantics of these containers, so this in fact introduces an optimization barrier.

To illustrate, consider this scenario.

// read a bunch of labeled points, type is DataBag[LDPoint[Long, String]]
val points = DataBag.readParquet[LDPoint[Long, String]](???)
// group by label, type is DataBag[Group[String, LDPoint[Long, String]]]
val groups = points.groupBy(_.label)
// split groups randomly, type is Array[DataBag[Group[String, LDPoint[Long, String]]]]
val splits = randomSplit(0.5, 0.3, 0.2)(groups)

// Emma cannot apply fold-group-fusion here because 
// it can't associate `split` with the `groupBy` application
for (split <- splits)
  println(split.values.count)

That being said, this is the only class of lost optimizations that I can think about at the moment, and on the other side I think that prohibiting the use of DataBag instances in containers might be a too restrictive constraint on the programming model, so I'm not so sure how to proceed.

For the concrete task, I can think of a solution which enforces clients to apply the projection as part of the splitting primitive, e.g.

def randomSplit(fractions: Double*)(xs: DataBag[A])(seed: Long): Array[DataBag[A]]

becomes

def randomSplit(fractions: Seq[Double], i: Int)(xs: DataBag[A])(seed: Long): DataBag[A]

and the problem above is solved as follows.

// split fractions for the following `randomSplit` application
val sfracs = Seq(0.5, 0.3, 0.2)

// Easier to apply fold-group-fusion upon 
// inlining the `randomSplit` method
for (i <- (0 until fracs))
  println(randomSplit(sfracs, i)(groups).values.count)

from emma.

fschueler commented on September 28, 2024

We could also think about doing it as a simple key-assignment based on the index of the fraction array where we assign keys based on the associated probability and then group the data based on those indices.

val data = DataBag.readParquet[LDPoint[Long, Double]](???)
val fractions = Array(0.3, 0.3, 0.4)

val samples = for (instance <- data) yield {
    val index = /* sample based on normalized fractions over the values {0, 1, 2}*/
    (index, instance)
}

val splits = groupBy(_.1)

val train = splits.filter(_.1 == 1).map(_.2)
val test  = splits.filter(_.1 == 2).map(_.2)
val eval  = splits.filter(_.1 == 3).map(_.2)

Not sure how efficient this is, e.g. how many passes over the whole dataset will really be necessary.

from emma.

aalexandrov commented on September 28, 2024

The problem is that this might cause an additional shuffle. I suggest to stick to the original idea with Array[DataBag[A]] as result type.

from emma.

aalexandrov commented on September 28, 2024

@fschueler I suggest to proceed with your original suggestion. Let's add a method with the signature

def split(fractions: Double*)(seed: Long = 631431513L): Array[DataBag[A]]

to the DataBag[A] trait and implement it once per implementation (similar to sample).

from emma.

fschueler commented on September 28, 2024

Alright, I'll work on it!

from emma.

fschueler commented on September 28, 2024

As I see it, there are two (semantically different) options here:

Imprecise splits

Go over the whole dataset once and assign split IDs based on the given weights (uniform sampling from the CDF). For large data this will converge to split(i).size = fraction(i) * N where N is the size of the whole dataset. For smaller data this could lead to empty splits even though they were assigned a small but nonzero weight.

Precise splits

Basically do xs.sample(fraction(i) * N) for every split i. This would produce results where the final size actually corresponds to the given proportions but probably requires multiple traversals (if there is not a good algorithm to do this in one pass). Additionally, we can not just use the implemented sample because we would need sampling without replacement (should probably be implemented anyways).

I suggest to start with option 1. and add 2. later.

from emma.

aalexandrov commented on September 28, 2024

👍 for imprecise splits.

from emma.

aalexandrov commented on September 28, 2024

The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.

For the Imprecise splits implementation, use something along the following lines (this is for Spark, similar for the other two).

val ts = rep.zipWithIndex()
      .mapPartitionsWithIndex({ (pid, it) =>
        for ((e, i) <- it) yield {
            val r = util.RanHash(seed).at(i)
            val j = r.nextLong(i + 1) // TODO: filter through inverse CDF
            (j, e)
        }
      })
for (f <- fractions.toArray) 
  yield for ((i, e) <- ts if i == f) yield e

from emma.

fschueler commented on September 28, 2024

The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.

Yes, that's true. It's like sampling n times without replacement. So it's basically a chain of

val s1 = xs.sample(f1 * N, withReplacement = false)
val s2 = xs.diff(s1).sample(f2 * N, withReplacement = false)
...
val sn = xs.diff(s1.union(s2)....union(sn-1)) // no sample here, it's just the rest

if I understand it correctly.

from emma.

aalexandrov commented on September 28, 2024

xs.diff(s1).sample(f2 * N, withReplacement = false)

This will produce f2 * N with respect to xs - s1, not with respect to xs.

from emma.

fschueler commented on September 28, 2024

Shouldn't it be with respect to xs if N = xs.size and f1...fn are fixed?

from emma.

aalexandrov commented on September 28, 2024

Yes, sorry, my bad.

from emma.

aalexandrov commented on September 28, 2024

Either way this will be way too expensive for such a basic operation.

from emma.

fschueler commented on September 28, 2024

Yes, I agree! Going with the imprecise splits should be good enough for most of the cases where this might be used (especially train/test/predict splits).

from emma.

aalexandrov commented on September 28, 2024

Merged in emma-lib in emmalanguage/emma-lib@48739a7 instead. See the discussion at the end of #353 for details.

from emma.

[API] Add `randomSplit` DataBag method about emma HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent