Comments (16)
returns an array of
DataBag
s
This part will be problematic. I think we assume throughout Emma that DataBag
s are not inside other data structures, but are just directly assigned to variables (or passed between functions).
I think the "Emma-idiomatic" way of doing this would be returning a DataBag
of DataBag
s, but we will need efficient handling of nested DataBag
s for that.
from emma.
@fschueler this is a good idea, but I suggest to this as an emma-lib
method for the time being.
@ggevay is right - wrapping bags in containers is problematic, as the compiler still does not understand the semantics of these containers, so this in fact introduces an optimization barrier.
To illustrate, consider this scenario.
// read a bunch of labeled points, type is DataBag[LDPoint[Long, String]]
val points = DataBag.readParquet[LDPoint[Long, String]](???)
// group by label, type is DataBag[Group[String, LDPoint[Long, String]]]
val groups = points.groupBy(_.label)
// split groups randomly, type is Array[DataBag[Group[String, LDPoint[Long, String]]]]
val splits = randomSplit(0.5, 0.3, 0.2)(groups)
// Emma cannot apply fold-group-fusion here because
// it can't associate `split` with the `groupBy` application
for (split <- splits)
println(split.values.count)
That being said, this is the only class of lost optimizations that I can think about at the moment, and on the other side I think that prohibiting the use of DataBag
instances in containers might be a too restrictive constraint on the programming model, so I'm not so sure how to proceed.
For the concrete task, I can think of a solution which enforces clients to apply the projection as part of the splitting primitive, e.g.
def randomSplit(fractions: Double*)(xs: DataBag[A])(seed: Long): Array[DataBag[A]]
becomes
def randomSplit(fractions: Seq[Double], i: Int)(xs: DataBag[A])(seed: Long): DataBag[A]
and the problem above is solved as follows.
// split fractions for the following `randomSplit` application
val sfracs = Seq(0.5, 0.3, 0.2)
// Easier to apply fold-group-fusion upon
// inlining the `randomSplit` method
for (i <- (0 until fracs))
println(randomSplit(sfracs, i)(groups).values.count)
from emma.
We could also think about doing it as a simple key-assignment based on the index of the fraction array where we assign keys based on the associated probability and then group the data based on those indices.
val data = DataBag.readParquet[LDPoint[Long, Double]](???)
val fractions = Array(0.3, 0.3, 0.4)
val samples = for (instance <- data) yield {
val index = /* sample based on normalized fractions over the values {0, 1, 2}*/
(index, instance)
}
val splits = groupBy(_.1)
val train = splits.filter(_.1 == 1).map(_.2)
val test = splits.filter(_.1 == 2).map(_.2)
val eval = splits.filter(_.1 == 3).map(_.2)
Not sure how efficient this is, e.g. how many passes over the whole dataset will really be necessary.
from emma.
The problem is that this might cause an additional shuffle. I suggest to stick to the original idea with Array[DataBag[A]]
as result type.
from emma.
@fschueler I suggest to proceed with your original suggestion. Let's add a method with the signature
def split(fractions: Double*)(seed: Long = 631431513L): Array[DataBag[A]]
to the DataBag[A]
trait and implement it once per implementation (similar to sample
).
from emma.
Alright, I'll work on it!
from emma.
As I see it, there are two (semantically different) options here:
- Imprecise splits
Go over the whole dataset once and assign split IDs based on the given weights (uniform sampling from the CDF). For large data this will converge to split(i).size = fraction(i) * N
where N
is the size of the whole dataset. For smaller data this could lead to empty splits even though they were assigned a small but nonzero weight.
- Precise splits
Basically do xs.sample(fraction(i) * N)
for every split i
. This would produce results where the final size actually corresponds to the given proportions but probably requires multiple traversals (if there is not a good algorithm to do this in one pass). Additionally, we can not just use the implemented sample because we would need sampling without replacement (should probably be implemented anyways).
I suggest to start with option 1. and add 2. later.
from emma.
from emma.
The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.
For the Imprecise splits implementation, use something along the following lines (this is for Spark, similar for the other two).
val ts = rep.zipWithIndex()
.mapPartitionsWithIndex({ (pid, it) =>
for ((e, i) <- it) yield {
val r = util.RanHash(seed).at(i)
val j = r.nextLong(i + 1) // TODO: filter through inverse CDF
(j, e)
}
})
for (f <- fractions.toArray)
yield for ((i, e) <- ts if i == f) yield e
from emma.
The second option does produce a split in my understanding, as we additionally have to ensure that the splits are non-overlapping.
Yes, that's true. It's like sampling n
times without replacement. So it's basically a chain of
val s1 = xs.sample(f1 * N, withReplacement = false)
val s2 = xs.diff(s1).sample(f2 * N, withReplacement = false)
...
val sn = xs.diff(s1.union(s2)....union(sn-1)) // no sample here, it's just the rest
if I understand it correctly.
from emma.
xs.diff(s1).sample(f2 * N, withReplacement = false)
This will produce f2 * N
with respect to xs - s1
, not with respect to xs
.
from emma.
Shouldn't it be with respect to xs
if N = xs.size
and f1...fn
are fixed?
from emma.
Yes, sorry, my bad.
from emma.
Either way this will be way too expensive for such a basic operation.
from emma.
Yes, I agree! Going with the imprecise splits should be good enough for most of the cases where this might be used (especially train/test/predict splits).
from emma.
Merged in emma-lib
in emmalanguage/emma-lib@48739a7 instead. See the discussion at the end of #353 for details.
from emma.
Related Issues (20)
- Macro-based compiler pipelines break in the presence of unbound values in the Spark REPL HOT 16
- Add conversion methods to backend-specific distributed collections. HOT 1
- [COGADB] Add a basic README.md to emma-cogadb. HOT 1
- Implement the static Fold-Group-Fusion optimization HOT 1
- Create the CoGaDBTable which extends DataBag. It should base upon ScalaSeq. HOT 1
- [TEST] Create micro benchmarks for compiler transformations HOT 3
- [INFRA] Migrate to container-based Travis CI
- [IR] LibSupport.expand match error on GraphPreprocessing example
- [IR] Enable `foldForestFusion` on Emma Core
- Add `addCacheCalls` in `BaseCodegenIntegrationSpec` HOT 6
- Constructed references have too narrow type HOT 2
- Size of fat jars HOT 5
- Eliminate `CommonAST.universe` member HOT 1
- Named tree transformations HOT 1
- Improve the pretty-printing facilities for Emma Core / Emma Source(ANF)
- Cache-Call Insertion breaks Flink iterations HOT 1
- Source.valid says 'invalid' for valid trees HOT 2
- Integrate Squid HOT 3
- Broken emma-examples link in README HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from emma.