Can we please add below method to merge update_theta_sketch? update_theta_sketch_a

Let us say we have data for 2 columns say for X and Y</strong

Can we add an enhancement to merge update_theta_sketch? about datasketches-cpp HOT 5 CLOSED

apache commented on August 14, 2024

Can we add an enhancement to merge update_theta_sketch?

from datasketches-cpp.

Comments (5)

AlexanderSaydakov commented on August 14, 2024

Why? I would suggest using union.

from datasketches-cpp.

ravindra-wagh commented on August 14, 2024

Let us say we have data for 2 columns say for X and Y and we want to perform 3 standard set of operations on them: union, intersection and difference.
The columns data are in big size and distributed among the nodes for the performance so let us say we have 4 nodes and each node create it's own update_theta_sketch for the received data. Finally leader node combines all the sketches into one sketch.

Column X computation:

             X
             |
 --------------------------
 |       |       |        |
 sk1    sk2     sk3      sk4   ==> All are update_theta_sketch
 |       |       |        |
  -------------------------
            |
          merge()
            |
   update_theta_sketch_X

Column Y computation:

             Y
             |
 --------------------------
 |       |       |        |
 sk1    sk2     sk3      sk4   ==> All are update_theta_sketch
 |       |       |        |
  -------------------------
            |
          merge()
            |
   update_theta_sketch_Y

Now with update_theta_sketch_X and update_theta_sketch_Y sketches, we can easily perform union, intersection and difference on them.
Instead of merge(), if we use theta_union to combine them then we would get theta_union_X and theta_union_Y. With these, we can perform only union operations and not intersection and difference directly. To perform intersection and difference, first we have to convert theta_union_X and theta_union_Y to compact_theta_sketch_X and compact_theta_sketch_Y respectively and then apply intersection and difference on them.
If we have merge() function as part of update_theta_sketch class, then we can easily perform all the operations using base class only.

from datasketches-cpp.

AlexanderSaydakov commented on August 14, 2024

I would suggest looking at distributed processing as a two-phase process: building sketches and merging sketches (or "map" and "reduce" in terms of a well-known map-reduce paradigm). These two phases can run on different sets of nodes with a so-called "shuffle" (or network transfer) between the phases. Input data is partitioned and sent to a number of "map" nodes. Those nodes build sketches from raw data, so they would use update_theta_sketch. When they finish processing, they can finalize sketches, serialize and send to the second phase. It makes perfect sense to convert sketches to compact_theta_sketch form at this point before serialization because they are smaller and do not need to be updated anymore. The second phase (or "reduce") would use union to merge the sketches from the first phase, and the result of a union is a compact sketch as well.

If I understand you correctly, you want to have two columns X and Y - say, metrics in a hypercube, that represent different sets of some distinct identifiers. So both X and Y are built in the manner described above. Now you can do set operations: X union Y, X intersection Y, or X not Y.

This is how sketches work in Druid, Pig, Hive and PostgreSQL.

from datasketches-cpp.

ravindra-wagh commented on August 14, 2024

Great, thanks for all these details, I will make the changes as suggested above. Thanks once again!

from datasketches-cpp.

AlexanderSaydakov commented on August 14, 2024

I think we can close this

from datasketches-cpp.

Can we add an enhancement to merge update_theta_sketch? about datasketches-cpp HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent