Code Monkey home page Code Monkey logo

Comments (5)

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

Why? I would suggest using union.

from datasketches-cpp.

ravindra-wagh avatar ravindra-wagh commented on August 14, 2024

Let us say we have data for 2 columns say for X and Y and we want to perform 3 standard set of operations on them: union, intersection and difference.
The columns data are in big size and distributed among the nodes for the performance so let us say we have 4 nodes and each node create it's own update_theta_sketch for the received data. Finally leader node combines all the sketches into one sketch.

Column X computation:

             X
             |
 --------------------------
 |       |       |        |
 sk1    sk2     sk3      sk4   ==> All are update_theta_sketch
 |       |       |        |
  -------------------------
            |
          merge()
            |
   update_theta_sketch_X

Column Y computation:

             Y
             |
 --------------------------
 |       |       |        |
 sk1    sk2     sk3      sk4   ==> All are update_theta_sketch
 |       |       |        |
  -------------------------
            |
          merge()
            |
   update_theta_sketch_Y

Now with update_theta_sketch_X and update_theta_sketch_Y sketches, we can easily perform union, intersection and difference on them.
Instead of merge(), if we use theta_union to combine them then we would get theta_union_X and theta_union_Y. With these, we can perform only union operations and not intersection and difference directly. To perform intersection and difference, first we have to convert theta_union_X and theta_union_Y to compact_theta_sketch_X and compact_theta_sketch_Y respectively and then apply intersection and difference on them.
If we have merge() function as part of update_theta_sketch class, then we can easily perform all the operations using base class only.

from datasketches-cpp.

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

I would suggest looking at distributed processing as a two-phase process: building sketches and merging sketches (or "map" and "reduce" in terms of a well-known map-reduce paradigm). These two phases can run on different sets of nodes with a so-called "shuffle" (or network transfer) between the phases. Input data is partitioned and sent to a number of "map" nodes. Those nodes build sketches from raw data, so they would use update_theta_sketch. When they finish processing, they can finalize sketches, serialize and send to the second phase. It makes perfect sense to convert sketches to compact_theta_sketch form at this point before serialization because they are smaller and do not need to be updated anymore. The second phase (or "reduce") would use union to merge the sketches from the first phase, and the result of a union is a compact sketch as well.

If I understand you correctly, you want to have two columns X and Y - say, metrics in a hypercube, that represent different sets of some distinct identifiers. So both X and Y are built in the manner described above. Now you can do set operations: X union Y, X intersection Y, or X not Y.

This is how sketches work in Druid, Pig, Hive and PostgreSQL.

from datasketches-cpp.

ravindra-wagh avatar ravindra-wagh commented on August 14, 2024

Great, thanks for all these details, I will make the changes as suggested above. Thanks once again!

from datasketches-cpp.

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

I think we can close this

from datasketches-cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.