Comments (5)
Why? I would suggest using union.
from datasketches-cpp.
Let us say we have data for 2 columns say for X and Y and we want to perform 3 standard set of operations on them: union, intersection and difference.
The columns data are in big size and distributed among the nodes for the performance so let us say we have 4 nodes and each node create it's own update_theta_sketch for the received data. Finally leader node combines all the sketches into one sketch.
Column X computation:
X
|
--------------------------
| | | |
sk1 sk2 sk3 sk4 ==> All are update_theta_sketch
| | | |
-------------------------
|
merge()
|
update_theta_sketch_X
Column Y computation:
Y
|
--------------------------
| | | |
sk1 sk2 sk3 sk4 ==> All are update_theta_sketch
| | | |
-------------------------
|
merge()
|
update_theta_sketch_Y
Now with update_theta_sketch_X and update_theta_sketch_Y sketches, we can easily perform union, intersection and difference on them.
Instead of merge(), if we use theta_union to combine them then we would get theta_union_X and theta_union_Y. With these, we can perform only union operations and not intersection and difference directly. To perform intersection and difference, first we have to convert theta_union_X and theta_union_Y to compact_theta_sketch_X and compact_theta_sketch_Y respectively and then apply intersection and difference on them.
If we have merge() function as part of update_theta_sketch class, then we can easily perform all the operations using base class only.
from datasketches-cpp.
I would suggest looking at distributed processing as a two-phase process: building sketches and merging sketches (or "map" and "reduce" in terms of a well-known map-reduce paradigm). These two phases can run on different sets of nodes with a so-called "shuffle" (or network transfer) between the phases. Input data is partitioned and sent to a number of "map" nodes. Those nodes build sketches from raw data, so they would use update_theta_sketch. When they finish processing, they can finalize sketches, serialize and send to the second phase. It makes perfect sense to convert sketches to compact_theta_sketch form at this point before serialization because they are smaller and do not need to be updated anymore. The second phase (or "reduce") would use union to merge the sketches from the first phase, and the result of a union is a compact sketch as well.
If I understand you correctly, you want to have two columns X and Y - say, metrics in a hypercube, that represent different sets of some distinct identifiers. So both X and Y are built in the manner described above. Now you can do set operations: X union Y, X intersection Y, or X not Y.
This is how sketches work in Druid, Pig, Hive and PostgreSQL.
from datasketches-cpp.
Great, thanks for all these details, I will make the changes as suggested above. Thanks once again!
from datasketches-cpp.
I think we can close this
from datasketches-cpp.
Related Issues (20)
- UndefinedBehaviorSanitizer failed, when serializing after using theta_a_not_b HOT 4
- The Python package for ARM MacOS has an x86_64 datasketches.so in it HOT 4
- AttributeError: type object 'datasketches.theta_sketch' has no attribute 'deserialize' HOT 3
- std::iterator is deprecated; replace it HOT 8
- .whl for Linux ARM64 error HOT 11
- random_utils is not thread-safe HOT 3
- the theta_union estimate value will change dramatically with the order of merge HOT 2
- Use non-colliding family id in count-min
- Determinism HOT 11
- One more instance of std::iterator deprecation warning in 4.1.0 HOT 3
- the getEstimate result for union of HllSketch is not stable HOT 2
- question to serialization HOT 1
- Serialize "non-compact" python theta sketch HOT 1
- Tuple: Update array_of_doubles_intersection have poor performance HOT 8
- How to serialize `frequent_items_sketch` with mixed data types? HOT 5
- Implement t-Digest HOT 40
- Workflow to check for memory leaks? HOT 2
- Study to compare t-Digest and REQ sketch HOT 13
- Reorganization proposal HOT 5
- Python scripts. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasketches-cpp.