Comments (9)
Yes there is. Once a KLL sketch reaches a certain size it grows very slowly, but it does grow. We have provided the function getMaxSerializedSizeBytes(k, n) for exactly this purpose. The only catch is you need to estimate the largest n that you anticipate. If you plot this static function you will discover that for large enough n, it only grows by 32 bytes for every factor of 2 increase in n. So overestimating n by a factor of 2 or 10 won't impact the overall size that much.
from datasketches-cpp.
Oops! I was thinking about the function in the Java version. That function must have been omitted by mistake in the C++ version. We will correct that pronto! :)
from datasketches-cpp.
Thanks @leerho I'll make sure to close this once the function is merged, for now I'll just use my own based on the Java version.
For future reference the relevant Java functions are getMaxSerializedBytes and getSerializedSizeBytes.
from datasketches-cpp.
It was not a mistake to omit getMaxSerializedSizeBytes(k, n) in the C++ implementation, but rather a result of an incomplete process of reconsidering the interface in light of generic implementation.
In Java, KllFloatsSketch is a specialized implementation using a primitive array of floats. There is no generic implementation of KllSketch<T> (yet), and we were not inclined to provide one since the ItemsSketch<T> exists, and the main point of the new KLL sketch was compactness.
In C++, kll_sketch is a template (not optimized for non-fixed sizes yet making potentially avoidable copies of objects). We were not sure what to do with the methods, which provide the serialized size before the serialization is done. They would work for fixed-size types, but not for a generic case of arbitrary custom objects. We need to think about it more and decide whether to have such methods at all, and if so, how to do that better.
from datasketches-cpp.
Thanks for the clarification @AlexanderSaydakov. Would you consider having specialized code paths for fixed-size data types?
I saw references of non-fixed data types in the code, could you provide an example where we might have a non-static size data type that works with kll_sketch? I was not aware of any scalar data types that allocate their own memory.
from datasketches-cpp.
The only requirement for a kll_sketch is that the items be comparable, not that they be primitives. So you're correct with scalar types, but the sketch isn't limited to those and we've generally tried to avoid precluding these sketches from being used in novel ways.
Strings have a variable size, or one could imagine (certainly a contrived example) a sketch of, say, images where they're compared on compressed size but where it stores the images, too. Then you could query to see what images look like at different percentiles in case you want to see what things compress better than others.
Those might not be useful ideas, but it's possible to do them!
from datasketches-cpp.
I implemented get_max_serialized_size_bytes(k, n) for fixed-size types.
I mentioned in the comments that it may not make sense if sizeof(T) is not the same as serialized size of an item, and may need to be specialized (or just not used)
from datasketches-cpp.
could you provide an example where we might have a non-static size data type that works with kll_sketch?
There is a toy example of serialization of kll_sketch<std::string> In kll_sketch_test.cpp
Note that the KLL algorithm does not assume numeric types. The only requirement is to define the order (operator<) between the items.
from datasketches-cpp.
Closed by 1770317
from datasketches-cpp.
Related Issues (20)
- UndefinedBehaviorSanitizer failed, when serializing after using theta_a_not_b HOT 4
- The Python package for ARM MacOS has an x86_64 datasketches.so in it HOT 4
- AttributeError: type object 'datasketches.theta_sketch' has no attribute 'deserialize' HOT 3
- std::iterator is deprecated; replace it HOT 8
- .whl for Linux ARM64 error HOT 11
- random_utils is not thread-safe HOT 3
- the theta_union estimate value will change dramatically with the order of merge HOT 2
- Use non-colliding family id in count-min
- Determinism HOT 11
- One more instance of std::iterator deprecation warning in 4.1.0 HOT 3
- the getEstimate result for union of HllSketch is not stable HOT 2
- question to serialization HOT 1
- Serialize "non-compact" python theta sketch HOT 1
- Tuple: Update array_of_doubles_intersection have poor performance HOT 8
- How to serialize `frequent_items_sketch` with mixed data types? HOT 5
- Implement t-Digest HOT 40
- Workflow to check for memory leaks? HOT 2
- Study to compare t-Digest and REQ sketch HOT 13
- Reorganization proposal HOT 5
- Python scripts. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from datasketches-cpp.