Code Monkey home page Code Monkey logo

Comments (9)

leerho avatar leerho commented on August 14, 2024

Yes there is. Once a KLL sketch reaches a certain size it grows very slowly, but it does grow. We have provided the function getMaxSerializedSizeBytes(k, n) for exactly this purpose. The only catch is you need to estimate the largest n that you anticipate. If you plot this static function you will discover that for large enough n, it only grows by 32 bytes for every factor of 2 increase in n. So overestimating n by a factor of 2 or 10 won't impact the overall size that much.

from datasketches-cpp.

leerho avatar leerho commented on August 14, 2024

Oops! I was thinking about the function in the Java version. That function must have been omitted by mistake in the C++ version. We will correct that pronto! :)

from datasketches-cpp.

thvasilo avatar thvasilo commented on August 14, 2024

Thanks @leerho I'll make sure to close this once the function is merged, for now I'll just use my own based on the Java version.

For future reference the relevant Java functions are getMaxSerializedBytes and getSerializedSizeBytes.

from datasketches-cpp.

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

It was not a mistake to omit getMaxSerializedSizeBytes(k, n) in the C++ implementation, but rather a result of an incomplete process of reconsidering the interface in light of generic implementation.

In Java, KllFloatsSketch is a specialized implementation using a primitive array of floats. There is no generic implementation of KllSketch<T> (yet), and we were not inclined to provide one since the ItemsSketch<T> exists, and the main point of the new KLL sketch was compactness.

In C++, kll_sketch is a template (not optimized for non-fixed sizes yet making potentially avoidable copies of objects). We were not sure what to do with the methods, which provide the serialized size before the serialization is done. They would work for fixed-size types, but not for a generic case of arbitrary custom objects. We need to think about it more and decide whether to have such methods at all, and if so, how to do that better.

from datasketches-cpp.

thvasilo avatar thvasilo commented on August 14, 2024

Thanks for the clarification @AlexanderSaydakov. Would you consider having specialized code paths for fixed-size data types?

I saw references of non-fixed data types in the code, could you provide an example where we might have a non-static size data type that works with kll_sketch? I was not aware of any scalar data types that allocate their own memory.

from datasketches-cpp.

jmalkin avatar jmalkin commented on August 14, 2024

The only requirement for a kll_sketch is that the items be comparable, not that they be primitives. So you're correct with scalar types, but the sketch isn't limited to those and we've generally tried to avoid precluding these sketches from being used in novel ways.

Strings have a variable size, or one could imagine (certainly a contrived example) a sketch of, say, images where they're compared on compressed size but where it stores the images, too. Then you could query to see what images look like at different percentiles in case you want to see what things compress better than others.

Those might not be useful ideas, but it's possible to do them!

from datasketches-cpp.

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

I implemented get_max_serialized_size_bytes(k, n) for fixed-size types.
I mentioned in the comments that it may not make sense if sizeof(T) is not the same as serialized size of an item, and may need to be specialized (or just not used)

from datasketches-cpp.

AlexanderSaydakov avatar AlexanderSaydakov commented on August 14, 2024

could you provide an example where we might have a non-static size data type that works with kll_sketch?

There is a toy example of serialization of kll_sketch<std::string> In kll_sketch_test.cpp

Note that the KLL algorithm does not assume numeric types. The only requirement is to define the order (operator<) between the items.

from datasketches-cpp.

thvasilo avatar thvasilo commented on August 14, 2024

Closed by 1770317

from datasketches-cpp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.