There is a related effort, fsbench (filesystem benchmark) at: <a href="http://encode.r

Consider merging efforts with fsbench about squash HOT 4 CLOSED

quixdb commented on July 17, 2024

Consider merging efforts with fsbench

from squash.

Comments (4)

nemequ commented on July 17, 2024

I'm familiar with fsbench. It's a great project, but its focus is very different from Squash, and I'm not sure merging the two is feasible. fsbench is a benchmarking tool for compression and hash algorithms. Squash is really about creating a generic abstraction layer for compression, and benchmarks are one (small) use case—the primary targets are programming languages (for example, I'm currently working on a Node.js addon) and other programs (the original reason I wrote Squash is for a database I'm working on).

Unfortunately, I don't think Squash can really re-use any of the fsbench code. AFAIK they have a single executable which links (mostly, if not completely, statically) every library they use. That's great for a benchmarking tool, but it sucks if you just want to use one or two algorithms. The actual code to integrate a compression library is only a couple of lines for most algorithms for both Squash and fsbench, but for Squash there is a decent amount of boilerplate (to make them dynamically loadable at runtime), documentation (because we have users other than ourselves—or at least we hope to eventually), build system (because we want to link to system libraries when possible), etc.

fsbench could, theoretically, use Squash for its compression algorithm support, but it also benchmarks hash algorithms, so obviously Squash isn't a complete solution. You would have to talk to m^2 about that, though I would be happy to help with any changes necessary from the Squash side. To be honest, though, for fsbench's use case I don't see a whole lot of benefit to using Squash… Integrating an algorithm is quite easy if you don't mind just dumping the library into your source tree and linking to it statically, and that's really all they need to do, and doing it fsbench's way really simplifies the dependencies since you don't have to make sure the devel packages for various libraries are installed.

The one place we could probably share would be the output. fsbench could generate JSON (probably optionally, in addition to what they use now [CSV?]) and use the HTML/JS distributed with Squash to display the data. I already plan to make some significant changes to Squash's benchmarking output in order to support multiple benchmarks per codec (i.e., different options), so if anyone else has any thoughts on what they would like displayed and how now is the time to speak up. I just created a "benchmark" label for the issue tracker.

I'm going to go ahead and close this since I don't think there is really anything to do on the Squash. If fsbench would like to use Squash and needs things changed then we can re-open this as a meta-bug to track progress of those issues, and I'll send m^2 a note informing him/her of this issue.

from squash.

nemequ commented on July 17, 2024

From m^2:

I've been also thinking about some duplication between our projects and whether we could work together. My conclusion was similar to yours in that the only feasible way would be to have fsbench use squash as one of backends. If that happened, whatever new algorigthms I would port I could port to squash. However, so far I'm not convinced that it's worthwhile. As you said, integrating new codecs is usually easy, so the benefit isn't very large. And costs? The minimum would be to:

benchmark Squash itself to make sure its internal overhead is negligible.

create and maintain a CMake make system for it

I didn't get far enough to notice that squash uses system libraries, for fsbench this is unacceptable and would have to be disabled. I want to have full control over library versions.

FWIW there are copies of many libraries either in-tree or as git submodules, I just prefer the system versions wherever available. Better for system integration, not as good for benchmarking. I'd probably be willing to add a way to prefer versions for everything, but only if you decide you want to explore this or someone else who wants to use Squash has a similar requirement.

I even started to work on it, but not for long - autogen fails to detect libnl and after looking into it I ended up in a black hole having no idea what's going on and decided to stop, it was too much for me.

Could be because you were looking for libnl instead libltdl? The only real hard dependencies for Squash are libltdl and pthreads (currently, though I would love to add support for Windows threads). glib is required for building the Vala bindings and unit testing, though I haven't actually tested without it. libnl isn't a dependency at all.

I noted it down on the ideas list and that's all for now.

As to JSON, I may do this, though not without a person saying that they would significantly benefit from it. I have too little time to work on fsbench (again) and don't want to spread myself too thin. And even then it won't be too high priority...

Fair enough. I'll try to remember to have a closer look at fsbench's output and make sure it's possible to present most of the information (except for the hash algorithms, of course).

I don't have a GitHub account that I'd like to link straight with my 'm^2' identity, so I can't reply there myself. I would be thankful if you quoted my message above.

from squash.

Intensity commented on July 17, 2024

Thanks for looking into the possibility of collaborating with fsbench. I wasn't sure which parts would be most useful but I wanted to open a dialog about it. There are different aims but also overlap. Potentially when someone uses a compression library they may be indifferent about which particular algorithm and options are used as long as the compression ratio and other tradeoffs work out. So having access to a variety of algorithms is helpful, and it seems that both projects have taken steps in that direction. Some compression algorithms (namely LZHAM) are not as easy to build or integrate, so if one of the two projects has taken time to do that, then it can save time to check in with developments from the other side. For what it's worth, I'd value a similar interface to hash algorithms, but that's
a bit tangential. It does seem that most compression algorithms have a low overhead to integration, but I appreciate the extra work done with squash in enabling dynamic loading and in presenting a universal interface. I'd say that benchmarking is a use case of squash that is important to me, because I can take an asymmetric approach: try a few algorithms, see which one performs optimally according to some criteria and bounds, and then choose the desired one. Such trial and error could even happen in multiple threads, as long as the additional memory usage on compression could be tolerated. I could also envision using more than one compressor (for example using one as a preprocessor) in a kind of chain.

I'm bringing this "chain" idea up because the JSON output might support a series of steps as a legitimate meta-compressor. So if first there is a dictionary preprocessing step and then an LZ compression, articulating that multi-step journey (with flags) can be done in JSON. Similarly, if one wanted to try various combinations and output the best chosen one (similar to dact - http://www.rkeene.org/docs/oss/dact/dact_man.txt), then a meta-compressor could encode the steps needed to reverse the algorithm in some kind of verbose universal JSON format. I'm happy to break ideas you would find interesting into separate tickets and elaborate as needed. The final crossover I see between the projects has to do with identifying resource use - either as an initial estimated constraint or as an information gathering device after compression or decompression was tried. For benchmarking or for a metacompressor idea, I would value being able to specify what the memory or time limitations are for (de)compression, and then as these algorithms are tried in sequence, I can take into account exactly how much CPU is used and how much memory is needed in compression or decompression. Again, happy to break these ideas into separate tickets. Thanks for looking this over and hopefully this has been helpful.

from squash.

nemequ commented on July 17, 2024

Some compression algorithms (namely LZHAM) are not as easy to build or integrate

I'm planning to do something about LZHAM. The hold-up right now is that the unreleased version is incompatible with the most recent release. I'll should probably send an e-mail to the author of LZHAM to try to figure out what is going on, I've just been putting it off because I have plenty of other stuff to do and no particular interest in LZHAM. If you're interested in it that bumps up the priority a bit.

I'd say that benchmarking is a use case of squash that is important to me, because I can take an asymmetric approach: try a few algorithms, see which one performs optimally according to some criteria and bounds, and then choose the desired one.

Yes, that is something Squash was designed for. However, I think it is more valuable to be able to create your own benchmarks easily. When I was talking about benchmarks not being a primary use case, I was really referring to fsbench style benchmarks, which is about trying to benchmark the underlying algorithm—that's why m^2 has to be in control of which version of each library is in use and doesn't want to use system libraries. In practice, I think most people will want to use system libraries when they're available (just like I don't expect people to keep a copy of Squash in their tree), at least on Linux. IIRC that's the major reason getting Chromium into Fedora has been so difficult. I'd like to get Squash as close as possible to the theoretical maximum performance of the various codecs, but to be frank that has to take a back seat to creating a universal API and playing nice with the OS. For fsbench, I think the reverse is true.

The benchmark distributed with Squash is largely a demo, and hopefully a framework, meant to inspire people to create their own. I even allude to the idea in the benchmarks section of the Squash web page. I'd like for the benchmark shipped with Squash to be accurate, but there are simply too many variables.

As an example, the reason I created Squash initially is to compress rows in a database I'm working on (QuixDB), which I intend to use for a project I'm developing at work. I'm planning on creating a benchmark based on real-world usage and data specific to my project and run it on the hardware I'm planning on using. I'll choose a reasonable default, but I don't even want to hard-code the compression algorithm in QuixDB—that decision should be available to the end user.

I'm bringing this "chain" idea up because the JSON output might support a series of steps as a legitimate meta-compressor. So if first there is a dictionary preprocessing step and then an LZ compression, articulating that multi-step journey (with flags) can be done in JSON. Similarly, if one wanted to try various combinations and output the best chosen one (similar to dact - http://www.rkeene.org/docs/oss/dact/dact_man.txt), then a meta-compressor could encode the steps needed to reverse the algorithm in some kind of verbose universal JSON format. I'm happy to break ideas you would find interesting into separate tickets and elaborate as needed. The final crossover I see between the projects has to do with identifying resource use - either as an initial estimated constraint or as an information gathering device after compression or decompression was tried. For benchmarking or for a metacompressor idea, I would value being able to specify what the memory or time limitations are for (de)compression, and then as these algorithms are tried in sequence, I can take into account exactly how much CPU is used and how much memory is needed in compression or decompression. Again, happy to break these ideas into separate tickets. Thanks for looking this over and hopefully this has been helpful.

This is definitely something that should be relatively easy to write with Squash. Much more so than trying to do it without Squash. If you want to tackle it I think it would at least make an interesting example to distribute with Squash, and I'd certainly be interested in seeing the result, but to be honest I don't see myself taking the time to write such a benchmark myself any time soon.

from squash.

Consider merging efforts with fsbench about squash HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent