Comments (12)
That is indeed a very good idea, that I had by myself also some years ago. It should detect hidden patterns. But gzip is not really good enough, a neural net compressor would be better. Forgot what I tried then, the state of the art compressor then with a jitted NN
from smhasher.
LZMA should be good, it's freely available open source library, can be integrated without much hassle. LZMA is also known as one of the best available compression algorithms, it's very hard to get much better, be it NN or not.
https://7-zip.org/sdk.html
from smhasher.
Just a note - it looks like neural network-based compressors work great for text compression, but I do not think they'll handle "random data" compression well.
from smhasher.
One advantage of these tests is they are pretty fast. May be tricky to integrate tho, I'm not sure how simple the various lib are.
from smhasher.
For example, discohash, which passes all quality tests here, if given to the test above, produces an output stream that is ~20% compressible using gzip -9
. This indicates that the tests above miss something about statistical regularity that is nevertheless picked up quickly by the compressibility test.
from smhasher.
Good idea!
from smhasher.
For reference, here's the code I used to create the infinite stream:
#include <iostream>
#include <vector>
#include <fstream>
#include <inttypes.h>
#include <stdexcept>
#include <string>
#include <cstring>
#include "discohash.h"
#ifdef _WIN32
#include <io.h>
#include <fcntl.h>
#define SET_BINARY_MODE(handle) _setmode(_fileno(handle), _O_BINARY)
#else
#define SET_BINARY_MODE(handle) ((void)0)
#endif
void readFileToBuffer(const std::string& filename, std::vector<uint8_t>& buffer) {
std::ifstream file(filename, std::ios::binary | std::ios::ate);
if (file.fail()) {
throw std::runtime_error("Unable to open the file: " + filename);
}
std::streamsize size = file.tellg();
file.seekg(0, std::ios::beg);
buffer.resize(size, 0);
if (!file.read(reinterpret_cast<char*>(buffer.data()), size)) {
throw std::runtime_error("Failed to read the file: " + filename);
}
}
void readStdinToBuffer(std::vector<uint8_t>& buffer) {
alignas(uint64_t)std::istreambuf_iterator<char> begin(std::cin), end;
alignas(uint64_t)std::vector<char> inputChars(begin, end);
buffer = std::vector<uint8_t>(inputChars.begin(), inputChars.end());
}
int main(int argc, char **argv) {
alignas(uint64_t)std::vector<uint8_t> buffer;
std::string filename;
std::string outputFilename;
bool infiniteMode = false;
FILE* outputFile = stdout; // Default to stdout
int outputWords = 4;
// Handle flags and arguments
for (int i = 1; i < argc; i++) {
if (strcmp(argv[i], "--infinite") == 0) {
infiniteMode = true;
} else if (strcmp(argv[i], "--outfile") == 0) {
if (i + 1 < argc) {
outputFilename = argv[++i];
outputFile = fopen(outputFilename.c_str(), "wb");
if (!outputFile) {
std::cerr << "Error: Unable to open output file: " << outputFilename << std::endl;
return EXIT_FAILURE;
}
} else {
std::cerr << "Error: --outfile option requires a filename argument." << std::endl;
return EXIT_FAILURE;
}
} else if (strcmp(argv[i], "--words") == 0) {
if (i + 1 < argc) {
outputWords = std::stoi(argv[++i]);
if (outputWords < 1 || outputWords > 4) {
std::cerr << "Error: --words option requires an integer between 1 and 4." << std::endl;
return EXIT_FAILURE;
}
} else {
std::cerr << "Error: --words option requires an integer argument." << std::endl;
return EXIT_FAILURE;
}
} else {
filename = argv[i];
}
}
if (infiniteMode && outputFile == stdout) {
SET_BINARY_MODE(stdout);
}
bool readFromFile = !filename.empty() && filename != "-";
if (readFromFile) {
readFileToBuffer(filename, buffer);
} else {
readStdinToBuffer(buffer);
}
// Buffer to store the hash output
std::vector<uint64_t> hash(4);
// Sponge construction
if (infiniteMode) {
std::vector<uint8_t> input = buffer;
while (true) {
BEBB4185_64(input.data(), input.size(), 0, hash.data());
std::fwrite(hash.data(), sizeof(uint64_t), 4, outputFile);
std::fflush(outputFile); // make sure it's written
// Reuse the same memory buffer as input for the next iteration
std::memcpy(input.data(), hash.data(), sizeof(uint64_t) * 4);
}
} else {
BEBB4185_64(buffer.data(), buffer.size(), 0, hash.data());
for (int i = 0; i < outputWords; ++i) {
fprintf(outputFile, "%016" PRIx64, hash[i]);
}
fprintf(outputFile, " %s\n", filename.c_str());
}
// Close the output file if it's not stdout
if (outputFile != stdout) {
fclose(outputFile);
}
return EXIT_SUCCESS;
}
from smhasher.
for illustration of exactly what i meant
from smhasher.
then i'd run:
$ gzip -vv -9 <outfile>
or
$ lzma -9 -z -v <outfile>
from smhasher.
@rurban would it be a bad idea to integrate a "simpler to integrate" compression library (like @avaneev's suggested lzma) at first to figure things out, and move on to a more involved NN compression thing later if needed?
Anyway man I know you're super busy so no worries if you don't get to this, just thank you for reading this far! :)
from smhasher.
lzma compresses/measures only linear repeatability, which is much less than simple linear transformations, not speaking about polynomial patterns, with the simpliest being multi-dimensional rotations, translations, scalings. Only NN (beside visual inspection) can detect proper patterns, never a primitive LZ compressor.
I"ve tried zpac and paq8px (http://mattmahoney.net/dc/) a few years ago, but better NN's are needed. Maybe https://github.com/mohit1997/Dzip-torch
from smhasher.
NN compressors are language-based models mainly. LZMA on the other hand works with bit-level patterns and is Markov chain based, it's not "linear repeatability".
from smhasher.
Related Issues (20)
- komihash 4.5 HOT 8
- PIC register clobbered by ‘%ebx’ in ‘asm’ or illegal instruction HOT 2
- PIC register clobbered by ‘%ebx’ in ‘asm’ or illegal instruction. See #245 HOT 1
- PRVHASH 4.3.1 HOT 5
- Questions about MurmurHash and FNV HOT 4
- Recent Ryzen 5 3350G 3.6GHz speed tests HOT 5
- failure to build on arm HOT 1
- port to std::experimental::simd (P0214) alternatives
- [Question] Does SMhasher measures input or output speed. HOT 1
- [Question] Are the 262144-byte keys read entirely one by one? HOT 2
- [Question] Is the hash function one time called for each 262144 bytes key? HOT 2
- komihash 5.0 HOT 11
- Add PolymurHash HOT 3
- Verif value typo
- komihash 5.7 HOT 1
- prvhash HOT 7
- komihash 5.10 HOT 4
- code cleanup / better naming
- Build error: no viable overloaded '+=' (macOS 14.4.1) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from smhasher.