Code Monkey home page Code Monkey logo

givati-compression's Introduction

GIVATI Compression

Live Demo

The algorithm selects two separators from an alphabet of around 90 ascii characters. The segment characters cannot appear in the uncompressed text. It then looks at every substring, so in the text "hello", it would look at "h", "he", "hel", "hell", "hello", "e", "el"... and so on and adds potentially viable substrings to a register. It calculates each substring's priority based on the number of bytes it would save if added to the dictionary where [ priority = (num matches) * length - (length + (num matches * 2)) ] and puts them in a catalog sorted by priority. It then moves through the the substrings from most effective to least recalculating priorities as it simulates removing the substrings, thus pruning and resorting the catalog. It then looks for conflicts between the letters of the alphabet that will actually be used based on the size of the pruned catalog and resolves any conflicts by adding each conflicting character to the catalog with a high priority so as to achieve 1 byte to 1 byte matching and not impact size of the compressed data. It then creates a hash table ie: the dictionary using the alphabet as a base(alphabet.length) number system. Should the size of the dictionary exceed the size of the alphabet it goes back and fills in the "leading zeros" for consistancy. It then uses regular expressions to apply the hashtable to the data compressing it and then amends the dictionary along with header information to the compressed data.

Decompression works by reading the header information and parsing the dictionary. It then inserts the 1st separator every other character after reading the order of magnitude of compressed data, ie: if each hash is one byte or more. Then using regular expressions it replaces hashes changing surrounding separators of a replacement to the second separator so as not to replace a substring of a substring in following passes, and finally removes all separators revealing the uncompressed data.

The compressed data structure is as follows


compressed data structure

1 byte 																	| separator 1
1 byte																	| separator 2
n bytes separated by 1st separator										| dictionary of keys/values assumed to altenate
1 byte 2nd separator													| end of dictionary
m bytes of 1st separator												| where m is order of magnitude
n bytes of data represented with fixed magnitude of m “leading zeros” 	| compressed data

The data can then be decompressed without pre-agreeing on a dictionary or any other information. This is optimal for sending compressed data to arbitrary recipients, or long term data storage. Because it's smart and prioritizes substrings to create the dictionary before any actual compression is done it typically achieves better compression than algorithms that step through the data and build their dictionary linearly. In high entropy or short data where compression would result in increased size it opts not to compress anything, this again is in contrast to linear algorithms like lzw.

TODO

  • Ability to manually provide a static or seed dictionary, for cases where pre-determining a dictionary can save additional bytes and allow more customization
  • port to Python and PHP
  • optimize code and algorithm

givati-compression's People

Contributors

yoavgivati avatar

Stargazers

gregory nicholas avatar Jack avatar Clarisa Leu avatar Malcolm White avatar Dennis Kalaygian avatar Michael Anthony avatar  avatar  avatar Sjoerd de Jong avatar Edgar Hipp avatar Julian Bilcke avatar  avatar

Watchers

Michael Anthony avatar

Forkers

bittorf

givati-compression's Issues

License file???

Your work is interesting. But, I don't find any LICENSE file for this project, because i am unsure whether I'm permitted to fully utilize this work (algorithm including source code) for my independent research study.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.