Huffman Coding

Note

Please feel free to make pull requests and edit this project to your will.

Introduction

Huffman Coding is a widely used method of lossless data compression. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and it was published in his seminal paper in 1952. The algorithm is used to compress data by assigning variable-length codes to characters based on their frequencies. Characters that occur more frequently are given shorter codes, while less frequent characters are assigned longer codes. This approach reduces the overall size of the encoded data.

The process of Huffman Coding involves:

Frequency Analysis: Calculate the frequency of each character in the input data.
Building a Huffman Tree: Create a binary tree where each leaf node represents a character and its frequency. The tree is built by repeatedly merging the two nodes with the lowest frequencies until only one node remains (the root of the tree).
Generating Codes: Assign binary codes to characters by traversing the Huffman Tree. Each left edge represents 0, and each right edge represents a 1.

Huffman Coding is optimal for symbol-by-symbol encoding and is widely used in various applications such as file compression (e.g., ZIP files), image compression (e.g., JPEG), and multimedia codecs (e.g., MP3).

Introduction to David A. Huffman

David A. Huffman (August 9, 1925 - October 7, 1999) was an American computer scientist and professor, known for his pioneering work in the field of information theory and data compression. Huffman was in Ohio and went on to study at the Massachussets Institute of Technology (MIT), where he completed his Ph.D. in electrical engineering.

While still a graduate student, Huffman devised the Huffman Coding algorithm as part of a term paper for a course on information theory taught by Robert M. Fano. His algorithm quickly became a fundamental technique in the field of data compression due to its efficiency and simplicity.

Throughout his career, Huffman made significant contributions to various areas of computer science and engineering. He was a professor at MIT and later at the University of California, Santa Cruz, where he helped establish the Computer Science Department.

David Huffman's work has had a lasting impact on how data is efficiently stored and quickly transmitted, and his legacy continues to influence modern computing and telecommunications.

Methodology

Step 1: Frequency Table of English Letters

First, we need a frequency table for the English letters. Here's a commonly used table based on large corpora of English texts:

# Frequency table of English letters (including space)
frequency_table = {
    'A': 8.167, 'B': 1.492, 'C': 2.782, 'D': 4.253, 'E': 12.702,
    'F': 2.228, 'G': 2.015, 'H': 6.094, 'I': 6.966, 'J': 0.153,
    'K': 0.772, 'L': 4.025, 'M': 2.406, 'N': 6.749, 'O': 7.507,
    'P': 1.929, 'Q': 0.095, 'R': 5.987, 'S': 6.327, 'T': 9.056,
    'U': 2.758, 'V': 0.978, 'W': 2.360, 'X': 0.150, 'Y': 1.974,
    'Z': 0.074, ' ': 18.29
}

Letter	Frequency (%)	Letter	Frequency (%)
A	8.167	B	1.492
C	2.782	D	4.253
E	12.702	F	2.228
G	2.015	H	6.094
I	6.966	J	0.153
K	0.772	L	4.025
M	2.406	N	6.749
O	7.507	P	1.929
Q	0.095	R	5.987
S	6.327	T	9.056
U	2.758	V	0.978
W	2.36	X	0.15
Y	1.974	Z	0.074
SPACE	18.29

Step 2: Build a Huffman Code for English

We still use the frequency table to build a Huffman Tree and derive the Huffman codes for each character. Here's a simplified version of the process:

Initialize: Create a leaf node for each character and add it to a priority queue based on frequency.

# Count frequency of each character in the sample text
sample_freq = Counter(sample_text.upper())

Build the Tree:

Extract two nodes with the lowest frequency.
Create a new internal node with these two nodes as children and the sum of their frequencies as the new frequency.
Insert the new node back into the priority queue.
Repeat until only one node (the root) remains.

# Build Huffman Tree
heap = [[weight, [char, ""]] for char, weight in frequency_table.items()]
heapq.heapify(heap)

while len(heap) > 1:
    lo = heapq.heappop(heap)
    hi = heapq.heappop(heap)
    for pair in lo[1:]:
        pair[1] = '0' + pair[1]
    for pair in hi[1:]:
        pair[1] = '1' + pair[1]
    heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])

Generate Codes: Traverse the tree to assign codes to each character.

# Generate Huffman codes
huffman_codes = sorted(heap[0][1:], key=lambda p: (len(p[-1]), p))

# Calculate the number of bits needed using Huffman coding
huffman_bits = sum(sample_freq[char] * len(code) for char, code in huffman_codes if char in sample_freq)

To run the final program please just run the main file. I also wrote a sample encoder that encodes any input with arbitrary bits just for you, my dear reader, to compare their inherit weights.

Conclusion

David A. Huffman, though his groundbreaking development of Huffman Coding, revolutionized the field of data compression. His algorithm, created during his time as a Ph.D. student at MIT, stands as a testament to the power of innovative thinking in solving practical problems. Huffman Coding's efficient method of variable-length encoding based on character frequency not only optimizes data storage and transmission but also underpins many modern compression techniques used in everyday digital media. Huffman's work exemplifies the profound impact that a single, elegant solution can have on technology, shaping the way we manage and process information in the digital age.

License

MIT

gyakobo / huffman-coding Goto Github PK

huffman-coding's Introduction

Huffman Coding

Introduction

The process of Huffman Coding involves:

Introduction to David A. Huffman

Methodology

Step 1: Frequency Table of English Letters

Step 2: Build a Huffman Code for English

Conclusion

License

huffman-coding's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent