Code Monkey home page Code Monkey logo

huffman-coding's Introduction

Huffman Coding

image image

Author: Andrew Gyakobo

Note

Please feel free to make pull requests and edit this project to your will.

Introduction

Huffman Coding is a widely used method of lossless data compression. It was developed by David A. Huffman while he was a Ph.D. student at MIT, and it was published in his seminal paper in 1952. The algorithm is used to compress data by assigning variable-length codes to characters based on their frequencies. Characters that occur more frequently are given shorter codes, while less frequent characters are assigned longer codes. This approach reduces the overall size of the encoded data.

The process of Huffman Coding involves:

  1. Frequency Analysis: Calculate the frequency of each character in the input data.

  2. Building a Huffman Tree: Create a binary tree where each leaf node represents a character and its frequency. The tree is built by repeatedly merging the two nodes with the lowest frequencies until only one node remains (the root of the tree).

  3. Generating Codes: Assign binary codes to characters by traversing the Huffman Tree. Each left edge represents 0, and each right edge represents a 1.

Huffman Coding is optimal for symbol-by-symbol encoding and is widely used in various applications such as file compression (e.g., ZIP files), image compression (e.g., JPEG), and multimedia codecs (e.g., MP3).

Introduction to David A. Huffman

David A. Huffman (August 9, 1925 - October 7, 1999) was an American computer scientist and professor, known for his pioneering work in the field of information theory and data compression. Huffman was in Ohio and went on to study at the Massachussets Institute of Technology (MIT), where he completed his Ph.D. in electrical engineering.

While still a graduate student, Huffman devised the Huffman Coding algorithm as part of a term paper for a course on information theory taught by Robert M. Fano. His algorithm quickly became a fundamental technique in the field of data compression due to its efficiency and simplicity.

Throughout his career, Huffman made significant contributions to various areas of computer science and engineering. He was a professor at MIT and later at the University of California, Santa Cruz, where he helped establish the Computer Science Department.

David Huffman's work has had a lasting impact on how data is efficiently stored and quickly transmitted, and his legacy continues to influence modern computing and telecommunications.

Methodology

Step 1: Frequency Table of English Letters

First, we need a frequency table for the English letters. Here's a commonly used table based on large corpora of English texts:

# Frequency table of English letters (including space)
frequency_table = {
    'A': 8.167, 'B': 1.492, 'C': 2.782, 'D': 4.253, 'E': 12.702,
    'F': 2.228, 'G': 2.015, 'H': 6.094, 'I': 6.966, 'J': 0.153,
    'K': 0.772, 'L': 4.025, 'M': 2.406, 'N': 6.749, 'O': 7.507,
    'P': 1.929, 'Q': 0.095, 'R': 5.987, 'S': 6.327, 'T': 9.056,
    'U': 2.758, 'V': 0.978, 'W': 2.360, 'X': 0.150, 'Y': 1.974,
    'Z': 0.074, ' ': 18.29
}
Letter Frequency (%) Letter Frequency (%)
A8.167 B1.492
C2.782 D4.253
E12.702 F2.228
G2.015 H6.094
I6.966 J0.153
K0.772 L4.025
M2.406 N6.749
O7.507 P1.929
Q0.095 R5.987
S6.327 T9.056
U2.758 V0.978
W2.36 X0.15
Y1.974 Z0.074
SPACE18.29

Step 2: Build a Huffman Code for English

We still use the frequency table to build a Huffman Tree and derive the Huffman codes for each character. Here's a simplified version of the process:

  1. Initialize: Create a leaf node for each character and add it to a priority queue based on frequency.
# Count frequency of each character in the sample text
sample_freq = Counter(sample_text.upper())
  1. Build the Tree:
  • Extract two nodes with the lowest frequency.
  • Create a new internal node with these two nodes as children and the sum of their frequencies as the new frequency.
  • Insert the new node back into the priority queue.
  • Repeat until only one node (the root) remains.
# Build Huffman Tree
heap = [[weight, [char, ""]] for char, weight in frequency_table.items()]
heapq.heapify(heap)

while len(heap) > 1:
    lo = heapq.heappop(heap)
    hi = heapq.heappop(heap)
    for pair in lo[1:]:
        pair[1] = '0' + pair[1]
    for pair in hi[1:]:
        pair[1] = '1' + pair[1]
    heapq.heappush(heap, [lo[0] + hi[0]] + lo[1:] + hi[1:])
  1. Generate Codes: Traverse the tree to assign codes to each character.
# Generate Huffman codes
huffman_codes = sorted(heap[0][1:], key=lambda p: (len(p[-1]), p))

# Calculate the number of bits needed using Huffman coding
huffman_bits = sum(sample_freq[char] * len(code) for char, code in huffman_codes if char in sample_freq)

To run the final program please just run the main file. I also wrote a sample encoder that encodes any input with arbitrary bits just for you, my dear reader, to compare their inherit weights.

Conclusion

David A. Huffman, though his groundbreaking development of Huffman Coding, revolutionized the field of data compression. His algorithm, created during his time as a Ph.D. student at MIT, stands as a testament to the power of innovative thinking in solving practical problems. Huffman Coding's efficient method of variable-length encoding based on character frequency not only optimizes data storage and transmission but also underpins many modern compression techniques used in everyday digital media. Huffman's work exemplifies the profound impact that a single, elegant solution can have on technology, shaping the way we manage and process information in the digital age.

License

MIT

huffman-coding's People

Contributors

gyakobo avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.