Greedy Algorithms/Huffman Codes

Lesson 7.31,599 words

Huffman Codes

Huffman coding is the greedy method's most beautiful application: it builds a provably optimal prefix-free binary code by repeatedly merging the two least frequent symbols. We develop prefix-free codes as binary trees, give the algorithm with a priority queue, build a Huffman tree from example frequencies, prove optimality with the same greedy-choice-plus-substructure argument, and pin the running time at O(nlogn)O(n\log n).

╌╌╌╌

Suppose we want to store or transmit a file of text drawn from some alphabet of symbols, letters being the obvious case. A fixed-length code assigns every symbol a bit string of the same length: with symbols we would spend bits each (), regardless of how often each symbol appears. But real text is lopsided. In English, e and t are everywhere while q and z are rare. It is wasteful to spend as many bits on z as on e.

A variable-length code exploits this skew: give the frequent symbols short codewords and the rare symbols long ones, so the total bit count drops. Huffman's 1952 algorithm finds the variable-length code that compresses a given file as much as any such code possibly can, and it does so with a beautifully simple greedy rule.1 This lesson is the payoff of the greedy method: a non-obvious algorithm, an elegant correctness proof, and a result used billions of times a day inside JPEG, MP3, gzip, and PNG.

Prefix-free codes

Variable-length codes carry a hazard. If e is 0 and t is 01, then the stream 001 is ambiguous (is it e e t? e ??) because the codeword for e is a prefix of the codeword for t. To decode unambiguously without separator symbols, we insist on a prefix-free code (also called a prefix code): no codeword is a prefix of any other.2

Prefix-freeness buys us instant, unambiguous decoding: read bits left to right, and the moment the bits so far match a codeword, that codeword is the only possible symbol, so emit it and start fresh. Remarkably, restricting to prefix-free codes costs nothing in compression: for any uniquely decodable code there is a prefix-free code at least as good, so we lose no optimality by considering only these.

Every prefix-free code corresponds to a binary tree. Symbols sit at the leaves; the path from the root to a leaf spells its codeword, taking 0 for a left edge and 1 for a right edge. Because symbols are only at leaves, no codeword can be a prefix of another: a prefix would mean one symbol's leaf lies on the path to another's, impossible when both are leaves. The depth of a leaf is its codeword's length.

A prefix-free code as a binary tree. Symbols sit only at leaves; the root-to-leaf path spells the codeword ( left, right). No codeword is a prefix of another because no leaf lies on the path to another.

The compression problem

Let the alphabet be , and let symbol occur with frequency (its count, or its probability) in the file. In a code tree , let be the depth of 's leaf, the number of bits in its codeword. The total number of bits to encode the whole file is the cost of the tree:

An optimal code's tree is always full — every internal node has exactly two children. (A one-child node wastes a bit: promote its subtree and every codeword beneath it shortens.) So with symbols, an optimal tree has leaves and exactly internal nodes.

Huffman's algorithm

Huffman's greedy insight is to think about the tree bottom-up, and to ask: which two symbols belong deepest in the tree? Surely the two least frequent ones, since they are the cheapest to bury: multiplying their long codewords by small frequencies costs little.3 So make the two rarest symbols siblings at the bottom, merge them into a single super-symbol whose frequency is their sum, and repeat on the smaller alphabet. Each merge fuses two nodes into one, so after merges a single tree remains.

A min-priority queue keyed on frequency makes the two least frequent cheap to extract.

Algorithm 1:Huffman(C)\textsc{Huffman}(C) — build an optimal prefix-free code tree
  1. 1
    nCn \gets \abs{C}
  2. 2
    QQ \gets a min-priority queue holding all symbols of CC, keyed on freq\mathit{freq}
  3. 3
    for i1i \gets 1 to n1n - 1 do
  4. 4
    allocate a new internal node zz
  5. 5
    z.leftxz.\mathit{left} \gets x \gets Extract-Min(Q)\textsc{Extract-Min}(Q)
    rarest remaining
  6. 6
    z.rightyz.\mathit{right} \gets y \gets Extract-Min(Q)\textsc{Extract-Min}(Q)
    next rarest
  7. 7
    z.freqx.freq+y.freqz.\mathit{freq} \gets x.\mathit{freq} + y.\mathit{freq}
  8. 8
    call Insert(Q,z)\textsc{Insert}(Q, z)
    re-insert merged super-symbol
  9. 9
    return Extract-Min(Q)\textsc{Extract-Min}(Q)
    last node is the root

Each iteration removes two nodes and inserts one, shrinking the queue by one; the single survivor after rounds is the root of the finished tree. Reading the tree from the root gives every symbol's codeword.

Building a Huffman tree by hand

Take a six-symbol alphabet with these frequencies (in thousands of occurrences), the classic CLRS example:

Symbolabcdef
Frequency4513121695

We repeatedly merge the two smallest frequencies:

  1. Merge f and e → node .
  2. Merge c and b → node .
  3. Merge and d → node .
  4. Merge and → node .
  5. Merge a and → root .

Watching the priority queue evolve makes the bottom-up construction concrete: each step extracts the two smallest weights and re-inserts their sum, so the queue loses one element per round until a single root survives.

The five priority-queue merges that build the Huffman tree. Each step extracts the two least-frequent nodes and inserts their sum; the queue shrinks by one until a single root of weight remains.

The resulting tree, with left edges labeled 0 and right edges 1, is:

Huffman code tree for the six-symbol example with edges labeled and .

Reading root-to-leaf gives the codewords:

Symbolabcdef
Codeword010110011111011100

The frequent a gets a single bit; the rare e and f get four. The cost is

thousand bits. A fixed-length -bit code would spend thousand bits, so Huffman saves about , and no prefix-free code does better.

Why Huffman is optimal

Huffman is a greedy algorithm, so its proof follows the template from the previous lesson exactly: a greedy-choice property proved by an exchange argument, then optimal substructure to close the induction.4

The greedy choice is safe

The intuition is the exchange argument in one sentence: the rarest symbols belong deepest, so pushing them down and pulling frequent symbols up can never cost more.

The two swaps are easiest to see side by side. In any optimal tree the deepest sibling pair holds some leaves ; the globally rarest symbols may sit higher up. Exchanging with and with sends the rare symbols to the bottom and lifts the more frequent ones — and the cost only drops, because each moved-down leaf is rarer and each moved-up leaf is more frequent.

The greedy-choice exchange for Huffman. Left, an optimal tree with deepest siblings and the rarest symbols sitting higher. Right, after swapping and the rarest symbols are deepest; cost cannot rise since rarer leaves moved down and more frequent leaves moved up.

Optimal substructure

Together the two lemmas give the theorem by induction on : the base case of one symbol is trivial, and each merge is both safe (greedy choice) and composable (substructure). Huffman's algorithm produces an optimal prefix-free code.

Running time

The cost is dominated by the priority-queue operations. Building the initial min-heap from symbols takes . The loop runs times, and each iteration does two s and one Insert, each on a binary heap. Hence

If the frequencies arrive already sorted, two simple FIFO queues replace the heap (one of original leaves, one of merged nodes, both kept in nondecreasing frequency), and each extract two smallest is , giving an algorithm. The bound, like activity selection's, is really the cost of getting the symbols into sorted order.

Where Huffman lives and where it doesn't

Huffman coding is optimal among codes that assign each symbol a whole number of bits independently. That last clause hides its one weakness: when a symbol's ideal codeword length is fractional (for a symbol of probability , ideally bits), Huffman must round up to bit and loses ground. Arithmetic coding and range coding sidestep the whole-bit floor and can beat Huffman on highly skewed sources, which is why modern compressors often prefer them. Still, Huffman's simplicity, speed, and provable optimality within its class keep it embedded in DEFLATE (gzip, PNG, ZIP), JPEG, and MP3 to this day. It remains the textbook proof that a greedy algorithm, properly justified, can be exactly optimal, not merely a good heuristic.

Takeaways

  • A prefix-free code lets a stream decode unambiguously; it is exactly a binary tree with symbols at the leaves, codeword length = leaf depth.
  • The goal is to minimize ; optimal trees are full.
  • Huffman's algorithm greedily merges the two least-frequent nodes via a min-priority queue, times, building the tree bottom-up.
  • Optimality follows the greedy template: an exchange argument shows the rarest symbols belong deepest (greedy choice), and optimal substructure closes the induction.
  • Running time is , the cost of the heap operations, dropping to when frequencies are pre-sorted.

Footnotes

  1. Skiena, §5 — Data Compression: Huffman's greedy construction of the optimal variable-length code for a given file.
  2. CLRS, Ch. 16 — Greedy Algorithms (§16.3): prefix-free codes and their representation as binary trees with symbols at the leaves.
  3. CLRS, Ch. 16 — Greedy Algorithms (§16.3): the greedy rule of repeatedly merging the two least-frequent symbols, implemented with a min-priority queue.
  4. Erickson, Ch. 4 — Greedy Algorithms (Huffman Codes): the optimality proof via greedy-choice exchange plus optimal substructure.
Practice

╌╌ END ╌╌