Data Structures/Hash Tables

Lesson 4.21,679 words

Hash Tables

A hash table implements the dictionary — insert, search, delete — in expected O(1)O(1) time by scattering keys across an array with a hash function. We build up from direct addressing, handle collisions by chaining and by open addressing, analyze the load factor α\alpha, and see how universal hashing earns its expected-time guarantee against every input.

╌╌╌╌

Many problems need only three operations on a set of records, each identified by a key: insert a record, search for the record with a given key, and delete a record. This is the dictionary abstract data type (also called an associative array or map), and it is one of the most heavily used data structures in all of computing: symbol tables in compilers, routing tables in networks, the dict in your favorite scripting language. A balanced search tree does all three in time. A hash table does them in expected time: constant, independent of how many keys are stored.1 The price is that it gives up the ordering a tree provides, with no efficient next larger key or "all keys in ", in exchange for raw speed on the point operations.

From direct addressing to hashing

Start with the easy case. Suppose every key is drawn from a small universe. Then we can keep an array , a direct-address table, and store the record with key in slot . Insert, search, and delete are each a single array access: worst-case , unbeatable.

Algorithm:Direct-address dictionary operations on table TT
  1. 1
    Direct-Address-Insert(T, x):
  2. 2
    T[key(x)]xT[key(x)] \gets x
  3. 3
    Direct-Address-Search(T, k):
  4. 4
    return T[k]T[k]
  5. 5
    Direct-Address-Delete(T, x):
  6. 6
    T[key(x)]nilT[key(x)] \gets \text{nil}

Direct addressing fails the moment the universe is large. To store -bit integers we would need an array of slots, impossible, even though we may hold only a few thousand keys. The waste is glaring: is almost entirely empty. The remedy is to use a table that is only as big as the number of keys we expect, and to compute a slot from the key with a hash function

The key lives in slot . We say hashes to slot , and is the hash value. Because , the function cannot be injective: two distinct keys can map to the same slot. That event is a collision, and the entire theory of hash tables is the theory of coping with collisions gracefully.

Collision resolution by chaining

The most natural fix is chaining: each slot holds a linked list of all the keys that hash to . To insert, prepend to the list at ; to search, scan that one list; to delete, splice the record out of its list.

Chained hash table where colliding keys share a linked list per slot

Slots , , and hold chains; the rest are empty. Keys , , all collided at slot , so they share a list.

Algorithm:Chained-hash dictionary operations on table TT
  1. 1
    Chained-Hash-Insert(T, x):
  2. 2
    insert xx at the head of list T[h(key(x))]T[h(key(x))]
  3. 3
    Chained-Hash-Search(T, k):
  4. 4
    search the list T[h(k)]T[h(k)] for an element with key kk
  5. 5
    Chained-Hash-Delete(T, x):
  6. 6
    delete xx from the list T[h(key(x))]T[h(key(x))]

Insertion is (prepend, assuming the key is not already present). Deletion is given a pointer to the record in a doubly linked list. Search costs time proportional to the length of the chain it scans, and that is what we must analyze.

The load factor and expected search time

Let be the number of keys stored in a table of slots. The ratio

is the load factor, the average number of keys per slot. With chaining may exceed ; it is the average chain length.

To say anything about expected chain length we need an assumption about how keys spread out. The standard one is simple uniform hashing: each key is equally likely to hash to any of the slots, independently of the others.2 Under this assumption a chain has expected length , and a search examines on average keys plus the cost of computing and indexing the table:

Sketch. For an unsuccessful search on key , we scan the entire chain , whose expected length is ; add to compute . For a successful search, the expected number of elements examined before finding is by averaging over the order in which keys were inserted. Either way the bound is .

The consequence is the headline result. If we keep the table size proportional to the number of keys, so , then every dictionary operation runs in expected time. Keeping bounded is the job of dynamic resizing: when grows past a threshold (say ), allocate a table of double the size and rehash every key into it. A single resize costs , but it is triggered only after cheap operations, so the amortized cost per operation stays , the same doubling argument that makes a dynamic array's append amortized .

Collision resolution by open addressing

Chaining stores keys outside the table. Open addressing stores every key inside the array itself: there are no lists and no pointers, so always. When a key's preferred slot is occupied, we probe a deterministic sequence of alternative slots until we find an empty one. The probe sequence is defined by extending the hash function with a probe number :

so the slots tried for key are , which must form a permutation of all slots so that probing can examine every slot.

Algorithm:Hash-Insert(T,k)\textsc{Hash-Insert}(T, k) — open addressing, returns the slot used
  1. 1
    i0i \gets 0
  2. 2
    repeat
  3. 3
    jh(k,i)j \gets h(k, i)
  4. 4
    if T[j]=nilT[j] = \text{nil} then
  5. 5
    T[j]kT[j] \gets k
  6. 6
    return jj
  7. 7
    ii+1i \gets i + 1
  8. 8
    until i=mi = m
  9. 9
    error "hash table overflow"

Search follows the same probe sequence, stopping when it finds (success) or an empty slot (failure, since is not present, because insertion would have used that empty slot). Deletion is the awkward case: simply emptying a slot would break the probe chains of other keys, so deleted slots are marked with a special deleted sentinel that search skips over but insertion may reuse. Heavy deletion is the classic reason to prefer chaining.

Why deletion needs a tombstone. Keys probed into a run ending at slot . Erasing to (top) would make a later search for stop early at the gap; marking it (bottom) lets search probe past while insertion may reuse the slot.

Three probing schemes are standard:

  • Linear probing. for an ordinary hash function . Simple and cache-friendly, but it suffers primary clustering: long runs of occupied slots build up and grow ever faster, since any key hashing anywhere into a run must walk to its end.
Primary clustering. A run of four occupied slots (shaded) absorbs any new key whose hash lands on slot , , , or — four of the eleven slots, each forcing a walk to the run's right end at slot (red). Every such insertion lengthens the run, so clusters snowball: the bigger the run, the likelier the next key extends it.
  • Quadratic probing. . The quadratic step spreads probes out, eliminating primary clustering, but two keys with the same initial slot follow the same sequence, a milder secondary clustering. The constants and must be chosen so the sequence hits every slot.
  • Double hashing. , using a second hash function to set the step size. Different keys with the same start get different step sizes, so probe sequences rarely coincide. Double hashing comes closest to the ideal of uniform hashing (every key's probe sequence equally likely to be any of the permutations), and is the strongest of the three.
Three probe sequences from in a table of size with slots occupied (shaded). Labels give probe order; the boxed slot is where the key lands. Linear walks (lands ); quadratic adds (lands ); double hashing steps by (lands ).

Cost of open addressing

Under the uniform-hashing assumption, with load factor , the expected number of probes is

The shape of is the whole story: at we expect probes; at , ten probes; as the cost explodes. Open addressing is fast only when the table is kept comfortably below full; a practical rule of thumb is to resize once exceeds about .

Expected probes for an unsuccessful search, , against load factor . Flat and cheap while the table is half-empty, the curve turns sharply upward near (the resize threshold, dashed) and diverges as — the reason open addressing must never run near full.

What makes a hash function good

Simple uniform hashing is an assumption; a real hash function must approximate it on real data. A good should scatter keys so that any regularities in the input, such as sequential integers, common prefixes, or similar strings, do not pile up in the same slots. Two classic constructions:

  • The division method. . Fast, but sensitive to : choosing a power of makes depend only on the low bits of , and values near a power of are bad for decimal data. A prime not close to a power of is the safe choice.
  • The multiplication method. for a constant ( is a good choice). It is insensitive to the value of , so can be a power of for fast bit shifts.

For string keys, treat the string as a base- number and fold it down, e.g. Horner's rule, over the characters , so that every character and its position influence the result.3 Skiena stresses the engineer's view: a hash function turns an arbitrary key into a pseudo-random slot, and the quality of that pseudo-randomness is exactly what protects the bound.

Universal hashing: a guarantee against every input

Any fixed hash function has an Achilles' heel: there exists a set of keys that all collide, and an adversary (or merely unlucky data) can hand it to us, degrading every operation to . Universal hashing defeats this by choosing at random from a carefully designed family of hash functions at runtime, so no single input is bad for all choices.

A family of functions from to is universal if, for every pair of distinct keys ,

where the probability is over the random choice of . That is, a randomly chosen collides any fixed pair no more often than picking two random slots would. This single property suffices to prove that, for any input set of keys, the expected length of the chain holding a given key is at most , recovering the bound without assuming anything about the data.4 The randomness lives in our coin flips, not in an assumption about the world.

A concrete universal family: pick a prime , draw random and , and set

The collection over all valid is universal. Universal hashing is the rigorous foundation under the everyday claim that hashing is : it is in expectation, on every input, precisely because we randomize the hash function.

Takeaways

  • A hash table implements the dictionary ADT — insert, search, delete — in expected time by mapping keys into an array with a hash function, trading away the ordered queries a search tree supports.
  • Direct addressing is perfect but needs one slot per possible key; hashing shrinks the table to and resolves the resulting collisions.
  • Chaining keeps a list per slot (search cost ); open addressing stores keys in the array and probes (linear, quadratic, or double hashing), with cost governed by .
  • Keeping the load factor bounded, via resizing, keeps every operation expected .
  • A good hash function scatters structured keys; universal hashing randomizes the choice of so the expectation holds against every input, not just under an assumption.

Footnotes

  1. CLRS, Ch. 11 — Hash Tables (§11.1–11.2): the dictionary ADT and the expected guarantee from hashing.
  2. Erickson, Ch. 5 — Hash Tables: the simple uniform hashing assumption and expected chain length.
  3. Skiena, §3.7 — Hashing and Strings: hashing string keys via Horner's-rule polynomial evaluation.
  4. CLRS, Ch. 11 — Hash Tables (§11.3.3): universal families and the bound without distributional assumptions.
Practice

╌╌ END ╌╌