Hash Tables
A hash table implements the dictionary — insert, search, delete — in expected time by scattering keys across an array with a hash function. We build up from direct addressing, handle collisions by chaining and by open addressing, analyze the load factor , and see how universal hashing earns its expected-time guarantee against every input.
╌╌╌╌
Many problems need only three operations on a set of records, each identified by
a key: insert a record, search for the record with a given key, and delete a
record. This is the dictionary abstract data type (also called an
associative array or map), and it is one of the most heavily used data
structures in all of computing: symbol tables in compilers, routing tables in
networks, the dict in your favorite scripting language. A balanced search tree
does all three in time. A hash table does them in expected time: constant, independent of how many keys are stored.1 The price is
that it gives up the ordering a tree provides, with no efficient next larger key
or "all keys in ", in exchange for raw speed on the point
operations.
From direct addressing to hashing
Start with the easy case. Suppose every key is drawn from a small universe. Then we can keep an array , a direct-address table, and store the record with key in slot . Insert, search, and delete are each a single array access: worst-case , unbeatable.
- 1Direct-Address-Insert(T, x):
- 2
- 3Direct-Address-Search(T, k):
- 4return
- 5Direct-Address-Delete(T, x):
- 6
Direct addressing fails the moment the universe is large. To store -bit integers we would need an array of slots, impossible, even though we may hold only a few thousand keys. The waste is glaring: is almost entirely empty. The remedy is to use a table that is only as big as the number of keys we expect, and to compute a slot from the key with a hash function
The key lives in slot . We say hashes to slot , and is the hash value. Because , the function cannot be injective: two distinct keys can map to the same slot. That event is a collision, and the entire theory of hash tables is the theory of coping with collisions gracefully.
Collision resolution by chaining
The most natural fix is chaining: each slot holds a linked list of all the keys that hash to . To insert, prepend to the list at ; to search, scan that one list; to delete, splice the record out of its list.
Slots , , and hold chains; the rest are empty. Keys , , all collided at slot , so they share a list.
- 1Chained-Hash-Insert(T, x):
- 2insert at the head of list
- 3Chained-Hash-Search(T, k):
- 4search the list for an element with key
- 5Chained-Hash-Delete(T, x):
- 6delete from the list
Insertion is (prepend, assuming the key is not already present). Deletion is given a pointer to the record in a doubly linked list. Search costs time proportional to the length of the chain it scans, and that is what we must analyze.
The load factor and expected search time
Let be the number of keys stored in a table of slots. The ratio
is the load factor, the average number of keys per slot. With chaining may exceed ; it is the average chain length.
To say anything about expected chain length we need an assumption about how keys spread out. The standard one is simple uniform hashing: each key is equally likely to hash to any of the slots, independently of the others.2 Under this assumption a chain has expected length , and a search examines on average keys plus the cost of computing and indexing the table:
Sketch. For an unsuccessful search on key , we scan the entire chain , whose expected length is ; add to compute . For a successful search, the expected number of elements examined before finding is by averaging over the order in which keys were inserted. Either way the bound is .
The consequence is the headline result. If we keep the table size proportional to the number of keys, so , then every dictionary operation runs in expected time. Keeping bounded is the job of dynamic resizing: when grows past a threshold (say ), allocate a table of double the size and rehash every key into it. A single resize costs , but it is triggered only after cheap operations, so the amortized cost per operation stays , the same doubling argument that makes a dynamic array's append amortized .
Collision resolution by open addressing
Chaining stores keys outside the table. Open addressing stores every key inside the array itself: there are no lists and no pointers, so always. When a key's preferred slot is occupied, we probe a deterministic sequence of alternative slots until we find an empty one. The probe sequence is defined by extending the hash function with a probe number :
so the slots tried for key are , which must form a permutation of all slots so that probing can examine every slot.
- 1
- 2repeat
- 3
- 4if then
- 5
- 6return
- 7
- 8until
- 9error "hash table overflow"
Search follows the same probe sequence, stopping when it finds (success)
or an empty slot (failure, since is not present, because insertion would have
used that empty slot). Deletion is the awkward case: simply emptying a slot would
break the probe chains of other keys, so deleted slots are marked with a special
deleted sentinel that search skips over but insertion may reuse. Heavy
deletion is the classic reason to prefer chaining.
Three probing schemes are standard:
- Linear probing. for an ordinary hash function . Simple and cache-friendly, but it suffers primary clustering: long runs of occupied slots build up and grow ever faster, since any key hashing anywhere into a run must walk to its end.
- Quadratic probing. . The quadratic step spreads probes out, eliminating primary clustering, but two keys with the same initial slot follow the same sequence, a milder secondary clustering. The constants and must be chosen so the sequence hits every slot.
- Double hashing. , using a second hash function to set the step size. Different keys with the same start get different step sizes, so probe sequences rarely coincide. Double hashing comes closest to the ideal of uniform hashing (every key's probe sequence equally likely to be any of the permutations), and is the strongest of the three.
Cost of open addressing
Under the uniform-hashing assumption, with load factor , the expected number of probes is
The shape of is the whole story: at we expect probes; at , ten probes; as the cost explodes. Open addressing is fast only when the table is kept comfortably below full; a practical rule of thumb is to resize once exceeds about .
What makes a hash function good
Simple uniform hashing is an assumption; a real hash function must approximate it on real data. A good should scatter keys so that any regularities in the input, such as sequential integers, common prefixes, or similar strings, do not pile up in the same slots. Two classic constructions:
- The division method. . Fast, but sensitive to : choosing a power of makes depend only on the low bits of , and values near a power of are bad for decimal data. A prime not close to a power of is the safe choice.
- The multiplication method. for a constant ( is a good choice). It is insensitive to the value of , so can be a power of for fast bit shifts.
For string keys, treat the string as a base- number and fold it down, e.g. Horner's rule, over the characters , so that every character and its position influence the result.3 Skiena stresses the engineer's view: a hash function turns an arbitrary key into a pseudo-random slot, and the quality of that pseudo-randomness is exactly what protects the bound.
Universal hashing: a guarantee against every input
Any fixed hash function has an Achilles' heel: there exists a set of keys that all collide, and an adversary (or merely unlucky data) can hand it to us, degrading every operation to . Universal hashing defeats this by choosing at random from a carefully designed family of hash functions at runtime, so no single input is bad for all choices.
A family of functions from to is universal if, for every pair of distinct keys ,
where the probability is over the random choice of . That is, a randomly chosen collides any fixed pair no more often than picking two random slots would. This single property suffices to prove that, for any input set of keys, the expected length of the chain holding a given key is at most , recovering the bound without assuming anything about the data.4 The randomness lives in our coin flips, not in an assumption about the world.
A concrete universal family: pick a prime , draw random and , and set
The collection over all valid is universal. Universal
hashing is the rigorous foundation under the everyday claim that hashing is
: it is in expectation, on every input, precisely because we
randomize the hash function.
Takeaways
- A hash table implements the dictionary ADT — insert, search, delete — in expected time by mapping keys into an array with a hash function, trading away the ordered queries a search tree supports.
- Direct addressing is perfect but needs one slot per possible key; hashing shrinks the table to and resolves the resulting collisions.
- Chaining keeps a list per slot (search cost ); open addressing stores keys in the array and probes (linear, quadratic, or double hashing), with cost governed by .
- Keeping the load factor bounded, via resizing, keeps every operation expected .
- A good hash function scatters structured keys; universal hashing randomizes the choice of so the expectation holds against every input, not just under an assumption.
Footnotes
- CLRS, Ch. 11 — Hash Tables (§11.1–11.2): the dictionary ADT and the expected guarantee from hashing. ↩
- Erickson, Ch. 5 — Hash Tables: the simple uniform hashing assumption and expected chain length. ↩
- Skiena, §3.7 — Hashing and Strings: hashing string keys via Horner's-rule polynomial evaluation. ↩
- CLRS, Ch. 11 — Hash Tables (§11.3.3): universal families and the bound without distributional assumptions. ↩
╌╌ END ╌╌