Hash Tables · study.

Many problems need only three operations on a set of records, each identified by a key: insert a record, search for the record with a given key, and delete a record. This is the dictionary abstract data type (also called an associative array or map), and it is one of the most heavily used data structures in all of computing: symbol tables in compilers, routing tables in networks, the dict in your favorite scripting language. A balanced search tree does all three in $O (log n)$ time. A hash table does them in expected $O (1)$ time: constant, independent of how many keys are stored.¹ The cost is that it gives up the ordering a tree provides: there is no efficient next larger key or "all keys in $[a, b]$ ", only fast point operations.

From direct addressing to hashing

Start with the easy case. Suppose every key is drawn from a small universe $U = {0, 1, \dots, m - 1}$ . Then we can keep an array $T [0.. m - 1]$ , a direct-address table, and store the record with key $k$ in slot $T [k]$ . Insert, search, and delete are each a single array access: worst-case $O (1)$ , unbeatable.

Algorithm:Direct-address dictionary operations on table

T

1
Direct-Address-Insert(T, x):
2
$T[key(x)] \gets x$
3
Direct-Address-Search(T, k):
4
return $T[k]$
5
Direct-Address-Delete(T, x):
6
$T[key(x)] \gets \text{nil}$

direct_address_table.pypython

from typing import Generic, Optional, TypeVar

Value = TypeVar("Value")

class DirectAddressTable(Generic[Value]):
  """
    A map from integer keys in {0, ..., size-1} to values, backed by one\n
    array slot per key. Absent keys hold None.\n
  """

  def __init__(self, size: int) -> None:
    """
      Build a table addressing keys 0 through size-1.\n
    """
    if size < 0:
      raise ValueError("size must be non-negative")

    # one slot per addressable key, all empty to start.
    self.size: int = size
    self._slots: list[Optional[Value]] = [None for _ in range(size)]
    self.count: int = 0

  def _check_key(self, key: int) -> None:
    """
      Reject keys outside the addressable universe.\n
    """
    if not 0 <= key < self.size:
      raise KeyError(key)

  def insert(self, key: int, value: Value) -> None:
    """
      Store `value` under `key`, overwriting any existing entry.\n
    """
    self._check_key(key)

    # count the key only when it newly occupies an empty slot.
    if self._slots[key] is None:
      self.count += 1
    self._slots[key] = value

  def search(self, key: int) -> Optional[Value]:
    """
      The value stored under `key`, or None if the slot is empty.\n
    """
    self._check_key(key)
    return self._slots[key]

  def delete(self, key: int) -> None:
    """
      Empty the slot for `key`. A no-op if it was already empty.\n
    """
    self._check_key(key)

    # drop the count only if a value was actually present.
    if self._slots[key] is not None:
      self.count -= 1
    self._slots[key] = None

  def __contains__(self, key: int) -> bool:
    return 0 <= key < self.size and self._slots[key] is not None

  def __len__(self) -> int:
    return self.count

Direct addressing fails the moment the universe is large. To store $64$ -bit integers we would need an array of $2^{64}$ slots, impossible, even though we may hold only a few thousand keys, leaving $T$ almost entirely empty. To address this, use a table $T [0.. m - 1]$ that is only as big as the number of keys we expect, and compute a slot from the key with a hash function

h : U \to {0, 1, \dots, m - 1} .

The key $k$ lives in slot $h (k)$ . We say $k$ hashes to slot $h (k)$ , and $h (k)$ is the hash value. Because $∣ U ∣ > m$ , the function $h$ cannot be injective: two distinct keys can map to the same slot. That event is a collision, and hash-table design is largely the design of collision handling.

Collision resolution by chaining

The most natural fix is chaining: each slot $T [j]$ holds a linked list of all the keys that hash to $j$ . To insert, prepend to the list at $T [h (k)]$ ; to search, scan that one list; to delete, splice the record out of its list.

Chained hash table where colliding keys share a linked list per slot

Slots $1$ , $3$ , and $4$ hold chains; the rest are empty. Keys $k_{1}$ , $k_{5}$ , $k_{9}$ all collided at slot $4$ , so they share a list.

Algorithm:Chained-hash dictionary operations on table

T

1
Chained-Hash-Insert(T, x):
2
insert $x$ at the head of list $T[h(key(x))]$
3
Chained-Hash-Search(T, k):
4
search the list $T[h(k)]$ for an element with key $k$
5
Chained-Hash-Delete(T, x):
6
delete $x$ from the list $T[h(key(x))]$

Insertion is $O (1)$ (prepend, assuming the key is not already present). Deletion is $O (1)$ given a pointer to the record in a doubly linked list. Search costs time proportional to the length of the chain it scans, and that is what we must analyze.

For a worked trace, take $m = 7$ with the division hash $h (k) = k mod 7$ , and insert the keys $19, 26, 13, 48, 5$ in that order:

insert	$h (k)$	action
$19$	$19 mod 7 = 5$	slot $5$ empty: chain becomes $19$
$26$	$26 mod 7 = 5$	collision: prepend, chain $26 \to 19$
$13$	$13 mod 7 = 6$	slot $6$ empty: chain becomes $13$
$48$	$48 mod 7 = 6$	collision: prepend, chain $48 \to 13$
$5$	$5 mod 7 = 5$	collision: prepend, chain $5 \to 26 \to 19$

Five keys, two occupied slots, chains of length $3$ and $2$ . A successful search for $26$ computes $h (26) = 5$ and scans that chain: compare against $5$ (no), then $26$ (yes), two comparisons total. An unsuccessful search for $12$ also lands on slot $5$ ( $12 mod 7 = 5$ ), scans all three keys, hits the end of the list, and reports absence. Keys hashing to the five empty slots are rejected after zero comparisons. The spread between these cases, and how it grows with the table's fullness, is what the next section quantifies.

The load factor and expected search time

Let $n$ be the number of keys stored in a table of $m$ slots. The ratio

α = \frac{n}{m}

is the load factor, the average number of keys per slot. With chaining $α$ may exceed $1$ ; it is the average chain length.

To say anything about expected chain length we need an assumption about how keys spread out. The standard one is simple uniform hashing: each key is equally likely to hash to any of the $m$ slots, independently of the others.² Under this assumption a chain has expected length $α$ , and a search examines on average $α$ keys plus the cost of computing $h$ and indexing the table:

The two bounds differ in their constants: an unsuccessful search scans a whole chain (expected $α$ keys), while a successful one scans on average about half a chain ( $1 + α /2$ ), since the sought key sits somewhere in the middle of the insertion order. At $α = 1$ , a plausible operating point for a chained table, that is $2$ expected comparisons for a hit and about $1$ chain-length for a miss: constants small enough that the hash computation itself is often the dominant cost.

If we keep the table size proportional to the number of keys, $m = Θ (n)$ so $α = O (1)$ , then every dictionary operation runs in expected $Θ (1)$ time. Keeping $α$ bounded is the job of dynamic resizing: when $α$ grows past a threshold (say $1$ ), allocate a table of double the size and rehash every key into it. A single resize costs $Θ (n)$ , but it is triggered only after $Θ (n)$ cheap operations, so the amortized cost per operation stays $O (1)$ , the same doubling argument that makes a dynamic array's append amortized $O (1)$ .

chained_hash_table.pypython

from __future__ import annotations

from collections.abc import Hashable, Iterator
from typing import Generic, Optional, TypeVar

Key = TypeVar("Key", bound=Hashable)
Value = TypeVar("Value")

class EntryNode(Generic[Key, Value]):
  """
    One key/value record in a bucket's chain, linked to the next record.\n
  """

  def __init__(
    self,
    key: Key,
    value: Value,
    next_node: Optional[EntryNode[Key, Value]] = None,
  ) -> None:
    self.key: Key = key
    self.value: Value = value
    self.next: Optional[EntryNode[Key, Value]] = next_node

  def __repr__(self) -> str:
    return f"EntryNode({self.key!r}, {self.value!r})"

class ChainedHashTable(Generic[Key, Value]):
  """
    A dictionary backed by an array of linked-list buckets.\n
    Grows (doubling) when the load factor exceeds `max_load_factor` and\n
    shrinks (halving) when it falls below a quarter of that, so the chains\n
    stay short.\n
  """

  def __init__(
    self,
    initial_capacity: int = 8,
    max_load_factor: float = 1.0,
  ) -> None:
    if initial_capacity < 1:
      raise ValueError("initial_capacity must be positive")
    if max_load_factor <= 0:
      raise ValueError("max_load_factor must be positive")

    # start with empty chains in every slot.
    self._capacity: int = initial_capacity
    self._max_load_factor: float = max_load_factor
    self._buckets: list[Optional[EntryNode[Key, Value]]] = [None for _ in range(self._capacity)]
    self.count: int = 0

  @property
  def load_factor(self) -> float:
    """
      The current average chain length, alpha = n / m.\n
    """
    return self.count / self._capacity

  def _slot(self, key: Key, capacity: int) -> int:
    """
      The bucket index for `key` under a table of `capacity` slots.\n
    """
    return hash(key) % capacity

  def _find_node(self, key: Key) -> Optional[EntryNode[Key, Value]]:
    """
      The node holding `key`, or None if it is absent.\n
    """
    # walk the chain at the key's slot until a match or the end.
    node = self._buckets[self._slot(key, self._capacity)]
    while node is not None:
      if node.key == key:
        return node
      node = node.next

    return None

  def insert(self, key: Key, value: Value) -> None:
    """
      Map `key` to `value`, overwriting an existing entry. Resizes up when\n
      the load factor would exceed the threshold.\n
    """
    # overwrite in place when the key already lives in its chain.
    existing: Optional[EntryNode[Key, Value]] = self._find_node(key)
    if existing is not None:
      existing.value = value
      return

    # new key: prepend to its chain, then grow if we are too loaded.
    index: int = self._slot(key, self._capacity)
    self._buckets[index] = EntryNode(key, value, self._buckets[index])
    self.count += 1
    if self.load_factor > self._max_load_factor:
      self._resize(self._capacity * 2)

  def search(self, key: Key) -> Optional[Value]:
    """
      The value mapped to `key`, or None if it is not present.\n
    """
    node: Optional[EntryNode[Key, Value]] = self._find_node(key)
    return node.value if node is not None else None

  def delete(self, key: Key) -> bool:
    """
      Remove `key` from the table. Returns False if it was absent.\n
      Shrinks the table when it grows sparse.\n
    """
    # walk the chain, tracking the predecessor so we can unsplice.
    index: int = self._slot(key, self._capacity)
    node = self._buckets[index]
    previous: Optional[EntryNode[Key, Value]] = None

    while node is not None:
      if node.key != key:
        previous, node = node, node.next
        continue

      # splice the matched node out of its chain.
      if previous is None:
        self._buckets[index] = node.next
      else:
        previous.next = node.next
      self.count -= 1

      # shrink the table when the chains have grown too sparse.
      if self._capacity > 1 and self.load_factor < self._max_load_factor / 4:
        self._resize(max(1, self._capacity // 2))
      return True

    return False

  def _resize(self, new_capacity: int) -> None:
    """
      Rehash every entry into a fresh table of `new_capacity` slots.\n
    """
    # swap in an empty table of the new size, keeping the old chains.
    old_buckets: list[Optional[EntryNode[Key, Value]]] = self._buckets
    self._capacity = new_capacity
    self._buckets = [None for _ in range(new_capacity)]

    for head in old_buckets:
      node = head
      while node is not None:

        # reattach into the new bucket; stash next before we overwrite it.
        following: Optional[EntryNode[Key, Value]] = node.next
        index: int = self._slot(node.key, new_capacity)
        node.next = self._buckets[index]
        self._buckets[index] = node
        node = following

  def items(self) -> Iterator[tuple[Key, Value]]:
    """
      Yield every stored (key, value) pair in unspecified order.\n
    """
    for head in self._buckets:
      node = head
      while node is not None:
        yield node.key, node.value
        node = node.next

  def __contains__(self, key: Key) -> bool:
    return self._find_node(key) is not None

  def __getitem__(self, key: Key) -> Value:
    node: Optional[EntryNode[Key, Value]] = self._find_node(key)
    if node is None:
      raise KeyError(key)
    return node.value

  def __setitem__(self, key: Key, value: Value) -> None:
    self.insert(key, value)

  def __len__(self) -> int:
    return self.count

Collision resolution by open addressing

Chaining stores keys outside the table. Open addressing stores every key inside the array itself: there are no lists and no pointers, so $α \leq 1$ always. When a key's preferred slot is occupied, we probe a deterministic sequence of alternative slots until we find an empty one. The probe sequence is defined by extending the hash function with a probe number $i$ :

h : U \times {0, 1, \dots, m - 1} \to {0, 1, \dots, m - 1},

so the slots tried for key $k$ are $h (k, 0), h (k, 1), h (k, 2), \dots$ , which must form a permutation of all $m$ slots so that probing can examine every slot.

Algorithm:

\textsc{Hash-Insert}(T, k)

— open addressing, returns the slot used

1
$i \gets 0$
2
repeat
3
$j \gets h(k, i)$
4
if $T[j] = \text{nil}$ then
5
$T[j] \gets k$
6
return $j$
7
$i \gets i + 1$
8
until $i = m$
9
error "hash table overflow"

Search follows the same probe sequence, stopping when it finds $k$ (success) or an empty slot (failure, since $k$ is not present, because insertion would have used that empty slot). Deletion is the awkward case: simply emptying a slot would break the probe chains of other keys, so deleted slots are marked with a special deleted sentinel that search skips over but insertion may reuse. Heavy deletion is the classic reason to prefer chaining.

Why deletion needs a tombstone. Keys

k_{1}, k_{2}, k_{3}

probed into a run ending at slot

5

. Erasing

k_{2}

nil

(top) would make a later search for

k_{3}

stop early at the gap; marking it

deleted

(bottom) lets search probe past while insertion may reuse the slot.

Three probing schemes are standard:

Linear probing. $h (k, i) = (h^{'} (k) + i) mod m$ for an ordinary hash function $h^{'}$ . Simple and cache-friendly, but it suffers primary clustering: long runs of occupied slots build up and grow ever faster, since any key hashing anywhere into a run must walk to its end.

Primary clustering. A run of four occupied slots (shaded) absorbs any new key whose hash

h^{'} (k)

lands on slot

3

4

5

, or

6

— four of the eleven slots, each forcing a walk to the run's right end at slot

7

(accented arc). Every such insertion lengthens the run, so clusters snowball: the bigger the run, the likelier the next key extends it.

Quadratic probing. $h (k, i) = (h^{'} (k) + c_{1} i + c_{2} i^{2}) mod m$ . The quadratic step spreads probes out, eliminating primary clustering, but two keys with the same initial slot follow the same sequence, a milder secondary clustering. The constants and $m$ must be chosen so the sequence hits every slot.
Double hashing. $h (k, i) = (h_{1} (k) + i \cdot h_{2} (k)) mod m$ , using a second hash function to set the step size. Different keys with the same start get different step sizes, so probe sequences rarely coincide. Double hashing comes closest to the ideal of uniform hashing (every key's probe sequence equally likely to be any of the $m!$ permutations), and is the strongest of the three.

Three probe sequences from

h^{'} (k) = 7

in a table of size

m = 11

with slots

{0, 7, 8, 10}

occupied (shaded). Labels

0, 1, 2, \dots

give probe order; the boxed slot is where the key lands. Linear walks

+ 1

(lands

9

); quadratic adds

i^{2}

(lands

5

); double hashing steps by

h_{2} (k) = 3

(lands

2

A full insertion trace

Linear probing on the same five keys used for chaining shows the displacement mechanics end to end. Table size $m = 7$ , $h^{'} (k) = k mod 7$ , keys $19, 26, 13, 48, 5$ in order:

insert	$h^{'} (k)$	probe sequence	lands in	probes
$19$	$5$	$5$	$5$	$1$
$26$	$5$	$5, 6$	$6$	$2$
$13$	$6$	$6, 0$	$0$	$2$
$48$	$6$	$6, 0, 1$	$1$	$3$
$5$	$5$	$5, 6, 0, 1, 2$	$2$	$5$

Two things happen that chaining never shows. First, the probe for $13$ steps off the right end of the table and wraps to slot $0$ , the same modular arithmetic as a circular buffer. Second, the cluster snowballs: after four inserts the run spans slots $5, 6, 0, 1$ , so the fifth key, whose home slot merely touches the run, must walk its entire length before finding an empty slot at $2$ . Five keys in, a table that is $71%$ full already costs five probes per insert, and a search for $5$ retraces the same five slots.

The cluster snowballing, one insert at a time (

m = 7

h^{'} (k) = k mod 7

). Small accent numerals give each insert's probe order; the boxed slot is where the key lands. Insert

13

wraps past the table end to slot

0

; insert

5

walks the whole run

5, 6, 0, 1

before landing at

2

The final states of the two strategies, side by side on identical input, make the structural difference plain: chaining grows lists and leaves the table sparse, while open addressing keeps everything in the array at the cost of displaced keys sitting far from home.

The same f/ive keys under both strategies. Chaining (left) leaves f/ive of seven slots empty and grows two lists. Linear probing (right) packs all keys into the array; muted annotations mark each displaced key's home slot — four of f/ive keys sit away from home.

Cost of open addressing

Under the uniform-hashing assumption, with load factor $α = n / m < 1$ , the expected number of probes is

unsuccessful search: \frac{1}{1 - α}, successful search: \frac{1}{α} ln \frac{1}{1 - α} .

Both bounds are worth deriving, because the derivations expose why the costs behave so differently.³

The bound has a clean reading: with probability $1 - α$ each probe is the last, so the probe count is dominated by a geometric random variable with success probability $1 - α$ . Inserting a key costs the same, since insertion is an unsuccessful search that writes into the empty slot it finds.

The asymmetry between the two results is the practical takeaway. Plug in numbers: at $α = 0.5$ , an unsuccessful search expects $2$ probes and a successful one $2 ln 2 \approx 1.39$ ; at $α = 0.9$ , an unsuccessful search expects $10$ probes while a successful one expects only $\frac{1}{0.9} ln 10 \approx 2.56$ . Hits stay cheap even in a crowded table, because most keys were inserted while the table was still relatively empty and therefore sit early in their probe sequences. Misses (and inserts) pay the full $1/ (1 - α)$ , which explodes as $α \to 1$ . Open addressing is fast only when the table is kept comfortably below full; a practical rule of thumb is to resize once $α$ exceeds about $0.7$ .

These formulas assume ideal uniform hashing, which double hashing approximates well. Linear probing is measurably worse because of primary clustering: its expected probe counts are roughly $\frac{1}{2} (1 + \frac{1}{( 1 - α ) ^{2}})$ for an unsuccessful search and $\frac{1}{2} (1 + \frac{1}{1 - α})$ for a successful one. At $α = 0.9$ that is about $50$ probes per miss instead of $10$ , a $5 \times$ penalty for the same load. Its saving grace is the cache: the probed slots are adjacent, so those $50$ probes may touch only a handful of cache lines while double hashing's $10$ probes take $10$ misses. On modern hardware, linear probing at moderate load ( $α \leq 0.5$ , where the formulas give $\leq 2.5$ probes) is often the fastest scheme in practice.

The three strategies, summarized at a glance:

	chaining	linear probing	double hashing
expected miss cost	$1 + α$	$\frac{1}{2} (1 + \frac{1}{( 1 - α ) ^{2}})$	$\frac{1}{1 - α}$
expected hit cost	$1 + \frac{α}{2}$	$\frac{1}{2} (1 + \frac{1}{1 - α})$	$\frac{1}{α} ln \frac{1}{1 - α}$
load factor range	any ( $α > 1$ fine)	$α < 1$ , keep $\leq 0.5$	$α < 1$ , keep $\leq 0.7$
deletion	$O (1)$ splice	tombstones	tombstones
cache behavior	poor (pointer chasing)	excellent	moderate
space overhead	one pointer per key	none	none

Chaining degrades gracefully (linearly in $α$ ) and deletes cleanly; open addressing wins on memory and locality but demands headroom and careful deletion. That tension is why both families appear in standard libraries.

Expected probes for an unsuccessful search against load factor

α

. Chaining's

1 + α

(muted) stays linear; open addressing's

1/ (1 - α)

(accent) is flat while the table is half-empty, turns sharply upward near the resize threshold

α = 0.7

(dashed), and diverges as

α \to 1

— the reason open addressing must never run near full.

open_addressing_hash_table.pypython

from collections.abc import Hashable, Iterator
from enum import Enum
from typing import Generic, Optional, TypeVar

Key = TypeVar("Key", bound=Hashable)
Value = TypeVar("Value")

class ProbeScheme(Enum):
  """
    Which probe sequence an open-addressing table walks on collision.\n
  """
  LINEAR = "linear"
  QUADRATIC = "quadratic"
  DOUBLE_HASHING = "double_hashing"

class Slot(Generic[Key, Value]):
  """
    One array cell: a stored key/value pair, plus a flag marking it as a\n
    tombstone (a once-occupied slot whose entry was deleted).\n
  """

  def __init__(self, key: Key, value: Value) -> None:
    self.key: Key = key
    self.value: Value = value
    self.is_tombstone: bool = False

  def __repr__(self) -> str:
    state: str = "deleted" if self.is_tombstone else "live"
    return f"Slot({self.key!r}, {self.value!r}, {state})"

class OpenAddressingHashTable(Generic[Key, Value]):
  """
    A dictionary that stores every key inside its own array, probing on\n
    collision per the chosen scheme. Capacity is always a power of two.\n
  """

  def __init__(
    self,
    initial_capacity: int = 8,
    max_load_factor: float = 0.7,
    scheme: ProbeScheme = ProbeScheme.LINEAR,
  ) -> None:
    if not 0.0 < max_load_factor < 1.0:
      raise ValueError("max_load_factor must lie strictly between 0 and 1")

    # capacity is always a power of two so probe arithmetic stays simple.
    self._capacity: int = self._round_up_to_power_of_two(initial_capacity)
    self._max_load_factor: float = max_load_factor
    self._scheme: ProbeScheme = scheme

    # the backing array starts empty.
    self._slots: list[Optional[Slot[Key, Value]]] = [None for _ in range(self._capacity)]
    self.count: int = 0

  @staticmethod
  def _round_up_to_power_of_two(value: int) -> int:
    """
      The smallest power of two greater than or equal to `value`, at least 1.\n
    """
    power: int = 1
    while power < value:
      power *= 2
    return power

  @property
  def load_factor(self) -> float:
    """
      The current occupancy, alpha = n / m (tombstones not counted).\n
    """
    return self.count / self._capacity

  def _step(self, key: Key, probe: int) -> int:
    """
      The offset added to the base slot on probe number `probe`.\n
      For double hashing the step uses a second hash forced odd, which is\n
      coprime to the power-of-two capacity, so the sequence visits every\n
      slot.\n
    """
    # linear: consecutive slots.
    if self._scheme is ProbeScheme.LINEAR:
      return probe

    # quadratic: triangular numbers (i^2 + i)/2 permute a power-of-two table.
    if self._scheme is ProbeScheme.QUADRATIC:
      return (probe * probe + probe) // 2

    # double hashing: a second, odd step size derived from the key.
    second_hash: int = (hash(key) // self._capacity) % self._capacity
    return probe * (second_hash | 1)

  def _slot_index(self, key: Key, probe: int) -> int:
    """
      The array index probed for `key` on attempt `probe`.\n
    """
    # base slot plus this probe's offset, wrapped into the array.
    base: int = hash(key) % self._capacity
    return (base + self._step(key, probe)) % self._capacity

  def _find_index(self, key: Key) -> Optional[int]:
    """
      The index of the live slot holding `key`, or None if absent.\n
    """
    # walk the probe chain looking for a live slot holding the key.
    for probe in range(self._capacity):
      index: int = self._slot_index(key, probe)
      slot: Optional[Slot[Key, Value]] = self._slots[index]

      # an empty (never-used) slot ends the chain: the key is absent.
      if slot is None:
        return None
      if not slot.is_tombstone and slot.key == key:
        return index

    return None

  def insert(self, key: Key, value: Value) -> None:
    """
      Map `key` to `value`, overwriting an existing entry. Resizes up before\n
      the load factor would cross the threshold.\n
    """
    # grow before alpha crosses the threshold, where probing degrades.
    if self.load_factor >= self._max_load_factor:
      self._resize(self._capacity * 2)
    self._insert_into_slots(self._slots, self._capacity, key, value)

  def _insert_into_slots(
    self,
    slots: list[Optional[Slot[Key, Value]]],
    capacity: int,
    key: Key,
    value: Value,
  ) -> None:
    """
      Place `key`/`value` into `slots`, reusing the first tombstone seen but\n
      still scanning ahead to overwrite an existing live copy of the key.\n
    """
    # remember the first tombstone so we can reuse it once the key is ruled out.
    base: int = hash(key) % capacity
    first_tombstone: Optional[int] = None

    for probe in range(capacity):
      index: int = (base + self._step(key, probe)) % capacity
      slot: Optional[Slot[Key, Value]] = slots[index]

      # empty slot ends the chain: land in the earliest tombstone, else here.
      if slot is None:
        target: int = first_tombstone if first_tombstone is not None else index
        slots[target] = Slot(key, value)
        self.count += 1
        return

      # note a tombstone for reuse, or overwrite a live copy of the key.
      if slot.is_tombstone:
        if first_tombstone is None:
          first_tombstone = index
      elif slot.key == key:
        slot.value = value
        return

    # chain was full of live/tombstone slots: reuse a tombstone if one exists.
    if first_tombstone is not None:
      slots[first_tombstone] = Slot(key, value)
      self.count += 1
      return
    raise RuntimeError("hash table overflow")

  def search(self, key: Key) -> Optional[Value]:
    """
      The value mapped to `key`, or None if it is not present.\n
    """
    index: Optional[int] = self._find_index(key)
    if index is None:
      return None

    slot: Optional[Slot[Key, Value]] = self._slots[index]
    return slot.value if slot is not None else None

  def delete(self, key: Key) -> bool:
    """
      Remove `key` by marking its slot a tombstone. Returns False if absent.\n
    """
    index: Optional[int] = self._find_index(key)
    if index is None:
      return False

    # tombstone the slot so probe chains through it stay intact.
    slot: Optional[Slot[Key, Value]] = self._slots[index]
    if slot is not None:
      slot.is_tombstone = True
    self.count -= 1
    return True

  def _resize(self, new_capacity: int) -> None:
    """
      Rehash every live entry into a fresh table, discarding tombstones.\n
    """
    # swap in a fresh, larger table and reset the running count.
    new_capacity = self._round_up_to_power_of_two(new_capacity)
    old_slots: list[Optional[Slot[Key, Value]]] = self._slots
    self._slots = [None for _ in range(new_capacity)]
    self._capacity = new_capacity
    self.count = 0

    # re-probe every live entry into the new table, dropping tombstones.
    for slot in old_slots:
      if slot is not None and not slot.is_tombstone:
        self._insert_into_slots(self._slots, new_capacity, slot.key, slot.value)

  def items(self) -> Iterator[tuple[Key, Value]]:
    """
      Yield every live (key, value) pair in unspecified order.\n
    """
    for slot in self._slots:
      if slot is not None and not slot.is_tombstone:
        yield slot.key, slot.value

  def __contains__(self, key: Key) -> bool:
    return self._find_index(key) is not None

  def __getitem__(self, key: Key) -> Value:
    index: Optional[int] = self._find_index(key)
    if index is None:
      raise KeyError(key)

    slot: Optional[Slot[Key, Value]] = self._slots[index]
    assert slot is not None
    return slot.value

  def __setitem__(self, key: Key, value: Value) -> None:
    self.insert(key, value)

  def __len__(self) -> int:
    return self.count

Resizing and rehashing

Every bound in this lesson conditions on $α$ staying moderate, and resizing is the mechanism that enforces it. When the load factor crosses its threshold ( $1$ is a common trigger for chaining, around $0.5$ for linear probing, $0.7$ for double hashing), allocate a new table roughly twice the size, for the division method the next prime past $2 m$ , and re-insert every key.

The keys cannot simply be copied across: $h (k)$ depends on $m$ , so growing the table changes every key's home slot. Each key is rehashed, its hash recomputed against the new modulus, and inserted fresh. Continuing the running example, growing from $m = 7$ to the prime $m = 17$ sends the five keys to entirely new homes:

19 mod 17 = 2, 26 mod 17 = 9, 13 mod 17 = 13, 48 mod 17 = 14, 5 mod 17 = 5.

All five now land in distinct slots, so the probe-displaced keys of the crowded table return to their home positions and every chain (or run) dissolves.

Rehashing on growth. The crowded

m = 7

open-addressed table (top,

α \approx 0.71

, four keys displaced from home) is rebuilt into

m = 17

(bottom,

α \approx 0.29

): every key's hash is recomputed against the new modulus, and all f/ive land in distinct home slots.

A resize costs $Θ (n + m)$ : touch every key, plus scan the old table. It is still cheap on average, by the same argument that gives a dynamic array its amortized $O (1)$ append: after doubling, the table holds $n$ keys with capacity for about $2 n$ , so at least $Θ (n)$ cheap inserts must occur before the next resize, and the $Θ (n)$ rebuild spreads to $O (1)$ per insert. Shrinking mirrors growth with hysteresis, rebuilding smaller only when $α$ falls to something like $1/4$ of the threshold, so a workload oscillating at the boundary cannot trigger a rebuild per operation.

Rehashing also serves a second purpose for open-addressed tables: it is the only way to clear tombstones. A deleted marker still lengthens probe sequences, since searches must walk past it, so the cost of operations is governed by the effective load, occupied slots plus tombstones, over $m$ . A long-lived table under heavy insert/delete churn can have few live keys yet terrible searches, its array full of tombstones. The standard policy tracks both counts and rebuilds, at the same size or smaller, once tombstones exceed a fixed fraction of the table, restoring the true load factor. When deletions dominate the workload and rebuilds are unwelcome, chaining, whose deletes are genuine $O (1)$ splices with nothing left behind, is the safer default.

What makes a hash function good

Simple uniform hashing is an assumption; a real hash function must approximate it on real data. A good $h$ should scatter keys so that any regularities in the input, such as sequential integers, common prefixes, or similar strings, do not pile up in the same slots. Two classic constructions:

The division method. $h (k) = k mod m$ . Fast, but sensitive to $m$ : choosing $m$ a power of $2$ makes $h$ depend only on the low bits of $k$ , and values near a power of $10$ are bad for decimal data. A prime $m$ not close to a power of $2$ is the safe choice.
The multiplication method. $h (k) = ⌊ m \cdot (k A mod 1) ⌋$ for a constant $0 < A < 1$ ( $A = (5 - 1) /2$ is a good choice). It is insensitive to the value of $m$ , so $m$ can be a power of $2$ for fast bit shifts.

For string keys, treat the string as a base- $b$ number and fold it down, e.g. Horner's rule, $h = (h \cdot b + c) mod m$ over the characters $c$ , so that every character and its position influence the result.⁴ Skiena stresses the engineer's view: a hash function turns an arbitrary key into a pseudo-random slot, and the quality of that pseudo-randomness is what protects the $O (1)$ bound.

hash_functions.pypython

import math

# the reciprocal golden ratio, knuth's recommended multiplier for the
# multiplication method: (sqrt(5) - 1) / 2.
GOLDEN_RATIO_FRACTION: float = (math.sqrt(5) - 1) / 2

def division_hash(key: int, table_size: int) -> int:
  """
    The division method: h(key) = key mod table_size.\n
    Fast, but sensitive to table_size — a prime not near a power of two is\n
    the safe choice. Works for negative keys via Python's floored modulo.\n
  """
  if table_size <= 0:
    raise ValueError("table_size must be positive")
  return key % table_size

def multiplication_hash(
  key: int,
  table_size: int,
  multiplier: float = GOLDEN_RATIO_FRACTION,
) -> int:
  """
    The multiplication method: h(key) = floor(table_size * frac(key * A)),\n
    where frac(x) is the fractional part of x and 0 < A < 1. Insensitive to\n
    table_size, so it may be a power of two.\n
  """
  if table_size <= 0:
    raise ValueError("table_size must be positive")
  if not 0.0 < multiplier < 1.0:
    raise ValueError("multiplier must lie strictly between 0 and 1")

  # take the fractional part of key * A, then scale into a slot.
  fractional_part: float = (key * multiplier) % 1.0
  slot: int = int(table_size * fractional_part)

  # guard the boundary: rounding can nudge the product to exactly
  # table_size, one past the last valid slot.
  return min(slot, table_size - 1)

def string_hash(text: str, table_size: int, radix: int = 256) -> int:
  """
    Horner's-rule string hash: fold the characters into a base-`radix`\n
    number modulo table_size, so every character and its position counts.\n
    h = (h * radix + ord(character)) mod table_size over the string.\n
  """
  if table_size <= 0:
    raise ValueError("table_size must be positive")

  # fold each character in via horner's rule, reducing modulo table_size.
  hash_value: int = 0
  for character in text:
    hash_value = (hash_value * radix + ord(character)) % table_size
  return hash_value

Universal hashing: a guarantee against every input

Any fixed hash function has a weakness: there exists a set of keys that all collide, and an adversary (or merely unlucky data) can hand it to us, degrading every operation to $Θ (n)$ . Universal hashing avoids this by choosing $h$ at random from a carefully designed family $H$ of hash functions at runtime, so no single input is bad for all choices.

That is, a randomly chosen $h$ collides any fixed pair no more often than picking two random slots would. This single property suffices to prove that, for any input set of keys, the expected length of the chain holding a given key is at most $1 + α$ , recovering the $Θ (1 + α)$ bound without assuming anything about the data.⁵ The randomness lives in our coin flips, not in an assumption about the world.

The proof is two lines of linearity, which is the point: universality is the weakest property that makes the chaining analysis go through, so it is the right definition. No adversary can defeat it, because a bad input would have to be chosen after our random draw of $h$ .

A concrete universal family: pick a prime $p > ∣ U ∣$ , draw random $a \in {1, \dots, p - 1}$ and $b \in {0, \dots, p - 1}$ , and set

h_{a, b} (k) = ((ak + b) mod p) mod m .

The collection ${h_{a, b}}$ over all valid $a, b$ is universal.

To use the family, pick $a$ and $b$ once, at table-creation time, and keep them for the table's lifetime (rehashing on resize is a natural moment to redraw them). A tiny instance: with $p = 17$ and $m = 7$ , the keys $k = 1$ and $ℓ = 8$ collide under $h_{1, 0}$ (since $1 mod 7 = 8 mod 7 = 1$ ) but not under $h_{3, 4}$ (which sends them to $(7 mod 17) mod 7 = 0$ and $(28 mod 17) mod 7 = 4$ ). No fixed pair is unlucky for more than a $1/ m$ fraction of the draws, so an adversary who knows the family, but not the draw, cannot manufacture collisions. Universal hashing is the rigorous foundation under the everyday claim that hashing is $O (1)$ : it is $O (1)$ in expectation, on every input, precisely because we randomize the hash function.

universal_hashing.pypython

import random
from typing import Optional

def _next_prime(at_least: int) -> int:
  """
    The smallest prime greater than or equal to `at_least`.\n
  """
  # scan upward from `at_least` until a prime turns up.
  candidate: int = max(at_least, 2)
  while not _is_prime(candidate):
    candidate += 1
  return candidate

def _is_prime(number: int) -> bool:
  """
    Trial-division primality test — enough for the modest primes we pick.\n
  """
  if number < 2:
    return False
  if number < 4:
    return True
  if number % 2 == 0:
    return False
  divisor: int = 3
  while divisor * divisor <= number:
    if number % divisor == 0:
      return False
    divisor += 2
  return True

class UniversalHashFunction:
  """
    One hash function h_{a,b}(k) = ((a*k + b) mod prime) mod table_size,\n
    drawn from the universal family by sampling a and b at random.\n
  """

  def __init__(
    self,
    table_size: int,
    universe_size: int,
    seed: Optional[int] = None,
  ) -> None:
    """
      Pick a prime above `universe_size`, then sample the coefficients\n
      a in 1..prime-1 and b in 0..prime-1. An optional `seed` makes the\n
      draw reproducible.\n
    """
    if table_size <= 0:
      raise ValueError("table_size must be positive")
    if universe_size <= 0:
      raise ValueError("universe_size must be positive")
    self.table_size: int = table_size
    self.prime: int = _next_prime(universe_size + 1)
    generator: random.Random = random.Random(seed)
    self.multiplier: int = generator.randrange(1, self.prime)
    self.offset: int = generator.randrange(0, self.prime)

  def __call__(self, key: int) -> int:
    """
      The slot for `key` under this function.\n
    """
    return ((self.multiplier * key + self.offset) % self.prime) % self.table_size

class UniversalHashFamily:
  """
    The family { h_{a,b} } for a fixed table size and key universe.\n
    `draw` samples a fresh member; this is the structure you randomize over\n
    at runtime to guarantee expected-O(1) behavior against every input.\n
  """

  def __init__(self, table_size: int, universe_size: int) -> None:
    if table_size <= 0:
      raise ValueError("table_size must be positive")
    if universe_size <= 0:
      raise ValueError("universe_size must be positive")
    self.table_size: int = table_size
    self.universe_size: int = universe_size

  def draw(self, seed: Optional[int] = None) -> UniversalHashFunction:
    """
      Sample a random member of the family.\n
    """
    return UniversalHashFunction(self.table_size, self.universe_size, seed)

Modern hashing

Two developments past the textbook show up throughout real systems.

Worst-case constant lookups. Chaining and open addressing give expected $O (1)$ , but a long chain can still slow a query. Cuckoo hashing (Pagh and Rodler, 2001) uses two hash functions, guaranteeing each key sits in one of two fixed slots, so lookup is worst-case $O (1)$ ; inserts may relocate a resident key, but the read path is unconditionally fast. Robin Hood hashing and Swiss Tables (absl::flat_hash_map) reach the same goal from open addressing, equalizing probe lengths to stay fast at high load.

Consistent hashing. When the table is a cluster of servers, ordinary $h (k) mod m$ is catastrophic: changing $m$ rehashes nearly every key. Consistent hashing (Karger et al., 1997) maps keys and servers onto a circle and assigns each key to the next server clockwise, so adding or removing a server moves only $O (1/ m)$ of the keys, the same collision-management problem lifted from one array to a fleet of machines.⁶

Takeaways

A hash table implements the dictionary ADT — insert, search, delete — in expected $O (1)$ time by mapping keys into an array with a hash function, trading away the ordered queries a search tree supports.
Direct addressing is perfect but needs one slot per possible key; hashing shrinks the table to $Θ (n)$ and resolves the resulting collisions.
Chaining keeps a list per slot (search cost $Θ (1 + α)$ ); open addressing stores keys in the array and probes (linear, quadratic, or double hashing), with cost governed by $1/ (1 - α)$ .
Keeping the load factor $α = n / m$ bounded, via resizing, keeps every operation expected $O (1)$ .
A good hash function scatters structured keys; universal hashing randomizes the choice of $h$ so the $O (1)$ expectation holds against every input, not just under an assumption.

CLRS, Ch. 11 — Hash Tables (§11.1–11.2): the dictionary ADT and the expected $O (1)$ guarantee from hashing. ↩
Erickson, Ch. 5 — Hash Tables: the simple uniform hashing assumption and expected chain length. ↩
CLRS, Ch. 11 — Hash Tables (§11.4): open addressing and the expected-probe bounds $1/ (1 - α)$ (unsuccessful) and $\frac{1}{α} ln \frac{1}{1 - α}$ (successful) under uniform hashing. ↩
Skiena, §3.7 — Hashing and Strings: hashing string keys via Horner's-rule polynomial evaluation. ↩
CLRS, Ch. 11 — Hash Tables (§11.3.3): universal families and the $Θ (1 + α)$ bound without distributional assumptions. ↩
Pagh & Rodler, Cuckoo hashing (2001); Celis, Robin Hood hashing (1986); Karger, Lehman, Leighton, Panigrahy, Levine & Lewin, Consistent hashing and random trees (1997). ↩

From direct addressing to hashing

Collision resolution by chaining

The load factor and expected search time

Collision resolution by open addressing

A full insertion trace

Cost of open addressing

Resizing and rehashing

What makes a hash function good

Universal hashing: a guarantee against every input

Modern hashing

Takeaways

Footnotes