String Matching: Naive & Rabin–Karp

We have spent the previous lessons treating a sequence as a bag of comparable keys. A string is more rigid: its characters come in a fixed order and that order is the information. The fundamental question is string matching: given a text $T [0 .. n - 1]$ and a pattern $P [0 .. m - 1]$ over some alphabet $Σ$ , report every shift $s$ such that $P$ occurs in $T$ starting at position $s$ , i.e. $T [s .. s + m - 1] = P$ . There are $n - m + 1$ candidate shifts, and the art is to test them without paying $m$ comparisons apiece.

Correctness of a matcher has the two standard halves of any search procedure: it must be sound — every reported shift is a genuine occurrence — and complete — every occurrence is reported. The naive scan is trivially both (it returns a shift only after verifying all $m$ characters, and it tries every shift); the interest in what follows is keeping both guarantees while doing far less work.

The naive answer does pay that price, and there are two distinct redundancies to attack. The first is the cost per alignment: naive spends up to $m$ comparisons to test one shift. The second is re-reading: after a partial match fails, naive slides the pattern by one and re-examines text characters it already knows. This lesson removes the first with Rabin–Karp, which replaces a length- $m$ comparison by a length- $1$ hash update. The companion lesson removes the second with KMP and the Z-function, which precompute the pattern's self-overlap so no text character is ever re-derived.

The naive scan, and why it wastes work

Try every alignment; at each, compare characters until one disagrees or the whole pattern matches.

Algorithm:

\textsc{Naive-Match}(T, P)

— test all

n-m+1

alignments

1
for $s \gets 0$ to $n - m$ do
2
$j \gets 0$
3
while $j < m$ and $T[s + j] = P[j]$ do
4
$j \gets j + 1$
5
if $j = m$ then
6
report occurrence at shift $s$

On random or low-repetition text the inner loop almost always dies on the first character, so naive matching averages $O (n)$ and is the right tool when $m$ is tiny or the text is unstructured.¹ The pathology is repetitive patterns; the fix is to stop discarding what a partial match revealed. The asymptotic gap between the worst and average cases is what Rabin–Karp — and the KMP and Z-function of the companion lesson — close.

naive_match.pypython

def naive_match(text: str, pattern: str) -> list[int]:
  """
    Every shift `s` where `pattern` occurs in `text`, in increasing order.\n
    An empty pattern matches at every position `0 .. len(text)`.\n
  """
  text_length: int = len(text)
  pattern_length: int = len(pattern)
  shifts: list[int] = []

  # only shifts that leave room for the whole pattern can match.
  for shift in range(text_length - pattern_length + 1):
    matched: int = 0
    while matched < pattern_length and text[shift + matched] == pattern[matched]:
      matched += 1
    if matched == pattern_length:
      shifts.append(shift)
  return shifts

Why the naive scan is

O (nm)

: on

T = aaaaab

P = aab

every shift re-reads the same as, matching

m - 1

of them before failing on the final character (red). Each row is one alignment

Rabin–Karp: matching by rolling hash

Rabin–Karp turns are these $m$ characters equal? into are these two numbers equal? by hashing each window. Interpret each length- $m$ block of text as an $m$ -digit number in base $b = ∣Σ∣$ (mapping characters to digits), reduced modulo a prime $q$ to keep it machine-word-sized. Precompute the pattern's hash $p = h (P)$ and the first window's hash $t_{0} = h (T [0 .. m - 1])$ . Slide the window one step at a time; if $t_{s} = p$ , the block might match, so verify it character-by-character to rule out a hash collision (a spurious hit).

This verification is what makes the matcher sound. The hash test alone is complete — equal blocks always hash equal, so a true occurrence never escapes the $t_{s} = p$ filter — but it is not sound on its own: a collision ( $t_{s} = p$ on differing blocks) would report a phantom match. The character-by-character recheck discharges that, so soundness lives in the verification and completeness lives in the hash filter; reporting a hash match without rechecking would be an unsound algorithm.

The key step is the rolling hash: when the window slides from position $s$ to $s + 1$ , we do not recompute the hash from scratch. We drop the contribution of the departing high-order digit $T [s]$ , shift the remaining digits up by one place (multiply by $b$ ), and add the incoming low-order digit $T [s + m]$ :

h^{'} = (b (h - T [s] b^{m - 1}) + T [s + m]) mod q .

The factor $b^{m - 1} mod q$ is precomputed once. Each slide is $O (1)$ arithmetic, so building all $n - m + 1$ window hashes costs $O (n)$ in total.

The rolling hash slides the length-

m

window by one: drop the departing high digit

T [s] b^{m - 1}

, multiply the rest by

b

, add the incoming low digit

T [s + m]

— one

O (1)

update giving

h^{'}

Algorithm:

\textsc{Rabin-Karp}(T, P, b, q)

— hash, slide, verify

1
$p \gets 0;\ t \gets 0;\ \rho \gets b^{\,m-1} \bmod q$
2
for $j \gets 0$ to $m - 1$ do
hash $P$ and window 0
3
$p \gets (b \cdot p + P[j]) \bmod q$
4
$t \gets (b \cdot t + T[j]) \bmod q$
5
for $s \gets 0$ to $n - m$ do
6
if $t = p$ then
verify, kill collisions
7
if $T[s \mathinner{\ldotp\ldotp} s+m-1] = P$ then report occurrence at shift $s$
8
if $s < n - m$ then
roll window forward
9
$t \gets (b\,(t - T[s]\cdot\rho) + T[s+m]) \bmod q$

The arithmetic, worked

Decimal strings make the digits literal. Take $b = 10$ , prime $q = 13$ , pattern $P = 31415$ (so $m = 5$ ), and text $T = 2359023141526739921$ (so $n = 19$ ).² Two precomputations:

p = 31415 mod 13 = 7 (since 31415 = 2416 \cdot 13 + 7),

ρ = b^{m - 1} mod q = 1 0^{4} mod 13 = 3 (since 10000 = 769 \cdot 13 + 3) .

The first window is $T [0 .. 4] = 23590$ , and $23590 = 1814 \cdot 13 + 8$ , so $t_{0} = 8$ . Now roll, one digit out and one digit in each time:

$s = 0 \to 1$ (window $23590 \to 35902$ ): drop the leading $2$ , whose place value mod $q$ is $ρ = 3$ ; append the new low digit $T [5] = 2$ . $t_{1} = (10 (8 - 2 \cdot 3) + 2) mod 13 = 22 mod 13 = 9.$ Check directly: $35902 = 2761 \cdot 13 + 9$ , as claimed.
$s = 1 \to 2$ (window $35902 \to 59023$ ): drop the $3$ , append $T [6] = 3$ . $t_{2} = (10 (9 - 3 \cdot 3) + 3) mod 13 = 3.$
$s = 2 \to 3$ (window $59023 \to 90231$ ): drop the $5$ , append $T [7] = 1$ . $t_{3} = (10 (3 - 5 \cdot 3) + 1) mod 13 = - 119 mod 13 = 11,$ because $- 119 + 10 \cdot 13 = 11$ . The intermediate value went negative — the subtraction removed more than the running hash held — and the final mod folds it back into $[0, q)$ . Implementations add a multiple of $q$ before reducing (or use a language whose mod is already non-negative); forgetting this is the classic Rabin–Karp bug.
$s = 3 \to 4$ (window $90231 \to 02314$ ): drop the $9$ , append $T [8] = 4$ . $t_{4} = (10 (11 - 9 \cdot 3) + 4) mod 13 = - 156 mod 13 = 0,$ since $156 = 12 \cdot 13$ exactly.

None of $8, 9, 3, 11, 0$ equals $p = 7$ , so these four shifts are dismissed with no character comparisons at all. Continuing the scan, the window at $s = 6$ is $T [6 .. 10] = 31415$ with hash $7$ : the filter fires, verification compares all five characters, and a genuine occurrence is reported. But the window at $s = 12$ is $T [12 .. 16] = 67399$ , and $67399 = 5184 \cdot 13 + 7$ — hash $7$ again. The filter fires on a block that is not the pattern, verification compares $6 \neq = 3$ and rejects, and the scan moves on. That is a spurious hit: two different 5-digit numbers that happen to agree mod $13$ . A matcher that skipped verification would have reported a phantom occurrence at shift $12$ .

One roll step with the numbers of the worked example (

b = 10

q = 13

ρ = 1 0^{4} mod 13 = 3

). The window slides from

23590

(

t_{0} = 8

) to

35902

(

t_{1} = 9

): subtract the departing digit times

ρ

, multiply by

10

, add the arriving digit, reduce mod

13

The $1/ q$ estimate can be made precise. A spurious hit at shift $s$ means $q$ divides the nonzero difference $D_{s} = val (T [s .. s + m - 1]) - val (P)$ , where $val$ reads a block as a base- $b$ number. Since $∣ D_{s} ∣ < b^{m}$ , the number $D_{s}$ has fewer than $m log_{2} b$ distinct prime factors (each prime factor is at least $2$ , so $k$ factors force $∣ D_{s} ∣ \geq 2^{k}$ ). If $q$ is drawn uniformly from the primes below some bound $M$ — and there are roughly $M / ln M$ of them — the chance that $q$ happens to divide $D_{s}$ is at most

\frac{m log _{2} b}{π ( M )} \approx \frac{m ln M log _{2} b}{M},

and summing over all $n - m + 1$ shifts, choosing $M \approx m n^{2}$ drives the expected total number of spurious hits below $O (1/ n) \cdot n = O (1)$ : with probability tending to $1$ , no shift collides spuriously and the whole run is $O (n + m)$ . The randomness lives in the choice of $q$ , not in the input; an adversary who sees $q$ before choosing $T$ can still manufacture collisions at every shift, which is why a fresh random prime per run (or per process) is the correct way to deploy the algorithm. With a fixed $q$ , the $1/ q$ heuristic treats the hash values as uniform — accurate for typical data, void as a worst-case guarantee.

Rabin–Karp is most useful when matching many patterns of the same length at once (hash them all, look each window up in a set) or when the comparison is naturally numeric. The rolling-hash idea reappears in deduplication, plagiarism detection, and content-defined chunking.³

rabin_karp.pypython

def rabin_karp(
  text: str,
  pattern: str,
  base: int = 256,
  prime: int = 1_000_000_007,
) -> list[int]:
  """
    Every shift `s` where `pattern` occurs in `text`, in increasing order.\n
    `base` is the alphabet radix and `prime` the modulus; the verification\n
    step keeps the result correct regardless of hash collisions.\n
  """
  text_length: int = len(text)
  pattern_length: int = len(pattern)

  # an empty pattern matches at every position; a too-long one never does.
  if pattern_length == 0:
    return list(range(text_length + 1))
  if pattern_length > text_length:
    return []

  # high-order place value of the window, b^(m-1) mod prime.
  high_place: int = pow(base, pattern_length - 1, prime)
  pattern_hash: int = 0
  window_hash: int = 0
  for index in range(pattern_length):
    pattern_hash = (base * pattern_hash + ord(pattern[index])) % prime
    window_hash = (base * window_hash + ord(text[index])) % prime

  shifts: list[int] = []
  for shift in range(text_length - pattern_length + 1):

    # equal hashes are only a candidate; verify to kill spurious hits.
    if window_hash == pattern_hash and text[shift:shift + pattern_length] == pattern:
      shifts.append(shift)

    # roll the window one step forward, unless we just hit the last shift.
    if shift < text_length - pattern_length:
      departing: int = ord(text[shift]) * high_place
      incoming: int = ord(text[shift + pattern_length])
      window_hash = (base * (window_hash - departing) + incoming) % prime
  return shifts

When Rabin–Karp is the right tool

Rabin–Karp's guarantee is expected-time, not worst-case, and its soundness depends on the verify step. In exchange, it handles two settings the worst-case matchers of the companion lesson cannot handle cheaply.

Many patterns at once. Hash all $k$ same-length patterns into a set; each window costs one $O (1)$ hash update plus one set lookup, so searching for all $k$ patterns together is $O (n + k m)$ expected — where a per-pattern rerun of a single-pattern matcher would pay $O (k n)$ . This is the natural tool for does the text contain any of these forbidden words of length $m$ ?
Numeric or higher-dimensional comparison. When the characters are already numbers, or the objects are 2-D blocks, the fingerprint idea extends directly: a 2-D rolling hash matches an $m \times m$ pattern in an $n \times n$ grid, and the same fingerprint answers find any repeated length- $L$ block — the core of duplicate detection.

Reject Rabin–Karp when you need a hard worst-case bound with no randomness, or when a single fixed $q$ could be attacked: an adversary who sees $q$ before choosing $T$ can force a collision at every shift, degrading the run to $O (nm)$ . A fresh random prime per run avoids this.

Fingerprints, deduplication, and anti-hash attacks

The rolling hash is one instance of a fingerprint: a short, cheaply updated summary of a long object such that equal objects always share a fingerprint and unequal ones rarely do. Karp and Rabin's original paper (Karp & Rabin, Efficient Randomized Pattern-Matching Algorithms, IBM J. Res. Dev., 1987) framed it exactly this way, and the same idea now runs far beyond substring search. Content-defined chunking — used by the rsync protocol (Tridgell & Mackerras, 1996) and by modern deduplicating backup and version-control systems — slides a rolling hash across a file and cuts a new chunk boundary whenever the hash hits a distinguished value, so that inserting a byte near the front shifts only one chunk instead of realigning the entire file. Rabin fingerprinting over polynomials (Rabin, Fingerprinting by Random Polynomials, 1981) is the same construction with xor-based arithmetic, chosen for provably low collision probability, and underlies network deduplication and similarity detection.

The polynomial hash also connects to a subtle practical failure. A single fixed modulus $q$ is vulnerable to anti-hash tests: adversarial inputs built to collide, which is why competitive-programming folklore uses a random base and a 64-bit modulus, or two independent hashes combined, to make a collision astronomically unlikely without needing per-run randomness. The general lesson — that a hash's worst case is only as good as the attacker's ignorance of its parameters — is the same one that pushed hash tables toward universal hashing, and it is why security-sensitive code uses cryptographic hashes rather than the fast polynomial ones.

Takeaways

String matching seeks all shifts where pattern $P$ ( $m$ chars) occurs in text $T$ ( $n$ chars). The naive scan tries all $n - m + 1$ alignments in $O (nm)$ worst case, but is fine for tiny patterns or unstructured text — and it is what production strstr/memmem typically use, in a hardware-tuned form.
Rabin–Karp compares a rolling hash of each length- $m$ window, updated in $O (1)$ via $h^{'} = (b (h - T [s] b^{m - 1}) + T [s + m]) mod q$ , then verifies on hash matches to kill collisions: expected $O (n + m)$ , worst case $O (nm)$ .
A hash match that is not the pattern is a spurious hit; the character recheck is what keeps the matcher sound, and choosing a random prime $q$ makes spurious hits rare in expectation over the randomness of $q$ , not the input.
Rabin–Karp is the matcher of choice for many patterns at once and for numeric or 2-D fingerprinting; its guarantee is expected-time only.

This continues in String Matching: KMP & the Z-Function, which removes the re-reading entirely and delivers a worst-case $O (n + m)$ bound with no randomness.

CLRS, Ch. 32 — String Matching (§32.1): the naive matcher and its $O ((n - m + 1) m)$ bound. ↩
CLRS, Ch. 32 — String Matching (§32.2): the Rabin–Karp algorithm; the $31415 mod 13$ text is CLRS's own worked example, spurious hit included. ↩
Skiena, § — String Algorithms: the Rabin–Karp rolling hash and its use for substring search and fingerprinting. ↩