Sequences & Strings/String Matching: Rabin–Karp, KMP & Z

Lesson 5.41,791 words

String Matching: Rabin–Karp, KMP & Z

Given a text TT of length nn and a pattern PP of length mm, find every occurrence of PP in TT. The naive scan costs O(nm)O(nm); we beat it three ways.

╌╌╌╌

We have spent the previous lessons treating a sequence as a bag of comparable keys. A string is more rigid: its characters come in a fixed order and that order is the information. The fundamental question is string matching: given a text and a pattern over some alphabet , report every shift such that occurs in starting at position , i.e. . There are candidate shifts, and the art is to test them without paying comparisons apiece.

The naive answer does pay that price. The three algorithms in this lesson each remove the redundancy in a different way: Rabin–Karp replaces a length- comparison by a length- hash update, KMP remembers what a partial match already told us so it never backs up over the text, and the Z-function computes a global longest prefix-match array that subsumes both.

The naive scan, and why it wastes work

Try every alignment; at each, compare characters until one disagrees or the whole pattern matches.

Algorithm:Naive-Match(T,P)\textsc{Naive-Match}(T, P) — test all nm+1n-m+1 alignments
  1. 1
    for s0s \gets 0 to nmn - m do
  2. 2
    j0j \gets 0
  3. 3
    while j<mj < m and T[s+j]=P[j]T[s + j] = P[j] do
  4. 4
    jj+1j \gets j + 1
  5. 5
    if j=mj = m then
  6. 6
    report occurrence at shift ss

The outer loop runs times and the inner loop up to , so the worst case is , realized by and , where every alignment matches characters before failing. On random or low-repetition text the inner loop almost always dies on the first character, so naive matching averages and is the right tool when is tiny or the text is unstructured.1 The pathology is repetitive patterns, and the cure is to stop discarding what a partial match revealed. The asymptotic gap between the worst and average cases is exactly what the next three algorithms close.

Why the naive scan is : on , every shift re-reads the same as, matching of them before failing on the final character (red). Each row is one alignment

Rabin–Karp: matching by rolling hash

Rabin–Karp turns are these characters equal? into are these two numbers equal? by hashing each window. Interpret each length- block of text as an -digit number in base (mapping characters to digits), reduced modulo a prime to keep it machine-word-sized. Precompute the pattern's hash and the first window's hash . Slide the window one step at a time; if , the block might match, so verify it character-by-character to rule out a hash collision (a spurious hit).

The trick is the rolling hash: when the window slides from position to , we do not recompute the hash from scratch. We drop the contribution of the departing high-order digit , shift the remaining digits up by one place (multiply by ), and add the incoming low-order digit :

The factor is precomputed once. Each slide is arithmetic, so building all window hashes costs in total.

The rolling hash slides the length- window by one: drop the departing high digit , multiply the rest by , add the incoming low digit — one update giving
Algorithm:Rabin-Karp(T,P,b,q)\textsc{Rabin-Karp}(T, P, b, q) — hash, slide, verify
  1. 1
    p0; t0; ρbm1modqp \gets 0;\ t \gets 0;\ \rho \gets b^{\,m-1} \bmod q
  2. 2
    for j0j \gets 0 to m1m - 1 do
    hash PP and window 0
  3. 3
    p(bp+P[j])modqp \gets (b \cdot p + P[j]) \bmod q
  4. 4
    t(bt+T[j])modqt \gets (b \cdot t + T[j]) \bmod q
  5. 5
    for s0s \gets 0 to nmn - m do
  6. 6
    if t=pt = p then
    verify, kill collisions
  7. 7
    if T[s..s+m1]=PT[s \mathinner{\ldotp\ldotp} s+m-1] = P then report occurrence at shift ss
  8. 8
    if s<nms < n - m then
    roll window forward
  9. 9
    t(b(tT[s]ρ)+T[s+m])modqt \gets (b\,(t - T[s]\cdot\rho) + T[s+m]) \bmod q

Rabin–Karp shines when you need to match many patterns of the same length at once (hash them all, look each window up in a set) or when the comparison is naturally numeric. The rolling-hash idea reappears in deduplication, plagiarism detection, and content-defined chunking.2

Knuth–Morris–Pratt: never re-read the text

KMP attacks the redundancy directly. Suppose we are matching against and have just matched characters, , when mismatches . The naive scan would slide by one and recompare from the start. But the matched text equals , which we know completely — so we can compute, in advance and from the pattern alone, how far to slide so the next comparison resumes correctly without ever moving the text pointer backward.

The information we need is the prefix function (or failure function):

For we have : after matching the prefix ABABA (length , index ), its longest border is ABA of length . So on a mismatch having matched characters, instead of restarting we keep the already-matched border ABA and resume comparing at . The pattern effectively slides right by positions, and the text pointer does not move.

A KMP shift. After matching ABABA, P fails on C vs the text (highlighted). The longest border ABA realigns, sliding P right by with no rescanning of the text.

Computing uses the very same self-matching idea, run on against itself. Maintain the length of the current longest border; to extend to index , follow failure links downward until or falls to .

Algorithm:Prefix-Function(P)\textsc{Prefix-Function}(P) — compute the failure array π\pi in O(m)O(m)
  1. 1
    π[0]0; k0\pi[0] \gets 0;\ k \gets 0
  2. 2
    for i1i \gets 1 to m1m - 1 do
  3. 3
    while k>0k > 0 and P[k]P[i]P[k] \ne P[i] do
  4. 4
    kπ[k1]k \gets \pi[k - 1]
    fall back
  5. 5
    if P[k]=P[i]P[k] = P[i] then
  6. 6
    kk+1k \gets k + 1
    extend border
  7. 7
    π[i]k\pi[i] \gets k
  8. 8
    return π\pi
Algorithm:KMP-Match(T,P)\textsc{KMP-Match}(T, P) — scan once, never moving the text pointer back
  1. 1
    πPrefix-Function(P)\pi \gets \textsc{Prefix-Function}(P)
  2. 2
    q0q \gets 0
    chars matched so far
  3. 3
    for i0i \gets 0 to n1n - 1 do
  4. 4
    while q>0q > 0 and P[q]T[i]P[q] \ne T[i] do
  5. 5
    qπ[q1]q \gets \pi[q - 1]
    mismatch: fall back
  6. 6
    if P[q]=T[i]P[q] = T[i] then
  7. 7
    qq+1q \gets q + 1
  8. 8
    if q=mq = m then
  9. 9
    report occurrence at shift im+1i - m + 1
  10. 10
    qπ[q1]q \gets \pi[q - 1]
    allow overlapping matches

The key is that is a potential that pays for the fallback: each unit of slide was funded by an earlier successful character match.3 No text character is ever examined more than a constant number of times, the central property that the naive scan lacks.

The prefix table, concretely

Here is the full table for . Read it left to right: is the length of the longest border of the prefix ending at column .

The prefix function for . Top row: the pattern. Bottom row: , the length of the longest proper prefix of that is also a suffix. E.g. because ABABA has border ABA.

The Z-function: longest prefix-match at every position

The Z-function repackages the same self-similarity into a single array, and is often easier to implement bug-free.

The linear-time computation keeps a half-open window , the Z-box: the rightmost interval known to equal a prefix of (so ). For a new index :

  • if , position lies inside a known prefix-match, so its mirror gives a free lower bound ;
  • then extend the match character-by-character past if possible, and if the match runs beyond , slide the Z-box to the new .
Z-box copy on at . The box already matches a prefix, so the mirror gives a free bound ; here caps the copy
Algorithm:Z-Function(S)\textsc{Z-Function}(S) — longest prefix-match at each position in O()O(\ell)
  1. 1
    Z[0]; l0; r0Z[0] \gets \ell;\ l \gets 0;\ r \gets 0
  2. 2
    for i1i \gets 1 to 1\ell - 1 do
  3. 3
    if i<ri < r then
  4. 4
    Z[i]min(ri, Z[il])Z[i] \gets \min(r - i,\ Z[i - l])
    copy from mirror
  5. 5
    while i+Z[i]<i + Z[i] < \ell and S[Z[i]]=S[i+Z[i]]S[Z[i]] = S[i + Z[i]] do
  6. 6
    Z[i]Z[i]+1Z[i] \gets Z[i] + 1
    extend past the box
  7. 7
    if i+Z[i]>ri + Z[i] > r then
  8. 8
    li; ri+Z[i]l \gets i;\ r \gets i + Z[i]
    advance Z-box
  9. 9
    return ZZ

Each character of is examined a constant number of times because the inner while only ever advances , which moves monotonically from to . This is the same amortised argument as KMP, giving .

To match in , run the Z-function on the concatenation , where # is a sentinel in neither nor . Wherever inside the portion, the prefix matches in full at that spot, so occurs in at shift . The sentinel prevents a match from straddling the boundary and caps . Total length is , so matching is .

Takeaways

  • String matching seeks all shifts where pattern ( chars) occurs in text ( chars). The naive scan tries all alignments in worst case, but is fine for tiny patterns or unstructured text.
  • Rabin–Karp compares a rolling hash of each length- window, updated in via , then verifies on hash matches to kill collisions: expected , worst case . Best for many-pattern search.
  • KMP precomputes the prefix/failure function , so on a mismatch the pattern slides by and the text pointer never backs up: worst-case , justified by an amortised potential argument on .
  • The Z-function computes the longest prefix-match at every position in via the Z-box; running it on and finding matches in .
  • and are interconvertible in linear time — two encodings of a string's self-similarity. KMP/Z give worst-case guarantees; Rabin–Karp trades that for simplicity and multi-pattern flexibility.

Footnotes

  1. CLRS, Ch. 32 — String Matching (§32.1): the naive matcher and its bound.
  2. Skiena, § — String Algorithms: the Rabin–Karp rolling hash and its use for substring search and fingerprinting.
  3. CLRS, Ch. 32 — String Matching (§32.4): the prefix function and the amortised KMP analysis.
  4. Erickson, Ch. — String Matching: the Z-function / failure-function duality and the Z-box linear-time computation.
Practice

╌╌ END ╌╌