String Matching: Rabin–Karp, KMP & Z
Given a text of length and a pattern of length , find every occurrence of in . The naive scan costs ; we beat it three ways.
╌╌╌╌
We have spent the previous lessons treating a sequence as a bag of comparable keys. A string is more rigid: its characters come in a fixed order and that order is the information. The fundamental question is string matching: given a text and a pattern over some alphabet , report every shift such that occurs in starting at position , i.e. . There are candidate shifts, and the art is to test them without paying comparisons apiece.
The naive answer does pay that price. The three algorithms in this lesson each
remove the redundancy in a different way: Rabin–Karp replaces a length-
comparison by a length- hash update, KMP remembers what a partial match already
told us so it never backs up over the text, and the Z-function computes a global
longest prefix-match
array that subsumes both.
The naive scan, and why it wastes work
Try every alignment; at each, compare characters until one disagrees or the whole pattern matches.
- 1for to do
- 2
- 3while and do
- 4
- 5if then
- 6report occurrence at shift
The outer loop runs times and the inner loop up to , so the worst case is , realized by and , where every alignment matches characters before failing. On random or low-repetition text the inner loop almost always dies on the first character, so naive matching averages and is the right tool when is tiny or the text is unstructured.1 The pathology is repetitive patterns, and the cure is to stop discarding what a partial match revealed. The asymptotic gap between the worst and average cases is exactly what the next three algorithms close.
as, matching of them before failing on the final character (red). Each row is one alignmentRabin–Karp: matching by rolling hash
Rabin–Karp turns are these characters equal?
into are these two numbers equal?
by hashing each window.
Interpret each length- block of text as an -digit number in base
(mapping characters to digits), reduced modulo a prime to keep
it machine-word-sized. Precompute the pattern's hash and the first
window's hash . Slide the window one step at a time; if
, the block might match, so verify it character-by-character to rule out
a hash collision (a spurious hit).
The trick is the rolling hash: when the window slides from position to , we do not recompute the hash from scratch. We drop the contribution of the departing high-order digit , shift the remaining digits up by one place (multiply by ), and add the incoming low-order digit :
The factor is precomputed once. Each slide is arithmetic, so building all window hashes costs in total.
- 1
- 2for to dohash and window 0
- 3
- 4
- 5for to do
- 6if thenverify, kill collisions
- 7if then report occurrence at shift
- 8if thenroll window forward
- 9
Rabin–Karp shines when you need to match many patterns of the same length at
once (hash them all, look each window up in a set) or when the comparison
is
naturally numeric. The rolling-hash idea reappears in deduplication,
plagiarism detection, and content-defined chunking.2
Knuth–Morris–Pratt: never re-read the text
KMP attacks the redundancy directly. Suppose we are matching against and have just matched characters, , when mismatches . The naive scan would slide by one and recompare from the start. But the matched text equals , which we know completely — so we can compute, in advance and from the pattern alone, how far to slide so the next comparison resumes correctly without ever moving the text pointer backward.
The information we need is the prefix function (or failure function):
For we have : after matching the prefix
ABABA (length , index ), its longest border is ABA of length . So on a
mismatch having matched characters, instead of restarting we keep the
already-matched border ABA and resume comparing at . The
pattern effectively slides right by positions, and the
text pointer does not move.
C vs the text (highlighted). The longest border ABA realigns, sliding P right by with no rescanning of the text.Computing uses the very same self-matching idea, run on against itself. Maintain the length of the current longest border; to extend to index , follow failure links downward until or falls to .
- 1
- 2for to do
- 3while and do
- 4fall back
- 5if then
- 6extend border
- 7
- 8return
- 1
- 2chars matched so far
- 3for to do
- 4while and do
- 5mismatch: fall back
- 6if then
- 7
- 8if then
- 9report occurrence at shift
- 10allow overlapping matches
The key is that is a potential
that pays for the fallback: each unit of slide
was funded by an earlier successful character match.3 No text character
is ever examined more than a constant number of times, the central property that
the naive scan lacks.
The prefix table, concretely
Here is the full table for . Read it left to right: is the length of the longest border of the prefix ending at column .
ABABA has border ABA.The Z-function: longest prefix-match at every position
The Z-function repackages the same self-similarity into a single array, and is often easier to implement bug-free.
The linear-time computation keeps a half-open window , the Z-box: the rightmost interval known to equal a prefix of (so ). For a new index :
- if , position lies inside a known prefix-match, so its mirror gives a free lower bound ;
- then extend the match character-by-character past if possible, and if the match runs beyond , slide the Z-box to the new .
- 1
- 2for to do
- 3if then
- 4copy from mirror
- 5while and do
- 6extend past the box
- 7if then
- 8advance Z-box
- 9return
Each character of is examined a constant number of times because the inner
while only ever advances , which moves monotonically from to . This
is the same amortised argument as KMP, giving .
To match in , run the Z-function on the concatenation ,
where # is a sentinel in neither nor . Wherever inside the
portion, the prefix matches in full at that spot, so occurs in at
shift . The sentinel prevents a match from straddling the boundary
and caps . Total length is , so matching is .
Takeaways
- String matching seeks all shifts where pattern ( chars) occurs in text ( chars). The naive scan tries all alignments in worst case, but is fine for tiny patterns or unstructured text.
- Rabin–Karp compares a rolling hash of each length- window, updated in via , then verifies on hash matches to kill collisions: expected , worst case . Best for many-pattern search.
- KMP precomputes the prefix/failure function , so on a mismatch the pattern slides by and the text pointer never backs up: worst-case , justified by an amortised potential argument on .
- The Z-function computes the longest prefix-match at every position in via the Z-box; running it on and finding matches in .
- and are interconvertible in linear time — two encodings of a string's self-similarity. KMP/Z give worst-case guarantees; Rabin–Karp trades that for simplicity and multi-pattern flexibility.
Footnotes
- CLRS, Ch. 32 — String Matching (§32.1): the naive matcher and its bound. ↩
- Skiena, § — String Algorithms: the Rabin–Karp rolling hash and its use for substring search and fingerprinting. ↩
- CLRS, Ch. 32 — String Matching (§32.4): the prefix function and the amortised KMP analysis. ↩
- Erickson, Ch. — String Matching: the Z-function / failure-function duality and the Z-box linear-time computation. ↩
╌╌ END ╌╌