Sequences & Strings/Tries & Prefix Trees

Lesson 5.51,481 words

Tries & Prefix Trees

A trie stores a set of strings in a tree keyed by characters, so that insert, search, and prefix-test all run in O(L)O(L) time — the length of the key, independent of how many keys are stored. Shared prefixes are stored once, which makes tries the natural structure for autocomplete, wildcard dictionaries, board word-search, and — over the alphabet {0,1}\{0,1\} — the maximum-XOR-pair problem.

╌╌╌╌

A balanced search tree gives us lookups by comparing whole keys to each other. But when the keys are strings, a single comparison is not : deciding whether "international" precedes "internet" already costs five character comparisons, so a BST of strings of length really spends per operation. Worse, the BST throws away an obvious source of structure, namely that "intern", "internet", and "international" all share a prefix, and re-examines those shared characters on every descent.

A trie (from re_trie_val; usually pronounced try) exploits exactly that structure. Instead of comparing keys against each other, it routes each key one character at a time down a tree whose edges are labeled by characters. A root-to-node path spells a prefix; following the characters of a key leads to the unique node that key reaches. Lookup costs , where the work depends only on the key being searched for, never on , the number of stored keys.1

The structure

The root represents the empty prefix. Note the two-level bookkeeping: a node can exist purely as an interior waypoint (a prefix of some longer key) yet not be a key itself. In the word set the node at path t exists but is not a stored word, so its is false; the nodes at to, tea, and ted have .

Each node needs a way to find a child by character. Two standard choices:

  • Array children. A fixed array of pointers per node (e.g. 26 for lowercase letters, or children[2] for a binary trie). Child lookup is a single index, at the cost of slots per node whether used or not.
  • Hash-map children. A char → node map per node, so a node stores only the children it actually has. Smaller for sparse, large alphabets (Unicode), with a small constant-factor hashing overhead.

Operations: all

Insertion walks down from the root, creating a child node whenever the needed edge is missing, and sets on the final node. Search walks the same path but never creates; it fails the moment a needed edge is absent.

Algorithm:Insert(T,w)\textsc{Insert}(T, w) — add word w=w1w2wLw = w_1 w_2 \dots w_L
  1. 1
    xroot(T)x \gets root(T)
  2. 2
    for i1i \gets 1 to LL do
  3. 3
    cwic \gets w_i
  4. 4
    if child(x,c)=nilchild(x, c) = \text{nil} then
  5. 5
    child(x,c)NewNode()child(x, c) \gets \textsc{NewNode}()
  6. 6
    xchild(x,c)x \gets child(x, c)
  7. 7
    isEnd(x)trueisEnd(x) \gets \text{true}
Algorithm:Search(T,w)\textsc{Search}(T, w) — is ww a stored key?
  1. 1
    xroot(T)x \gets root(T)
  2. 2
    for i1i \gets 1 to LL do
  3. 3
    cwic \gets w_i
  4. 4
    if child(x,c)=nilchild(x, c) = \text{nil} then
  5. 5
    return false
  6. 6
    xchild(x,c)x \gets child(x, c)
  7. 7
    return isEnd(x)isEnd(x)

Each loop runs times and does work per character (one indexed slot or one hash probe), so insert, search, and the prefix test all run in . startsWith(p) is identical to except it returns true as soon as the path for exists, ignoring , because any node on a valid path witnesses that is a prefix of some stored key.

Cost of the three string-dictionary structures per operation on keys of length . The trie alone drops the factor and answers prefix queries directly; the hash set is but unordered
A trie over {to, tea, ted, ten, in, inn}; shared prefixes stored once, lookup is

The blue nodes are the six stored words; the white interior nodes (t, te, i) are prefixes shared among them and stored exactly once. Looking up ten visits three edges regardless of whether the trie holds six words or six million.

Space and trade-offs

With array children the worst case is pointers: keys, up to nodes each, slots per node. That bound is pessimistic, and tries are far better than it suggests precisely when prefixes are shared: every common prefix collapses to a single path, so a dictionary of English words (densely overlapping) stores far fewer than nodes. Hash-map children replace the factor with the actual child count, trading a constant for the array's indexing.

Applications

Autocomplete and prefix search. To offer completions for what a user has typed, walk to the node for the typed prefix in , then DFS the subtree beneath it to enumerate every stored key with that prefix, since the trie has already grouped them.

Autocomplete for prefix \texttt{te}: walk to the prefix node in (accent path ), then DFS the subtree to emit every completion — \texttt{tea}, \texttt{ted}, \texttt{ten} — already grouped under that node

Wildcard dictionary (the . problem). Design Add and Search Words asks for a dictionary where a query may contain . matching any single character. Plain search no longer follows one path: at a . we must branch into all children and recurse. Concrete characters keep the search ; each . multiplies the branching, but the trie still prunes any path that cannot match.

Algorithm:WildSearch(x,w,i)\textsc{WildSearch}(x, w, i) — match ww from node xx, wiw_i may be .\texttt{.}
  1. 1
    if i>Li > L then
  2. 2
    return isEnd(x)isEnd(x)
  3. 3
    if wi="."w_i = \texttt{"."} then
  4. 4
    for each child cc of xx do
  5. 5
    if WildSearch(c,w,i+1)\textsc{WildSearch}(c, w, i+1) then return true
  6. 6
    return false
  7. 7
    else
  8. 8
    if child(x,wi)=nilchild(x, w_i) = \text{nil} then return false
  9. 9
    return WildSearch(child(x,wi),w,i+1)\textsc{WildSearch}(child(x, w_i), w, i+1)
Wildcard search for \texttt{t.n}: the concrete \texttt{t} follows one edge, the \texttt{.} branches into every child (red, both \texttt{o} and \texttt{e}), and only the \texttt{e} branch survives to spell \texttt{ten}. A \texttt{.} fans the search; concrete characters keep it on one path

Word search on a board. Word Search II hunts for many dictionary words in a grid simultaneously. Building a trie of all target words lets one DFS over the board carry a trie pointer alongside the grid position: the instant the current board path spells a string that is not a prefix of any target, the missing trie edge prunes the entire branch. One traversal finds all words, and the shared prefixes mean overlapping targets share work.

The binary trie: maximum XOR pair

A beautiful non-string use treats a fixed-width integer as a string of bits over . Maximum XOR of Two Numbers asks for . Brute force is ; a binary trie solves it in for -bit numbers.

Insert every number bit-by-bit from the high bit down, so each root-to-leaf path of length is one number. To maximize the XOR of a query against the stored set, walk down from the root and at each bit greedily steer toward the opposite bit of : a differing bit contributes a at that (high) position, worth more than every lower bit combined. If the opposite child exists, take it; otherwise follow the only child available. The path traced spells the stored number that maximizes .

Binary trie on {010, 011, 110}; greedy walk for a query maximizes XOR by taking opposite bits

The trie holds . For the query the greedy walk wants the opposite bit at each level: . The high bit of is , so it takes the -child; the remaining two bits of are , so at each step the wanted opposite bit is available and the walk follows it. It lands on the stored number , giving , the maximum.

Multi-pattern and compressed variants

A trie of patterns augmented with failure links, pointers that, on a mismatch, jump to the longest proper suffix of the current match that is also a prefix in the trie, is the Aho–Corasick automaton: it scans a text once and reports every occurrence of every pattern in linear time, the multi-string generalization of KMP (which is single-pattern failure-link matching).2 For storage, a compressed trie (a radix tree or Patricia trie) contracts each chain of single-child nodes into one edge labeled by a whole substring, shrinking the node count dramatically; suffix trees and suffix arrays push the idea further, indexing all suffixes of a text for fast substring search, the subject of later lessons.3

Takeaways

  • A trie is a rooted tree whose edges are labeled by characters; a root-to-node path spells a prefix, and an flag marks nodes that complete a stored key. Children are an array of size or a per-node map.
  • Insert, search, and startsWith all run in , the key length, and are independent of , beating a BST's .
  • Space is up to with array children, but shared prefixes are stored once, so prefix-heavy sets compress well; tries beat hash sets by giving ordered traversal, prefix queries, and no collisions.
  • Tries power autocomplete, wildcard . matching, and board word-search pruning; over a binary trie solves maximum-XOR pair by greedily walking toward the opposite bit from the high bit down.
  • Aho–Corasick = trie + failure links = multi-pattern KMP; Patricia / radix trees compress single-child chains, and suffix trees / arrays index all suffixes of a text.

Footnotes

  1. Skiena, § — String Data Structures: tries route keys character-by-character, giving search independent of the number of stored strings.
  2. Erickson, Ch. — Data Structures: tries as a string dictionary; failure links extend a trie into the Aho–Corasick multi-pattern matcher.
  3. Sedgewick, §5.2 — Tries: -way and ternary tries, prefix operations, and compressed (Patricia) variants for space.
Practice

╌╌ END ╌╌