Sequence Alignment & LCS
Two strings can be compared by asking how much of one survives inside the other. The longest common subsequence (LCS) and edit distance are the two classic answers, and they are the same dynamic program wearing different costs.
╌╌╌╌
How similar are two strings? algorithm and altruistic share the letters
a, l, t, i, c in order, and that shared spine, the longest common
subsequence, is one of the most useful measures of similarity in computing. It
underlies the diff utility, version-control merges, and (with a change of cost
function) the alignment of DNA and protein sequences in computational biology.1
This lesson develops the LCS dynamic program in full, then shows that edit
distance is the very same machine with the costs rearranged.
The problem
A subsequence of a string is what remains after deleting zero or more
characters, keeping the rest in their original order. It need not be
contiguous: ace is a subsequence of abcde, but aec is not. Given two
strings and , a common
subsequence is a string that is a subsequence of both, and we want a longest
one.
A common subsequence is a set of order-preserving matches threading the two strings: the matched letters appear in both, left to right, though the gaps between them differ.
A brute-force search is hopeless: has subsequences, and checking each against gives . We need the recipe, and it pays to follow it literally. The DP recipe runs through fixed steps: (0) simplify the goal, (1) define the subproblems and notation, (2) write the DP equations, (3) prove them correct by induction, (4) turn them into iterative pseudocode, (5) return to the original goal, then analyze. We walk LCS through those steps verbatim.
Step 0–1: Simplify the goal, define the subproblem
Simplify first. Computing the string drags around bookkeeping; computing its length is cleaner. So we first solve for the LCS length, then recover an actual subsequence in a cheap second pass (Step 5). With that simplification, the decisive move, the one worth internalizing because it recurs across all sequence DPs, is to index subproblems by prefixes of the two strings. Write and for the two inputs.
The answer we want is . There are subproblems, one per pair of prefix lengths and .
Step 2: The DP equations
Now apply optimal substructure by looking at the last characters, and .2 Write the recurrence as a single over three cases, plus a base case:
Read the three branches as moves on the prefixes:
- Case 1 drops : an LCS of and is a common subsequence of and , so .
- Case 2 drops symmetrically, so .
- Case 3 fires only when : the shared character can end an optimal LCS, so we append it to the best LCS of the strictly shorter prefixes, giving .
Why exactly these three? Because every common subsequence of and falls into one of three shapes: it ignores (case 1), ignores (case 2), or uses both as a final match (case 3, possible only if ). The over the applicable cases is the longest of them. Note that when , only cases 1 and 2 apply, and the recurrence collapses to .
Each entry depends only on its left, upper, and upper-left neighbors, so filling the table row by row, left to right (or column by column) respects every dependency. The same three-neighbour stencil drives every sequence DP below: only the labels on the arrows change.
Step 3: Correctness by induction on
We claim for all , and prove it by induction on . The recurrence is itself a , so the cleanest proof shows two inequalities: that the recurrence is neither too small () nor too large ().
Base case (, i.e. or ). One prefix is empty, so the only common subsequence is empty and the length is , exactly case 0. Trivial.
Inductive step. Fix and assume the claim for all with .
The direction (the recurrence is achievable). We exhibit a common subsequence of length at least the right-hand side.
- If : by the induction hypothesis there is a common subsequence of and of length . Appending yields a common subsequence of and , so .
- Cases 1 and 2 are even simpler: any common subsequence of a shorter prefix pair is still common for the longer pair, giving and . So the true length is the .
The direction (the recurrence is not exceeded). Take any common subsequence of and ; we show is bounded by one of the three branches.
- If does not use : then is a common subsequence of and , so by the IH.
- If does not use : symmetrically .
- If uses both and : then they must match as 's last character, so , and dropping it leaves a common subsequence of and ; hence , i.e. .
One of these three cases always applies, so over the applicable branches. Taking to be a longest common subsequence, . Both directions together give equality, completing the induction.
Step 4: Iterative pseudocode
The equations are non-circular, since every entry reads strictly smaller prefixes, so they convert directly into a bottom-up table fill. Allocate , zero the border, then sweep:
- 1for to do
- 2empty prefix
- 3for to do
- 4empty prefix
- 5for to do
- 6for to do
- 7cases 1, 2
- 8if then
- 9case 3
- 10return
Every cell costs , so the fill is .
The DP table, filled
Take and . We build the table of values. Row and column are all zero (empty prefix); every other cell is filled by the recurrence. The shaded diagonal steps mark the matches that build the answer, and the red arrows trace the reconstruction walk (Step 5) backwards from the corner.
The bottom-right entry reads : the longest common
subsequence of BDCAB and ABCB has length , namely BCB (the only one here,
though in general there may be ties). The arrows enter each shaded match cell
diagonally (emitting B, then C, then B, read from the corner upward) and step
straight up or left through non-match cells, exactly as the reconstruction below
prescribes.
Step 5: Reconstructing the subsequence
The table gives the length; this is where the Step 0 simplification is paid back. In a second pass we walk backwards from , undoing the recurrence. At cell : if , that character belongs to the LCS — emit it and step diagonally to (case 3); otherwise move to whichever neighbor, up or left, holds the larger value (the one the of cases 1 and 2 chose).
- 1if or then
- 2return the empty stringempty prefix
- 3if then
- 4return followed bycase 3 match
- 5else if then
- 6returnfrom above (case 1)
- 7else
- 8returnfrom left (case 2)
The walk takes one step toward the origin each call, so it runs in time, cheap compared with building the table.
Running time and space
The table has entries, and each is computed in : a comparison and a . Therefore LCS runs in
time, with space for the full table. If only the length is wanted, note that each row depends only on the row above it, so two rows, giving space, suffice. The catch, common to all such DPs: this space trick discards the information needed to reconstruct the subsequence. (Hirschberg's divide-and-conquer refinement recovers the actual LCS in time and only linear space, but that is a topic for later.)
The same machine: edit distance
Edit distance (the Levenshtein distance) asks the closely related
question: what is the minimum number of single-character insertions,
deletions, and substitutions that transform into ?3 It is the cost
model behind spell-checkers and diff, and it is structurally identical to LCS.
Again we look at the last characters. If , they need no edit and we align them for free. Otherwise we make one of three moves (delete , insert , or substitute ), each at cost , and recurse on the correspondingly shorter prefixes:
The base cases say it: turning a length- prefix into the empty string costs deletions, and building a length- prefix from nothing costs insertions.
Filled on a small pair, the table looks just like the LCS one but now minimizes. Take and : the shaded diagonal marks the free matches, and the corner reports the answer.
Reading the corner, : align the shared prefix CA for free,
substitute T R, then insert S. The match cells on the diagonal
( and ) copy the upper-left value unchanged, exactly the
LCS diagonal step but counting saved edits instead of matched characters.
Compare this against LCS line by line. Both index subproblems by prefix pairs;
both branch on whether the last characters match; both fill an
table where each entry reads its left, upper, and upper-left
neighbors; both run in . The only differences are the costs and
the optimization direction: LCS maximizes matched characters, edit distance
minimizes edits. Sequence DPs are a single template, parameterized by
what a match
earns and a mismatch
costs. Recognize the template and a whole
family of problems (LCS, edit distance, sequence alignment, longest common
substring, and string matching) falls
to the same code.
Takeaways
- Index sequence subproblems by prefixes: is the key that unlocks LCS, after the Step 0 move of solving for length first.
- The recurrence is a over three cases (drop in case 1, drop in case 2, or extend the diagonal by when in case 3) over a base case of for an empty prefix.
- Prove it by induction on in two directions: (build a witness subsequence) and (every common subsequence fits one of the three cases).
- Fill the table in ; reconstruct in a second pass, walking backwards from and emitting a character on every diagonal match.
- Only the length is needed? Two rows give space, at the cost of losing the reconstruction.
- Edit distance is the same dynamic program: same prefix subproblems, same table shape, same , minimizing edits instead of maximizing matches. Sequence DP is one reusable template.
Footnotes
- Skiena, §10 — Dynamic Programming: the longest common subsequence as a similarity measure underlying
diffand sequence alignment. ↩ - CLRS, Ch. 15 — Dynamic Programming: the LCS recurrence obtained by examining the last characters of each prefix. ↩
- Erickson, Ch. 3 — Dynamic Programming: edit (Levenshtein) distance as the minimum-cost insert/delete/substitute alignment filling an table. ↩
╌╌ END ╌╌