Divide & Conquer/Linear-Time Selection

Lesson 2.32,048 words

Linear-Time Selection

Finding the kk-th smallest element looks like it should require sorting, but it does not. Quickselect adapts quicksort's partition to recurse on just one side, achieving expected O(n)O(n).

╌╌╌╌

How do you find the median of numbers? The obvious answer, sorting them and reading off the middle, costs . But the median, and more generally the -th smallest element, can be found in linear time.1 The insight is that selection asks for less than sorting: we want one element's value, not the full ordering, and we can stop as soon as we have it.

The selection problem

Special cases: is the minimum, the maximum, and the median. The minimum and maximum are easy in comparisons. The median is the interesting case, and the algorithms below solve it as a byproduct of solving general selection.

Quickselect: partition, then recurse on one side

Quicksort partitions around a pivot and recurses on both halves. But for selection we only care about one of them. After partitioning around a pivot that lands at index , the pivot is the -th smallest element of the subarray. Compare that rank to :

  • if it equals , the pivot is the answer;
  • if is smaller, the answer lies in the left part, so recurse there;
  • if is larger, the answer lies in the right part; recurse there, adjusting to skip the elements we discarded.

Throwing away the side that cannot contain the answer is what turns into .2

Quickselect partitions once around , then recurses into only the side holding rank (here , the left part) and discards the other.
Algorithm 1:Quickselect(A,p,r,k)\textsc{Quickselect}(A, p, r, k)return the kk-th smallest of A[p..r]A[p..r]
  1. 1
    if p=rp = r then
  2. 2
    return A[p]A[p]
    only element: the answer
  3. 3
    qq \gets call Randomized-Partition(A,p,r)\textsc{Randomized-Partition}(A, p, r)
  4. 4
    iqp+1i \gets q - p + 1
    pivot rank in A[p..r]
  5. 5
    if k=ik = i then
  6. 6
    return A[q]A[q]
    pivot is the k-th smallest
  7. 7
    else if k<ik < i then
  8. 8
    return call Quickselect(A,p,q1,k)\textsc{Quickselect}(A, p, q - 1, k)
    recurse left
  9. 9
    else
  10. 10
    return call Quickselect(A,q+1,r,ki)\textsc{Quickselect}(A, q + 1, r, k - i)
    recurse right, shift k

Expected linear time

Partition costs . The win over quicksort is that we recurse on only one side. With a randomized pivot, the partition splits the array at a uniformly random rank, so on average we discard a constant fraction each time. Intuitively, a random pivot lands in the middle half of the array with probability , in which case the surviving side has at most elements. The expected work satisfies, roughly,

a geometric (not branching) recurrence. Unrolling it,

The geometric series converges to a constant, so the total is linear. (A careful CLRS-style analysis with indicator variables gives the same .)

A single-side recurrence shrinks the subproblem by a constant factor each level, so the bars sum to a geometric .

The catch is the same as quicksort's: the worst case, with every pivot maximally unbalanced, is

Randomization makes that astronomically unlikely, but it is still possible. Can we guarantee linear time?

Median of medians: a guaranteed-good pivot

The deterministic algorithm of Blum, Floyd, Pratt, Rivest, and Tarjan (1973) achieves worst-case by spending a little effort to choose a pivot that is provably not too extreme.3 The trick is to pick the pivot as a median of medians.

Algorithm 2:Select(A,k)\textsc{Select}(A, k) — deterministic kk-th smallest, worst-case O(n)O(n)
  1. 1
    if AA has at most 55 elements then
  2. 2
    return the kk-th smallest of AA by direct sorting
  3. 3
    divide AA into n/5\ceil{n/5} groups of 55 elements (last group may be smaller)
  4. 4
    foreach group do
  5. 5
    find that group's median by sorting its 5\le 5 elements
  6. 6
    let MM be the array of the n/5\ceil{n/5} group medians
  7. 7
    xx \gets call Select(M,M/2)\textsc{Select}(M, \ceil{|M|/2})
    median of medians
  8. 8
    qq \gets partition AA around the pivot xx, returning its rank ii
  9. 9
    if k=ik = i then
  10. 10
    return xx
  11. 11
    else if k<ik < i then
  12. 12
    return call Select(A[left part],k)\textsc{Select}(A[\,\text{left part}\,], k)
  13. 13
    else
  14. 14
    return call Select(A[right part],ki)\textsc{Select}(A[\,\text{right part}\,], k - i)

Why groups of five give a good pivot

Here is the core of it. Picture the groups as columns, each sorted top (large) to bottom (small), and now imagine the columns reordered left to right by their medians. The pivot is the median of that middle row, so it sits dead-center in the grid below. (We assume distinct elements for simplicity.)

The pivot is built in stages: chop the array into groups of five, sort each group to expose its median, collect those medians, and recurse to find their median — the median of medians.

Building the pivot: split into groups of , sort each to surface its median (accent cell), gather the medians, then recurse on that smaller array to get the median of medians .
Grid of groups of five with the median of medians , showing the regions guaranteed at least or at most .

Consider the columns whose median is : that is, 's own column and the roughly half of the columns to its right. In each such column, the median and the two elements above it (the two larger ones) are all . That is 3 elements per column, across about half of the columns:

So at least elements are , which forces at most to be . By the symmetric argument (the dashed block) at least are , so at most are . Whichever side we recurse into has at most elements, so the pivot is provably never too lopsided. (Five is the smallest odd group size that makes the recurrence below close; groups of fail because their fractions sum to exactly .)

Solving the recurrence

Tallying the work: splitting into groups and finding their medians is (each group is sorted in ). Partitioning is . There are two recursive calls:

  • finding the median of the medians, a subproblem of size ;
  • recursing into the surviving side, a subproblem of size at most .

The two fractions are what make this work. Observe that : the two subproblems together are strictly smaller than the input. That is shrinkage — the total work contracts by a constant factor at every level.

The two recursive calls span of the input; the leftover gap is the shrinkage that forces .

For inspiration, first consider the single-call cousin . The master theorem gives ; unrolling shows why directly,

The constant-factor-shrinkage plus linear cleanup work sums to . The median-of-medians recurrence has two calls instead of one, but the same shrinkage idea carries it through. State it as a clean general lemma:

Plugging in , (so ) gives , that is, worst-case linear time. Had the fractions summed to or more (as for groups of , where ), the lemma's hypothesis fails: the per-level savings vanish, the recursion tree carries levels of work, and the bound degrades to .

Which to use

The median-of-medians algorithm is a landmark: it proves selection is possible in worst-case linear time. But its constant factor is large (all that grouping and recursive median-finding), so in practice randomized quickselect is faster and is the algorithm of choice.4 The deterministic version shines when a worst-case guarantee is mandatory, or, most importantly, as a pivot-selection subroutine that can be bolted onto quicksort to give it a worst-case bound.

Bonus: Closest Pair of Points

This is often called the most beautiful divide-and-conquer algorithm, and also the scariest to get right. It is worth seeing because, like median-of-medians, its speed hinges on a geometric counting argument that caps how much work the combine step can possibly do.

We treat one distance computation and one distance comparison as . The naïve algorithm, two nested loops over all pairs, runs in . Can divide-and-conquer beat it?

The idea, and the trap

Split by the median -coordinate into a left half and a right half, recurse on each to get the closest pair within each side, and let be the smaller of the two returned distances. The danger is the combine step: the true closest pair might straddle the dividing line, so we still have to look across it.

The trap is doing this naïvely. If left points and right points lie near the line, checking every cross pair costs , which can be , no better than brute force. The whole game is to show that only cross-comparisons are ever needed.

The -strip and the 7-neighbor bound

Any cross pair closer than must have both points within horizontal distance of the dividing line, i.e. inside a vertical strip of width . So discard everything outside the strip. Then sort the strip's points by -coordinate and walk up it; the magic is that for each point you only need to compare against the next 7 points in -order.

The width- strip around the mid line with the 8 cells bounding the 7 neighbors above point at .

Why 7? Fix a strip point and look at the points above it (larger ) that could beat . Any such point lies in the box sitting just above inside the strip. Tile that box with a grid of cells — there are exactly 8 of them.

So the box holds at most points, one of which is itself: at most candidates. Hence in -sorted order it suffices to compare with — a constant number of checks per point. The combine step is , not .

The algorithm

The whole thing rests on having the points pre-sorted both ways. Sort once by (call it ) and once by () up front; the recursion then threads these sorted orders down without re-sorting.

Algorithm 3:Closest-Pair(P)\textsc{Closest-Pair}(P) — closest pair in O(nlogn)O(n\log n)
  1. 1
    AXPAX \gets P sorted by xx-coordinate
  2. 2
    AYPAY \gets P sorted by yy-coordinate
  3. 3
    return call Closest-Pair-Rec(AX,AY)\textsc{Closest-Pair-Rec}(AX, AY)
Algorithm 4:Closest-Pair-Rec(AX,AY)\textsc{Closest-Pair-Rec}(AX, AY)AXAX sorted by xx, AYAY by yy
  1. 1
    nAXn \gets |AX|
  2. 2
    if n<2n < 2 then
  3. 3
    return \bot
    no pair exists
  4. 4
    if n=2n = 2 then
  5. 5
    return (AX[1],AX[2])(AX[1], AX[2])
  6. 6
    midAX[n/2]\textit{mid} \gets AX[\ceil{n/2}]
    line through median x
  7. 7
    LXAX[1..n/2]LX \gets AX[1 \,..\, \ceil{n/2}]; RXAX[n/2+1..n]\quad RX \gets AX[\ceil{n/2}+1 \,..\, n]
    split by index (ties safe)
  8. 8
    LY,RYLY, RY \gets the same split applied to AYAY
    keeps halves y-sorted, O(n)
  9. 9
    (uL,vL)(uL, vL) \gets call Closest-Pair-Rec(LX,LY)\textsc{Closest-Pair-Rec}(LX, LY)
    best pair on left
  10. 10
    (uR,vR)(uR, vR) \gets call Closest-Pair-Rec(RX,RY)\textsc{Closest-Pair-Rec}(RX, RY)
    best pair on right
  11. 11
    δmin{dist(uL,vL), dist(uR,vR)}\delta \gets \min\set{\operatorname{dist}(uL, vL),\ \operatorname{dist}(uR, vR)}
  12. 12
    B[pAY:midxδpxmidx+δ]B \gets [\,p \in AY : \textit{mid}_x - \delta \le p_x \le \textit{mid}_x + \delta\,]
    strip, still y-sorted
  13. 13
    (uM,vM)(uM, vM) \gets call Best-Pair(B,δ)\textsc{Best-Pair}(B, \delta)
  14. 14
    return the closest of (uL,vL)(uL, vL), (uR,vR)(uR, vR), (uM,vM)(uM, vM)
Algorithm 5:Best-Pair(B,δ)\textsc{Best-Pair}(B, \delta) — best pair inside the strip; BB sorted by yy
  1. 1
    (um,vm)(,)(u_m, v_m) \gets (\bot, \bot); dmδ\quad d_m \gets \delta
    best pair so far + its dist
  2. 2
    for i1i \gets 1 to B|B| do
  3. 3
    for ji+1j \gets i + 1 to min{i+7, B}\min\set{i + 7,\ |B|} do
    only 7 neighbors each!
  4. 4
    if dist(B[i],B[j])<dm\operatorname{dist}(B[i], B[j]) < d_m then
  5. 5
    (um,vm)(B[i],B[j])(u_m, v_m) \gets (B[i], B[j]); dmdist(B[i],B[j])\quad d_m \gets \operatorname{dist}(B[i], B[j])
  6. 6
    return (um,vm)(u_m, v_m)

Running time and correctness

The split, the strip extraction, and are each , and there are two half-size recursive calls:

by the master theorem, the same shape as mergesort. The initial sorts add a one-time , so the total stays .

Correctness is an easy induction on , given that finds any closest cross-strip pair. That assumption is exactly the 7-neighbor claim above: for each , every later strip point that could improve on lies in one of the cells of 's grid (the cell holding aside), and each cell holds at most one point — so scanning cannot miss it.

  • Selection finds the -th smallest element; it needs less than sorting, so it can run in .
  • = quicksort's partition, but recurse into only the side that holds the answer; expected via a geometric (non-branching) recurrence, worst case .
  • Median of medians chooses a provably balanced pivot using groups of five, guaranteeing the recursion drops at least elements per side.
  • That balance yields , and because , the recurrence solves to worst-case .
  • In practice randomized quickselect wins on constants; median-of-medians matters for guarantees and as a quicksort pivot rule.
  • Closest pair of points is divide-and-conquer with the same flavor: split by median , recurse on each half, then combine over a width- strip. A geometric argument caps the strip work at comparisons per point, giving .

Footnotes

  1. CLRS, Ch. 9 — Medians and Order Statistics: selecting the -th order statistic in linear time without fully sorting.
  2. Erickson, Algorithms, Ch. 1 — Recursion: quickselect adapting quicksort's partition to recurse into only the side that holds the answer.
  3. CLRS, Ch. 9 — Medians and Order Statistics: the Blum–Floyd–Pratt–Rivest–Tarjan median-of-medians algorithm achieving worst-case via groups of five.
  4. Skiena, The Algorithm Design Manual, §4 — Sorting and Searching: randomized quickselect as the practical choice over deterministic median-of-medians.
Practice

╌╌ END ╌╌