Linear-Time Selection
Finding the -th smallest element looks like it should require sorting, but it does not. Quickselect adapts quicksort's partition to recurse on just one side, achieving expected .
╌╌╌╌
How do you find the median of numbers? The obvious answer, sorting them and reading off the middle, costs . But the median, and more generally the -th smallest element, can be found in linear time.1 The insight is that selection asks for less than sorting: we want one element's value, not the full ordering, and we can stop as soon as we have it.
The selection problem
Special cases: is the minimum, the maximum, and the median. The minimum and maximum are easy in comparisons. The median is the interesting case, and the algorithms below solve it as a byproduct of solving general selection.
Quickselect: partition, then recurse on one side
Quicksort partitions around a pivot and recurses on both halves. But for selection we only care about one of them. After partitioning around a pivot that lands at index , the pivot is the -th smallest element of the subarray. Compare that rank to :
- if it equals , the pivot is the answer;
- if is smaller, the answer lies in the left part, so recurse there;
- if is larger, the answer lies in the right part; recurse there, adjusting to skip the elements we discarded.
Throwing away the side that cannot contain the answer is what turns into .2
- 1if then
- 2returnonly element: the answer
- 3call
- 4pivot rank in A[p..r]
- 5if then
- 6returnpivot is the k-th smallest
- 7else if then
- 8return callrecurse left
- 9else
- 10return callrecurse right, shift k
Expected linear time
Partition costs . The win over quicksort is that we recurse on only one side. With a randomized pivot, the partition splits the array at a uniformly random rank, so on average we discard a constant fraction each time. Intuitively, a random pivot lands in the middle half of the array with probability , in which case the surviving side has at most elements. The expected work satisfies, roughly,
a geometric (not branching) recurrence. Unrolling it,
The geometric series converges to a constant, so the total is linear. (A careful CLRS-style analysis with indicator variables gives the same .)
The catch is the same as quicksort's: the worst case, with every pivot maximally unbalanced, is
Randomization makes that astronomically unlikely, but it is still possible. Can we guarantee linear time?
Median of medians: a guaranteed-good pivot
The deterministic algorithm of Blum, Floyd, Pratt, Rivest, and Tarjan (1973) achieves worst-case by spending a little effort to choose a pivot that is provably not too extreme.3 The trick is to pick the pivot as a median of medians.
- 1if has at most elements then
- 2return the -th smallest of by direct sorting
- 3divide into groups of elements (last group may be smaller)
- 4foreach group do
- 5find that group's median by sorting its elements
- 6let be the array of the group medians
- 7callmedian of medians
- 8partition around the pivot , returning its rank
- 9if then
- 10return
- 11else if then
- 12return call
- 13else
- 14return call
Why groups of five give a good pivot
Here is the core of it. Picture the groups as columns, each sorted top (large) to bottom (small), and now imagine the columns reordered left to right by their medians. The pivot is the median of that middle row, so it sits dead-center in the grid below. (We assume distinct elements for simplicity.)
The pivot is built in stages: chop the array into groups of five, sort each group to expose its median, collect those medians, and recurse to find their median — the median of medians.
Consider the columns whose median is : that is, 's own column and the roughly half of the columns to its right. In each such column, the median and the two elements above it (the two larger ones) are all . That is 3 elements per column, across about half of the columns:
So at least elements are , which forces at most to be . By the symmetric argument (the dashed block) at least are , so at most are . Whichever side we recurse into has at most elements, so the pivot is provably never too lopsided. (Five is the smallest odd group size that makes the recurrence below close; groups of fail because their fractions sum to exactly .)
Solving the recurrence
Tallying the work: splitting into groups and finding their medians is (each group is sorted in ). Partitioning is . There are two recursive calls:
- finding the median of the medians, a subproblem of size ;
- recursing into the surviving side, a subproblem of size at most .
The two fractions are what make this work. Observe that : the two subproblems together are strictly smaller than the input. That is shrinkage — the total work contracts by a constant factor at every level.
For inspiration, first consider the single-call cousin . The master theorem gives ; unrolling shows why directly,
The constant-factor-shrinkage plus linear cleanup work sums to . The median-of-medians recurrence has two calls instead of one, but the same shrinkage idea carries it through. State it as a clean general lemma:
Plugging in , (so ) gives , that is, worst-case linear time. Had the fractions summed to or more (as for groups of , where ), the lemma's hypothesis fails: the per-level savings vanish, the recursion tree carries levels of work, and the bound degrades to .
Which to use
The median-of-medians algorithm is a landmark: it proves selection is possible in worst-case linear time. But its constant factor is large (all that grouping and recursive median-finding), so in practice randomized quickselect is faster and is the algorithm of choice.4 The deterministic version shines when a worst-case guarantee is mandatory, or, most importantly, as a pivot-selection subroutine that can be bolted onto quicksort to give it a worst-case bound.
Bonus: Closest Pair of Points
This is often called the most beautiful divide-and-conquer algorithm,
and
also the scariest to get right. It is worth seeing because, like
median-of-medians, its speed hinges on a geometric counting argument that
caps how much work the combine step can possibly do.
We treat one distance computation and one distance comparison as . The naïve algorithm, two nested loops over all pairs, runs in . Can divide-and-conquer beat it?
The idea, and the trap
Split by the median -coordinate into a left half and a right half, recurse on each to get the closest pair within each side, and let be the smaller of the two returned distances. The danger is the combine step: the true closest pair might straddle the dividing line, so we still have to look across it.
The trap is doing this naïvely. If left points and right points lie near the line, checking every cross pair costs , which can be , no better than brute force. The whole game is to show that only cross-comparisons are ever needed.
The -strip and the 7-neighbor bound
Any cross pair closer than must have both points within horizontal distance of the dividing line, i.e. inside a vertical strip of width . So discard everything outside the strip. Then sort the strip's points by -coordinate and walk up it; the magic is that for each point you only need to compare against the next 7 points in -order.
Why 7? Fix a strip point and look at the points above it (larger ) that could beat . Any such point lies in the box sitting just above inside the strip. Tile that box with a grid of cells — there are exactly 8 of them.
So the box holds at most points, one of which is itself: at most candidates. Hence in -sorted order it suffices to compare with — a constant number of checks per point. The combine step is , not .
The algorithm
The whole thing rests on having the points pre-sorted both ways. Sort once by (call it ) and once by () up front; the recursion then threads these sorted orders down without re-sorting.
- 1sorted by -coordinate
- 2sorted by -coordinate
- 3return call
- 1
- 2if then
- 3returnno pair exists
- 4if then
- 5return
- 6line through median x
- 7;split by index (ties safe)
- 8the same split applied tokeeps halves y-sorted, O(n)
- 9callbest pair on left
- 10callbest pair on right
- 11
- 12strip, still y-sorted
- 13call
- 14return the closest of , ,
- 1;best pair so far + its dist
- 2for to do
- 3for to doonly 7 neighbors each!
- 4if then
- 5;
- 6return
Running time and correctness
The split, the strip extraction, and are each , and there are two half-size recursive calls:
by the master theorem, the same shape as mergesort. The initial sorts add a one-time , so the total stays .
Correctness is an easy induction on , given that finds any closest cross-strip pair. That assumption is exactly the 7-neighbor claim above: for each , every later strip point that could improve on lies in one of the cells of 's grid (the cell holding aside), and each cell holds at most one point — so scanning cannot miss it.
- Selection finds the -th smallest element; it needs less than sorting, so it can run in .
- = quicksort's partition, but recurse into only the side that holds the answer; expected via a geometric (non-branching) recurrence, worst case .
- Median of medians chooses a provably balanced pivot using groups of five, guaranteeing the recursion drops at least elements per side.
- That balance yields , and because , the recurrence solves to worst-case .
- In practice randomized quickselect wins on constants; median-of-medians matters for guarantees and as a quicksort pivot rule.
- Closest pair of points is divide-and-conquer with the same flavor: split by median , recurse on each half, then combine over a width- strip. A geometric argument caps the strip work at comparisons per point, giving .
Footnotes
- CLRS, Ch. 9 — Medians and Order Statistics: selecting the -th order statistic in linear time without fully sorting. ↩
- Erickson, Algorithms, Ch. 1 — Recursion: quickselect adapting quicksort's partition to recurse into only the side that holds the answer. ↩
- CLRS, Ch. 9 — Medians and Order Statistics: the Blum–Floyd–Pratt–Rivest–Tarjan median-of-medians algorithm achieving worst-case via groups of five. ↩
- Skiena, The Algorithm Design Manual, §4 — Sorting and Searching: randomized quickselect as the practical choice over deterministic median-of-medians. ↩
╌╌ END ╌╌