Sorting in Linear Time

The previous lesson proved that any sort that learns only by comparing elements needs $Ω (n log n)$ comparisons. That proof assumes the algorithm extracts information one comparison at a time. If instead we treat keys as data we can read, using a key directly as an array index or splitting it into digits, the decision-tree argument no longer applies, and we can sort in linear time.¹ The tradeoff is generality: these algorithms need keys drawn from a small or structured universe, not arbitrary comparables.

How the lower bound is escaped

The $Ω (n log n)$ bound counts the branchings of a decision tree whose only moves are comparisons: with $n!$ possible orderings and two outcomes per comparison, at least $log_{2} (n!) = Θ (n log n)$ comparisons are needed to distinguish them. A key used as an index makes a move the tree cannot: it routes an element to one of $k$ slots in a single operation, a $k$ -way branch that no binary comparison tree models. The moment an algorithm reads a key's value directly — rather than only its order relative to another key — the decision-tree argument stops applying, and the floor it imposes is no longer binding.

This works only when the key universe is small or structured enough to index into: integers in a bounded range (counting sort), integers split into a bounded number of digits (radix sort), or reals whose distribution is known (bucket sort). On arbitrary comparable objects there is nothing to index on; the only remaining move is to compare, and the $n log n$ bound applies again.

Counting sort

Suppose every key is an integer in the range ${0, 1, \dots, k}$ . Counting sort never compares two elements. Instead it counts, for each value $v$ , how many keys are $\leq v$ ; that count gives the final position of the last key equal to $v$ . Reading the input back-to-front and decrementing as we place, we drop each element straight into its sorted slot.

Algorithm 1:

\textsc{Counting-Sort}(A, B, k)

— sort

A[1..n]

with keys in

[0, k]

into

B

1
let $C[0..k]$ be a new array
2
for $v \gets 0$ to $k$ do
3
$C[v] \gets 0$
4
for $j \gets 1$ to $A.length$ do
5
$C[A[j]] \gets C[A[j]] + 1$
C[v] = count of keys = v
6
for $v \gets 1$ to $k$ do
7
$C[v] \gets C[v] + C[v - 1]$
C[v] = count of keys $\le v$
8
for $j \gets A.length$ downto $1$ do
9
$B[C[A[j]]] \gets A[j]$
10
$C[A[j]] \gets C[A[j]] - 1$
next equal key goes before it

The first count loop tallies occurrences; the prefix-sum loop turns counts into ranks (how many keys land at or before each value); the final loop scatters each element into its slot in the output array $B$ . Walking the input from $n$ down to $1$ is what makes the sort stable. Equal keys are emitted in their original relative order, because the last such key claims the highest of the slots reserved for that value, and earlier ones fill in below it.

Counting sort on

A = ⟨ 2, 5, 3, 0, 2, 3, 0, 3 ⟩

: tally counts, prefix-sum into ranks, then scatter into the output

B

Trace it on $A = ⟨ 2, 5, 3, 0, 2, 3, 0, 3 ⟩$ . The count loop tallies each value: two $0$ s, no $1$ s, two $2$ s, three $3$ s, no $4$ s, one $5$ , giving $C = ⟨ 2, 0, 2, 3, 0, 1 ⟩$ . The prefix-sum loop replaces each entry with the running total, $C = ⟨ 2, 2, 4, 7, 7, 8 ⟩$ . Read this as ranks: $C [3] = 7$ says seven keys are $\leq 3$ , so the last $3$ belongs in output slot $7$ . Each value's rank now gives the exact output index of its largest copy, computed without a single comparison between two keys.

One step of the scatter loop shows both the mechanism and the stability. Reading $A$ from the back, the last key $A [8] = 3$ looks up its rank $C [3] = 7$ and drops straight into $B [7]$ ; we then decrement $C [3]$ to $6$ , so the next $3$ we meet (an earlier one in $A$ ) lands in $B [6]$ , just before it, preserving input order.

One step of the scatter loop. The last key

A [8] = 3

reads its rank

C [3] = 7

, is written to

B [7]

, and then

C [3]

is decremented to

6

so the previous

3

falls just before it — the source of stability.

Analysis. The loops run $Θ (k)$ , $Θ (n)$ , $Θ (k)$ , and $Θ (n)$ times, so counting sort is $Θ (n + k)$ in both time and space. As long as $k = O (n)$ this is $Θ (n)$ , genuinely linear, beating the comparison bound because no comparisons happen.² The limitation is the space and time in $k$ . If the keys range over, say, $32$ -bit integers, then $k \approx 4 \times 1 0^{9}$ dwarfs any realistic $n$ , the count array $C$ is enormous, and the method is impractical. Counting sort works best when the key universe is small.

counting_sort.pypython

from collections.abc import Callable, Sequence
from typing import TypeVar

Item = TypeVar("Item")

def counting_sort(values: Sequence[int]) -> list[int]:
  """
    Sort non-negative integers in O(n + k) time, k the maximum key.\n
    Returns a new sorted list; the input is left untouched.\n
  """
  if not values:
    return []
  return counting_sort_by_key(values, key=lambda value: value)

def counting_sort_by_key(
  items: Sequence[Item],
  key: Callable[[Item], int],
) -> list[Item]:
  """
    Stable counting sort of `items` by an integer `key` in [0, k].\n
    Equal keys keep their input order, so this is the stable subsort that\n
    radix sort layers digit by digit.\n
  """
  if not items:
    return []

  # project each item to its integer key once.
  keys: list[int] = [key(item) for item in items]
  max_key: int = max(keys)

  # counts[value] = how many items carry that key.
  counts: list[int] = [0 for _ in range(max_key + 1)]
  for current_key in keys:
    counts[current_key] += 1

  # prefix-sum turns counts into ranks: counts[value] = number of keys <= value.
  for value in range(1, max_key + 1):
    counts[value] += counts[value - 1]

  # scatter back-to-front so equal keys keep input order (stable).
  output: list[Item] = [items[0]] * len(items)
  for position in range(len(items) - 1, -1, -1):
    current_key = keys[position]
    counts[current_key] -= 1
    output[counts[current_key]] = items[position]
  return output

Radix sort

What if the keys are larger, say $d$ -digit numbers, so that a single counting pass is infeasible? Radix sort decomposes each key into $d$ digits and sorts one digit at a time. The counterintuitive rule, known since the days of punched-card machines, is to sort by the least significant digit first (LSD), working up to the most significant.

Algorithm 2:

\textsc{Radix-Sort}(A, d)

— sort

d

-digit keys, least significant digit first

1
for $i \gets 1$ to $d$ do
2
use a stable sort to sort $A$ on digit $i$
digit 1 = least sig.

The correctness rests entirely on stability.

Using an unstable per-digit sort would destroy the work of every earlier pass. This is why the inner sort must be stable, and counting sort is the natural choice.

LSD radix sort over three digits; each pass stably sorts on one digit, and ties keep the previous pass's order until the array is sorted.

A single pass shows why stability is required. Suppose the array is already ordered on the low digit, and we now sort on the next one. Keys that tie on the new digit must keep their incoming order, since that order already encodes the lower digit; only keys that differ on the new digit may be reordered.

Why stability is essential. Sorting on the tens digit, keys that tie there (both

3 □

) keep their incoming order, preserving the units sort; only keys that differ on the tens digit cross.

Analysis. With counting sort on each of $d$ digits, each drawn from a range of size $k$ , every pass costs $Θ (n + k)$ , for a total of

Θ (d (n + k)) .

When $d$ is a constant and $k = O (n)$ , for example fixed-width integers split into a constant number of digits in a base of size $Θ (n)$ , radix sort runs in $Θ (n)$ . Choosing the digit size is an engineering tradeoff: larger digits mean fewer passes ( $d$ shrinks) but a larger $k$ per pass. For $b$ -bit keys, the best choice is typically digits of about $log_{2} n$ bits, so $k \approx n$ and $d \approx b / log_{2} n$ .

Consider $32$ -bit keys with $n \approx 2^{16}$ elements. Splitting into $8$ -bit digits gives $d = 4$ passes over a count array of size $k = 256$ ; each pass is $Θ (n + 256) = Θ (n)$ , for $Θ (4 n)$ total. Splitting into $16$ -bit digits gives $d = 2$ passes but a count array of size $k = 65, 536$ , comparable to $n$ itself; the total is $Θ (2 (n + n)) = Θ (4 n)$ again, but the larger $C$ strains the cache. Halving the digit size the other way — $4$ -bit digits — doubles $d$ to $8$ passes with a tiny $k = 16$ . The product $d (n + k)$ is what to minimize, and the sweet spot keeps $k$ near $n$ .

The radix digit-size tradeoff for

32

-bit keys. Wider digits cut the pass count

d

but enlarge the per-pass count array

k

; total work

d (n + k)

is minimized when

k

sits near

n

(the middle rung).

radix_sort.pypython

from collections.abc import Sequence

def _digit_count(largest: int, radix: int) -> int:
  """
    Number of base-`radix` digits needed to represent `largest`.\n
  """
  if largest == 0:
    return 1

  # strip one digit per division until nothing is left.
  digits: int = 0
  while largest > 0:
    largest //= radix
    digits += 1
  return digits

def _counting_sort_on_digit(
  values: Sequence[int],
  place: int,
  radix: int,
) -> list[int]:
  """
    One stable counting-sort pass keyed on the digit at `place`\n
    (`place` is a power of `radix`: 1, radix, radix**2, ...).\n
  """
  # tally how many keys carry each digit at this place.
  counts: list[int] = [0 for _ in range(radix)]
  for value in values:
    digit: int = (value // place) % radix
    counts[digit] += 1

  # prefix-sum into ranks within this digit.
  for digit in range(1, radix):
    counts[digit] += counts[digit - 1]

  # scatter back-to-front to keep the pass stable.
  output: list[int] = [0 for _ in range(len(values))]
  for position in range(len(values) - 1, -1, -1):
    digit = (values[position] // place) % radix
    counts[digit] -= 1
    output[counts[digit]] = values[position]
  return output

def radix_sort(values: Sequence[int], radix: int = 10) -> list[int]:
  """
    Sort non-negative integers least significant digit first.\n
    `radix` sets the digit base; larger bases mean fewer passes but a\n
    larger count array per pass. Returns a new sorted list.\n
  """
  if radix < 2:
    raise ValueError("radix must be at least 2")
  if not values:
    return []

  if any(value < 0 for value in values):
    raise ValueError("radix_sort handles non-negative integers only")

  # one pass per digit of the largest key.
  result: list[int] = list(values)
  passes: int = _digit_count(max(result), radix)

  # sort on each digit position, least significant first.
  place: int = 1
  for _ in range(passes):
    result = _counting_sort_on_digit(result, place, radix)
    place *= radix
  return result

Bucket sort

Counting and radix sort exploit integer keys. Bucket sort instead exploits a distributional assumption: that the keys are drawn (roughly) uniformly at random from an interval, say $[0, 1)$ . It scatters the $n$ keys into $n$ equal sub-intervals, the buckets, sorts each bucket with a simple sort like insertion sort, then concatenates the buckets in order.

Bucket sort on

n = 10

keys in

[0, 1)

. Key

x

drops into bucket

⌊ 10 x ⌋

; each bucket is sorted and the buckets concatenated left to right. Uniform keys give

\approx 1

per bucket.

Algorithm 3:

\textsc{Bucket-Sort}(A)

— sort

n

keys drawn uniformly from

[0, 1)

1
$n \gets A.length$
2
let $B[0..n-1]$ be an array of empty lists
3
for $i \gets 1$ to $n$ do
4
insert $A[i]$ into list $B[\,\floor{n \cdot A[i]}\,]$
bucket by value
5
for $i \gets 0$ to $n - 1$ do
6
sort list $B[i]$ with insertion sort
7
concatenate $B[0], B[1], \dots, B[n-1]$ in order

Scattering is $Θ (n)$ , and concatenation is $Θ (n)$ . The only variable cost is sorting the buckets. If the input is spread uniformly, each bucket holds about one element on average, so the insertion sorts cost $O (1)$ each in expectation.⁴

Analysis. Let $n_{i} = ∣ B [i] ∣$ . Insertion sort on bucket $i$ costs $O (n_{i}^{2})$ , so the expected total bucket-sorting cost is $\sum_{i} E [O (n_{i}^{2})] = \sum_{i} O (E [n_{i}^{2}])$ . Each key lands in bucket $i$ independently with probability $1/ n$ , so $n_{i}$ is Binomial $(n, 1/ n)$ , which has

E [n_{i}^{2}] = Var (n_{i}) + E [n_{i}]^{2} = (1 - \frac{1}{n}) + 1^{2} = 2 - \frac{1}{n} < 2.

Summing over the $n$ buckets gives $\sum_{i} O (E [n_{i}^{2}]) = O (n)$ , so the total expected running time is

Θ (n) + n \cdot O (1) = Θ (n) .

This is an average-case result: it assumes the inputs are uniformly distributed. Adversarial input, with every key landing in the same bucket, degrades bucket sort to the $Θ (n^{2})$ of a single insertion sort. Bucket sort is the right tool when you know your data is spread evenly (or can cheaply map it so), as with fractional parts of well-mixed values.

A worked bucket sort

Take the $n = 10$ keys $A = ⟨ 0.78, 0.17, 0.39, 0.26, 0.72, 0.94, 0.21, 0.12, 0.23, 0.68 ⟩$ , uniform-looking values in $[0, 1)$ . Each key $x$ lands in bucket $⌊ 10 x ⌋$ , so $0.78 \to B [7]$ , $0.17 \to B [1]$ , $0.39 \to B [3]$ , and so on. Scattering costs one pass:

bucket $i$	keys placed (in arrival order)
$0$	—
$1$	$0.17, 0.12$
$2$	$0.26, 0.21, 0.23$
$3$	$0.39$
$6$	$0.68$
$7$	$0.78, 0.72$
$9$	$0.94$

Buckets $0$ , $4$ , $5$ , and $8$ stay empty. Insertion sort now orders each bucket's short list — bucket $1$ becomes $⟨ 0.12, 0.17 ⟩$ , bucket $2$ becomes $⟨ 0.21, 0.23, 0.26 ⟩$ , bucket $7$ becomes $⟨ 0.72, 0.78 ⟩$ — and reading the buckets left to right concatenates them into the sorted output. No bucket held more than three keys, so every insertion sort was $O (1)$ work, and the whole sort touched each key a constant number of times.

Bucket sort on the

10

keys. Each key scatters to bucket

⌊ 10 x ⌋

, buckets are insertion-sorted in place (most hold

0

1

key), then read left to right into the sorted run.

bucket_sort.pypython

from collections.abc import Sequence

def insertion_sort(values: list[float]) -> None:
  """
    In-place insertion sort — the cheap per-bucket subsort.\n
    Stable, and O(length) on the near-sorted, near-singleton buckets that\n
    uniform input produces.\n
  """
  for position in range(1, len(values)):
    current: float = values[position]

    # shift larger neighbors right until current's slot opens up.
    scan: int = position - 1
    while scan >= 0 and values[scan] > current:
      values[scan + 1] = values[scan]
      scan -= 1
    values[scan + 1] = current

def bucket_sort(values: Sequence[float]) -> list[float]:
  """
    Sort reals drawn (roughly) uniformly from [0, 1) in expected O(n).\n
    Returns a new sorted list; raises if any key falls outside [0, 1).\n
  """
  count: int = len(values)
  if count == 0:
    return []

  # reject keys outside the half-open interval the scatter step assumes.
  if any(value < 0.0 or value >= 1.0 for value in values):
    raise ValueError("bucket_sort expects keys in the half-open interval [0, 1)")

  # scatter each key into bucket floor(count * value) in [0, count).
  buckets: list[list[float]] = [[] for _ in range(count)]
  for value in values:
    index: int = int(count * value)
    buckets[index].append(value)

  # subsort each bucket and concatenate in bucket order.
  result: list[float] = []
  for bucket in buckets:
    insertion_sort(bucket)
    result.extend(bucket)
  return result

Choosing among them

None of these linear-time sorts is a drop-in replacement for a comparison sort like mergesort or heapsort. Each rests on a structural assumption about the keys, so the choice comes down to matching the algorithm to what you know about your data.⁵

Algorithm	Assumption on keys	Time	Stable?	Extra space
Counting sort	integers in a small range $[0, k]$	$Θ (n + k)$	yes	$Θ (n + k)$
Radix sort	$d$ digits, each in a small range	$Θ (d (n + k))$	yes	$Θ (n + k)$
Bucket sort	reals spread uniformly over an interval	$Θ (n)$ expected	yes	$Θ (n)$

Practical guidance:

Use counting sort when keys are integers over a range comparable to $n$ (grades, small ages, byte values). It is also the standard stable subsort inside radix sort.
Use radix sort for fixed-width keys with a larger range, such as $32$ - or $64$ -bit integers or fixed-length strings, where a single counting pass would need an impossibly large count array.
Use bucket sort when keys are real numbers believed to be uniformly (or near-uniformly) distributed, and linear expected time suffices.

These methods beat $Ω (n log n)$ precisely because they are not comparison sorts: they compute with the keys rather than comparing them. On arbitrary comparable objects with no exploitable integer or distributional structure, the linear-time guarantee is gone, and a comparison sort with its $n log n$ bound is the only option.

Radix sort in practice

The textbook radix sort scatters into $k$ separate output lists per pass, paying $Θ (n + k)$ auxiliary space. In production that copying and the poor cache behavior of scattered writes are the bottleneck, and two refinements address them.

MSD radix, in place: American flag sort. Sorting most-significant digit first lets a radix sort partition the array in place, the way quicksort does, rather than into external buckets. American flag sort (McIlroy, Bostic, and McIlroy, 1993) makes two passes over the array per digit: the first counts how many keys fall in each of the $k$ digit values, turning the counts into bucket boundaries; the second permutes elements into place by following a cycle of swaps, so each key is moved directly to its bucket with no auxiliary array. It then recurses on each bucket for the next digit. The in-place permutation trades counting sort's $Θ (n)$ scratch space for a swap-heavy inner loop, and because it is MSD it can stop early on distinguishing prefixes — the standard choice for sorting large string sets where keys share long common prefixes.

Adaptive bucketing: spreadsort. Bucket sort's fragility is its fixed uniform partition; real data is rarely uniform. Spreadsort (Ross, 2002; shipped in the Boost C++ libraries) is a hybrid that inspects the actual range of the keys, sizes its buckets to that range rather than assuming $[0, 1)$ , and recursively spreads or falls back to a comparison sort when a bucket is small enough that partitioning no longer pays. It interpolates between radix sort's digit-splitting and quicksort's divide-and-conquer, achieving close to linear time on real numeric data without bucket sort's uniform-distribution assumption or radix sort's fixed digit width.

Where linear sorts actually run. Radix sort is the standard high-throughput sort on GPUs: a GPU has thousands of lanes but suffers from the branch divergence of a comparison sort's data-dependent control flow, whereas a radix pass is a fixed sequence of counts and scatters that maps cleanly onto parallel prefix-sums (Merrill and Grimshaw, 2011). Column-store databases likewise radix-sort fixed-width integer and date columns, and MapReduce-style systems partition keys by a radix-like hash to route them to reducers. The common pattern: when the keys have exploitable structure, computing with them beats comparing them, and the advantage is largest on wide parallel hardware and data too large to shuffle randomly.⁵

Takeaways

The $Ω (n log n)$ bound binds only comparison sorts; using keys as array indices or digit sequences sidesteps it entirely.
Counting sort ranks keys by prefix-summing their counts: $Θ (n + k)$ , stable, linear when $k = O (n)$ but impractical when $k$ is large.
Radix sort stably sorts digit by digit, least significant first; stability is what preserves earlier passes, giving $Θ (d (n + k))$ .
Bucket sort scatters uniform keys into $n$ buckets and sorts each; expected $Θ (n)$ , but $Θ (n^{2})$ if the distribution is adversarial.
Each linear sort trades generality for a structural assumption on the keys, so choose by what you actually know about your data.

Erickson, Algorithms, Ch. — Sorting Beyond Comparisons — treating keys as readable data sidesteps the decision-tree argument and permits linear-time sorting. ↩
CLRS, §8.2 — Counting Sort — counting sort runs in $Θ (n + k)$ , is stable, and is linear when $k = O (n)$ . ↩
CLRS, §8.3 — Radix Sort — sorting least-significant digit first with a stable subsort yields a correct sort in $Θ (d (n + k))$ . ↩
CLRS, §8.4 — Bucket Sort — scattering uniformly distributed keys into $n$ buckets gives expected $Θ (n)$ running time. ↩
Skiena, The Algorithm Design Manual, §4 — Sorting and Searching — choosing the right sort by matching the algorithm to the structure of the keys. ↩ ↩²

How the lower bound is escaped

Counting sort

Radix sort

Bucket sort

A worked bucket sort

Choosing among them

Radix sort in practice

Takeaways

Footnotes