DP Optimizations

The previous lessons built DP recurrences and trusted their dimensions to give the running time: a table of $S$ states, each filled by a transition that scans $T$ predecessors, costs $O (S \cdot T)$ . Often $T$ is itself $Θ (n)$ , a min over all earlier states or a split point ranging over an interval, and the honest recurrence runs in $O (n^{2})$ or $O (n^{3})$ . The techniques in this lesson all share one move: they observe that the transition is not an arbitrary min over predecessors but one with structure, and they maintain an auxiliary object (a deque, a hull of lines, a monotone split pointer) that answers each transition faster than a fresh scan.¹ None of them changes what the DP computes, only how fast it computes it.

A useful test runs through the whole lesson: look at the shape of the inner min/max and ask what stays constant and what slides as the outer index advances. The answer names the technique.

The cost model here is the state-count times transition-work product; these techniques attack the second factor.

Monotonic-queue optimization

Consider a transition of the form

d p [i] = (i - k \leq j < i min d p [j]) + cost (i),

where each state takes a min (or max) of the previous states within a sliding window of width $k$ , then adds a term depending only on $i$ . Evaluated directly this is $O (nk)$ : every state rescans its window. But the window's left edge only ever moves right, and its right edge only ever moves right, so this is the sliding-window minimum problem, which a monotonic deque solves in $O (1)$ amortized per step.²

Algorithm:

\textsc{MonoQueue-DP}

— evaluate

dp[i]=\min_{i-k\le j<i}dp[j]+\text{cost}(i)

O(n)

1
$Q \gets$ empty deque of indices
$dp$ increasing front $\to$ back
2
$dp[0] \gets \text{base}$ ; push $0$ onto $Q$
3
for $i \gets 1$ to $n$ do
4
while $Q$ nonempty and $front(Q) < i-k$ do
5
pop front of $Q$
left the window
6
$dp[i] \gets dp[front(Q)] + \text{cost}(i)$
7
while $Q$ nonempty and $dp[back(Q)] \ge dp[i]$ do
8
pop back of $Q$
$i$ dominates older candidate
9
push $i$ onto $Q$

Each index is pushed and popped at most once, so the total work is $O (n)$ , down from $O (nk)$ . Jump Game VI is the canonical instance: $d p [i]$ is the best score reachable at index $i$ , equal to $max$ of $d p [j]$ for $j$ in the last $k$ positions, plus $nums [i]$ : a sliding-window max, the same deque with the inequality flipped. Constrained Subsequence Sum is the same recurrence with $cost (i) = nums [i]$ and a $max (\cdot, 0)$ reset. This is the monotonic-stack idea from the sequences module, extended to a deque so that both ends move.

For a full trace, run Jump Game VI on $nums = [1, - 1, - 2, 4, - 7, 3]$ with jump width $k = 2$ , so $d p [i] = max_{i - 2 \leq j < i} d p [j] + nums [i]$ and $d p [0] = 1$ . The deque stores indices whose $d p$ values decrease from front to back; the front is always the window maximum. Each row shows the state after processing index $i$ :

$i$	window $[i - 2, i - 1]$	front $d p$	$d p [i]$	deque (indices, $d p$ )
$0$	—	—	$1$	$[0 : 1]$
$1$	${0}$	$d p [0] = 1$	$1 + (- 1) = 0$	$[0 : 1, 1 : 0]$
$2$	${0, 1}$	$d p [0] = 1$	$1 + (- 2) = - 1$	$[0 : 1, 1 : 0, 2 : - 1]$
$3$	${1, 2}$	$d p [1] = 0$	$0 + 4 = 4$	$[3 : 4]$
$4$	${2, 3}$	$d p [3] = 4$	$4 + (- 7) = - 3$	$[3 : 4, 4 : - 3]$
$5$	${3, 4}$	$d p [3] = 4$	$4 + 3 = 7$	$[5 : 7]$

Two evictions do the real work. At $i = 3$ index $0$ has slid out of the window ( $0 < 3 - 2$ ), so it is popped from the front; then $d p [3] = 4$ dominates the stale $1 : 0$ and $2 : - 1$ at the back, clearing them. At $i = 5$ the same back-eviction wipes $3 : 4$ and $4 : - 3$ , leaving only the fresh maximum. The answer $d p [5] = 7$ is read straight off, and across all six steps no index is touched more than twice.

a sliding window

[i - k, i - 1]

over the

d p

row; the deque front holds the window optimum

monotonic_queue_dp.pypython

from collections import deque
from typing import Callable

def windowed_min_dp(
  length: int,
  base: float,
  cost: Callable[[int], float],
  window: int,
) -> list[float]:
  """
    Fill dp[0..length] where dp[0] = base and, for every later index,\n
    dp[index] = min(dp[j] for j in [index-window, index-1]) + cost(index).\n
    The candidate indices ride a deque whose front is always the window\n
    minimum; a new index evicts every back entry whose dp value it ties or\n
    beats, since that older candidate can never win again. Runs in O(length).\n
  """
  dp: list[float] = [0.0 for _ in range(length + 1)]
  dp[0] = base

  # candidate indices, dp-increasing from front to back.
  candidates: deque[int] = deque([0])
  for index in range(1, length + 1):
    # drop indices that have slid out of the left edge of the window.
    while candidates and candidates[0] < index - window:
      candidates.popleft()
    dp[index] = dp[candidates[0]] + cost(index)

    # evict back candidates this index now dominates, then enqueue it.
    while candidates and dp[candidates[-1]] >= dp[index]:
      candidates.pop()
    candidates.append(index)

  return dp

def constrained_subsequence_sum(numbers: list[int], window: int) -> int:
  """
    Maximum sum of a non-empty subsequence of `numbers` in which every two\n
    chosen indices are at most `window` apart. Here best[index] is the best\n
    sum of such a subsequence ending exactly at `index`:\n
      best[index] = numbers[index] + max(0, best[j] for j in window before).\n
    A monotone-decreasing deque supplies the window maximum in O(n) overall.\n
  """
  if not numbers:
    return 0
  best: list[int] = [0 for _ in range(len(numbers))]
  candidates: deque[int] = deque()
  answer: int = numbers[0]
  for index in range(len(numbers)):
    # drop indices that have slid out of the left edge of the window.
    while candidates and candidates[0] < index - window:
      candidates.popleft()

    # extend by this element, never paying for a negative window prefix.
    window_best: int = best[candidates[0]] if candidates else 0
    best[index] = numbers[index] + max(window_best, 0)
    answer = max(answer, best[index])

    # keep the deque decreasing in best value, front = window maximum.
    while candidates and best[candidates[-1]] <= best[index]:
      candidates.pop()
    candidates.append(index)

  return answer

Convex hull trick

Now suppose each previous state contributes a line and the transition queries the best line at a point:

d p [i] = j < i min (m_{j} \cdot x_{i} + b_{j}),

where the slope $m_{j}$ and intercept $b_{j}$ depend only on $j$ (typically $m_{j}$ is a function of $d p [j]$ and the problem data), and $x_{i}$ depends only on $i$ . The naive evaluation is $O (n^{2})$ . But $min_{j} (m_{j} x + b_{j})$ over a fixed set of lines, as a function of $x$ , is the lower envelope of those lines, a convex, piecewise-linear curve. Only the lines on the lower hull can ever be optimal; the rest are dominated everywhere. Maintaining that hull and querying it is the Convex Hull Trick.¹

each state is a line; the optimal

d p [i]

is the lower hull queried at

x_{i}

If lines are inserted in monotone slope order and queries $x_{i}$ are also monotone, both operations are $O (1)$ amortized: push lines onto a stack-like hull, popping any that the newcomer makes redundant, and advance a pointer for queries. Without monotonicity, store the hull and binary-search for the optimal line at $x_{i}$ in $O (log n)$ , or use a Li Chao tree. Either way the DP drops from $O (n^{2})$ to $O (n log n)$ .

For example, suppose three previous states have contributed the lines $ℓ_{0} (x) = x + 3$ , $ℓ_{1} (x) = 4$ , and $ℓ_{2} (x) = - x + 6$ (slopes $+ 1$ , $0$ , $- 1$ ). Their lower envelope, the pointwise minimum, is a convex $\lor$ shape. Checking crossings, $ℓ_{0} = ℓ_{2}$ at $x + 3 = - x + 6 \Rightarrow x = \frac{3}{2}$ , where both equal $4.5$ . But the flat line $ℓ_{1} = 4$ already sits below that meeting point, so $ℓ_{0}$ and $ℓ_{2}$ never touch the envelope near their crossing: $ℓ_{1}$ undercuts both. Solving $ℓ_{0} = ℓ_{1}$ gives $x = 1$ and $ℓ_{1} = ℓ_{2}$ gives $x = 2$ , so the envelope is $ℓ_{0}$ on $(- \infty, 1]$ , then $ℓ_{1}$ on $[1, 2]$ , then $ℓ_{2}$ on $[2, \infty)$ — all three lines survive on the hull. For a query $x_{i} = 1.5$ the naive scan evaluates all three ( $ℓ_{0} = 4.5$ , $ℓ_{1} = 4$ , $ℓ_{2} = 4.5$ ) and takes the min, $4$ . With monotone queries the hull instead advances a pointer to the $ℓ_{1}$ segment and reads $4$ in $O (1)$ , never re-checking $ℓ_{0}$ or $ℓ_{2}$ . Had $ℓ_{1}$ been $ℓ_{1} = 5$ instead, the two crossings would invert ( $ℓ_{0} = ℓ_{1}$ at $x = 2$ , $ℓ_{1} = ℓ_{2}$ at $x = 1$ ) and $ℓ_{1}$ would be dominated everywhere; inserting it would find it already redundant and pop it, leaving the two-line hull ${ℓ_{0}, ℓ_{2}}$ meeting at $x = \frac{3}{2}$ .

convex_hull_trick.pypython

class Line:
  """
    A line y = slope * x + intercept, one candidate in the lower envelope.\n
  """

  def __init__(self, slope: float, intercept: float) -> None:
    self.slope: float = slope
    self.intercept: float = intercept

  def value_at(self, x: float) -> float:
    """
      The line's height at coordinate `x`.\n
    """
    return self.slope * x + self.intercept

  def __repr__(self) -> str:
    return f"Line(slope={self.slope}, intercept={self.intercept})"

class MinLineContainer:
  """
    A lower hull of lines for minimum queries, with monotone insertion.\n
    Lines must arrive in non-increasing slope order; queries must arrive in\n
    non-decreasing x order. Both then run in O(1) amortized.\n
  """

  def __init__(self) -> None:
    self._hull: list[Line] = []

    # pointer into the hull for the next monotone query.
    self._query_index: int = 0

  def _is_redundant(self, last: Line, middle: Line, incoming: Line) -> bool:
    """
      Whether `middle` is never minimal once `incoming` joins `last`.\n
      True exactly when the intersection of `last` and `incoming` lies at or\n
      before the intersection of `last` and `middle` — the cross-multiplied\n
      form, which avoids division.\n
    """
    left = (incoming.intercept - last.intercept) * (last.slope - middle.slope)
    right = (middle.intercept - last.intercept) * (last.slope - incoming.slope)
    return left <= right

  def add_line(self, slope: float, intercept: float) -> None:
    """
      Insert a line, dropping any hull line the newcomer makes redundant.\n
      Slopes must be supplied in non-increasing order.\n
    """
    incoming = Line(slope, intercept)

    # a duplicate slope only survives if it lies strictly lower.
    if self._hull and self._hull[-1].slope == slope:
      if self._hull[-1].intercept <= intercept:
        return
      self._hull.pop()

    # pop hull lines the newcomer renders redundant, then append it.
    while len(self._hull) >= 2 and self._is_redundant(
      self._hull[-2], self._hull[-1], incoming
    ):
      self._hull.pop()
    self._hull.append(incoming)

    # keep the query pointer inside the (possibly shrunk) hull.
    if self._query_index >= len(self._hull):
      self._query_index = len(self._hull) - 1

  def query(self, x: float) -> float:
    """
      Minimum line value at `x`, for non-decreasing `x` across calls.\n
      Advances a pointer past lines that the rising query has outgrown.\n
    """
    if self._query_index >= len(self._hull):
      self._query_index = len(self._hull) - 1
    while (
      self._query_index + 1 < len(self._hull)
      and self._hull[self._query_index + 1].value_at(x)
      <= self._hull[self._query_index].value_at(x)
    ):
      self._query_index += 1
    return self._hull[self._query_index].value_at(x)

def query_lines_minimum(lines: list[tuple[float, float]], x: float) -> float:
  """
    Brute-force minimum of a set of (slope, intercept) lines at `x`.\n
    The reference the hull container must match for every query.\n
  """
  return min(slope * x + intercept for slope, intercept in lines)

Divide-and-conquer optimization

For a layered transition

d p [i] [j] = k < j min (d p [i - 1] [k] + C (k, j)),

let $o pt (i, j)$ be the smallest $k$ achieving that minimum. If $o pt$ is monotone in $j$ (that is, $o pt (i, j) \leq o pt (i, j + 1)$ for every fixed layer $i$ ), then the search range for column $j$ is bounded by the answers of its neighbors, and we can solve a whole layer by divide and conquer:

Algorithm:

\textsc{DC-Opt}(i, j_{lo}, j_{hi}, k_{lo}, k_{hi})

— fill layer

i

, columns

[j_{lo},j_{hi}]

1
if $j_{lo} > j_{hi}$ then return
2
$j_{mid} \gets \lfloor (j_{lo}+j_{hi})/2 \rfloor$
3
$best \gets \infty$ ; $opt \gets k_{lo}$
4
for $k \gets k_{lo}$ to $\min(j_{mid}-1,\,k_{hi})$ do
5
if $dp[i-1][k] + C(k,j_{mid}) < best$ then
6
$best \gets dp[i-1][k] + C(k,j_{mid})$ ; $opt \gets k$
7
$dp[i][j_{mid}] \gets best$
8
$\textsc{DC-Opt}(i,\ j_{lo},\ j_{mid}-1,\ k_{lo},\ opt)$
9
$\textsc{DC-Opt}(i,\ j_{mid}+1,\ j_{hi},\ opt,\ k_{hi})$

Solve the middle column $j_{mi d}$ first by scanning its full allowed $k$ -range; its optimum $o pt$ then caps the left half's search and floors the right half's, so the two recursive calls split both the columns and the candidate range:

Divide-and-conquer optimization. Solving

j_{mi d}

finds

o pt (j_{mi d})

; by monotonicity the left columns search only

[k_{l o}, o pt]

and the right only

[o pt, k_{hi}]

, halving the column span while the

k

-ranges overlap only at

o pt

At each recursion depth the $k$ -ranges across all sub-calls overlap by at most their endpoints, so one depth costs $O (n)$ ; there are $O (log n)$ depths per layer and $O (k)$ layers, giving $O (k n log n)$ instead of $O (k n^{2})$ . The monotonicity of $o pt$ is the hypothesis you must verify; it holds whenever $C$ satisfies the quadrangle inequality (below), but is sometimes provable directly from the problem.

divide_and_conquer_dp.pypython

from typing import Callable

INFINITY: float = float("inf")

def divide_and_conquer_dp(
  layers: int,
  columns: int,
  cost: Callable[[int, int], float],
  base_layer: list[float],
) -> list[list[float]]:
  """
    Fill a `layers` x `columns` DP table where row 0 is `base_layer` and\n
      dp[layer][column] = min over split in [0, column) of\n
        dp[layer-1][split] + cost(split, column).\n
    `cost` must induce a split-optimum monotone in `column`. Returns the full\n
    table; dp[layers-1] is the final layer.\n
  """
  table: list[list[float]] = [
    [INFINITY for _ in range(columns)] for _ in range(layers)
  ]
  table[0] = list(base_layer)

  def solve_layer(
    layer: int,
    column_low: int,
    column_high: int,
    split_low: int,
    split_high: int,
  ) -> None:
    """
      Fill columns [column_low, column_high] of `layer`, knowing each one's\n
      optimal split lies in [split_low, split_high].\n
    """
    if column_low > column_high:
      return

    # solve the middle column first to pin down the split for both halves.
    column_mid: int = (column_low + column_high) // 2
    best: float = INFINITY
    best_split: int = split_low
    previous: list[float] = table[layer - 1]

    # the split must come strictly before the column being filled.
    upper: int = min(column_mid - 1, split_high)
    for split in range(split_low, upper + 1):
      candidate: float = previous[split] + cost(split, column_mid)
      if candidate < best:
        best = candidate
        best_split = split
    table[layer][column_mid] = best

    # monotonicity: left columns split no later, right columns no earlier.
    solve_layer(layer, column_low, column_mid - 1, split_low, best_split)
    solve_layer(layer, column_mid + 1, column_high, best_split, split_high)

  for layer in range(1, layers):
    solve_layer(layer, 0, columns - 1, 0, columns - 1)
  return table

def naive_layered_dp(
  layers: int,
  columns: int,
  cost: Callable[[int, int], float],
  base_layer: list[float],
) -> list[list[float]]:
  """
    Straightforward O(layers * columns^2) evaluation of the same recurrence,\n
    the reference the optimized version must reproduce exactly.\n
  """
  table: list[list[float]] = [
    [INFINITY for _ in range(columns)] for _ in range(layers)
  ]
  table[0] = list(base_layer)

  # for each cell, scan every earlier split with no monotonicity shortcut.
  for layer in range(1, layers):
    for column in range(columns):
      best: float = INFINITY
      for split in range(column):
        candidate: float = table[layer - 1][split] + cost(split, column)
        if candidate < best:
          best = candidate
      table[layer][column] = best

  return table

Knuth's optimization

Interval DPs have the shape

d p [i] [j] = i \leq k < j min (d p [i] [k] + d p [k + 1] [j]) + C (i, j),

and naively cost $O (n^{3})$ : $O (n^{2})$ intervals, each scanning $O (n)$ split points. Knuth's optimization applies when $C$ satisfies the quadrangle inequality (QI) and is monotone on intervals:

When QI holds, the optimal split point is monotone in both arguments:

o pt [i] [j - 1] \leq o pt [i] [j] \leq o pt [i + 1] [j] .

So when filling $d p [i] [j]$ we only scan split points $k$ in $[o pt [i] [j - 1], o pt [i + 1] [j]]$ rather than all of $[i, j)$ . Summed over a fixed interval length, those ranges telescope, and the total work collapses to $O (n^{2})$ . This is the optimization behind optimal binary search trees and the cost-merging part of matrix-chain multiplication from the interval-DP lesson: both have cost functions satisfying QI, so Knuth's optimization applies and each runs in $O (n^{2})$ .

Knuth's optimization: filling

d p [i] [j]

scans only

k \in [o pt [i] [j - 1], o pt [i + 1] [j]]

, a window pinned by two already-known optimal splits, not all of

[i, j)

knuth_optimization.pypython

from typing import Callable

INFINITY: float = float("inf")

def knuth_interval_dp(
  count: int,
  cost: Callable[[int, int], float],
) -> float:
  """
    Minimum total cost of fully merging items 0..count-1 under\n
      dp[i][j] = min over k in [i, j) of (dp[i][k] + dp[k+1][j]) + cost(i, j),\n
    with cost(i, i) = 0. `cost` must satisfy the quadrangle inequality so the\n
    optimal split is monotone. Returns dp[0][count-1]; count == 0 gives 0.\n
  """
  if count <= 1:
    return 0.0

  # dp[i][j] = best merge cost; opt[i][j] = the split that achieves it.
  dp: list[list[float]] = [[0.0 for _ in range(count)] for _ in range(count)]
  opt: list[list[int]] = [[0 for _ in range(count)] for _ in range(count)]

  # length-1 intervals are free and split trivially at themselves.
  for start in range(count):
    opt[start][start] = start

  # widen the interval; each width reuses the narrower optima as bounds.
  for width in range(1, count):
    for start in range(count - width):
      end: int = start + width

      # monotonicity confines the split to [opt[i][j-1], opt[i+1][j]].
      lower_split: int = opt[start][end - 1]
      upper_split: int = opt[start + 1][end] if start + 1 <= end else end - 1

      # scan only that confined range for the cheapest split.
      best: float = INFINITY
      best_split: int = lower_split
      for split in range(lower_split, min(upper_split, end - 1) + 1):
        candidate: float = dp[start][split] + dp[split + 1][end]
        if candidate < best:
          best = candidate
          best_split = split

      # record this interval's cost and the split that produced it.
      dp[start][end] = best + cost(start, end)
      opt[start][end] = best_split

  return dp[0][count - 1]

def naive_interval_dp(
  count: int,
  cost: Callable[[int, int], float],
) -> float:
  """
    Plain O(n^3) interval DP over every split, the reference Knuth must match.\n
  """
  if count <= 1:
    return 0.0

  dp: list[list[float]] = [[0.0 for _ in range(count)] for _ in range(count)]

  # widen the interval, trying every split for each one.
  for width in range(1, count):
    for start in range(count - width):
      end: int = start + width

      # take the cheapest split across the whole interval.
      best: float = INFINITY
      for split in range(start, end):
        candidate: float = dp[start][split] + dp[split + 1][end]
        if candidate < best:
          best = candidate
      dp[start][end] = best + cost(start, end)

  return dp[0][count - 1]

def optimal_bst_cost(frequencies: list[float]) -> float:
  """
    Minimum expected search cost of an optimal binary search tree over keys\n
    with the given access `frequencies`. The merge cost of an interval is the\n
    sum of its frequencies (every key gains one level when nested), which\n
    satisfies the quadrangle inequality, so Knuth's optimization applies.\n
  """
  count: int = len(frequencies)
  if count == 0:
    return 0.0
  prefix: list[float] = [0.0 for _ in range(count + 1)]
  for index in range(count):
    prefix[index + 1] = prefix[index] + frequencies[index]

  def interval_weight(start: int, end: int) -> float:
    return prefix[end + 1] - prefix[start]

  return knuth_interval_dp(count, interval_weight)

SOS DP (sum over subsets)

The last technique is combinatorial rather than geometric. Given a value $f [m]$ for every bitmask $m$ over $n$ bits, we want, for each mask $m$ , an aggregate over all of its submasks:

g [m] = s \subseteq m \sum f [s] .

Enumerating every submask of every mask costs $\sum_{m} 2^{popcount (m)} = 3^{n}$ (the classic submask-enumeration bound). Sum over subsets does it in $O (n 2^{n})$ by adding one bit-dimension at a time: process bits $0.. n - 1$ , and when processing bit $b$ , fold each mask that has bit $b$ set into the version without it. It is a multidimensional prefix sum over the hypercube ${0, 1}^{n}$ .

SOS DP folds one bit at a time over the hypercube

{0, 1}^{3}

: when processing bit

b

, each mask with bit

b

set absorbs

g [m \oplus (1 ≪ b)]

along that axis

Algorithm:

\textsc{SOS}

— submask sums for all masks in

O(n\,2^n)

1
for $m \gets 0$ to $2^n-1$ do
2
$g[m] \gets f[m]$
3
for $b \gets 0$ to $n-1$ do
4
for $m \gets 0$ to $2^n-1$ do
5
if $m \mathbin{\&} (1 \ll b) \ne 0$ then
6
$g[m] \gets g[m] + g[m \oplus (1 \ll b)]$

After processing bit $b$ , $g [m]$ holds the sum of $f$ over all submasks of $m$ that differ from $m$ only in bits $0.. b$ ; after all $n$ bits, it is the full submask sum. To see the fold in motion, take $n = 3$ with $f = [1, 2, 3, 4, 5, 6, 7, 8]$ indexed by masks $000 \dots 111$ . The array $g$ starts as a copy of $f$ and absorbs one axis per pass:

after	$000$	$001$	$010$	$011$	$100$	$101$	$110$	$111$
init	$1$	$2$	$3$	$4$	$5$	$6$	$7$	$8$
bit $0$	$1$	$3$	$3$	$7$	$5$	$11$	$7$	$15$
bit $1$	$1$	$3$	$4$	$10$	$5$	$11$	$12$	$26$
bit $2$	$1$	$3$	$4$	$10$	$6$	$14$	$16$	$36$

The final row is the submask sum. Check $g [111] = 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 = 36$ (all eight submasks) and $g [101] = f [000] + f [001] + f [100] + f [101] = 1 + 2 + 5 + 6 = 14$ , both matching the table — computed in $3 \cdot 2^{3} = 24$ additions rather than the $3^{3} = 27$ of naive submask enumeration, a gap that widens fast: at $n = 20$ it is $2 \times 1 0^{7}$ against $3.5 \times 1 0^{9}$ .

Replacing the order of the two loops, or flipping the bit test, gives sums over supersets instead. This is what drives the bitmask-DP lesson's harder counting problems: anything that asks you to aggregate over all subsets of every state at once.

sos_dp.pypython

def sum_over_subsets(values: list[int], bits: int) -> list[int]:
  """
    For every mask over `bits` bits, the sum of `values` across all of its\n
    submasks. `values` must have length 2**bits. Returns a fresh list; the\n
    input is left untouched. Runs in O(bits * 2**bits).\n
  """
  size: int = 1 << bits
  aggregate: list[int] = list(values)

  # fold one bit-axis per pass: bit-set masks absorb their bit-cleared twin.
  for bit in range(bits):
    selector: int = 1 << bit
    for mask in range(size):
      if mask & selector:
        aggregate[mask] += aggregate[mask ^ selector]

  return aggregate

def sum_over_supersets(values: list[int], bits: int) -> list[int]:
  """
    For every mask, the sum of `values` across all of its supersets.\n
    The mirror of `sum_over_subsets`: fold the bit-set neighbour into every\n
    mask that has the bit cleared. Runs in O(bits * 2**bits).\n
  """
  size: int = 1 << bits
  aggregate: list[int] = list(values)

  # mirror image: bit-cleared masks absorb their bit-set twin.
  for bit in range(bits):
    selector: int = 1 << bit
    for mask in range(size):
      if not (mask & selector):
        aggregate[mask] += aggregate[mask ^ selector]

  return aggregate

def naive_subset_sums(values: list[int], bits: int) -> list[int]:
  """
    Direct O(3^bits) submask enumeration, the reference for the fast version.\n
  """
  size: int = 1 << bits
  result: list[int] = [0 for _ in range(size)]

  # walk every submask of each mask via the (s-1) & mask descent.
  for mask in range(size):
    submask: int = mask
    while True:
      result[mask] += values[submask]
      if submask == 0:
        break
      submask = (submask - 1) & mask

  return result

Choosing the technique

Technique	Transition shape	Complexity win
Monotonic queue	$d p [i] = min_{i - k \leq j < i} d p [j] + cost (i)$ (sliding window)	$O (nk) \to O (n)$
Convex hull trick	$d p [i] = min_{j} (m_{j} x_{i} + b_{j})$ (line per state)	$O (n^{2}) \to O (n log n)$
Divide & conquer	$d p [i] [j] = min_{k < j} d p [i - 1] [k] + C (k, j)$ , $o pt$ monotone	$O (k n^{2}) \to O (k n log n)$
Knuth	$d p [i] [j] = min_{i \leq k < j} (d p [i] [k] + d p [k + 1] [j]) + C (i, j)$ , QI	$O (n^{3}) \to O (n^{2})$
SOS DP	$g [m] = ⨁_{s \subseteq m} f [s]$ (submask aggregate)	$O (3^{n}) \to O (n 2^{n})$

The origins of the speedups

Each technique here has a traceable pedigree, and the pedigrees explain why the conditions are what they are. Knuth's optimization is the oldest: Donald Knuth's 1971 paper Optimum binary search trees (Acta Informatica 1) showed that the $O (n^{3})$ dynamic program for optimal BSTs runs in $O (n^{2})$ because the optimal root of the interval $[i, j]$ lies between the optimal roots of $[i, j - 1]$ and $[i + 1, j]$ . F. Frances Yao generalized the mechanism in Efficient dynamic programming using quadrangle inequalities (1980, STOC) and Speed-up in dynamic programming (1982, SIAM J. Algebraic Discrete Methods), isolating the quadrangle inequality as the exact structural hypothesis, which is why the condition carries her name (the Knuth/Yao QI) and covers matrix-chain and BST alike.³

The convex hull trick grew out of computational geometry's lower-envelope machinery rather than a single DP paper; the general offline/online line-container that supports arbitrary insertion order is the Li Chao tree (attributed to the competitive-programming author Li Chao), a segment tree over $x$ -coordinates that stores at each node the line best there, answering point queries in $O (log n)$ without any slope-monotonicity assumption. Divide-and-conquer optimization is the algorithmic cousin of the same monotone-optimum idea; it needs only that $o pt (i, j)$ be monotone in $j$ , a strictly weaker condition than the full QI, which is why it applies to layered (exactly $k$ groups) partition DPs where Knuth does not.

Sum-over-subsets is a special case of the fast zeta / Möbius transform over the subset lattice, the combinatorial analog of the fast Fourier transform for the Boolean hypercube. Björklund, Husfeldt, Kaski, and Koivisto's work on subset convolution (Fourier meets Möbius: fast subset convolution, 2007, STOC) built on exactly this $O (n 2^{n})$ zeta transform to compute the full subset convolution in $O (n^{2} 2^{n})$ , which in turn cracked several $# P$ -flavored counting problems and the graph-coloring polynomial.⁴ The monotonic-deque idea, meanwhile, is the sliding-window minimum, folklore since at least the 1980s and standard in streaming and signal processing (it is the linear-time morphological erosion of a 1-D signal). Modern competitive programming (see the open cp-algorithms reference) collects all five together because they answer one question — what structure does the inner loop have? — with the same discipline the asymptotic-analysis lesson applies to loops in general.

Takeaways

These are not new DPs but faster evaluations of an existing recurrence; the trigger is always structure in the inner min/max, not its mere size.
Monotonic-queue optimization: a sliding-window min/max transition runs in $O (n)$ via a monotonic deque whose front is the window optimum — the natural tool for Jump Game VI and Constrained Subsequence Sum.
Convex hull trick: when each state is a line and the transition queries the lower envelope at $x_{i}$ , maintain the hull and query in $O (log n)$ (or $O (1)$ amortized when slopes and queries are monotone), turning $O (n^{2})$ into $O (n log n)$ .
Divide-and-conquer optimization needs a monotone optimal split $o pt (i, j)$ ; Knuth's optimization gets that monotonicity for interval DPs from the quadrangle inequality, dropping $O (n^{3})$ to $O (n^{2})$ on problems like optimal BST and matrix-chain.
SOS DP aggregates over every submask of every mask in $O (n 2^{n})$ by summing one bit-dimension at a time — a prefix sum over the subset lattice.
Verify the applicability condition (window monotonicity, linear cost, monotone split, QI) before reaching for the speedup; the optimization is only correct when its structural hypothesis holds.

Erickson, Ch. — Dynamic Programming: DP optimizations (the convex hull trick, divide-and-conquer, and Knuth's optimization) treated as transition-acceleration techniques over a fixed recurrence; CLRS §15.2/§15.5 for the interval-DP instances (matrix-chain, optimal BST) that Knuth accelerates. ↩ ↩²
Skiena, § — Dynamic Programming: recognizing that a DP's cost is the product of state count and per-state transition work, and attacking the transition. ↩
Knuth, Optimum binary search trees, Acta Informatica 1 (1971), and F. F. Yao, Speed-up in dynamic programming, SIAM J. Algebraic Discrete Methods 3 (1982): the $O (n^{2})$ optimal-BST algorithm and the quadrangle-inequality generalization behind Knuth's optimization. ↩
Björklund, Husfeldt, Kaski, Koivisto, Fourier meets Möbius: fast subset convolution, STOC 2007: the fast zeta/Möbius transform over the subset lattice ( $O (n 2^{n})$ ) that SOS DP computes, and its use in $O (n^{2} 2^{n})$ subset convolution. ↩

Monotonic-queue optimization

Convex hull trick

Divide-and-conquer optimization

Knuth's optimization

SOS DP (sum over subsets)

Choosing the technique

The origins of the speedups

Takeaways

Footnotes