Data Structures/Disjoint Sets (Union-Find)

Lesson 4.61,971 words

Disjoint Sets (Union-Find)

The disjoint-set data structure tracks a partition of elements into groups, answering "are these two in the same group? " and merging groups on demand.

╌╌╌╌

Some problems keep a collection of items partitioned into disjoint groups that only ever merge, never split, and repeatedly ask whether two items currently share a group. Are these two cities on the same electrical grid? Do these two pixels belong to the same connected region? Does adding this edge to a graph create a cycle? The disjoint-set (or union-find) data structure answers exactly these questions, and does so in near-constant amortized time per operation1, slow-growing enough that for every input you will ever see, it is effectively .

There is a recurring lesson in the design of efficient algorithms: the right data structure is what makes an algorithm fast. Dijkstra's and Prim's shortest- path and MST algorithms are correct with any priority queue, but their speed hinges on the queue: a binary heap gives , while a Fibonacci heap (with amortized ) shaves it toward . Kruskal's MST tells the same story through a different structure. The algorithm is one line of logic, and every bit of its efficiency comes from the disjoint-set structure beneath it. We will build that structure from the ground up and watch it collapse from per query down to inverse Ackermann.

The disjoint-set ADT

We maintain a collection of disjoint sets that together partition a universe of elements. Each set is named by a representative, some fixed member of the set chosen by the structure. The ADT has three operations:

  • creates a new set whose only member is (so is its own representative). must not already be in any set.
  • returns the representative of the set containing . Two elements are in the same set iff they return the same representative.
  • merges the sets containing and into one, picking a representative for the combined set. The two old sets are destroyed.

The query are and together? is just the test . After operations there can be at most Union operations, since each union reduces the number of sets by one.

The forest representation

The fast implementation represents each set as a rooted tree, and the whole collection as a forest.2 Every element points only to its parent; the root of each tree is the set's representative and points to itself. There are no child pointers and no key ordering. This is not a search tree, just a tangle of upward pointers whose only job is to lead to a root.

Two disjoint-set trees of parent pointers, each root looping to itself

Two sets: with representative , and with representative . Each node's single arrow points at its parent; the roots loop to themselves. follows parent pointers up to the root; Union makes one tree's root a child of the other's.

Algorithm:Naive disjoint-set forest operations
  1. 1
    Make-Set(x):
  2. 2
    parent(x)xparent(x) \gets x
  3. 3
    Find-Set(x):
  4. 4
    while xparent(x)x \ne parent(x) do
  5. 5
    xparent(x)x \gets parent(x)
    walk to the root
  6. 6
    return xx
  7. 7
    Union(x, y):
  8. 8
    parent(Find-Set(x))Find-Set(y)parent(\textbf{Find-Set}(x)) \gets \textbf{Find-Set}(y)

So far this is correct but not fast: a careless sequence of unions can build a tall, skinny tree, a path of nodes, making cost . Two heuristics, used together, flatten the forest and make the structure extraordinarily fast.

A warm-up: labels, and always relabel the smaller side

Before the forest, consider the most naive possible implementation, along with the one idea that already makes it respectable. Keep an array that stores, for each element, a label naming its current set. Then is just , a single array lookup, and the same-set test is instant. The whole cost lives in : merging two sets means walking through one of them and rewriting every member's label to match the other.

The question is which set to rewrite. If we are careless and always relabel, say, the set containing , an adversary can force work on every union. The fix is a single disciplined rule:

To do this efficiently, alongside keep a list of the elements currently carrying label , plus each set's size; the union then splices the smaller list into the larger and relabels only the short side.

Why does this help so much? Charge the cost to the elements that get relabeled. An element's label changes only when it sits in the smaller of two merging sets, and after that merge, the set it belongs to is at least twice as big as before. A set can double in size at most times before it swallows the whole universe, so each element is relabeled at most times over the entire run. Summed over all elements, all of the union work together costs , even though a single union might still touch elements.

This is the whole game in miniature: a structurally trivial idea (relabel the smaller side) plus an amortized doubling argument turns a quadratic-looking cost into . The forest representation below keeps exactly this intuition, the smaller thing yields to the larger, but replaces the explicit relabeling with a single pointer move, so a union becomes instead of .

Everything now hinges on which root we hang beneath the other. Get it wrong and the forest degenerates into exactly the chain we feared; get it right and the trees stay flat. The figure contrasts the two outcomes of the same four merges.

The same four unions, two ways. Careless linking builds a height- chain; union by rank keeps height — every is then one hop

Heuristic 1: union by rank

The trouble is unions that make a tall tree a child of a short one, deepening it. Union by rank prevents this. Each root carries a rank, an upper bound on the height of its tree. When uniting two trees, we attach the root of smaller rank beneath the root of larger rank, so the taller tree's height never grows. Only when the two ranks are equal does the height increase, and then by exactly one (and we bump the surviving root's rank).

This single rule already guarantees that every tree of rank contains at least nodes, so rank, and hence height, is at most . With union by rank alone, every operation is .

Why rank forces nodes. A root's rank rises only when two equal-rank trees merge, and that merge at least doubles the node count: rank is a single node, and each rank bump joins two trees of the previous rank. So a rank- tree holds nodes, capping rank (and height) at .

Notice this is the same doubling argument from the warm-up, read from the other direction: there, a set doubled each time an element was relabeled; here, a root's rank rises only when two equal-rank trees merge, which doubles the node count. Either way, is the ceiling, because nothing can double more than that many times.

The figure shows a under this rule. The left tree has rank , the right rank ; since their ranks differ, the smaller-rank root is hung beneath the larger-rank root and no rank changes. The result still has rank , exactly as the warm-up's smaller side yields to the larger demands, but now it costs a single pointer move rather than relabeling every member.

Union by rank hangs the lower-rank root beneath the higher-rank root

Heuristic 2: path compression

Path compression attacks the cost from the other side. Each time walks up to the root, it makes a second pass and points every node it visited directly at the root. The path is paid for once; every future on those nodes is then a single hop.

Path compression points every node on a Find-Set path straight at the root

Before, sits at the bottom of a chain ; after , the nodes , , all point straight at . The that pays for the walk leaves the tree dramatically flatter for everyone behind it.

Algorithm:Find-Set(x)\textsc{Find-Set}(x) — with path compression (recursive)
  1. 1
    if xparent(x)x \ne parent(x) then
  2. 2
    parent(x)parent(x) \gets call Find-Set(parent(x))\textsc{Find-Set}(parent(x))
    point x at the root
  3. 3
    return parent(x)parent(x)

The recursion bottoms out at the root, and as it unwinds it reassigns every node's parent to that root. With path compression in use, the rank of a root is only an upper bound on its height (compression can make the tree shorter than its rank suggests), which is why the heuristic is called union by rank rather than by height.

The near-constant amortized bound

Used together, union by rank and path compression make the disjoint-set structure almost unbelievably fast.

The function grows so slowly it is practically constant: for every up to roughly , a number far larger than the count of atoms in the universe. So for any conceivable input, each operation costs amortized .3

The two heuristics earn this by attacking complementary failure modes: union by rank keeps trees from getting tall in the first place (height ), while path compression ensures that any depth a tree does accumulate gets paid down and reused, so the expensive walks cannot recur. Neither alone gives (union by rank alone is amortized, path compression alone is amortized), but their combination collapses to inverse Ackermann. The full proof uses a subtle potential-function (amortized) argument, charging each node's cost against the steady growth of the ranks above it. The intuition to keep is that a node can be lifted closer to the root only so many times before it is the root's child, and ranks climb too slowly for that to happen often.

Application: connectivity and minimum spanning trees

Two applications make the structure indispensable.

Connectivity. Given a graph, call on every vertex, then for every edge . Afterward, holds iff and lie in the same connected component. The structure also processes online edge insertions: each new edge is one Union, and connectivity queries between insertions are each one pair of calls, both amortized .

Kruskal's minimum spanning tree. Kruskal's algorithm builds a minimum spanning tree by scanning edges in increasing weight order and adding each edge unless it would form a cycle. An edge forms a cycle exactly when and are already connected (i.e. already in the same set), which is the disjoint-set query verbatim.4

Kruskal as union-find. Scanning edges by weight, each is accepted iff its endpoints have different roots; the rejected edge closes a cycle since already share a set
Algorithm:MST-Kruskal(G,w)\textsc{MST-Kruskal}(G, w) — minimum spanning tree via union-find
  1. 1
    AA \gets \emptyset
  2. 2
    foreach vertex vv in V[G]V[G] do
  3. 3
    call Make-Set(v)\textsc{Make-Set}(v)
  4. 4
    sort the edges of GG into nondecreasing order by weight ww
  5. 5
    foreach edge {u,v}\set{u, v} in that order do
  6. 6
    if call Find-Set(u)\textsc{Find-Set}(u) \ne call Find-Set(v)(v) then
  7. 7
    AA{{u,v}}A \gets A \cup \set{\,\set{u, v}\,}
    joins two components
  8. 8
    call Union(u,v)\textsc{Union}(u, v)
  9. 9
    return AA

Now read off the running time. Sorting the edges costs , and since a simple graph has edges we have , so the sort is , and this dominates. The disjoint-set work spans tests and s; even with only the warm-up's relabel-the-smaller scheme this is , already cheaper than the sort, and with the rank/compression forest it is , effectively linear. Either way:

The lesson here is worth stating plainly: Kruskal's logic is one acyclicity test per edge, but its efficiency is entirely a property of the structure answering that test. The cycle test is the same-set query, and a good disjoint-set structure is exactly what makes Kruskal both simple and fast.

Takeaways

  • The disjoint-set ADT — , , Union — maintains a partition under merges and answers same group? by comparing representatives.
  • A labels + relabel-the-smaller warm-up already costs only total: each element is relabeled times because its set doubles whenever it moves. This doubling argument is the seed of union by rank.
  • The forest representation stores each set as a tree of parent pointers whose root is the representative; walks to the root, Union links two roots, turning the warm-up's relabel into one pointer move.
  • Union by rank keeps trees short (attach shorter under taller), and path compression flattens each path to point straight at the root.
  • Together they give amortized time per operation — inverse Ackermann, so for any realistic , i.e. effectively constant.
  • It powers connectivity queries and Kruskal's MST, where the same-set test is exactly the cycle test. Kruskal runs in , with the sort, not the union-find, as the bottleneck.
  • The unifying theme: clever data structures are what make algorithms fast. Kruskal's logic is one line; all of its speed lives in the disjoint-set structure underneath.

Footnotes

  1. CLRS, Ch. 21 — Data Structures for Disjoint Sets (§21.1): the Make-Set/Find-Set/Union ADT and its near-constant amortized cost.
  2. Erickson, Ch. — Disjoint Sets: the parent-pointer forest representation of a partition.
  3. CLRS, Ch. 21 — Data Structures for Disjoint Sets (§21.4): Tarjan's inverse-Ackermann amortized bound.
  4. Skiena, §6.1 — Union-Find: the cycle test in Kruskal's MST is exactly the same-set query.
Practice

╌╌ END ╌╌