CMSort: A Beginner’s Guide to Fast, Stable Sorting

Implementing CMSort — Code Examples in Python and C++CMSort is a hybrid sorting algorithm designed to combine the predictable performance of merge-based approaches with targeted optimizations for cache locality and multi-core execution. It aims to provide near-stable behavior across diverse input distributions while keeping memory overhead reasonable and enabling easy parallelization. This article explains CMSort’s core ideas, provides clear Python and C++ implementations, discusses complexity, memory use, parallel variants, and gives practical tips for tuning and debugging.


Overview: what CMSort does and why it matters

CMSort stands for “Cache- and Merge-optimized Sort.” Conceptually it blends ideas from merge sort, timsort’s run detection and galloping, and multi-way merge techniques. Key goals:

  • Good worst-case and average-case performance through divide-and-conquer merging.
  • Improved cache locality by working on contiguous blocks and limiting random access.
  • Practical parallelism by splitting work into independent subproblems.
  • Reduced extra memory relative to naive merge implementations by using in-place techniques where helpful and allocating temporary buffers sized to fit L2/L3 cache for merging.

Typical CMSort behavior:

  • Detects natural runs (monotonic sequences) in input like timsort.
  • Consolidates short runs into larger sorted blocks using an insertion-sort-optimized base case.
  • Uses a k-way merging strategy (commonly k=2 or k=4) with small buffer windows to reduce cache misses.
  • Optionally parallelizes the top levels of recursion when subarray sizes exceed a threshold.

When to choose CMSort

  • Datasets with partly ordered data or many small runs benefit from run detection.
  • Large arrays on multicore machines where cache-aware merging reduces real-world time.
  • Applications needing stable results or predictable resource usage.
  • If extreme memory constraints exist, CMSort can be configured to trade time for less extra memory.

Algorithm details and design choices

  1. Run detection

    • Scan the array to find increasing or decreasing runs.
    • Reverse decreasing runs to make them increasing.
    • Minimum run length: extend short runs with insertion sort to a tuned size (commonly 32–64).
  2. Merge policy

    • Maintain a stack of runs and apply merging rules (similar to timsort) to avoid highly unbalanced merges.
    • Merging can be 2-way or small k-way; k=4 is a good trade-off between pointer overhead and fewer passes.
  3. Buffering and cache optimization

    • Allocate a temporary buffer at each merge sized to min(run_length, cache_target).
    • Use buffer as a local scratch space for one of the runs to perform fewer random writes.
  4. Parallelization

    • Parallelize at recursion levels where subarray sizes exceed a threshold and available threads remain.
    • Use work-stealing or task queues to balance load.
  5. Stability

    • Maintain stability by careful selection order during merging when equal keys are compared.

Complexity

  • Time: O(n log n) worst-case; average-case close to O(n log n) with improvements on partially ordered input.
  • Space: extra auxiliary memory typically O(n) in the worst-case for simple merges; configurable to reduce to sublinear extra space using in-place merging techniques at the cost of constant factors in time.

Python implementation (clear, single-threaded)

The Python example prioritizes clarity and stability over micro-optimizations. It implements run detection, insertion-extension to a min run, and a 2-way merge using a temporary buffer.

from typing import List, Callable, TypeVar T = TypeVar("T") def insertion_sort(arr: List[T], left: int, right: int, key: Callable[[T], object]=lambda x: x):     for i in range(left+1, right+1):         cur = arr[i]         kcur = key(cur)         j = i - 1         while j >= left and key(arr[j]) > kcur:             arr[j+1] = arr[j]             j -= 1         arr[j+1] = cur def merge(arr: List[T], left: int, mid: int, right: int, key: Callable[[T], object]=lambda x: x):     # Merge arr[left:mid+1] and arr[mid+1:right+1]     left_part = arr[left:mid+1]     i = 0     j = mid + 1     k = left     while i < len(left_part) and j <= right:         if key(left_part[i]) <= key(arr[j]):             arr[k] = left_part[i]             i += 1         else:             arr[k] = arr[j]             j += 1         k += 1     while i < len(left_part):         arr[k] = left_part[i]         i += 1         k += 1 def cmsort(arr: List[T], key: Callable[[T], object]=lambda x: x, min_run: int = 32):     n = len(arr)     if n <= 1:         return     # 1) find runs     runs = []     i = 0     while i < n:         run_start = i         i += 1         if i == n:             runs.append((run_start, i-1))             break         # detect direction         if key(arr[i-1]) <= key(arr[i]):             # ascending             while i < n and key(arr[i-1]) <= key(arr[i]):                 i += 1         else:             # descending             while i < n and key(arr[i-1]) > key(arr[i]):                 i += 1             # reverse to make ascending             arr[run_start:i] = reversed(arr[run_start:i])         run_end = i - 1         # extend short runs to min_run         run_len = run_end - run_start + 1         if run_len < min_run:             extend_to = min(n-1, run_start + min_run - 1)             insertion_sort(arr, run_start, extend_to, key=key)             run_end = extend_to             i = run_end + 1         runs.append((run_start, run_end))     # 2) merge runs pairwise until one run remains     while len(runs) > 1:         new_runs = []         for j in range(0, len(runs), 2):             if j+1 >= len(runs):                 new_runs.append(runs[j])             else:                 l0, r0 = runs[j]                 l1, r1 = runs[j+1]                 merge(arr, l0, r0, r1, key=key)                 new_runs.append((l0, r1))         runs = new_runs 

Usage example:

if __name__ == "__main__":     import random     a = [random.randint(0, 1000) for _ in range(2000)]     cmsort(a) 

C++ implementation (efficient, single-threaded)

The C++ implementation emphasizes performance: in-place run detection, insertion sort for small runs, and an efficient merge that uses a temporary vector allocated once.

#include <vector> #include <algorithm> #include <functional> #include <iterator> template<typename T, typename Key=std::identity> void insertion_sort(std::vector<T>& arr, size_t left, size_t right, Key key=Key()) {     for (size_t i = left + 1; i <= right; ++i) {         T cur = std::move(arr[i]);         auto kcur = key(cur);         size_t j = i;         while (j > left && key(arr[j-1]) > kcur) {             arr[j] = std::move(arr[j-1]);             --j;         }         arr[j] = std::move(cur);     } } template<typename T, typename Key=std::identity> void merge(std::vector<T>& arr, size_t left, size_t mid, size_t right, std::vector<T>& buffer, Key key=Key()) {     size_t n1 = mid - left + 1;     // copy left part into buffer     buffer.clear();     buffer.reserve(n1);     for (size_t i = 0; i < n1; ++i) buffer.push_back(std::move(arr[left + i]));     size_t i = 0;     size_t j = mid + 1;     size_t k = left;     while (i < n1 && j <= right) {         if (key(buffer[i]) <= key(arr[j])) {             arr[k++] = std::move(buffer[i++]);         } else {             arr[k++] = std::move(arr[j++]);         }     }     while (i < n1) {         arr[k++] = std::move(buffer[i++]);     } } template<typename T, typename Key=std::identity> void cmsort(std::vector<T>& arr, Key key=Key(), size_t min_run = 32) {     size_t n = arr.size();     if (n <= 1) return;     std::vector<std::pair<size_t,size_t>> runs;     runs.reserve(64);     size_t i = 0;     while (i < n) {         size_t run_start = i++;         if (i == n) {             runs.emplace_back(run_start, i-1);             break;         }         if (key(arr[i-1]) <= key(arr[i])) {             while (i < n && key(arr[i-1]) <= key(arr[i])) ++i;         } else {             while (i < n && key(arr[i-1]) > key(arr[i])) ++i;             std::reverse(arr.begin() + run_start, arr.begin() + i);         }         size_t run_end = i - 1;         size_t run_len = run_end - run_start + 1;         if (run_len < min_run) {             size_t extend_to = std::min(n-1, run_start + min_run - 1);             insertion_sort(arr, run_start, extend_to, key);             run_end = extend_to;             i = run_end + 1;         }         runs.emplace_back(run_start, run_end);     }     std::vector<T> buffer;     while (runs.size() > 1) {         std::vector<std::pair<size_t,size_t>> new_runs;         new_runs.reserve((runs.size()+1)/2);         for (size_t j = 0; j < runs.size(); j += 2) {             if (j+1 >= runs.size()) {                 new_runs.push_back(runs[j]);             } else {                 size_t l0 = runs[j].first;                 size_t r0 = runs[j].second;                 size_t r1 = runs[j+1].second;                 merge(arr, l0, r0, r1, buffer, key);                 new_runs.emplace_back(l0, r1);             }         }         runs.swap(new_runs);     } } 

Example usage:

#include <iostream> #include <random> int main() {     std::vector<int> a(200000);     std::mt19937 rng(123);     std::uniform_int_distribution<int> d(0,1000000);     for (auto &x : a) x = d(rng);     cmsort(a);     std::cout << "First 10: ";     for (int i=0;i<10;i++) std::cout << a[i] << ' ';     std::cout << ' ';     return 0; } 

Parallel variants

Parallelization strategies:

  • Parallel run detection: scan blocks in parallel, then stitch runs at block boundaries.
  • Parallel merging: perform merges for independent run pairs in parallel (fork-join style).
  • Task-stealing runtime: submit merge tasks sized above threshold; worker threads steal smaller tasks.

A simple approach: at top recursion levels, spawn threads for left/right sub-sorts and join. Use thread pools or OpenMP/TBB for production code.


Tuning notes

  • min_run: 32–64 works well across many inputs; increase if stable runs are short.
  • Buffer sizing: allocate buffers sized to L2 or L3 cache per thread to reduce misses.
  • Parallel threshold: choose subarray size so thread overhead is amortized (commonly 1e5–1e6 elements).
  • For small types (ints), prefer moves and memcpy-like operations in C++ merges.

Debugging and testing

  • Verify stability by sorting pairs (key, original_index) and confirming original order preserved for equal keys.
  • Test with edge cases: already sorted, reverse-sorted, many duplicates, small arrays.
  • Compare outputs with std::stable_sort (C++) or sorted(list) (Python) for correctness.
  • Profile with real data and CPU counters to verify cache miss reductions.

Limitations and trade-offs

  • CMSort is more complex to implement than simple quicksort or mergesort.
  • Extra memory for buffers can still be O(n) depending on merge strategy.
  • For tiny arrays, insertion sort or std::sort may be faster due to lower overhead.

Conclusion

CMSort is a practical hybrid sorting algorithm that leverages run detection, cache-aware merging, and optional parallelism to deliver robust, stable, and performant sorts in real-world scenarios. The Python and C++ implementations above provide a clear, adaptable baseline you can extend for parallel execution, custom key extractors, or memory-constrained environments.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *