CMSort: A Beginner’s Guide to Fast, Stable Sorting

Implementing CMSort — Code Examples in Python and C++CMSort is a hybrid sorting algorithm designed to combine the predictable performance of merge-based approaches with targeted optimizations for cache locality and multi-core execution. It aims to provide near-stable behavior across diverse input distributions while keeping memory overhead reasonable and enabling easy parallelization. This article explains CMSort’s core ideas, provides clear Python and C++ implementations, discusses complexity, memory use, parallel variants, and gives practical tips for tuning and debugging.

Overview: what CMSort does and why it matters

CMSort stands for “Cache- and Merge-optimized Sort.” Conceptually it blends ideas from merge sort, timsort’s run detection and galloping, and multi-way merge techniques. Key goals:

Good worst-case and average-case performance through divide-and-conquer merging.
Improved cache locality by working on contiguous blocks and limiting random access.
Practical parallelism by splitting work into independent subproblems.
Reduced extra memory relative to naive merge implementations by using in-place techniques where helpful and allocating temporary buffers sized to fit L2/L3 cache for merging.

Typical CMSort behavior:

Detects natural runs (monotonic sequences) in input like timsort.
Consolidates short runs into larger sorted blocks using an insertion-sort-optimized base case.
Uses a k-way merging strategy (commonly k=2 or k=4) with small buffer windows to reduce cache misses.
Optionally parallelizes the top levels of recursion when subarray sizes exceed a threshold.

When to choose CMSort

Datasets with partly ordered data or many small runs benefit from run detection.
Large arrays on multicore machines where cache-aware merging reduces real-world time.
Applications needing stable results or predictable resource usage.
If extreme memory constraints exist, CMSort can be configured to trade time for less extra memory.

Algorithm details and design choices

Run detection
- Scan the array to find increasing or decreasing runs.
- Reverse decreasing runs to make them increasing.
- Minimum run length: extend short runs with insertion sort to a tuned size (commonly 32–64).
Merge policy
- Maintain a stack of runs and apply merging rules (similar to timsort) to avoid highly unbalanced merges.
- Merging can be 2-way or small k-way; k=4 is a good trade-off between pointer overhead and fewer passes.
Buffering and cache optimization
- Allocate a temporary buffer at each merge sized to min(run_length, cache_target).
- Use buffer as a local scratch space for one of the runs to perform fewer random writes.
Parallelization
- Parallelize at recursion levels where subarray sizes exceed a threshold and available threads remain.
- Use work-stealing or task queues to balance load.
Stability
- Maintain stability by careful selection order during merging when equal keys are compared.

Complexity

Time: O(n log n) worst-case; average-case close to O(n log n) with improvements on partially ordered input.
Space: extra auxiliary memory typically O(n) in the worst-case for simple merges; configurable to reduce to sublinear extra space using in-place merging techniques at the cost of constant factors in time.

Python implementation (clear, single-threaded)

The Python example prioritizes clarity and stability over micro-optimizations. It implements run detection, insertion-extension to a min run, and a 2-way merge using a temporary buffer.

from typing import List, Callable, TypeVar T = TypeVar("T") def insertion_sort(arr: List[T], left: int, right: int, key: Callable[[T], object]=lambda x: x):     for i in range(left+1, right+1):         cur = arr[i]         kcur = key(cur)         j = i - 1         while j >= left and key(arr[j]) > kcur:             arr[j+1] = arr[j]             j -= 1         arr[j+1] = cur def merge(arr: List[T], left: int, mid: int, right: int, key: Callable[[T], object]=lambda x: x):     # Merge arr[left:mid+1] and arr[mid+1:right+1]     left_part = arr[left:mid+1]     i = 0     j = mid + 1     k = left     while i < len(left_part) and j <= right:         if key(left_part[i]) <= key(arr[j]):             arr[k] = left_part[i]             i += 1         else:             arr[k] = arr[j]             j += 1         k += 1     while i < len(left_part):         arr[k] = left_part[i]         i += 1         k += 1 def cmsort(arr: List[T], key: Callable[[T], object]=lambda x: x, min_run: int = 32):     n = len(arr)     if n <= 1:         return     # 1) find runs     runs = []     i = 0     while i < n:         run_start = i         i += 1         if i == n:             runs.append((run_start, i-1))             break         # detect direction         if key(arr[i-1]) <= key(arr[i]):             # ascending             while i < n and key(arr[i-1]) <= key(arr[i]):                 i += 1         else:             # descending             while i < n and key(arr[i-1]) > key(arr[i]):                 i += 1             # reverse to make ascending             arr[run_start:i] = reversed(arr[run_start:i])         run_end = i - 1         # extend short runs to min_run         run_len = run_end - run_start + 1         if run_len < min_run:             extend_to = min(n-1, run_start + min_run - 1)             insertion_sort(arr, run_start, extend_to, key=key)             run_end = extend_to             i = run_end + 1         runs.append((run_start, run_end))     # 2) merge runs pairwise until one run remains     while len(runs) > 1:         new_runs = []         for j in range(0, len(runs), 2):             if j+1 >= len(runs):                 new_runs.append(runs[j])             else:                 l0, r0 = runs[j]                 l1, r1 = runs[j+1]                 merge(arr, l0, r0, r1, key=key)                 new_runs.append((l0, r1))         runs = new_runs

Usage example:

if __name__ == "__main__":     import random     a = [random.randint(0, 1000) for _ in range(2000)]     cmsort(a)

C++ implementation (efficient, single-threaded)

The C++ implementation emphasizes performance: in-place run detection, insertion sort for small runs, and an efficient merge that uses a temporary vector allocated once.

#include <vector> #include <algorithm> #include <functional> #include <iterator> template<typename T, typename Key=std::identity> void insertion_sort(std::vector<T>& arr, size_t left, size_t right, Key key=Key()) {     for (size_t i = left + 1; i <= right; ++i) {         T cur = std::move(arr[i]);         auto kcur = key(cur);         size_t j = i;         while (j > left && key(arr[j-1]) > kcur) {             arr[j] = std::move(arr[j-1]);             --j;         }         arr[j] = std::move(cur);     } } template<typename T, typename Key=std::identity> void merge(std::vector<T>& arr, size_t left, size_t mid, size_t right, std::vector<T>& buffer, Key key=Key()) {     size_t n1 = mid - left + 1;     // copy left part into buffer     buffer.clear();     buffer.reserve(n1);     for (size_t i = 0; i < n1; ++i) buffer.push_back(std::move(arr[left + i]));     size_t i = 0;     size_t j = mid + 1;     size_t k = left;     while (i < n1 && j <= right) {         if (key(buffer[i]) <= key(arr[j])) {             arr[k++] = std::move(buffer[i++]);         } else {             arr[k++] = std::move(arr[j++]);         }     }     while (i < n1) {         arr[k++] = std::move(buffer[i++]);     } } template<typename T, typename Key=std::identity> void cmsort(std::vector<T>& arr, Key key=Key(), size_t min_run = 32) {     size_t n = arr.size();     if (n <= 1) return;     std::vector<std::pair<size_t,size_t>> runs;     runs.reserve(64);     size_t i = 0;     while (i < n) {         size_t run_start = i++;         if (i == n) {             runs.emplace_back(run_start, i-1);             break;         }         if (key(arr[i-1]) <= key(arr[i])) {             while (i < n && key(arr[i-1]) <= key(arr[i])) ++i;         } else {             while (i < n && key(arr[i-1]) > key(arr[i])) ++i;             std::reverse(arr.begin() + run_start, arr.begin() + i);         }         size_t run_end = i - 1;         size_t run_len = run_end - run_start + 1;         if (run_len < min_run) {             size_t extend_to = std::min(n-1, run_start + min_run - 1);             insertion_sort(arr, run_start, extend_to, key);             run_end = extend_to;             i = run_end + 1;         }         runs.emplace_back(run_start, run_end);     }     std::vector<T> buffer;     while (runs.size() > 1) {         std::vector<std::pair<size_t,size_t>> new_runs;         new_runs.reserve((runs.size()+1)/2);         for (size_t j = 0; j < runs.size(); j += 2) {             if (j+1 >= runs.size()) {                 new_runs.push_back(runs[j]);             } else {                 size_t l0 = runs[j].first;                 size_t r0 = runs[j].second;                 size_t r1 = runs[j+1].second;                 merge(arr, l0, r0, r1, buffer, key);                 new_runs.emplace_back(l0, r1);             }         }         runs.swap(new_runs);     } }

Example usage:

#include <iostream> #include <random> int main() {     std::vector<int> a(200000);     std::mt19937 rng(123);     std::uniform_int_distribution<int> d(0,1000000);     for (auto &x : a) x = d(rng);     cmsort(a);     std::cout << "First 10: ";     for (int i=0;i<10;i++) std::cout << a[i] << ' ';     std::cout << ' ';     return 0; }

Parallel variants

Parallelization strategies:

Parallel run detection: scan blocks in parallel, then stitch runs at block boundaries.
Parallel merging: perform merges for independent run pairs in parallel (fork-join style).
Task-stealing runtime: submit merge tasks sized above threshold; worker threads steal smaller tasks.

A simple approach: at top recursion levels, spawn threads for left/right sub-sorts and join. Use thread pools or OpenMP/TBB for production code.

Tuning notes

min_run: 32–64 works well across many inputs; increase if stable runs are short.
Buffer sizing: allocate buffers sized to L2 or L3 cache per thread to reduce misses.
Parallel threshold: choose subarray size so thread overhead is amortized (commonly 1e5–1e6 elements).
For small types (ints), prefer moves and memcpy-like operations in C++ merges.

Debugging and testing

Verify stability by sorting pairs (key, original_index) and confirming original order preserved for equal keys.
Test with edge cases: already sorted, reverse-sorted, many duplicates, small arrays.
Compare outputs with std::stable_sort (C++) or sorted(list) (Python) for correctness.
Profile with real data and CPU counters to verify cache miss reductions.

Limitations and trade-offs

CMSort is more complex to implement than simple quicksort or mergesort.
Extra memory for buffers can still be O(n) depending on merge strategy.
For tiny arrays, insertion sort or std::sort may be faster due to lower overhead.

Conclusion

CMSort is a practical hybrid sorting algorithm that leverages run detection, cache-aware merging, and optional parallelism to deliver robust, stable, and performant sorts in real-world scenarios. The Python and C++ implementations above provide a clear, adaptable baseline you can extend for parallel execution, custom key extractors, or memory-constrained environments.

CMSort: A Beginner’s Guide to Fast, Stable Sorting

Overview: what CMSort does and why it matters

When to choose CMSort

Algorithm details and design choices

Complexity

Python implementation (clear, single-threaded)

C++ implementation (efficient, single-threaded)

Parallel variants

Tuning notes

Debugging and testing

Limitations and trade-offs

Conclusion

Comments

Leave a Reply Cancel reply

More posts

SimpleBT: The Future of Effortless Bluetooth Connectivity

HSLAB Security Tracker TSES

Mastering OneFinger Task Switch — Quick Guide & Best Practices

Bccvol