Implementing CMSort — Code Examples in Python and C++CMSort is a hybrid sorting algorithm designed to combine the predictable performance of merge-based approaches with targeted optimizations for cache locality and multi-core execution. It aims to provide near-stable behavior across diverse input distributions while keeping memory overhead reasonable and enabling easy parallelization. This article explains CMSort’s core ideas, provides clear Python and C++ implementations, discusses complexity, memory use, parallel variants, and gives practical tips for tuning and debugging.
Overview: what CMSort does and why it matters
CMSort stands for “Cache- and Merge-optimized Sort.” Conceptually it blends ideas from merge sort, timsort’s run detection and galloping, and multi-way merge techniques. Key goals:
- Good worst-case and average-case performance through divide-and-conquer merging.
- Improved cache locality by working on contiguous blocks and limiting random access.
- Practical parallelism by splitting work into independent subproblems.
- Reduced extra memory relative to naive merge implementations by using in-place techniques where helpful and allocating temporary buffers sized to fit L2/L3 cache for merging.
Typical CMSort behavior:
- Detects natural runs (monotonic sequences) in input like timsort.
- Consolidates short runs into larger sorted blocks using an insertion-sort-optimized base case.
- Uses a k-way merging strategy (commonly k=2 or k=4) with small buffer windows to reduce cache misses.
- Optionally parallelizes the top levels of recursion when subarray sizes exceed a threshold.
When to choose CMSort
- Datasets with partly ordered data or many small runs benefit from run detection.
- Large arrays on multicore machines where cache-aware merging reduces real-world time.
- Applications needing stable results or predictable resource usage.
- If extreme memory constraints exist, CMSort can be configured to trade time for less extra memory.
Algorithm details and design choices
-
Run detection
- Scan the array to find increasing or decreasing runs.
- Reverse decreasing runs to make them increasing.
- Minimum run length: extend short runs with insertion sort to a tuned size (commonly 32–64).
-
Merge policy
- Maintain a stack of runs and apply merging rules (similar to timsort) to avoid highly unbalanced merges.
- Merging can be 2-way or small k-way; k=4 is a good trade-off between pointer overhead and fewer passes.
-
Buffering and cache optimization
- Allocate a temporary buffer at each merge sized to min(run_length, cache_target).
- Use buffer as a local scratch space for one of the runs to perform fewer random writes.
-
Parallelization
- Parallelize at recursion levels where subarray sizes exceed a threshold and available threads remain.
- Use work-stealing or task queues to balance load.
-
Stability
- Maintain stability by careful selection order during merging when equal keys are compared.
Complexity
- Time: O(n log n) worst-case; average-case close to O(n log n) with improvements on partially ordered input.
- Space: extra auxiliary memory typically O(n) in the worst-case for simple merges; configurable to reduce to sublinear extra space using in-place merging techniques at the cost of constant factors in time.
Python implementation (clear, single-threaded)
The Python example prioritizes clarity and stability over micro-optimizations. It implements run detection, insertion-extension to a min run, and a 2-way merge using a temporary buffer.
from typing import List, Callable, TypeVar T = TypeVar("T") def insertion_sort(arr: List[T], left: int, right: int, key: Callable[[T], object]=lambda x: x): for i in range(left+1, right+1): cur = arr[i] kcur = key(cur) j = i - 1 while j >= left and key(arr[j]) > kcur: arr[j+1] = arr[j] j -= 1 arr[j+1] = cur def merge(arr: List[T], left: int, mid: int, right: int, key: Callable[[T], object]=lambda x: x): # Merge arr[left:mid+1] and arr[mid+1:right+1] left_part = arr[left:mid+1] i = 0 j = mid + 1 k = left while i < len(left_part) and j <= right: if key(left_part[i]) <= key(arr[j]): arr[k] = left_part[i] i += 1 else: arr[k] = arr[j] j += 1 k += 1 while i < len(left_part): arr[k] = left_part[i] i += 1 k += 1 def cmsort(arr: List[T], key: Callable[[T], object]=lambda x: x, min_run: int = 32): n = len(arr) if n <= 1: return # 1) find runs runs = [] i = 0 while i < n: run_start = i i += 1 if i == n: runs.append((run_start, i-1)) break # detect direction if key(arr[i-1]) <= key(arr[i]): # ascending while i < n and key(arr[i-1]) <= key(arr[i]): i += 1 else: # descending while i < n and key(arr[i-1]) > key(arr[i]): i += 1 # reverse to make ascending arr[run_start:i] = reversed(arr[run_start:i]) run_end = i - 1 # extend short runs to min_run run_len = run_end - run_start + 1 if run_len < min_run: extend_to = min(n-1, run_start + min_run - 1) insertion_sort(arr, run_start, extend_to, key=key) run_end = extend_to i = run_end + 1 runs.append((run_start, run_end)) # 2) merge runs pairwise until one run remains while len(runs) > 1: new_runs = [] for j in range(0, len(runs), 2): if j+1 >= len(runs): new_runs.append(runs[j]) else: l0, r0 = runs[j] l1, r1 = runs[j+1] merge(arr, l0, r0, r1, key=key) new_runs.append((l0, r1)) runs = new_runs
Usage example:
if __name__ == "__main__": import random a = [random.randint(0, 1000) for _ in range(2000)] cmsort(a)
C++ implementation (efficient, single-threaded)
The C++ implementation emphasizes performance: in-place run detection, insertion sort for small runs, and an efficient merge that uses a temporary vector allocated once.
#include <vector> #include <algorithm> #include <functional> #include <iterator> template<typename T, typename Key=std::identity> void insertion_sort(std::vector<T>& arr, size_t left, size_t right, Key key=Key()) { for (size_t i = left + 1; i <= right; ++i) { T cur = std::move(arr[i]); auto kcur = key(cur); size_t j = i; while (j > left && key(arr[j-1]) > kcur) { arr[j] = std::move(arr[j-1]); --j; } arr[j] = std::move(cur); } } template<typename T, typename Key=std::identity> void merge(std::vector<T>& arr, size_t left, size_t mid, size_t right, std::vector<T>& buffer, Key key=Key()) { size_t n1 = mid - left + 1; // copy left part into buffer buffer.clear(); buffer.reserve(n1); for (size_t i = 0; i < n1; ++i) buffer.push_back(std::move(arr[left + i])); size_t i = 0; size_t j = mid + 1; size_t k = left; while (i < n1 && j <= right) { if (key(buffer[i]) <= key(arr[j])) { arr[k++] = std::move(buffer[i++]); } else { arr[k++] = std::move(arr[j++]); } } while (i < n1) { arr[k++] = std::move(buffer[i++]); } } template<typename T, typename Key=std::identity> void cmsort(std::vector<T>& arr, Key key=Key(), size_t min_run = 32) { size_t n = arr.size(); if (n <= 1) return; std::vector<std::pair<size_t,size_t>> runs; runs.reserve(64); size_t i = 0; while (i < n) { size_t run_start = i++; if (i == n) { runs.emplace_back(run_start, i-1); break; } if (key(arr[i-1]) <= key(arr[i])) { while (i < n && key(arr[i-1]) <= key(arr[i])) ++i; } else { while (i < n && key(arr[i-1]) > key(arr[i])) ++i; std::reverse(arr.begin() + run_start, arr.begin() + i); } size_t run_end = i - 1; size_t run_len = run_end - run_start + 1; if (run_len < min_run) { size_t extend_to = std::min(n-1, run_start + min_run - 1); insertion_sort(arr, run_start, extend_to, key); run_end = extend_to; i = run_end + 1; } runs.emplace_back(run_start, run_end); } std::vector<T> buffer; while (runs.size() > 1) { std::vector<std::pair<size_t,size_t>> new_runs; new_runs.reserve((runs.size()+1)/2); for (size_t j = 0; j < runs.size(); j += 2) { if (j+1 >= runs.size()) { new_runs.push_back(runs[j]); } else { size_t l0 = runs[j].first; size_t r0 = runs[j].second; size_t r1 = runs[j+1].second; merge(arr, l0, r0, r1, buffer, key); new_runs.emplace_back(l0, r1); } } runs.swap(new_runs); } }
Example usage:
#include <iostream> #include <random> int main() { std::vector<int> a(200000); std::mt19937 rng(123); std::uniform_int_distribution<int> d(0,1000000); for (auto &x : a) x = d(rng); cmsort(a); std::cout << "First 10: "; for (int i=0;i<10;i++) std::cout << a[i] << ' '; std::cout << ' '; return 0; }
Parallel variants
Parallelization strategies:
- Parallel run detection: scan blocks in parallel, then stitch runs at block boundaries.
- Parallel merging: perform merges for independent run pairs in parallel (fork-join style).
- Task-stealing runtime: submit merge tasks sized above threshold; worker threads steal smaller tasks.
A simple approach: at top recursion levels, spawn threads for left/right sub-sorts and join. Use thread pools or OpenMP/TBB for production code.
Tuning notes
- min_run: 32–64 works well across many inputs; increase if stable runs are short.
- Buffer sizing: allocate buffers sized to L2 or L3 cache per thread to reduce misses.
- Parallel threshold: choose subarray size so thread overhead is amortized (commonly 1e5–1e6 elements).
- For small types (ints), prefer moves and memcpy-like operations in C++ merges.
Debugging and testing
- Verify stability by sorting pairs (key, original_index) and confirming original order preserved for equal keys.
- Test with edge cases: already sorted, reverse-sorted, many duplicates, small arrays.
- Compare outputs with std::stable_sort (C++) or sorted(list) (Python) for correctness.
- Profile with real data and CPU counters to verify cache miss reductions.
Limitations and trade-offs
- CMSort is more complex to implement than simple quicksort or mergesort.
- Extra memory for buffers can still be O(n) depending on merge strategy.
- For tiny arrays, insertion sort or std::sort may be faster due to lower overhead.
Conclusion
CMSort is a practical hybrid sorting algorithm that leverages run detection, cache-aware merging, and optional parallelism to deliver robust, stable, and performant sorts in real-world scenarios. The Python and C++ implementations above provide a clear, adaptable baseline you can extend for parallel execution, custom key extractors, or memory-constrained environments.
Leave a Reply