patch: Change default sorting algorithm in sort(1)

Cyril Zhang cyril at freebsdfoundation.org
Mon May 17 20:26:06 UTC 2021


I am looking to change the default algorithm used in sort from
heapsort to mergesort. This would significantly improve the runtime of
sort, even after this patch: https://reviews.freebsd.org/D30170.

I couldn't find any rationale for why heapsort was chosen as the
default. The original commit from 2012, that introduced heapsort as
the default, says "Change default sort method to mergesort, which has
a better worst case performance than qsort", seemingly indicating that
mergesort should be the default.

Here is some data. Sorting a 9999999 line file of integers between 1 and 100:

$ time sort --heapsort -n -o /dev/null rand.txt
      22.99 real        22.69 user         0.31 sys

$ time sort --mergesort -n -o /dev/null rand.txt
      10.76 real        10.47 user         0.29 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       6.39 real         5.77 user         0.62 sys

$ time sort --heapsort -o /dev/null rand.txt
      26.42 real        26.16 user         0.26 sys

$ time sort --mergesort -o /dev/null rand.txt
       9.97 real         9.72 user         0.25 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       8.25 real         7.58 user         0.66 sys

Same file, but pre-sorted numerically (a particular trait of libc's
mergesort implementation, which sort uses, is that it is linear on
sorted data):

$ time sort --heapsort -n -o /dev/null sorted.txt
      15.97 real        15.64 user         0.33 sys

$ time sort --mergesort -n -o /dev/null sorted.txt
       3.97 real         3.69 user         0.27 sys

$ time gsort --parallel=1 -o /dev/null rand.txt
       3.75 real         3.07 user         0.68 sys

A real-world example, sorting a file with 10403181 lines of text:

$ time sort --heapsort -o /dev/null cscope.1
      38.52 real        37.86 user         0.66 sys

$ time sort --mergesort -o /dev/null cscope.1
      18.57 real        17.82 user         0.75 sys

$ time gsort --parallel=1 -o /dev/null cscope.1
      15.75 real        12.19 user         3.56 sys

The main consideration is memory usage. The mergesort implementation
in libc allocates extra memory, and in the sort program this amounts
to about n * sizeof(void *) extra memory where n is the number of
lines in the input. In practice, this doesn't seem to be much, and is
dwarfed by the memory allocations that occur in the pre-processing
stage of sort, even if each line contains only a single character.
Using Valgrind's massif tool, I found that mergesort appears to
allocate only about 7% of the program's total memory usage.
Equivalently, this means that mergesort allocates about 7.5% extra
memory.

Last, this changes the default sorting algorithm from a non-stable
sort to a stable one, which could potentially affect existing programs
that may have been incorrectly relying on the output sorting order
even when -s is not specified.

Here is the Phabricator review for the patch:
https://reviews.freebsd.org/D30319. Does anyone see any problems with
this change?


More information about the freebsd-hackers mailing list