Re: git: 8dcf3a82c54c - main - libc: Implement bsort(3) a bitonic type of sorting algorithm.

From: Hans Petter Selasky <hselasky_at_freebsd.org>
Date: Wed, 19 Apr 2023 17:24:20 UTC
On 4/19/23 17:46, Brooks Davis wrote:
> On Wed, Apr 19, 2023 at 12:06:26PM +0000, Hans Petter Selasky wrote:
>> The branch main has been updated by hselasky:
>>
>> URL: https://cgit.FreeBSD.org/src/commit/?id=8dcf3a82c54cb216df3213a013047907636a01da
>>
>> commit 8dcf3a82c54cb216df3213a013047907636a01da
>> Author:     Hans Petter Selasky <hselasky@FreeBSD.org>
>> AuthorDate: 2022-09-08 10:16:43 +0000
>> Commit:     Hans Petter Selasky <hselasky@FreeBSD.org>
>> CommitDate: 2023-04-19 12:04:22 +0000
>>
>>      libc: Implement bsort(3) a bitonic type of sorting algorithm.
>>      
>>      The bsort(3) algorithm works by swapping objects, similarly to qsort(3),
>>      and does not require any significant amount of additional memory.
>>      
>>      The bsort(3) algorithm doesn't suffer from the processing time issues
>>      known the plague the qsort(3) family of algorithms, and is bounded by
>>      a complexity of O(log2(N) * log2(N) * N), where N is the number of
>>      elements in the sorting array. The additional complexity compared to
>>      mergesort(3) is a fair tradeoff in situations where no memory may
>>      be allocated.
>>      
>>      The bsort(3) APIs are identical to those of qsort(3), allowing for
>>      easy drop-in and testing.
>>      
>>      The design of the bsort(3) algorithm allows for future parallell CPU
>>      execution when sorting arrays. The current version of the bsort(3)
>>      algorithm is single threaded. This is possible because fixed areas
>>      of the sorting data is compared at a time, and can easily be divided
>>      among different CPU's to sort large arrays faster.
>>      
>>      Reviewed by:    gbe@, delphij@, pauamma_gundo.com (manpages)
>>      Sponsored by:   NVIDIA Networking
>>      Differential Revision:  https://reviews.freebsd.org/D36493
> 
> Why was this needed?  I'm not really a fan to adding new, non-standard
> sorts without a clear use case.  I understand it has some specific
> advantages vs qsort, but are we going to use it?  Doing so will make our
> code less portable so we almost certainly can't s/qsort/bsort/.

Hi Brooks,

My long term plan is to get bsort() to replace qsort(), but because the 
two algorithms have different characteristics, then some people may 
complain it is good for me, but not for you. I want there to be an 
option besides from qsort(), which does not use malloc() as an integral 
part of sorting. And is faster than O(N*N) sorting (still the worst case 
for qsort in FreeBSD).

qsort() is frequently used to do all kinds of sorting, and some pointed 
out that qsort() can technically be any kind of sorting algorithm, but 
looking around it is not.

When I build applications of my own, I always use mergesort(), which 
depend on malloc(). Having a dependency on a certain memory allocator to 
get performance is not good.

I want to distinguish from qsort() by calling it bsort(). If people use 
bsort() they know what they get cross-platform. If people use qsort() 
the output is random basically. It helps very little my application 
works on FreeBSD, but not Linux, for example.

In FreeBSD qsort() is typically used for sorting files up to 16K entries 
per directory. Even if qsort() can explode in time, probably it's not 
that important. But when using qsort() for sorting millions of 
mathematical expressions, for example, doing number analysis, this is 
unacceptable.

I think "C.A.R. Hoare's quicksort" technique is designed for single CPU 
systemsf only. Even if the best-case average is "N*log2(N)", that amount 
of processing cannot be split by multiple CPUs. The algorithm is serial 
as such.

The bsort() algorithm is much more NCPU friendly, because it can split 
the work into fixed size smaller and independent work loads. Otherwise 
the work load doubles every time you merge two sorted lists. And when 
the number of lists to merge is fewer than the number of CPUs available, 
your system runs out of guts basically.

> 
> I also note that the swap code is pointlessly slow for size > 256 and
> should almost certainly use aliasing with matching words like memcpy
> implementations do.  Doing so would make it easier to port this code to
> CHERI where that is required.

I totally agree about the swap code being pointless. And I tried to look 
where is memswap(), but it was not there. This kind of swapping is done 
many places, and we could benefit from having a compiler supported 
memswap() function. What do you think?

--HPS