sort is broken
per at hedeland.org
Sun Nov 3 01:23:26 UTC 2019
On 2019-11-02 23:29, Dr. Nikolaus Klepp wrote:
> Anno domini 2019 Sat, 02 Nov 15:11:37 -0700
> Ronald F. Guilmette scripsit:
>> In message <eec0b13b-b5d6-7e51-6241-8e1898150315 at queldor.net>, you wrote:
>>> On 11/2/19 5:14 PM, Ronald F. Guilmette wrote:
>>>> Not a question, just an expression of grief and deep dismay.
>>>> It is a sad day when even very fundamental tools, used in billions
>>>> of scripts, such as /usr/bin/sort turn up broken.
>>> root at q4:/ # sort a
>>> root at q4:/ # sort < a
>>> root at q4:/ # uname -a
>>> FreeBSD q4.queldor.net 12.0-RELEASE-p3 FreeBSD 12.0-RELEASE-p3 GENERIC
>>> root at q4:/ # cat a
>>> root at q4:/ #
>>> Seems to be fine on my 12.0
>> Well, I guess it's just me then...
>> % uname -a
>> FreeBSD segfault.tristatelogic.com 12.0-RELEASE FreeBSD 12.0-RELEASE r341666 GENERIC amd64
>> % sort --version
>> What version of sort do you have?
> I remember that this sort of thing is around since at least 11.0. The problem occurs, when you have UFT-8 encoding set as default, but the input data is iso 8859-1. Some characters of iso 8859-1 (äöü...) are not valid in UTF-8.
This is exactly the problem - in fact, by definition (see RFC 3629)
*no* characters with values outside the range 0x00 to 0x7f are valid
as-is in UTF-8 - this is the case for almost 80 characters in 8859-1
(ü is 0xfc).
$ uname -a
FreeBSD pluto.hedeland.org 12.0-RELEASE FreeBSD 12.0-RELEASE GENERIC amd64
$ env LANG=C sort < /tmp/test
$ env LANG=en_US.UTF-8 sort < /tmp/test
sort: Illegal byte sequence
And the "success" case:
$ env LANG=en_US.UTF-8 sort /tmp/test
Not sure if it survives the e-mail encoding, but the output here has
actually been *converted* to the correct UTF-8 representation - if my
terminal was set up for UTF-8, I would actually see "ü" there.
$ od -t x1 /tmp/test
0000000 7a fc 72 69 63 68 2e 65 6d 61 69 6c 0a
$ env LANG=en_US.UTF-8 sort /tmp/test | od -t x1
0000000 7a c3 bc 72 69 63 68 2e 65 6d 61 69 6c 0a
I wouldn't consider the "Illegal byte sequence" case a bug, but rather
the "success" case - why is the content converted, and why is it
different from stdin?
More information about the freebsd-questions