sort is broken

Mon Nov 4 14:06:30 UTC 2019

On 2019-11-04 09:47, Ronald F. Guilmette wrote:
> 
> Thank you I understand now. You would prefer both cases to yield the
> -consistant- behavior of "Illegal byte sequence".  Although i am not
> fully persuaded that this is the Right outcome, it certainly would be
> ,by definition, more consistant that what I myself was seeing.

Yes, I have a preference for "Illegal byte sequence" for both - but
far more important is, as you say, that the two cases behave the
*same* way.

> My trusty HP 16C Programmer's calculator says that the strange byte shown
> above (\374) is actually equivalent to \xfc which is rather clearly
> different that what you are saying UTF-8 prescribes should be the code
> for a lower case letter "u" with an umlaut.
> 
> I've largely managed to maintain an arguably blissful... up until now...
> ignorance of all things non-ASCII7 so I spent at least a couple of
> minutes trying to deduce for myself why the single byte value of
> (octal) \374 (decimal 252, hex FC) gets displayed by many tools as a
> lower case u with an umlaut.  It would appear that this is what is called
> for by ISO/IEC 8859:
> 
>     https://en.wikipedia.org/wiki/ISO/IEC_8859

Yes, the nice tables there describe the character sets that are
typically referred to as ISO-8859-<N>, with ISO-8859-1 probably being
the most widely used one.

> I think that we are ending on a note of agreement, which is good.
> 
> It would appear that one of these two, in my specific case, is tolerating
> a byte value of \374 while the other is not:
> 
>      sort file
>      sort < file

Yes - when you have a locale setting that specifies UTF-8. With the
default/C locale, or a locale that specifies ISO8859-1 such as
en_US.ISO8859-1, both work fine (as they should).

> I am more than a little dismayed that no one else has been able to reproduce
> this, but I'll get over it, and it wouldn't be the first time.

I can reproduce it!:-) I assume that those that can't don't have a
locale setting that specifies UTF-8 (I don't), and didn't try with one
that does (I did:-).

> If I were enterprising, and if I had all kinds of time, I would sally
> forth now and try to find the place in the sort sources where control
> goes either left or right, depending on how input data is supplied,
> but I don't so I won't.  Whereever it is, it is clearly wrong, no matter
> which is the more desired treatment of the "funny" input data.

I had a quick look at the source, but it isn't trivial to follow. And
the behaviors are probably not due "immediately" to the 'sort' source,
but due to the functions it uses, such as mbtowc(3), mbstowcs(3),
wcscoll(3), and others. The error message is actually from (according
to 'man errno'):

   86 EILSEQ Illegal byte sequence. While decoding a multibyte character the
           function came along an invalid or an incomplete sequence of bytes
           or the given wide character is invalid.

- an errno value that (according to their man pages) can be returned
by most of the mb* and wc* functions. (Hm, "came along"?:-)

> One certainly does (and must) wonder what the sort order of \374 is in
> cases where sort has only the following envar value to go on, as in my
> case:
> 
>      LANG=en_US.UTF-8

Indeed... - and at least the 'sort file' invocation apparently
*assumed* that it was an ISO-8859-1 character, since it was changed to
the corresponding UTF-8 encoding - even though the same value
represents other characters, with other UTF-8 encodings, in ISO-8859-5
and ISO-8859-7.

--Per