sort is broken

Mon Nov 4 08:47:31 UTC 2019

In message <07d3de09-b778-fb67-66d3-6a1c2900c7a4 at hedeland.org>,
Per Hedeland <per at hedeland.org> wrote:

>> While the above may perhaps *explain* the behvior I've reported, I do
>> not feel that it excuses it.  Not even marginally.  I say that for
>> three reasons.
>
>I never claimed otherwise...

My apologizes for misconstruing.

>>            sort file
>>            sort < file
>>
>> Any difference in resuts between the above two commands, by definition,
>> violates the design principal of least surprise and is thus wholly
>> inappropriate, in my opinion, regardless of environmental circumstances.
>
>In the message above, I wrote:
>>
>> I wouldn't consider the "Illegal byte sequence" case a bug, but rather
>> the "success" case - why is the content converted, and why is it
>> different from stdin?
>
>So, yes, agreed.

Thank you I understand now. You would prefer both cases to yield the
-consistant- behavior of "Illegal byte sequence".  Although i am not
fully persuaded that this is the Right outcome, it certainly would be
,by definition, more consistant that what I myself was seeing.

>> 2)  The data I attempted to sort does *not* as far as I am able to deternmine
>> contain anything which is in any sense "illegal" or even invalid UTF-8.
...
>This is not conclusive, many environments can correctly display
>ISO-8859-1 in addition to UTF-8. Of course I don't know for a fact
>what is in your file, but it is trivial and unambiguous to determine
>by means of 'od' or 'hd' -

Yes.  Sorry. I should have posted that for the sake of completeness and
clarity.  Here is what "od -c" has to say is in the file in question:

0000000    z 374   r   i   c   h   .   e   m   a   i   l  \n

I.e. the ISO-8859-1 character "ü" (hex fc) is encoded as hex c3 bc in
UTF-8. If you doubt this, please read the definition of UTF-8 in
https://tools.ietf.org/html/rfc3629 - or at least one of the
properties that it enumerates:

I don't need to.  I'll take your word for it.

My trusty HP 16C Programmer's calculator says that the strange byte shown
above (\374) is actually equivalent to \xfc which is rather clearly
different that what you are saying UTF-8 prescribes should be the code
for a lower case letter "u" with an umlaut.

I've largely managed to maintain an arguably blissful... up until now... 
ignorance of all things non-ASCII7 so I spent at least a couple of
minutes trying to deduce for myself why the single byte value of
(octal) \374 (decimal 252, hex FC) gets displayed by many tools as a
lower case u with an umlaut.  It would appear that this is what is called
for by ISO/IEC 8859:

   https://en.wikipedia.org/wiki/ISO/IEC_8859

so at least that explains that, to my satisfaction anyway.

    o  The octet values C0, C1, F5 to FF never appear.

> 3)  EVEN IF the line in question had in fact contained some invalid byte
> sequence, even when construed in accordance with UTF-8, the response of
> /usr/bin/sort in this instance is inconsistant, as noted in (1) above, and
> even if that were not the case, the response of /usr/bin/sort is clearly
> sub-optimal.  When faced with a "bad" byte sequence, sort could have, and
> arguably should have fallen back and simply treated the bytes as bytes,
> without interpretation, possibly issuing a non-fatal *warning* rather than
> issuing a hard error and totally abandoning the task at hand, which is what
> sort did in fact do in this case.

This is clearly a matter of opinion...

I am forced to agree.  Many would prefer an immediate fatal error, rather
than issuing a warning which could easily be missed, thus allowing
improperly handled cases to "slip by", which itself in turn could possibly
result in undiagnosed issues making it into some embedded software
slated for delivery to one of the outer planets or into some component
affecting nuclear launches.  (I'm being serious.  I *do* worry about
such things.)

Anyway, it is certainly a matter of taste.

>- to sort a file with contents that is *impossible* to sort "according
>to the current locale's collating rules", I think I would prefer a
>hard error.

Reasonable people can have reasonably different views on this arcane question.

>>> If you convert your file to UTF-8, e.g. using the strange behavior of
>>> 'sort':
>>>
>>> $ sort test > test.utf8
>>> ...
>> 
>> I was not aware, until now, that /usr/bin/sort was, in addition to its
>> primary function, also a data conversion utility.  More to the point,
>> I would argue that the UNIX philosophy of having a large number of tools,
>> each of which performs one, and only one job, is violated if sort is now
>> also performing an additional (and unrequested) data conversion function.
>
>Sorry, it was just a joke (missing the smiley), followed by the proper
>invocation of 'iconv' for the purpose - as you can see above, I
>pointed out this broken behavior of 'sort' already in my original
>message, and describe it again as "strange behavior" in the message
>you quote now. And arguably this silent modification of the file
>contents is the most serious of the bugs uncovered here.

I think that we are ending on a note of agreement, which is good.

It would appear that one of these two, in my specific case, is tolerating
a byte value of \374 while the other is not:

    sort file
    sort < file

I am more than a little dismayed that no one else has been able to reproduce
this, but I'll get over it, and it wouldn't be the first time.

If I were enterprising, and if I had all kinds of time, I would sally
forth now and try to find the place in the sort sources where control
goes either left or right, depending on how input data is supplied,
but I don't so I won't.  Whereever it is, it is clearly wrong, no matter
which is the more desired treatment of the "funny" input data.

One certainly does (and must) wonder what the sort order of \374 is in
cases where sort has only the following envar value to go on, as in my
case:

    LANG=en_US.UTF-8

Regards,
rfg