sort is broken

Ronald F. Guilmette rfg at tristatelogic.com
Sun Nov 3 22:20:57 UTC 2019


In message <f416a932-7084-bec3-8a7a-8efaaebc2952 at hedeland.org>, 
Per Hedeland <per at hedeland.org> wrote:

>> In my env, LC_ALL is not set at all.
>> 
>> I do have these, but not sure if they make any difference:
>> 
>> LANG=en_US.UTF-8
>
>This, in combination with trying to sort a file with contents that
>*isn't* valid UTF-8, is the reason for the behavior you observe - see
>my previous post.

While the above may perhaps *explain* the behvior I've reported, I do
not feel that it excuses it.  Not even marginally.  I say that for
three reasons.

1)  There are -zero- curcumstances in which in makes any sense whstsoever
to have the results of the following two commands be in the least bit
different:

          sort file
          sort < file

Any difference in resuts between the above two commands, by definition,
violates the design principal of least surprise and is thus wholly
inappropriate, in my opinion, regardless of environmental circumstances.

2)  The data I attempted to sort does *not* as far as I am able to deternmine
conatin anything which is in any sense "illegal" or even invalid UTF-8.
Quite the contrary, in fact.  I am able to view the line in question with
no problems by simply cat'ing it to my UTF-8 enabled xterm window, and I
was alos able to upload it to Pastebin, where it displays in a manner that
was exactly as intended, I think, with a umlaut over the "u" in zuruich,
and lastely I also pasted it into ny Bugzilla bug report in this issue
where it also displays in a quite reasonable and expected fashion.  Given
these facts, I am favorably inclined to believe that the string in question,
which certainly contains a byte sequence that falls outside of the confines
of 7-bit ASCII, does not contain any improper UTF-8 sequences.

3)  EVEN IF the line in question had in fact contained some invalid byte
sequence, even when construed in accordance with UTF-8, the response of
/usr/bin/sort in this instance is inconsistant, as noted in (1) above, and
even if that were not the case, the response of /usr/bin/sort is clearly
sub-optimal.  When faced with a "bad" byte sequence, sort could have, and
arguably should have fallen back and simply treated the bytes as bytes,
without interpretation, possibly issuing a non-fatal *warning* rather than
issuing a hard error and totally abandoning the task at hand, which is what
sort did in fact do in this case.


>If you convert your file to UTF-8, e.g. using the strange behavior of
>'sort':
>
>$ sort test > test.utf8
>...

I was not aware, until now, that /usr/bin/sort was, in addition to its
primary function, also a data conversion utility.  More to the point,
I would argue that the UNIX philosophy of having a large number of tools,
each of which performs one, and only one job, is violated if sort is now
also performing an additional (and unrequested) data conversion function.


Regards,
rfg


More information about the freebsd-questions mailing list