tr(1) buggy with de_DE.ISO8859-1(5) locale?

Tue Feb 7 07:48:08 PST 2006

Martin Krzysiak <cinek at gmx.de> wrote:
 > Oliver Fromme wrote:
 > > It's not a bug.  It's perfectly POSIX-compatible.
 > 
 > I think this behavior is "undefined" in POSIX,

That's correct.  Which means that FreeBSD's tr(1) is
POSIX-compatible.  And any script which assumes that
"tr a-z A-Z" works in any locale is _not_ POSIX-
compatible.

Specifically, SUSv3 (a.k.a. POSIX-2001) says:

    LC_COLLATE
        Determine the locale for the behavior of range
        expressions and equivalence classes.

And it also specifically mentions the following as an
example that must be used for case conversions:

    tr -s '[:upper:]' '[:lower:]'

 > It's not only upper-lowercase conversion that is weird.
 > Try "echo wxyz | tr w-z a-d". Ranges are broken generally
 > in ISO-locales, in my opinion.

Ranges are not broken, they just work as defined by the
locale.  It's an error to assume that "a-d" always means
the four letters a, b, c, d.  That's only true in the
US-ASCII locale (a.k.a. "C" or POSIX locale).

When you're browsing in an index of German words, you
_do_ want them to be ordered correctly, don't you?
That is, you expect words starting with a-umlaut ("ä")
to be ordered along with "a", not after "z" or anywhere
else.  Therefore, the collation definitions are correct,
not broken.

 > > By the way:  Do not set LANG or LC_ALL, expecially for
 > > the root user, and especially when compiling things.
 > 
 > One thing I like about FreeBSD is that I have my German
 > environment.

What do you mean by "German environment"?  I also have a
German environment, but I only set LC_CTYPE, not LC_ALL,
LANG or LC_COLLATE.

 > But you are right. The only locale that is
 > expected to work correctly is "C".

I think that all locales work correctly, as far as I can
tell.  At least the German ones that I use work correctly.

The only problem is that script authors that use tr(1)
make illegal assumptions about the behaviour of ranges.

 > How many times did you use tr(1) to convert your texts
 > to upper/lower case? Do you expect that it works correctly?

I don't have LC_COLLATE set (or LANG or LC_ALL), so I
expect that "tr a-z A-Z" works in the usual way when
used for English texts.

I never need to convert German texts from lower case to
upper case.  But if I had to do that, the following way
that you mentioned would work fine for me, too (except
that I have to convert sharp-s ("ß") to "SS" manually):

 > I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ",

When writing scripts, I either use the correct tr syntax
with [:lower:] [:upper:], or -- if you know that locale
support is not required -- put "unset LC_ALL LC_COLLATE
LANG" at the beginning.

Note that tr(1) is not appropriate to perform non-English
case conversions in general.  For example, it does never
handle the German sharp-s ("ß") correctly, no matter how
you set your locale, and no matter what syntax you use
with tr.  This is a limitation which cannot be easily
solved, unfortunately.  And German is easy ...  There are
languages with more complicated rules.  For example, in
Turkish, the letter "I" is not the upper-case of "i".

 > For people who are interested in a simple workaround.
 > Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8.
 > tr(1)'s ranges work like expected there.

tr's ranges _always_ work as expected, given how locales
work (especially LC_COLLATE).  Using UTF-8 encoding
doesn't guarantee that 'a-z' works for case conversions
either.  The _only_ reliable way is to use character
classes, as mentioned several times.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'