tr(1) buggy with de_DE.ISO8859-1(5) locale?
Oliver Fromme
olli at lurza.secnetix.de
Tue Feb 7 07:48:08 PST 2006
Martin Krzysiak <cinek at gmx.de> wrote:
> Oliver Fromme wrote:
> > It's not a bug. It's perfectly POSIX-compatible.
>
> I think this behavior is "undefined" in POSIX,
That's correct. Which means that FreeBSD's tr(1) is
POSIX-compatible. And any script which assumes that
"tr a-z A-Z" works in any locale is _not_ POSIX-
compatible.
Specifically, SUSv3 (a.k.a. POSIX-2001) says:
LC_COLLATE
Determine the locale for the behavior of range
expressions and equivalence classes.
And it also specifically mentions the following as an
example that must be used for case conversions:
tr -s '[:upper:]' '[:lower:]'
> It's not only upper-lowercase conversion that is weird.
> Try "echo wxyz | tr w-z a-d". Ranges are broken generally
> in ISO-locales, in my opinion.
Ranges are not broken, they just work as defined by the
locale. It's an error to assume that "a-d" always means
the four letters a, b, c, d. That's only true in the
US-ASCII locale (a.k.a. "C" or POSIX locale).
When you're browsing in an index of German words, you
_do_ want them to be ordered correctly, don't you?
That is, you expect words starting with a-umlaut ("ä")
to be ordered along with "a", not after "z" or anywhere
else. Therefore, the collation definitions are correct,
not broken.
> > By the way: Do not set LANG or LC_ALL, expecially for
> > the root user, and especially when compiling things.
>
> One thing I like about FreeBSD is that I have my German
> environment.
What do you mean by "German environment"? I also have a
German environment, but I only set LC_CTYPE, not LC_ALL,
LANG or LC_COLLATE.
> But you are right. The only locale that is
> expected to work correctly is "C".
I think that all locales work correctly, as far as I can
tell. At least the German ones that I use work correctly.
The only problem is that script authors that use tr(1)
make illegal assumptions about the behaviour of ranges.
> How many times did you use tr(1) to convert your texts
> to upper/lower case? Do you expect that it works correctly?
I don't have LC_COLLATE set (or LANG or LC_ALL), so I
expect that "tr a-z A-Z" works in the usual way when
used for English texts.
I never need to convert German texts from lower case to
upper case. But if I had to do that, the following way
that you mentioned would work fine for me, too (except
that I have to convert sharp-s ("ß") to "SS" manually):
> I would prefer to use it like: "tr a-zäöü A-ZÄÖÜ",
When writing scripts, I either use the correct tr syntax
with [:lower:] [:upper:], or -- if you know that locale
support is not required -- put "unset LC_ALL LC_COLLATE
LANG" at the beginning.
Note that tr(1) is not appropriate to perform non-English
case conversions in general. For example, it does never
handle the German sharp-s ("ß") correctly, no matter how
you set your locale, and no matter what syntax you use
with tr. This is a limitation which cannot be easily
solved, unfortunately. And German is easy ... There are
languages with more complicated rules. For example, in
Turkish, the letter "I" is not the upper-case of "i".
> For people who are interested in a simple workaround.
> Don't use de_DE.ISO8859-1(5). Instead use de_DE.UTF-8.
> tr(1)'s ranges work like expected there.
tr's ranges _always_ work as expected, given how locales
work (especially LC_COLLATE). Using UTF-8 encoding
doesn't guarantee that 'a-z' works for case conversions
either. The _only_ reliable way is to use character
classes, as mentioned several times.
Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.
'Instead of asking why a piece of software is using "1970s technology,"
start asking why software is ignoring 30 years of accumulated wisdom.'
More information about the freebsd-stable
mailing list