Shouldn't cat(1) use the C locale?

Oliver Fromme olli at lurza.secnetix.de
Wed May 6 08:32:06 UTC 2009


Juli Mallett wrote:
 > The cat manpage suggests that the infamous, non-standard -v extension
 > is ASCII-oriented but cat(1) these days uses isprint and pals and
 > calls setlocale(LC_CTYPE, ""), which for those of us with dodgy
 > environments (mine includes LC_ALL=en_US.UTF-8), means that "cat -v"
 > behaves radically-differently to the manual page describes.
 > 
 > Does anyone see any reason for our extensions, etc., to work with
 > LC_CTYPE != C?  It doesn't make a lot of sense to me.  I'd like to
 > change it if there's not a good reason to keep it broken this way,
 > like:
 > 
 > -       setlocale(LC_CTYPE, "");
 > +       setlocale(LC_CTYPE, "C");
 > 
 > Thoughts, etc.?

This is a difficult matter.  I guess when you ask n people,
you will get n different opinions.  Well, here's mine ...

I think this is a bug in the manual page.  When cat(1) is
using the current locale, that's perfectly correct behaviour
in a world that is clearly moving away from ASCII, towards
unicode.  "Fixing" it by always using the ASCII locale would
be a step backwards.  Instead it is better to work on
bringing all of the tools to compliance with multibyte
character encodings in general, and with UTF8 in particular,
which seems to be the most important unicode encoding these
days (and probably UTF16, too).

So I think the manual page should be fixed so it says that
the -v option handles non-printing characters in the current
locale, and cat needs to be fixed to handle multibyte chars
correctly if the -v option is used with a UTF locale.

By the way, your patch would probably be a POLA violation.
I currently have LC_CTYPE=de_DE.ISO8859-15 on most of my
machines (because FreeBSD's UTF support is too incomplete
at the moment), and I'm occasionally using "cat -v" to
look for non-printable characters in that locale.  In fact
I have a zsh function:  "diff -u =(cat $1) =(cat -v $1)"
Your patch would break that.

I'm already somewhat annoyed that locale support was broken
in strings(1).  Some time ago, it used the current locale
so I could use it on German texts with my LC_CTYPE setting.
At some point in time, they probably introduced a patch
similar to yours and instead provided the -e option, which
does not work as expected ("-e S" is completely useless
because it prints characters that are non-printable in
ISO8859 locales).  Since then I was forced to use cat -v
for that purpose.  Now you're proposing to break that, too.
I hope that explains a little bit why I'm against that
change.  ;-)

Best regards
   Oliver

PS:  If you set LC_* to a UTF locale, but your environment
(i.e. tools and adat) is not UTF-compliant, breakage is
expected.  If you still want to keep that LC_* setting,
a workaround would be to make aliases cat='LC_CTYPE=C cat'
or similar for tools that seem to be broken.

I also recommend *not* to set LC_ALL, but instead set LANG.
The differenc is that you can override LANG, like in the
above example ("LC_CTYPE=C cat").  You cannot override
LC_ALL, because LC_ALL overrides everything else.  See the
environ(7) manual page for details.

-- 
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606,  Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758,  Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart

FreeBSD-Dienstleistungen, -Produkte und mehr:  http://www.secnetix.de/bsd

"Perl will consistently give you what you want,
unless what you want is consistency."
        -- Larry Wall


More information about the freebsd-standards mailing list