Shouldn't cat(1) use the C locale?

Garrett Wollman wollman at csail.mit.edu
Wed May 6 17:43:15 UTC 2009


<<On Wed, 6 May 2009 19:07:45 +0200 (CEST), Oliver Fromme <olli at lurza.secnetix.de> said:

> Normally cat is agnostic of the encoding of its input data,
> because it is handled like binary data.  But if the -v
> option is used, it has to actually look at the data in
> order to decide what is printable and what is not.
> This has two consequences:  First, it has to know the
> encoding of the input, and second, it has to know what
> is considered "printable".

I think that should be fairly obvious: the input is a stream of bytes,
which may or may not encode characters in any locale.

> The same is true for binary files.  For example, if you have
> a binary with embedded ISO8859 strings that you want to display
> on a UTF8 terminal, then the following works:
> LC_CTYPE=en_US.ISO8859-1 cat -v file | recode iso8859-1..utf8
> It correctly displays German Umlauts and some other characters,
> but escapes 8bit characters that are non-printable in the
> ISO8859-1 locale.

Now try the same thing on a binary with UTF-8 strings in it.

(UTF-8 at least gives you a validity constraint on possible multibyte
characters, which arbitrary multibyte encodings do not necessarily
provide.  This mitigates the "reading frame" problem, because the
first byte of an actual UTF-8 character cannot be the n'th byte of any
UTF-8 character.)

-GAWollman


More information about the freebsd-standards mailing list