"Unprintable" 8-bit characters

Conrad J. Sabatier conrads at cox.net
Wed Nov 9 00:42:48 UTC 2011


Pardon me if this may seem like a stupid question, but this is
something that's been bugging me for a long time, and none of my
research has turned up anything useful yet.

I've been trying to understand what the deal is with regards to the
displaying of the "extended" 8-bit character set, i.e., 8-bit characters
with the MSB set.

More specifically, I'm trying to figure out how to get the "ls" command
to properly display filenames containing characters in this extended
set.  I have some MP3 files, for instance, whose names contain certain
European characters, such as the lowercase "u" with umlaut (code 0xfc
in the Latin set, according to gucharmap), that I just can't get ls to
display properly.  These characters seem to be considered by ls as
"unprintable", and the best I've been able to produce in the ls
output is backslash interpretations of the characters using either the
-B or -b options, otherwise the default "?" is displayed in their place.

The strange thing is that these characters will display just fine in
xterm, gnome-terminal, etc.  I can copy and paste them from the
gucharmap utility into a shell command line or other application, and
they appear as they should, but ls simply refuses to display them.  I
can print them using the printf command, even bash's builtin echo seems
to have no problem with them.  Only ls appears to have this problem.

I've experimented with using various locales, using the LC_*
variables, as well as the LANG variable (as documented in the
environment section of the ls man page), all to no avail.

Is this an inherent limitation of ls, or is there some workaround or
other solution?  Do we need a new en_*.UTF-16 locale?  Should we
consider extending the ls command to handle these characters?  Or is
there just something about all of this that I'm just not "getting"?

As an additional note, I notice that in the text console, this same
character code (0xfc) produces an entirely different character (a
lowercase n in a raised position, as for the exponent in a mathematical
expression).  Is there, in fact, no standardization re: the
representation of these "high bit" characters?

Thanks to anyone who can help clear up this long-standing mystery for
me.

-- 
Conrad J. Sabatier
conrads at cox.net


More information about the freebsd-questions mailing list