"Unprintable" 8-bit characters

Wed Nov 9 01:58:18 UTC 2011

On Tue, 8 Nov 2011 19:17:27 -0600 (CST)
Robert Bonomi <bonomi at mail.r-bonomi.com> wrote:

> 
> On Tue, 8 Nov 2011 18:42:36 -0600, "Conrad J. Sabatier" wrote:
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit
> > characters with the MSB set.
> 
> Quite simply Unix dates from the days where the 8th bit was used as a
> 'parity' bit.  Allowing detection of *all* single-bit errors --
> especially over the notoriously un-reliable connections known as
> 'serial ports'.

Ah, yes!  The "good old days".  :-)

> > More specifically, I'm trying to figure out how to get the "ls"
> > command to properly display filenames containing characters in this
> > extended set.  I have some MP3 files, for instance, whose names
> > contain certain European characters, such as the lowercase "u" with
> > umlaut (code 0xfc in the Latin set, according to gucharmap), that I
> > just can't get ls to display properly.  These characters seem to be
> > considered by ls as "unprintable", and the best I've been able to
> > produce in the ls output is backslash interpretations of the
> > characters using either the -B or -b options, otherwise the default
> > "?" is displayed in their place.
> >
> > The strange thing is that these characters will display just fine in
> > xterm, gnome-terminal, etc.  I can copy and paste them from the
> > gucharmap utility into a shell command line or other application,
> > and they appear as they should, but ls simply refuses to display
> > them.  I can print them using the printf command, even bash's
> > builtin echo seems to have no problem with them.  Only ls appears
> > to have this problem.
> >
> > I've experimented with using various locales, using the LC_*
> > variables, as well as the LANG variable (as documented in the
> > environment section of the ls man page), all to no avail.
> 
> Obviously you never read as far as the '-w' switch.  <grin>

Yes, somehow that one went right past me.  Haste makes waste!  :-)

> > Is this an inherent limitation of ls, 
> 
> It is -not- a limitation; rather it is a _desired_ behavior -- so
> that one can _tell_ where there is an 'unprintable' character (like
> \r, or\b) in a filename.  There are *good*reasons*(TM) why -q is the
> default behavior for 'terminal' output.

OK, I can see that.  :-)

> > or is there some workaround or
> > other solution?  Do we need a new en_*.UTF-16 locale?  Should we
> > consider extending the ls command to handle these characters?
> 
> There _are_ "improved" versions of ls that do understand the 'locale'
> environment variables -- but those programs introduce a whole bunch of
> *other* 'not necessarily desired' behaviors -- like sorting
> upper-case and lower-case letters as 'equals', rather than regarding
> any upper-case as sorting before any lowercase.

Well, *that* certainly won't do!  That should be the exception, not the
rule.

> > Or is
> > there just something about all of this that I'm just not "getting"?
> >
> > As an additional note, I notice that in the text console, this same
> > character code (0xfc) produces an entirely different character (a
> > lowercase n in a raised position, as for the exponent in a
> > mathematical expression).  Is there, in fact, no standardization
> > re: the representation of these "high bit" characters?
> 
> "The nice thing about standards is that there are so many to choose
> from" applies.  WITH A VENGANCE!!
> 
> There are at least FIFTEEN different sets of glyphs for the 'high bit
> set' byte codes *JUST* for the 'iso-8859' base charset.  Plus
> 'utf-8'  And not counting the various bastardiztions (e.g. 'CP-1252',
> etc.) that Microsoft has introduced.
> 
> > Thanks to anyone who can help clear up this long-standing mystery
> > for me.
> 
> <R>eading <t>he <f>ine <m>anpage -- with particular attention to the
> '-q' and '-w' options should provie some enlightenment.

Thank you very much.  Some of this matched the suspicions I already had
re: this matter.

Don't know how I completely missed the -w switch. Mea culpa.  :-)

So, what would be the safest bet as far as the most "universal"
representation for these characters?  Something I've long wondered
about when I've e-mailed people and copied/pasted these characters (are
they really seeing what I'm seeing?).  :-)

-- 
Conrad J. Sabatier
conrads at cox.net