"Unprintable" 8-bit characters

Wed Nov 9 02:24:30 UTC 2011

On Wed, 09 Nov 2011 02:51:31 +0100
"Michael Ross" <gmx at ross.cx> wrote:

> Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
> <conrads at cox.net>:
> 
> > Pardon me if this may seem like a stupid question, but this is
> > something that's been bugging me for a long time, and none of my
> > research has turned up anything useful yet.
> >
> > I've been trying to understand what the deal is with regards to the
> > displaying of the "extended" 8-bit character set, i.e., 8-bit
> > characters with the MSB set.
> >
> > More specifically, I'm trying to figure out how to get the "ls"
> > command to properly display filenames containing characters in this
> > extended set.  I have some MP3 files, for instance, whose names
> > contain certain European characters, such as the lowercase "u" with
> > umlaut (code 0xfc in the Latin set, according to gucharmap), that I
> > just can't get ls to display properly.  These characters seem to be
> > considered by ls as "unprintable", and the best I've been able to
> > produce in the ls output is backslash interpretations of the
> > characters using either the -B or -b options, otherwise the default
> > "?" is displayed in their place.
> 
> Unsure if I understand you correctly.
> ("extended" 8-bit character set with MSB? utf-16?)
> I'm confused by this charset stuff in general.

That is to say, "8-bit characters with the most significant bit set",
or "characters greater than 0x7f".

I can certainly appreciate your confusion; this is definitely a
confusing area.  In gucharmap, selecting the unlauted "u" in the Latin
set, the "Character Details" tab reveals the following:

U+00FC LATIN SMALL LETTER U WITH DIAERESIS

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Lowercase
Canonical decomposition: U+0075 LATIN SMALL LETTER U + U+0308 COMBINING
DIAERESIS

Various Useful Representations

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

C octal escaped UTF-8: \303\274
XML decimal entity: &#252;

So apparently, it's a "wide" character in UTF-8, which really throws a
monkey wrench into the works in certain situations (for example, one of
the little scripts I've written to process MP3 files uses the "cut"
command, which complains about an "illegal byte sequence").

Even more confusing, selecting the character and copying it to the
clipboard, the UTF-16 representation (0xfc) is what actually gets
used.  Pasting this single-byte version into an X terminal (any of
them: xterm, gnome-terminal, etc.) does display the correct character,
an umlauted "u", even if using an 8-bit locale, such as UTF-8.  Majorly
confusing!

> Assuming you want \0xfc displayed as "ü",

Yes, exactly.

> > cat test.py && python test.py && ls -l
> 
> #!/usr/local/bin/python
> # -*- coding: utf-8 -*-
> 
> f=open('\xfc','w')
> f.close()
> total 2
> 
> -rw-r--r--  1 michael  wheel  29  9 Nov 02:43 test.py
> -rw-r--r--  1 michael  wheel   0  9 Nov 02:44 ü
> 
> 
> here is what works for me:
> 
> in my login class in /etc/login.conf:
> 
>          :charset=ISO-8859-1:\
>          :lang=de_DE.ISO8859-1:\
> 
> ``cap_mkdb /etc/login.conf'' after changes
> 
> 
> in /etc/rc.conf:
> 
> 	scrnmap="iso-8859-1_to_cp437"
> 	font8x8="cp850-8x8"
> 	font8x14="cp850-8x14"
> 	font8x16="cp850-8x16"
> 
> 
> and in /etc/ttys, console type is set to ``cons25l1''

Thanks, I hadn't considered making those sorts of changes for the
console.  I work so seldom nowadays in the console, I'd forgotten all
about that stuff (use it or lose it, as they say!).  I'll certainly give
that a try.

Much appreciation for both yours and Robert's replies.

-- 
Conrad J. Sabatier
conrads at cox.net