"Unprintable" 8-bit characters

Wed Nov 9 03:00:01 UTC 2011

On Wed, 9 Nov 2011 03:10:24 +0100
Polytropon <freebsd at edvax.de> wrote:

> On Wed, 09 Nov 2011 02:51:31 +0100, Michael Ross wrote:
> > Am 09.11.2011, 01:42 Uhr, schrieb Conrad J. Sabatier
> > <conrads at cox.net>:

[snip]

> > > I've been trying to understand what the deal is with regards to
> > > the displaying of the "extended" 8-bit character set, i.e., 8-bit
> > > characters with the MSB set.

[snip] 

> > Unsure if I understand you correctly.
> > ("extended" 8-bit character set with MSB? utf-16?)
> > I'm confused by this charset stuff in general.
> > 
> > Assuming you want \0xfc displayed as "ü",

[snip]

> > here is what works for me:
> > 
> > in my login class in /etc/login.conf:
> > 
> >          :charset=ISO-8859-1:\
> >          :lang=de_DE.ISO8859-1:\
> > 
> > ``cap_mkdb /etc/login.conf'' after changes
> 
> Ah, thanks - that seems to be the proper way to have
> the environmental variables set - instead of my (ab)use
> of setenv's in the csh config file. :-)

Same here.  I've been "guilty" as well of neglecting to properly adjust
my console configuration.

> Note the "precedence" of $LANG vs. $LC_* (as they can
> be used to configure things more precisely, e. g.
> regarding system messages or date formats; see example
> following).
> 
> 
> 
> > in /etc/rc.conf:
> > 
> > 	scrnmap="iso-8859-1_to_cp437"
> 
> Hm? CP437? Codepage? Isn't that some MS-DOS thing?
> I've never needed a screenmap to make "extended
> characters" (everything beyong US-ASCII) work.
> 
> 
> 
> > 	font8x8="cp850-8x8"
> > 	font8x14="cp850-8x14"
> > 	font8x16="cp850-8x16"
> > 
> > 
> > and in /etc/ttys, console type is set to ``cons25l1''
> 
> I have a similar setting here, but that does _not_ work
> wuth UTF-8 codec characters. If I want to use them, I
> have to change some environmental variables, from
> 
> 	#-------GERMAN/ENGLISH------------------------ <=== DEFAULT
> 	setenv	LC_ALL		en_US.ISO8859-1
> 	setenv	LC_MESSAGES	en_US.ISO8859-1
> 	setenv	LC_COLLATE	de_DE.ISO8859-1
> 	setenv	LC_CTYPE	de_DE.ISO8859-1
> 	setenv	LC_MONETARY	de_DE.ISO8859-1
> 	setenv	LC_NUMERIC	de_DE.ISO8859-1
> 	setenv	LC_TIME		de_DE.ISO8859-1
> 	unsetenv LANG
> 
> to
> 
> 	#-------INTERNATIONAL-------------------------
> 	setenv	LC_ALL		en_US.UTF-8
> 	setenv	LC_MESSAGES	en_US.UTF-8
> 	setenv	LC_COLLATE	de_DE.UTF-8
> 	setenv	LC_CTYPE	de_DE.UTF-8
> 	setenv	LC_MONETARY	de_DE.UTF-8
> 	setenv	LC_NUMERIC	de_DE.UTF-8
> 	setenv	LC_TIME		de_DE.UTF-8
> 	setenv	LANG		de_DE.UTF-8

Doesn't using "LC_ALL" obviate the need to set any of the other LC_*
variables?  At least, that's always been my understanding of it.

But, getting back to something you said earlier, what did you mean
exactly about the precedence of LANG vs. LC_*?

> Then I can use UTF-8 characters inside rxvt-unicode. Of
> course, text mode console is limited to the first set
> of configuration, using the ISO 8859-1 character set.
> 
> This worked long before UTF-8 arrived with the glorious
> idea that I should have 2 bytes where one is sufficient,
> to describe our (german) 6 umlauts and the Eszett ligature. :-)

<grin>

Yes, and this is one area where the labels are more than a little
misleading as well.  My natural inclination is think of UTF-8 as being a
single-byte representation for each character in the set, whereas
UTF-16, as the name implies, would be the "wide", 2-byte version.
Nonetheless, as I posted earlier in this thread, according to the info
in gucharmap, the representations of the umlauted "u" are just the
opposite of this:

UTF-8: 0xC3 0xBC
UTF-16: 0x00FC

Go figure, huh?  :-)

> Improper settings will result in [][] or A-tilde three
> quarters upside-down question mark, depending on editor
> or terminal used.

Yes, I will definitely have to try using the recommendations that have
come up in this thread re: the console.

> But returning to the original question, I think Robert
> did explain it very well: There is no real consensus
> about what the different codings should mean. They
> were meant to unify the representation of a very large
> set of characters, but basically there are many inter-
> pretations now, and how they show up to the user depends
> on the font in use, _if_ it has this mapping or that,
> or none.

This seems rather unfortunate to me.  You would think that, by now,
some "standard" character set might have emerged that would allow one
to use, at the very least, the "Western" characters (as opposed to
the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
expectation that others will see what was intended.

> For running ls, -w is the right option to use - but IN
> COMBINATION with correct settings for the terminal
> emulation AND the presence of a font that will do.

Yes.  I'm still a little embarrassed for having completely overlooked
that option earlier.  Hasty (impatient) reading of man pages.  :-)

> Again a fine demonstration why file names should be
> limited to printable ASCII and no spaces if you want
> them to work everywhere. :-)

Well, for myself, personally, I'm a bit of a stickler for "language
authenticity", you might call it.  Having studied both German and
French rather extensively in my younger days, I'm quite fond of both
languages, and rather keen on seeing them represented accurately (I
especially wince at the use of the plain, unaccented vowel followed by
an "e" in place of the umlaut, and to a lesser degree, the use of "ss"
in place of Esszett), which has caused me no small amount of confusion,
aggravation and frustration over the years, to be sure!  :-)

-- 
Conrad J. Sabatier
conrads at cox.net