"Unprintable" 8-bit characters

Wed Nov 9 17:25:46 UTC 2011

On Tue, 8 Nov 2011 20:59:48 -0600, Conrad J. Sabatier wrote:
> Same here.  I've been "guilty" as well of neglecting to properly adjust
> my console configuration.

Sometimes "just works" in combination with lazyness beats
all proper concepts of doing things. :-)

> Doesn't using "LC_ALL" obviate the need to set any of the other LC_*
> variables?  At least, that's always been my understanding of it.

I have to admit that I haven't fully understood everything
in that relation, but it seems that the $LC_* (!ALL) can
modify "subsets" of what $LC_ALL defines. Languages and
character sets can be assigned independently (e. g. english
program messages, but german file names properly displayed).

> But, getting back to something you said earlier, what did you mean
> exactly about the precedence of LANG vs. LC_*?

There is, if I remember correctly, the idea that _if_
$LANG is set, $LC_* won't be considered at all, even
if they are set.

http://www.freebsd.org/doc/handbook/using-localization.html
See 24.3.4.1.1.1 and 24.3.4.1.2.

> Yes, and this is one area where the labels are more than a little
> misleading as well.  My natural inclination is think of UTF-8 as being a
> single-byte representation for each character in the set, whereas
> UTF-16, as the name implies, would be the "wide", 2-byte version.
> Nonetheless, as I posted earlier in this thread, according to the info
> in gucharmap, the representations of the umlauted "u" are just the
> opposite of this:
> 
> UTF-8: 0xC3 0xBC
> UTF-16: 0x00FC
> 
> Go figure, huh?  :-)

I think Robert did explain it very good: While UTF-16 is
a "fixed width" (2 byte) representation, UTF-8 is "variable
width" (1 byte _or_ two byte).

> > But returning to the original question, I think Robert
> > did explain it very well: There is no real consensus
> > about what the different codings should mean. They
> > were meant to unify the representation of a very large
> > set of characters, but basically there are many inter-
> > pretations now, and how they show up to the user depends
> > on the font in use, _if_ it has this mapping or that,
> > or none.
> 
> This seems rather unfortunate to me.  You would think that, by now,
> some "standard" character set might have emerged that would allow one
> to use, at the very least, the "Western" characters (as opposed to
> the "Eastern" or "Oriental" or "Asian", if you will) with a reasonable
> expectation that others will see what was intended.

Assumptions, wishes, conclusions and hopes do differ from
reality. :-)

For example, in October I had to assist working on a
document containing german text and chinese symbols.
Decision: We use UTF-8 so the chinese symbols can appear
in the input. A name: Weng Tonghe [][][]. The brackets
should symbolize the three characters for that name.
They did show up properly in the editor, but on the
printed page... Weng Tonghe [][]. What? Two? But there
were three on input! As we found out, the "he" used
in input was the wrong one (there are several "he"s),
and the font used to render the text did not have that
particular "he". When we found the correct one, finally
three characters appeared, as intended and correct.

This should show: You _never_ know where things are
wrong when something is missing - settings, fonts,
who knows. In relation to file names, this is not a
problem of the file system as it will store any name
you want, but if you can actually SEE or USE that
file name - that's a completely different thing.

> > Again a fine demonstration why file names should be
> > limited to printable ASCII and no spaces if you want
> > them to work everywhere. :-)
> 
> Well, for myself, personally, I'm a bit of a stickler for "language
> authenticity", you might call it.  Having studied both German and
> French rather extensively in my younger days, I'm quite fond of both
> languages, and rather keen on seeing them represented accurately (I
> especially wince at the use of the plain, unaccented vowel followed by
> an "e" in place of the umlaut, and to a lesser degree, the use of "ss"
> in place of Esszett), which has caused me no small amount of confusion,
> aggravation and frustration over the years, to be sure!  :-)

Make sure to call it "Eszett" ("Es" = S and "Zett" = Z).
The teletyping conventions suggests to dissolve "ß" to "sz",
because it's easier to recombine "sz" to "ß" because it's
likely to be correct, whereas recombining "ss" to "ß" is
often wrong, as there are too many correct "ss" in texts.

Example:
Mißwirtschaft -> Miszwirtschaft -> Mißwirtschaft  ===> good.
Messer -> Meßer  ===> wrong.

In names (e. g. of towns): Staßfurt (right) != Stassfurt (wrong).

Note that !("sz" <-> "ß") in all cases, and !("ss" <-> "ß")
as well, as the rule states that only a non-truncatable "ss"
is to be set as Eszett. There are only few "sz" that are
"real 'sz'", typically in word gaps, e. g. Reiszange. :-)

The "funny" things start when diacritic marks and other
non-US-ASCII representable elements change the meaning
of a word. In such cases, it's often justified to use
the proper localized representation. However, this is
also the point where problems may start if you're doing
it wrong (which means: others do not conform to the
language settings or fonts you're using).

The (limited) US-ASCII set of characters is the only
easy way to avoid that. It may not _always_ look pretty,
but in worst cases, it works - and you can RELY on that.

-- 
Polytropon
Magdeburg, Germany
Happy FreeBSD user since 4.0
Andra moi ennepe, Mousa, ...