Locale problem updating 10.3 to 11.1

Brandon Allbery allbery.b at gmail.com
Wed Feb 21 12:16:51 UTC 2018


A locale mapping is basically a lookup table (with complications for things
like ß). A single-byte lookup table will be 256 entries, each holding one
or more (because of combining characters) Unicode codepoints representing
the mapping from the locale character set to the underlying common
character set (Unicode). (There may also be a reverse lookup table for
mapping Unicode codepoints to locale codepoints.)

Without this, every program would have to deal directly with every possible
character set. With it, code can use Unicode internally and let the locale
system map to what to display, or in the other direction from what it has
read to the common representation.

(Complications include things like: depending on encoding/locale details,
German lowercase ß will uppercase to either SS or ẞ. And that's one of the
simpler ones; for some locales, things can get *really* weird. Not to
mention fun stuff like Arabic having 4 representations of every character:
initial, medial, final, standalone.)

Locale handling is seriously *nasty*. Unicode is also pretty nasty... but
it mostly manages the superset of individual locale nastinesses in about as
logical a way as possible given that locales are fundamentally illogical:
very few of them were designed, most grew organically and without regard
for rules or logic. (Esperanto locales being an exception... but even
Esperanto has developed some organic extensions with actual usage. It's how
humans work.)

On Wed, Feb 21, 2018 at 7:08 AM, Eivind Nicolay Evensen <
eivinde at terraplane.org> wrote:

> On Wed, Feb 21, 2018 at 01:03:01AM -0500, Brandon Allbery wrote:
> > On Tue, Feb 20, 2018 at 6:08 PM, Eivind Nicolay Evensen <
> > eivinde at terraplane.org> wrote:
> >
> > > However, since it was mentioned in a note starting with
> > > "Add support for unicode collation" I most likely didn't even read it
> > > since I'll never touch unicode.
> > >
> >
> > If you ever use anything other than LANG=C, you *are* touching Unicode.
>
> Well, I don't see multibyte characters with 8859-1, and
> multibyte is what I don't tolerate. I didn't even know
> that unicode could be single-byte character only sets.
>
>
>
>
> --
> Eivind
>



-- 
brandon s allbery kf8nh                               sine nomine associates
allbery.b at gmail.com                                  ballbery at sinenomine.net
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net


More information about the freebsd-stable mailing list