"Unprintable" 8-bit characters

Conrad J. Sabatier conrads at cox.net
Fri Nov 11 01:12:48 UTC 2011


On Tue, 8 Nov 2011 23:04:25 -0600 (CST)
Robert Bonomi <bonomi at mail.r-bonomi.com> wrote:

> 
> "Conrad J. Sabatier" <conrads at cox.net> wrote:
> >
> > <grin>
> >
> > Yes, and this is one area where the labels are more than a little
> > misleading as well.  My natural inclination is think of UTF-8 as
> > being a single-byte representation for each character in the set,
> > whereas UTF-16, as the name implies, would be the "wide", 2-byte
> > version.
> 
> "Not exactly."
> 
> > Nonetheless, as I posted earlier in this thread, according to the
> > info in gucharmap, the representations of the umlauted "u" are just
> > the opposite of this:
> 
> "not exactly." Again.
> 
> > UTF-8: 0xC3 0xBC
> > UTF-16: 0x00FC
> >  
> > Go figure, huh?  :-)
> 
> In UTF-16, everything _is_ a 16-bit entity.  Notice that 0x00FC has
> -four- nybbles after the '0x.'  Every character boundary is on a
> multiple of 16 bits.

Ah yes!  I hadn't noticed that.

What's really weird, as I mentioned in a later private email to
Polytropon, last night, the copy-and-paste in gucharmap suddenly
decided to start copying the UTF-8 code instead of the UTF-16.  I have
no idea why that changed.

> In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are
> represented by a single byte.  'extended' characters are represented
> by two bytes. Thus, 'characters' have  a *variable*length*
> representation -- one or two bytes.  A character, whether it is
> represented by one or two bytes,  can begin on -any- byte boundary
> within a data stream, depending on 'what came before it'.  UTF-8
> 2-byte representations are designed such that one can jump to any
> _byte_ offset within the file, and determine -- by looking *only* at
> the value of that byte whether is is (a) a single-byte character, (b)
> the first byte of a two-byte sequence, or (c) the second byte of a
> two-byte sequence.
> 
> With UTF-16 you can position directly to any -character-, by jumping
> to a _byte_ offset that is twice the index of the character you want.
> Given a byte offset, you always know the 'equivalent' _character_
> offset.
> 
> With UTF-8, you have to read the character stream, counting
> 'characters' as you go, to get to the desired point.  You can seek to
> an arbitrary _byte_ offset, but you do not know how mny 'characters'
> into the file that offset is.

I see.  Yes, that could certainly complicate things.

> UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and 
> simplicity of addessing/representation (UTF-16).
> 
> > This seems rather unfortunate to me.  You would think that, by now,
> > some "standard" character set might have emerged that would allow
> > one to use, at the very least, the "Western" characters (as opposed
> > to the "Eastern" or "Oriental" or "Asian", if you will) with a
> > reasonable expectation that others will see what was intended.
> 
> Heh. 
> 
> How many 'character' codes are you willing to devote to national
> 'currency symbols', just for starters?  Probable minimum of two per
> currency -- one for the minimum coinage unit (cent, pence, pfennig,
> etc.) and one for the denomination unit (dollar, pound, mark, kroner,
> etc.)
> 
> Now, one (obviously) has to have the basic 'Roman' alphabet. 
> 
> Then there are all the diacritical markings (accent, accent grave, dot
> umlaut, ring, bar, 'hat', inverted hat,  etc.) for vowels.  And
> cedilla, tilde, etc., for select consonants.  Plus language specific
> symbols like ess-zett , 'thorn', etc.
> 
> How about phonetic symbols, like 'schwa' ?
> 
> And Greek for all sorts of scientific use?
> 
> What about Cyrilic characters, for many Eastern Eurpean languages?
> 
> Now, consider punctuation marks:
>    the 'typewriter' basics, 
>    How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen'
> are needed? How many of 'accent, accent grave, apostrophe,
> opening/closing single-quote' are needed?
>    opening/closing double-quotes,  and/or a 'position neutral'
> double-quote?
> 
> "Other symbols", like --
>    digits,
>    common fractions,
>    'Trademark','Registered trademark','copyright' 
>    'paragraph','section', 
>    superscripts  -- exponents, footnotes, etc.
>    subscripts -- chemical formulae, etc.
>    "Simple line-drawing graphics"
> 
> Diphthongs??  Ligatures??
> 
> Start counting things up. 
> 
> An 8-bit 'address space' gets used used up _really_ quick.
> 
> <wry grin>

I certainly get the point.  :-)  Thanks for that very thorough
elucidation.  :-)

Now I just have to figure out what the heck's going on here, why
suddenly I'm seeing the exact opposite of what I was seeing yesterday.
Thought I had everything straightened out for a while there.  :-(

Oh, this is madness!  :-)

-- 
Conrad J. Sabatier
conrads at cox.net


More information about the freebsd-questions mailing list