"Unprintable" 8-bit characters
Conrad J. Sabatier
conrads at cox.net
Fri Nov 11 01:12:48 UTC 2011
On Tue, 8 Nov 2011 23:04:25 -0600 (CST)
Robert Bonomi <bonomi at mail.r-bonomi.com> wrote:
>
> "Conrad J. Sabatier" <conrads at cox.net> wrote:
> >
> > <grin>
> >
> > Yes, and this is one area where the labels are more than a little
> > misleading as well. My natural inclination is think of UTF-8 as
> > being a single-byte representation for each character in the set,
> > whereas UTF-16, as the name implies, would be the "wide", 2-byte
> > version.
>
> "Not exactly."
>
> > Nonetheless, as I posted earlier in this thread, according to the
> > info in gucharmap, the representations of the umlauted "u" are just
> > the opposite of this:
>
> "not exactly." Again.
>
> > UTF-8: 0xC3 0xBC
> > UTF-16: 0x00FC
> >
> > Go figure, huh? :-)
>
> In UTF-16, everything _is_ a 16-bit entity. Notice that 0x00FC has
> -four- nybbles after the '0x.' Every character boundary is on a
> multiple of 16 bits.
Ah yes! I hadn't noticed that.
What's really weird, as I mentioned in a later private email to
Polytropon, last night, the copy-and-paste in gucharmap suddenly
decided to start copying the UTF-8 code instead of the UTF-16. I have
no idea why that changed.
> In UTF-8, the 'base' charset -- the 'C0' and 'C1' groups are
> represented by a single byte. 'extended' characters are represented
> by two bytes. Thus, 'characters' have a *variable*length*
> representation -- one or two bytes. A character, whether it is
> represented by one or two bytes, can begin on -any- byte boundary
> within a data stream, depending on 'what came before it'. UTF-8
> 2-byte representations are designed such that one can jump to any
> _byte_ offset within the file, and determine -- by looking *only* at
> the value of that byte whether is is (a) a single-byte character, (b)
> the first byte of a two-byte sequence, or (c) the second byte of a
> two-byte sequence.
>
> With UTF-16 you can position directly to any -character-, by jumping
> to a _byte_ offset that is twice the index of the character you want.
> Given a byte offset, you always know the 'equivalent' _character_
> offset.
>
> With UTF-8, you have to read the character stream, counting
> 'characters' as you go, to get to the desired point. You can seek to
> an arbitrary _byte_ offset, but you do not know how mny 'characters'
> into the file that offset is.
I see. Yes, that could certainly complicate things.
> UTF-8 vs. UTF-16 is a trade-off between 'compactness' (UTF-8), and
> simplicity of addessing/representation (UTF-16).
>
> > This seems rather unfortunate to me. You would think that, by now,
> > some "standard" character set might have emerged that would allow
> > one to use, at the very least, the "Western" characters (as opposed
> > to the "Eastern" or "Oriental" or "Asian", if you will) with a
> > reasonable expectation that others will see what was intended.
>
> Heh.
>
> How many 'character' codes are you willing to devote to national
> 'currency symbols', just for starters? Probable minimum of two per
> currency -- one for the minimum coinage unit (cent, pence, pfennig,
> etc.) and one for the denomination unit (dollar, pound, mark, kroner,
> etc.)
>
> Now, one (obviously) has to have the basic 'Roman' alphabet.
>
> Then there are all the diacritical markings (accent, accent grave, dot
> umlaut, ring, bar, 'hat', inverted hat, etc.) for vowels. And
> cedilla, tilde, etc., for select consonants. Plus language specific
> symbols like ess-zett , 'thorn', etc.
>
> How about phonetic symbols, like 'schwa' ?
>
> And Greek for all sorts of scientific use?
>
> What about Cyrilic characters, for many Eastern Eurpean languages?
>
> Now, consider punctuation marks:
> the 'typewriter' basics,
> How many of 'minus-sign, hyphen, em-dash, en-dash, soft-hyphen'
> are needed? How many of 'accent, accent grave, apostrophe,
> opening/closing single-quote' are needed?
> opening/closing double-quotes, and/or a 'position neutral'
> double-quote?
>
> "Other symbols", like --
> digits,
> common fractions,
> 'Trademark','Registered trademark','copyright'
> 'paragraph','section',
> superscripts -- exponents, footnotes, etc.
> subscripts -- chemical formulae, etc.
> "Simple line-drawing graphics"
>
> Diphthongs?? Ligatures??
>
> Start counting things up.
>
> An 8-bit 'address space' gets used used up _really_ quick.
>
> <wry grin>
I certainly get the point. :-) Thanks for that very thorough
elucidation. :-)
Now I just have to figure out what the heck's going on here, why
suddenly I'm seeing the exact opposite of what I was seeing yesterday.
Thought I had everything straightened out for a while there. :-(
Oh, this is madness! :-)
--
Conrad J. Sabatier
conrads at cox.net
More information about the freebsd-questions
mailing list