converting UTF-8 to HTML
    Matthew Seaman 
    m.seaman at infracaninophile.co.uk
       
    Sun Apr 22 10:45:59 UTC 2012
    
    
  
On 22/04/2012 10:17, Erik Nørgaard wrote:
> UTF-8 is variable with, ascii characters are stored as single bytes (not
> sure about iso-8859-1) while other characters are stored as two byte chars.
ascii uses the low 128 values that you can assign to an unsigned char,
ie. those where the high-order bit is zero.
iso-8859-1 and the various other iso-8859-X character sets fill in the
remaining 128 characters with various other glyphs useful in latin
alphabets, so it's still one char per glyph.  Other alphabets (greek,
cyrillic, etc) have similar one byte-per glyph encodings. But you have
to know what the encoding is to display the content correctly, and it is
difficult to mix chunks of text in different encodings in the same document.
UTF has various different forms, based on different word sizes, but the
commonly used UTF-8 works in units of 1-byte chars.  However, glyphs may
be represented by sequences of from 1 to 4 bytes.  The 1-byte glyphs are
identical to ascii.  Any byte with the high-order bit set indicates the
beginning of a multibyte glyph -- the number of bytes is indicated by
the bit pattern of the first byte and all the other bytes of that glyph
will have the high order bit set.  All million-plus glyphs available
through Unicode can be expressed this way, so the encoding is universal
and suitable for all languages and alphabets or non-alphabetic languages.
Not all possible byte sequences are valid UTF-8 text, but the design of
the encoding means that an interpreter can skip over an invalid sequence
of bytes and find the beginning of the next valid sequence easily.
Whoever it was upthread had the misfortune to run into a text editor
that just gave up and truncated their document at an invalid sequence
needs to vent their ire on the lazy and stupid programmers of whatever
app it was, rather than on the concept of UTF-8 itself.
Yes, with UTF-8 encoded text, you can no-longer equate the number of
glyphs[*] in a piece of text (and hence the space required to display
the text) with the memory required to store that text.  There's a lot of
legacy code out there which makes this assumption, but this is
overshadowed by the amount of legacy code out there which can only
handle ascii text.  Fixing all that code is pretty long-winded, but not
conceptually too difficult.  Programming a text-only display to assume
everything is UTF-8 would be quite viable, and backwardly compatible
with ascii-only displays.  The hard part is creating a font with a
more-or-less complete set of Unicode glyphs.
	Cheers,
	Matthew
[*] Let's not even mention the concept of 'combining characters' here.
-- 
Dr Matthew J Seaman MA, D.Phil.                   7 Priory Courtyard
                                                  Flat 3
PGP: http://www.infracaninophile.co.uk/pgpkey     Ramsgate
JID: matthew at infracaninophile.co.uk               Kent, CT11 9PW
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 267 bytes
Desc: OpenPGP digital signature
Url : http://lists.freebsd.org/pipermail/freebsd-questions/attachments/20120422/2006146d/signature.pgp
    
    
More information about the freebsd-questions
mailing list