converting UTF-8 to HTML

Sun Apr 22 00:09:57 UTC 2012

Polytropon <freebsd at edvax.de> wrote:
> On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote:
> > On Sat, 21 Apr 2012, Erik Nurgaard wrote:
> > 
> > > When characters show up wrong in the users browser it's usually
> > > because the browser is set to use a non-UTF-8 charset by default
> > > such as windows-1252, the web server sends the charset=ascii in 
> > > the http header and there is no or incorrect meta tag to resolve 
> > > the problem. Non UTF-8 charsets are a leftover from last millenia 
> > > that we sometimes still choke on .. sorry the rant ;)
> > 
> > UTF-8 is a waste of storage for most people and is incompatiple with
> > text-mode tools: it's simple another bid to make it impossible to run
> > without a GUI.
>
> Regarding the fun of encodings, endianness, representation,
> use ("fi" the two letters vs. "fi" the ligature, or "a"
> the 1-byte sequence vs. "a" the two-byte sequence), see
> the following document:
>
> Matt Mayer: Love Hotels and Unicode
> http://www.reigndesign.com/blog/love-hotels-and-unicode/
>
> And finally it offers an interesting attack vector, given
> the fact that several unicode characters "look" the same,
> but in fact are different. So "two files with the 'same'
> name" is a possible means that malware implementers can
> utilize to mislead the users.
>
> Short example from MICROS~1 land here:
> http://blogs.technet.com/b/mmpc/archive/2011/08/10/can-we-believe-our-eyes.aspx
>
> But this all doesn't negate the usefulness of unicode / UTF-8
> in general. Especially when you have collaborative settings
> with multi-language document processing requirements, it
> is a helpful thing, as working with "normal" (ASCII) letters,
> cyrillic ones, chinese and japanese symbols, arabic writing
> is no big deal as long as all the tools do properly support
> it the _same_ way.
>

Sorry, but UTF-8 is a *botch*, to put it charitably.

Correction -- UTF-8 is a particular implementation of the botch that is
'variable-width encoding' representation of the glyphs used to represent
printed information.

"Variable-width ecoding" destroys the concept of addressibility -within-
a text.  And, therefore, 'random access'/'direct access' is impossible.

Ditto for concepts like 'read backwards'. 

Not to mention the inevitable, and UNAVOIDABLE problems that occur when
the 'encoding' used for a particular set of data is not represented *IN*
the dataset (or in inextricably-coupled 'metadata').  When one has to
'guess' what the encoding for a particular file is.  

'Assume' -- with all that -that- word implies -- a particular encoding,
when the data is actually encoded with something 'different', and you
can encounter 'illegal' (in the 'assumed' encoding) byte sequences, 
from which there is *NO* means of recovery -- since the 'interpreter'
can't tell how long the 'illegal' code is, it can't tell where the 'next'
symbol should start, and and it just _stops_cold_ ... an apparent 'end of
file'. 

I have had _that_ particular ufortunate experience, with an 'encoding-aware'
text editor (On a Debain Linux system, if it matters), which, on exit 
_SILENTLY_ *truncated* the originl file at the point of the 'illegal' symbol.

The -correct- solution -- if you are in an environment where you need more
glyphs than can be represented by a single byte -- is to use *fixed-width*
multi-byte symbols for _everything_.  This is "relatively easy" to implement
within a single 'system' (be it a single machine, or 'corporate wide'), but
makes for major difficulities when 'external' communication is involved.
There is, unfortunately, simply -no- simple solution for that problem. :((