On Sat, 21 Apr 2012 09:10:03 -0500 (CDT), Lars Eighner wrote:
> On Sat, 21 Apr 2012, Erik Nørgaard wrote:
> > When characters show up wrong in the users browser it's usually because the 
> > browser is set to use a non-UTF-8 charset by default such as windows-1252, 
> > the web server sends the charset=ascii in the http header and there is no or 
> > incorrect meta tag to resolve the problem. Non UTF-8 charsets are a leftover 
> > from last millenia that we sometimes still choke on .. sorry the rant ;)
> UTF-8 is a waste of storage for most people [...]

Disks and RAM are huge and cheap. Plenty of space that is
going to be used. Nobody cares.

> [...] and is incompatiple with
> text-mode tools: it's simple another bid to make it impossible to run
> without a GUI.

Again, nobody cares - until, of couse, it's too late and you
need to do some recovery or analytic tasks in a limited
environment or via a connection with limited means.

Regarding the fun of encodings, endianness, representation,
use ("fi" the two letters vs. "fi" the ligature, or "ß"
the 1-byte sequence vs. "ß" the two-byte sequence), see
the following document:

Matt Mayer: Love Hotels and Unicode

And finally it offers an interesting attack vector, given
the fact that several unicode characters "look" the same,
but in fact are different. So "two files with the 'same'
name" is a possible means that malware implementers can
utilize to mislead the users.

Short example from MICROS~1 land here:

But this all doesn't negate the usefulness of unicode / UTF-8
in general. Especially when you have collaborative settings
with multi-language document processing requirements, it
is a helpful thing, as working with "normal" (ASCII) letters,
cyrillic ones, chinese and japanese symbols, arabic writing
is no big deal as long as all the tools do properly support
it the _same_ way.

