Re: Grep with non-ascii

From: Tomoaki AOKI <junchoon_at_dec.sakura.ne.jp>
Date: Sat, 04 Feb 2023 04:16:37 UTC
On Fri, 3 Feb 2023 12:36:47 -0500
George Mitchell <george+freebsd@m5p.com> wrote:

> On 2/3/23 11:06, Tomoaki AOKI wrote:
> > [...]
> > If this is the case like above, the only solution is to move to
> > character set containing ALL characters all over the world.
> > 
> > AFAIK, the only candidates are only two, TRON code [1] and Unicode (UCS,
> > ISO/IEC 10646) [2]. And TRON code is very rarely used, actual candidate
> > would be Unicode only.
> > Note that Unicode is usually encoded to any of UTF-8, UTF-16 or UTF-32
> > for data transfer (sometimes raw UCS-2?).
> > [...]
> 
> The one positive development in the world of computing that I would
> credit to Java is the earliest big push toward the adoption of UTF-8.
> I strongly hope UTF-8 becomes universally used sooner rather than
> later.                                                     -- George

And FreeBSD already has UTF-8. ;-)

Drawbacks of UTF-8 are...
  *Han unification. Not exactly same but lookalike characters in
   Japanese, Chinese and Korean are fatally missingly unified.

  *Lack of proper support for variant forms of characters.
   Maybe Unicode should have another 2 dimensions, one for classifying
   wrongly unified CJK characters and another one for variants.

  *Font sets. Very limited number of fonts covers the whole
   Unicode codepoints that are assigned any of actual character.

  *FreeBSD base does not have full Unicode font for vt yet.
   (Input methods are the different problem, though.)

-- 
Tomoaki AOKI    <junchoon@dec.sakura.ne.jp>