Problems in base iconv conversion

Tomoaki AOKI junchoon at dec.sakura.ne.jp
Sun Dec 8 03:13:28 UTC 2013


Hi

In base iconv, some character sets have problem, mostly related to
single-byte JIS X 0201 Kana (aka Half Width Kana) and multi-bytes JIS X
0213 (2004 version is the newest standard for now).

The problems can be stratified by 3 error patterns.

1. Illegal byte sequence (in Gnu iconv, "cannot convert")
2. Invalid argument (in Gnu iconv, "unsupported")
3. Invalid characters


What I tried is: 
  *Select Japanese codes from `icomv -l`. (Possibly dropped some codes)

  *Stratify them by whether single-byte or multi-bytes.

  *Convert simple test string (no meaning as sentences) from UTF-8 to
   target code using `iconv -f UTF-8 -t (target) (TestString)` and its
   reverse conversion, and compare reverse converted string with
   original test string. If error occurred, stratify by it and record
   the output in hex form.

Please see attached PDF for detail (Notes are basically for base iconv).
Base iconv in stable/10 r258701 and Gnu iconv from ports in stable/9 for
reference.

Strangely, although all target is listed in `iconv -l` (base iconv,
not all of them are listed in ports Gnu iconv), some target caught error
"invalid argument" and no output string to stdout. This shall not
happen, and should be gracefully supported or dropped from list.

In other error pattern, output strings are erroneously converted.
In some case dropped some character, or converted to alternative
character for error case (GETA MARK).

But I'd need to mention that mapping non-supported character to GETA
MARK is normal treatment for multi-bytes case because not all UTF-8
characters are supported in every codes. Dropping unsupported  is
considered as really abnormal in most cases.

Can someone confirm and fix? Looking in src tree, corresponding csmapper
sources seems existing. But my knowledge in iconv internals is
insufficient, so I can't figure out why these error occurs.
I have no fix, sorry. (It's beyonds my ability).


Some technical notes:

In JIS X 0201 and its variants, half width katakana characters are
supported directly in 8bits encoding and via shift-out/shift-in in
7bits encoding.

JIS X 0212 is extension for JIS X 0208. Not superset of JIS X 0208.

In other hand, JIS X 0213 is modified superset of JIS X 0208. (Includes
almost all of JIS X 0208 but not compatible as some code points are
changed, subsumed or splitted. In addition, many of MS extended
characters are included.)

In strict EUC-JP (equals to EUC variant of ISO2022-JP), half width
katakana characters are intentionally unsupported, but EUC-JP itself can
support them as 2-bytes form lead by 0x8E followed by JIS X 0201 code.
In strict EUC-JP, JIS X 0212 extended characters are not supported, but
EUC-JP itself can support as 3-bytes form lead by 0x8F.

In my multi-bytes test string, codepoint 0xE2 0x85 0xB1 in UTF-8 is
vendor specific in JIS X 0208 and JIS X 0208 + 0212 (equals to
ISO2022-JP-1 excluding half width katanaka characters), including
SHIFT_JIS variants and EUC variants. Some of these vendor specific
characters are introduced into standard in JIS X 0213.
Vendor specific variants such as CP932 already have them from before
JIS X 0213.

Regards.

-- 
Tomoaki AOKI    junchoon at dec.sakura.ne.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test-iconv-results.pdf
Type: application/pdf
Size: 28806 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20131208/85102db3/attachment.pdf>


More information about the freebsd-stable mailing list