UTF-8 <-> UTF-16BE Converter in Kernel Needs Test

Sun Aug 13 12:27:15 UTC 2006

Hi, Intron,

- iconv(9), aka kiconv, is not implementation of POSIX iconv(3).
- UDF is another kiconv user.
- kiconv is not a present for Microsoft.
- UCS-2 is not enough for explaining full GB18030.
   I'd like to know how Microsoft controls GB18030.

 > 1. I don't know why the author takes the concept of Microsoft's 16-bit
 >     wchar_t as UTF-16BE (the macro ENCODING_UNICODE in /sys/sys/iconv.h).

You can see why it's UTF-16BE via cvs logs.

- R. Imura

Intron wrote:
> Yoshihiro Ota wrote:
> 
>> You may try these patches, first.
>> http://people.freebsd.org/~imura/kiconv/
>>
>> It sounds like these patches implement better supports.
>>
>> Hiro
>>
>> On Sun, 13 Aug 2006 01:28:17 +0800
>> "Intron" <mag at intron.ac> wrote:
>>
>>> I'm sorry that I send my experimental patch set here to call for test.
>>> But if I send it to freebsd-i18n@, I wonder no one will respond to me.
>>>
>>> Download: http://ftp.intron.ac/tmp/kiconv_utf8_20060813.tar.bz2
>>>
>>> My patch set implements a UTF-8 <-> UTF-16BE converter for iconv in
>>> kernel. It doesn't need kiconv(3) to send unnecessary UTF-8 <-> UTF-16BE
>>> conversion tables to kernel. And it doesn't require the help of GNU
>>> libiconv, which kiconv(3) depends on.
>>>
>>> With my patch set, if you mount FAT/NTFS/ISO9660 file system, less
>>> resource will be occupied than before:
>>>
>>> mount_msdosfs -L ll_NN.UTF-8 /dev/md0s1 /mnt
>>>
>>> See my "readme.txt" for installation guide.
>>>
>>>                 ************  ATTENTION !!!  ************
>>>
>>> 1. Do NOT test my patch set upon your CRITICAL FAT/NTFS partition !!!
>>>
>>> 2. Limited by BUGGY FreeBSD modules msdosfs/ntfs/cd9660, whether you
>>>     use my patch set or not, only 1/2-byte UTF-8 character (up to 0x7ff)
>>>     is supported, which means only a few languages are supported.
>>>
>>>     I will try to patch those modules to support all languages (up to
>>>     6-byte UTF-8 character) included in current Unicode step by step.
>>>
>>> ------------------------------------------------------------------------
>>>                                                  From Beijing, China
>>>
>>> _______________________________________________
>>> freebsd-hackers at freebsd.org mailing list
>>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
>> _______________________________________________
>> freebsd-hackers at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
>> To unsubscribe, send any mail to "freebsd-hackers-unsubscribe at freebsd.org"
> 
> I have looked in his patch set. Some essential problems:
> 
> 1. I don't know why the author takes the concept of Microsoft's 16-bit
>     wchar_t as UTF-16BE (the macro ENCODING_UNICODE in /sys/sys/iconv.h).
>     16-bit wchar_t is only enough for UCS-2 BE/LE (Unicode BMP) while
>     real UTF-16 includes 4-byte formation.
> 
> 2. Actually, kernel iconv is prepared only for Microsoft (FAT32, NTFS,
>     Joliet extension to ISO 9660, SambaFS) so far. It should be a minimum
>     function set just fit for Microsoft. Above all, it is not a complete
>     implementation of UNIX98 iconv and should be as simple as possible.
> 
> 3. In fact, UNIX98 iconv(3) handles any character set as char array.
>     The usage of wchar_t is not of a good style in modules msdosfs/
>     cd9660/ntfs. String function such as memcpy() should be used instead.
>     If 5/6-byte UTF-8 sequence (Annex D of ISO/IEC 10646-1:2000) or other
>     special encoding is allowed, handling by char array will be still
>     robust.
> 
> ------------------------------------------------------------------------
>                                                 From Beijing, China