converters/libiconv change request for net/samba3

Thu Jul 8 18:33:11 PDT 2004

Mr. Iijima advised me again in ports-jp at jp.freebsd.org, but he is shy to
post gnome at freebsd.org.

<cite>
> If Microsoft called some hacked Shift_JIS version Shift_JIS 
> it doesn't make it valid for the rest of the world.

Absolutely right, in a context in which JIS-Unicode mapping does matter.
In such a context, we cannot, and will not, call CP932 as Shift_JIS.

But in practice, we have traditionally treated official Shift_JIS,
Microsoft CP932, Apple Japanese set, Sun Java's "SJIS" encoding, etc.
as identical to each other, whenever we do not use extra characters
added by each vendor (for example, codepoints circa 0x85??-0x87??).

Historically, JIS X0208 was born in 1978 and soon 'shift' encoding
(what we now call Shift_JIS) and 8th-bit-on encoding (what we now call
EUC) were invented. Of course there was no Unicode at that time and
JIS did not describe precisely how each symbol is used.

For instance, Shift_JIS 0x8166 (now all vendors map to U+2019 RIGHT
SINGLE QUOTATION MARK) played two roles: right quotation mark and
apostrophe. Despite that we now have U+FF07 FULLWIDTH APOSTROPHE!

It is in 1997 that JIS X0208 was reformed to identify each symbol by
English name and UTF codepoint, and it was too late. I heard that
a draft was proposed in 1994, but virtually nobody knew it.

So each vendor had designed its own JIS-Unicode mapping in the way that
he/she thought is the best. Microsoft CP932 and Apple Japanese set are
a few examples of such mappings, with addition of extra characters.

# Apple Japanese set is available at:
# http://www.unicode.org/Public/MAPPINGS/VENDORS/APPLE/JAPANESE.TXT

See Shift_JIS codepoint 0x81CA. Most tables map this symbol to U+00AC
NOT SIGN, but Microsoft does to U+FFE2, to avoid conflict with OS/2
single-byte NOT SIGN.

> For instance you can not have 1:1 mapping between cp932 and eucJP.

You are right, if you applied the rule strictly.

So we do not employ Unicode but we map CP932 or Shift_JIS characters to
EUC-JP in this simple calculation, with no conversion table:

	Either MS, Apple, or Shift_JIS    EUC-JP
	-----------------------------------------------
	0x00-0x7F (ASCII)              -> 0x00-0x7F
	0xA1-0xDF (JIS X0201 Katakana) -> 0x8EA1-0x8EDF

	0x8140-0x819E (except 0x817F)  -> 0xA1A1-0xA1FE
	0x819F-0x81FC                  -> 0xA2A1-0xA2FE
	0x8240-0x829E (ditto)          -> 0xA3A1-0xA3FE
	0x829F-0x82FC                  -> 0xA4A1-0xA4FE
		:
	0x9F9F-0x9FFC                  -> 0xDEA1-0xDEFE

	0xE040-0xE09E (ditto)          -> 0xDFA1-0xDFFE
		:
	0xEF9F-0xEFFC                  -> 0xFEFE-0xFEFE

	0x80 and 0xFD-0xFF (Apple only)-> not supported

	(CP932 only)
	0xF040-0xF9FC (private area)   -> not supported
	0xFA40-0xFC4B (IBM extention)  -> few converters support them,
			but there are two ways:
			(a) every character here has its duplicate within
			the range of 0x8140-0xEFFC (namely 0x87?? and
			0xED40-0xEEFC) for historical reason, so it can
			be unified to its counterpart, though this breaks
			round-trip conversion.
			(b) most characters here (perhaps all) are included
			in JIS X0212 (specified by 'ESC $ ( D' in 7-bit
			encodings and prefixed by 0x8F in EUC-JP), so
			you can convert them to X0212 characters if your
			applications support X0212.

Extra characters by Microsoft or Apple are mapped to EUC-JP undefined
codepoints, but we either use a font that supports such extras or totally
eliminate these characters.
</cite>

>>>>>	Alexander Nedotsukov <bland at FreeBSD.org> wrote:

> Btw, are you guys pretty sure you problem comes form libiconv? I have 
> few japanese windows workstations here and if you like can check what's 
> wrong with them. Just give me a simple instructions how to reproduce a 
> problem in this case. Why I asking because I already saw false reports 
> about libiconv problems when people tried to convert windows client 
> encoding to samba's host encoding and this is not always possible. For 
> instance you can not have 1:1 mapping between cp932 and eucJP.

And MORIYAMA Masayuki at MIRACLE LINUX CORPORATION showed me an example on
vim6. Because he sent to me in Japanese and I translated it, there may be
my mistake in translation.

<cite>
The step to reproduce wrong mapping.

1. Install libiconv-1.9.1 with the patch to add modified cp932 and
   eucJP-ms.
   http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932.patch.gz

2. Install vim6 according to the explanation in
   http://pcmania.jp/~moraz/howto/install.html (written in Japanese)

3. Configure your ~/.vimrc

-----------
set encoding=japan
if has('iconv')
  set fileencodings+=iso-2022-jp
  set fileencodings+=utf-8,ucs-2le,ucs-2
  if &encoding ==# 'euc-jp'
    set fileencodings+=cp932
  else
    set fileencodings+=euc-jp
  endif
endif
-----------

4. Run vim and open the Shift_JIS file, tmp.txt:

-----------
日本語〜
\~
-----------

5. You can see "〜" is not displayed correctly. But this is an expected
   result from the RIGHT modified mapping of cp932 in libiconv. And you
   have to change your ~/.vimrc to use sjis not cp932:

-----------
set encoding=japan
if has('iconv')
  set fileencodings+=iso-2022-jp
  set fileencodings+=utf-8,ucs-2le,ucs-2
  if &encoding ==# 'euc-jp'
    set fileencodings+=sjis
  else
    set fileencodings+=euc-jp
  endif
endif
-----------

6. Open tmp.txt again, and then you can see the right contents.

7. After execute the ed command ":w!" in vim, you will get an error:

-----------
"tmp.txt"
"tmp.txt" E513: write error, conversion failed
Hit ENTER or type command to continue
-----------

   Note: it is because the conversion 0x5C and 0x7E in euc-jp to 0x5C and
   0x7E in Shift_JIS respectively is impossible with the implementation of
   original libiconv. For example,

   $ echo '\~' | /usr/local/bin/iconv -f euc-jp -t sjis
   iconv: (stdin): cannot convert

About the "0x5C and 0x7E in Shift_JIS" problem, the page will be helpful:
http://www.debian.or.jp/~kubota/unicode-symbols-yen.html.en

The patch was made to solve such a problem described above.
http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932-jis.patch.gz
</cite>

Additional information from MORIYAMA-san,

<cite>
Because the problem in libiconv + vim6 is not related to Samba 3.0, there
is no problem in Samba 3.0 with using
http://www2d.biglobe.ne.jp/~msyk/software/libiconv/libiconv-1.9.1-cp932.patch.gz
which does not have JIS hack.

The libiconv-1.9.1-cp932-jis.patch.gz must have the same conversion rules
as "The conversion table of characters which has less harm in JIS-Unicode
conversion": http://hp.vector.co.jp/authors/VA010341/unicode/ (Japanese)
</cite>

Finnaly, on my experience,

1. pkg_delete -f ja-samba-2.2.9-ja-1.0
2. (cd /usr/ports/net/samba3; make install ALWAYS_BUILD_DEPENDS=yes)
3. Configure /usr/local/etc/smb.conf

---------
[global]
        dos charset = CP932
        unix charset = EUCJP-MS
        display charset = CP932
        netbios name = SAMBA3
        ; and so on
[homes]
        comment = %U's Home Directories
        valid users = %S
        read only = No
        nt acl support = No
        browseable = No
---------

4. Browse "\\SAMBA3\nakaji", the Japanese filename in euc-jp is not
   displayed correctly.
   http://www.rc.tutrp.tut.ac.jp/~nakaji/tmp/libiconv-NG.JPG

5. With patched libiconv and same smb.conf, it is right.
   http://www.rc.tutrp.tut.ac.jp/~nakaji/tmp/libiconv-OK.JPG

Thanks.
-- 
NAKAJI Hiroyuki