kern/151845: smbfs should be upgraded to support Unicode

Michael Meelis m.meelis at easybow.com
Sun Oct 31 13:40:10 UTC 2010


>Number:         151845
>Category:       kern
>Synopsis:       smbfs should be upgraded to support Unicode
>Confidential:   no
>Severity:       serious
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          update
>Submitter-Id:   current-users
>Arrival-Date:   Sun Oct 31 13:40:09 UTC 2010
>Closed-Date:
>Last-Modified:
>Originator:     Michael Meelis
>Release:        8.1-RELEASE
>Organization:
EasyBOW
>Environment:
8.1-RELEASE FreeBSD 8.1-RELEASE
>Description:
Windows stores all file names in UTF-16 encoding. When you put files from windows to freebsd using samba server it converts file names from UTF-16 to UTF-8. Then you get files with samba - reverse conversion occurs. This is correct lossless bidirectional conversion. This can possible because samba server uses modern interaction protocol with UTF-16 encoding support. On this way all is ok.

When you want to cp files from freebsd to windows you first mounts windows share using mount_smbfs and smbfs.ko. But smbfs.ko (what do the main work) supports only old DOS-style interaction protocol without unicode encoding. It use simple byte encoding. On windows side server component converts byte coded characters into windows UTF-16 using conversion table. By default windows (beautiful "I knows better" solution) use CP437. But in most cases to represent wide range of file names used ISO8859-1 table. I checks this by analyzing you test archive. And this is not all. I found many characters that can't fit into ISO8859-1 because they from WINDOWS-1252 table (I done this check too).

So even if we use UTF-8 to CP437 conversion on freebsd we lost most of additional characters on freebsd side. If we use UTF-8 to WINDOWS-1252 conversion on freebsd we not lost anything on freebsd side, but lost the same characters as in previous case on windows side.
We MUST change conversion on windows side to correct one - must be used WINDOWS-1252 table.
After this we may use UTF-8 to WINDOWS-1252 conversion on freebsd and get perfect result.

Additionally I found smbfs have erroneous realization of conversion from various byte length characters (UTF-8) to single bytes characters (like WINDOWS-1252). And this can't be fixed without significant effort and take a long time to debug. But this is no problem - we may use "iconv" option in rsync. Libiconv with rsync works perfect.

Continues. All look fine. But windows can put (and do it, I checked it) in the file names several control characters not defined in WINDOWS-1252. This characters comes from UTF-16 and converts into UTF-8 correct, but conversion from UTF-8 to WINDOWS-1252 fails. So we need to make a patch for iconv and libiconv to allow conversion in libiconv work without errors (else rsync fails with "can't convert name" or similar error).

I near to break down my mind with smbfs and rsync. I makes new patch smbfs with replace unconvertible characters to "_". And rsync becomes crazy and copying same files to windows share again and again when runs several times with same parameters. Funny. But bad.
This problem connected with whole conversion sequences:
on first and next runs while copying files to windows share: localfs->rsync->smbfs->iconv->patch->windowsfs
on next runs while finding files need to be rsynced: windowsfs->smbfs->iconv->rsync->localfs
Here file named ex. "FrØya.html" converts to "Fr_ya.html" on windows share and first rsync run done without errors. But when rsync runs second time it lookups windows share for "FrØya.html" but got only "Fr_ya.html" (rsync didn't knows about this lossy conversion inside smbfs) and it copies this file again and again. Bug.

To fix this we need to leave smbfs module untouched and add new conversion table (to do "_" replace inside rsync) to libiconv and use rsync with "iconv" option.

Added new encoding "CP437FIXED" with always good conversion to '_' for wrong symbols.

>How-To-Repeat:

>Fix:
smbfs should be upgrade to support unicode. Until than work with attached libiconv patch and new CP437FIXED encoding. (The full patch & test doesn't fix the 100kb & txt extention.

Patch attached with submission follows:

--- libcharset/tools/all-charsets.orig	2009-06-21 11:17:33.000000000 +0000
+++ libcharset/tools/all-charsets	2010-06-29 00:11:59.000000000 +0000
@@ -21,7 +21,7 @@
     ISO-8859-7 | ISO-8859-8 | ISO-8859-9 | ISO-8859-13 | ISO-8859-14 | ISO-8859-15 | \
     KOI8-R | KOI8-U | KOI8-T | \
     CP437 | CP775 | CP850 | CP852 | CP855 | CP856 | CP857 | CP861 | CP862 | CP864 | CP865 | CP866 | CP869 | CP874 | CP922 | CP932 | CP943 | CP949 | CP950 | CP1046 | CP1124 | CP1125 | CP1129 | CP1131 | \
-    CP1250 | CP1251 | CP1252 | CP1253 | CP1254 | CP1255 | CP1256 | CP1257 | \
+    CP1250 | CP1251 | CP1252 | CP437FIXED | CP1253 | CP1254 | CP1255 | CP1256 | CP1257 | \
     GB2312 | EUC-JP | EUC-KR | EUC-TW | BIG5 | BIG5-HKSCS | GBK | GB18030 | SHIFT_JIS | JOHAB | \
     TIS-620 | VISCII | TCVN5712-1 | ARMSCII-8 | GEORGIAN-PS | PT154 | \
     HP-ROMAN8 | HP-ARABIC8 | HP-GREEK8 | HP-HEBREW8 | HP-TURKISH8 | HP-KANA8 | \
--- lib/flags.h.orig	2009-06-30 20:52:08.000000000 +0000
+++ lib/flags.h	2010-06-29 00:12:55.000000000 +0000
@@ -54,6 +54,7 @@
 #define ei_cp1250_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1251_oflags (HAVE_QUOTATION_MARKS)
 #define ei_cp1252_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
+#define ei_cp437fixed_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1253_oflags (HAVE_QUOTATION_MARKS)
 #define ei_cp1254_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
 #define ei_cp1255_oflags (HAVE_ACCENTS | HAVE_QUOTATION_MARKS)
--- lib/encodings.def.orig	2009-06-21 11:17:33.000000000 +0000
+++ lib/encodings.def	2010-06-29 00:14:55.000000000 +0000
@@ -459,6 +459,11 @@
             cp1252)
 #endif
 
+DEFENCODING(( "CP437FIXED",                 /* JDK 1.1 */
+            ),
+            cp437fixed,
+            { cp437fixed_mbtowc, NULL },      { cp437fixed_wctomb, NULL })
+
 DEFENCODING(( "CP1253",                 /* JDK 1.1 */
               "WINDOWS-1253",           /* IANA */
               "MS-GREEK",
--- lib/aliases.h.orig	2009-06-30 20:51:58.000000000 +0000
+++ lib/aliases.h	2010-06-29 14:42:30.000000000 +0000
@@ -32,11 +32,11 @@
 #line 1 "lib/aliases.gperf"
 struct alias { int name; unsigned int encoding_index; };
 
-#define TOTAL_KEYWORDS 346
+#define TOTAL_KEYWORDS 347
 #define MIN_WORD_LENGTH 2
 #define MAX_WORD_LENGTH 45
 #define MIN_HASH_VALUE 7
-#define MAX_HASH_VALUE 935
+#define MAX_HASH_VALUE 936
 /* maximum key range = 929, duplicates = 0 */
 
 #ifdef __GNUC__
@@ -46,24 +46,25 @@
 inline
 #endif
 #endif
+
 static unsigned int
 aliases_hash (register const char *str, register unsigned int len)
 {
   static const unsigned short asso_values[] =
     {
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936,  16,  62, 936,  73,   0,
-        5,   2,  47,   4,   1, 168,   8,  12, 357, 936,
-      936, 936, 936, 936, 936, 112, 123,   3,  14,  34,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937,  16,  62, 937,  73,   0,
+        5,   2,  47,   4,   1, 168,   8,  12, 357, 937,
+      937, 937, 937, 937, 937, 112, 123,   3,  14,  34,
        71, 142, 147,   0, 258,  79,  39, 122,   4,   0,
-      109, 936,  76,   1,  54, 147, 114, 180, 102,   3,
-       10, 936, 936, 936, 936,  34, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936, 936, 936,
-      936, 936, 936, 936, 936, 936, 936, 936
+      109, 937,  76,   1,  54, 147, 114, 180, 102,   3,
+       10, 937, 937, 937, 937,  34, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937, 937, 937,
+      937, 937, 937, 937, 937, 937, 937, 937
     };
   register int hval = len;
 
@@ -452,6 +453,7 @@
     char stringpool_str900[sizeof("BIG5-HKSCS:1999")];
     char stringpool_str908[sizeof("MACHEBREW")];
     char stringpool_str935[sizeof("BIG5-HKSCS:2004")];
+    char stringpool_str936[sizeof("CP437FIXED")];
   };
 static const struct stringpool_t stringpool_contents =
   {
@@ -800,7 +802,8 @@
     "CSHALFWIDTHKATAKANA",
     "BIG5-HKSCS:1999",
     "MACHEBREW",
-    "BIG5-HKSCS:2004"
+    "BIG5-HKSCS:2004",
+    "CP437FIXED"
   };
 #define stringpool ((const char *) &stringpool_contents)
 
@@ -1449,7 +1452,8 @@
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str463, ei_cp1254},
 #line 73 "lib/aliases.gperf"
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str464, ei_iso8859_3},
-    {-1},
+#line 358 "lib/aliases.gperf"
+    {(int)(long)&((struct stringpool_t *)0)->stringpool_str936, ei_cp437fixed},
 #line 89 "lib/aliases.gperf"
     {(int)(long)&((struct stringpool_t *)0)->stringpool_str466, ei_iso8859_5},
 #line 20 "lib/aliases.gperf"
--- lib/aliases.gperf.orig	2009-06-30 20:51:58.000000000 +0000
+++ lib/aliases.gperf	2010-06-29 01:12:13.000000000 +0000
@@ -355,3 +355,4 @@
 CSISO2022KR, ei_iso2022_kr
 CHAR, ei_local_char
 WCHAR_T, ei_local_wchar_t
+CP437FIXED, ei_cp437fixed
--- lib/converters.h.orig	2009-06-21 11:17:33.000000000 +0000
+++ lib/converters.h	2010-06-29 00:20:00.000000000 +0000
@@ -160,6 +160,7 @@
 #include "cp1250.h"
 #include "cp1251.h"
 #include "cp1252.h"
+#include "cp437fixed.h"
 #include "cp1253.h"
 #include "cp1254.h"
 #include "cp1255.h"
--- tests/Makefile.in.orig	2010-06-29 00:20:34.000000000 +0000
+++ tests/Makefile.in	2010-06-29 00:20:27.000000000 +0000
@@ -68,6 +68,7 @@
 	$(srcdir)/check-stateless $(srcdir) CP1250
 	$(srcdir)/check-stateless $(srcdir) CP1251
 	$(srcdir)/check-stateless $(srcdir) CP1252
+	$(srcdir)/check-stateless $(srcdir) CP437FIXED
 	$(srcdir)/check-stateless $(srcdir) CP1253
 	$(srcdir)/check-stateless $(srcdir) CP1254
 	$(srcdir)/check-stateless $(srcdir) CP1255
--- tools/Makefile.orig	2010-06-27 18:09:49.000000000 +0000
+++ tools/Makefile	2010-06-29 00:38:07.000000000 +0000
@@ -26,6 +26,7 @@
  cp1250.h \
  cp1251.h \
  cp1252.h \
+ cp437fixed.h \
  cp1253.h \
  cp1254.h \
  cp1255.h \
@@ -191,6 +192,9 @@
 cp1252.h : $(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP1252.TXT 8bit_tab_to_h
 	./8bit_tab_to_h CP1252 cp1252 < $<
 
+cp437fixed.h : $(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP437FIXED.TXT 8bit_tab_to_h
+	./8bit_tab_to_h CP437FIXED cp437fixed < $<
+
 cp1253.h : $(TABLESDIR)/unicode.org-mappings/VENDORS/MICSFT/WINDOWS/CP1253.TXT 8bit_tab_to_h
 	./8bit_tab_to_h CP1253 cp1253 < $<
 


>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list