From wolfgang at lyxys.ka.sub.org Sun May 27 23:07:46 2007 From: wolfgang at lyxys.ka.sub.org (Wolfgang Zenker) Date: Sun May 27 23:07:50 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? Message-ID: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> Hello all, I'm a bit surprised there are no TODIGIT mappings for non-latin scripts in src/share/mklocale/UTF-8. Is there a technical reason why this would be a bad idea or is it simply because noone did get around to define the mappings yet? Looking at am_ET.UTF-8.src, the mappings are defined using the UTF-8 encoding for the digit signs in their respective script and mapping them to their numeric value. So, e.g. for arabic the TODIGIT mappings would be /* Arabic-Indic digits 0 - 9 */ TODIGIT <0xd9a0 - 0xd9a9 : 0> /* Extended Arabic-Indic digits 0 - 9 */ TODIGIT <0xdbb0 - 0xdbb9 : 0> By the way, the TODIGIT mapping in am_ET.UTF-8.src appears to be off by one, as the Ethiopic digit 1 is 0x1369 in UCS-2, which maps to 0xe18da9 in UTF-8 while in am_ET.UTF-8.src it says 0xe18da8. Wolfgang From ache at freebsd.org Mon May 28 07:40:16 2007 From: ache at freebsd.org (Andrey Chernov) Date: Mon May 28 07:40:19 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> Message-ID: <20070528072847.GA18850@nagual.pp.ru> On Mon, May 28, 2007 at 12:41:42AM +0200, Wolfgang Zenker wrote: > Hello all, > > I'm a bit surprised there are no TODIGIT mappings for non-latin scripts > in src/share/mklocale/UTF-8. Is there a technical reason why this would > be a bad idea or is it simply because noone did get around to define the > mappings yet? Because of POSIX isdigit(): digit Define the characters to be classified as numeric digits. In the POSIX locale, only: 0 1 2 3 4 5 6 7 8 9 shall be included. In a locale definition file, only the digits , , , , , , , , , and shall be specified, and in contiguous ascending sequence by numerical value. The digits to of the portable character set are automatically included in this class. -- http://ache.pp.ru/ From wolfgang at lyxys.ka.sub.org Mon May 28 08:47:01 2007 From: wolfgang at lyxys.ka.sub.org (Wolfgang Zenker) Date: Mon May 28 08:47:04 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528072847.GA18850@nagual.pp.ru> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> Message-ID: <20070528084659.GA77240@lyxys.ka.sub.org> * Andrey Chernov [070528 09:28]: > On Mon, May 28, 2007 at 12:41:42AM +0200, Wolfgang Zenker wrote: >> I'm a bit surprised there are no TODIGIT mappings for non-latin scripts >> in src/share/mklocale/UTF-8. Is there a technical reason why this would >> be a bad idea or is it simply because noone did get around to define the >> mappings yet? > Because of POSIX isdigit(): > digit > Define the characters to be classified as numeric digits. > In the POSIX locale, only: > 0 1 2 3 4 5 6 7 8 9 > shall be included. > In a locale definition file, only the digits , , , > , , , , , , and shall be > specified, and in contiguous ascending sequence by numerical value. The > digits to of the portable character set are automatically > included in this class. Looking at our UTF-8.src, I see $ grep DIGIT UTF-8.src DIGIT '0' - '9' XDIGIT '0' - '9' 'A' - 'F' 'a' - 'f' TODIGIT < '0' - '9' : 0x0000 > TODIGIT < 'A' - 'F' : 10 > < 'a' - 'f' : 10 > It appears to me that isdigit() behaviour is controlled by the DIGIT keyword, not TODIGIT. However, I do admit that I don't understand completely how locale files are supposed to work. So where does e.g. iswdigit() get its character class information from, should that not be in the locale information as well somewhere? Wolfgang From ache at freebsd.org Mon May 28 11:52:54 2007 From: ache at freebsd.org (Andrey Chernov) Date: Mon May 28 11:52:56 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528084659.GA77240@lyxys.ka.sub.org> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> Message-ID: <20070528115250.GA24812@nagual.pp.ru> On Mon, May 28, 2007 at 10:46:59AM +0200, Wolfgang Zenker wrote: > Looking at our UTF-8.src, I see > > $ grep DIGIT UTF-8.src > DIGIT '0' - '9' > XDIGIT '0' - '9' 'A' - 'F' 'a' - 'f' > TODIGIT < '0' - '9' : 0x0000 > > TODIGIT < 'A' - 'F' : 10 > < 'a' - 'f' : 10 > > > It appears to me that isdigit() behaviour is controlled by the DIGIT > keyword, not TODIGIT. However, I do admit that I don't understand completely > how locale files are supposed to work. So where does e.g. iswdigit() get > its character class information from, should that not be in the locale > information as well somewhere? There is no POSIX function to extract TODIGIT info, so it is useless for now. todigit() is SCO extension and its manpage says: The macro todigit returns the digit character corresponding to its integer argument. The argument must be in the range 0-9, otherwise the behavior is undefined. iswdigit() have the same 0-9 restriction as isdigit() just accepts wint_t -- http://ache.pp.ru/ From wolfgang at lyxys.ka.sub.org Mon May 28 12:34:59 2007 From: wolfgang at lyxys.ka.sub.org (Wolfgang Zenker) Date: Mon May 28 12:35:03 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528115250.GA24812@nagual.pp.ru> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> <20070528115250.GA24812@nagual.pp.ru> Message-ID: <20070528123456.GA12679@lyxys.ka.sub.org> * Andrey Chernov [070528 13:52]: > On Mon, May 28, 2007 at 10:46:59AM +0200, Wolfgang Zenker wrote: >> Looking at our UTF-8.src, I see >> $ grep DIGIT UTF-8.src >> DIGIT '0' - '9' >> XDIGIT '0' - '9' 'A' - 'F' 'a' - 'f' >> TODIGIT < '0' - '9' : 0x0000 > >> TODIGIT < 'A' - 'F' : 10 > < 'a' - 'f' : 10 > >> It appears to me that isdigit() behaviour is controlled by the DIGIT >> keyword, not TODIGIT. However, I do admit that I don't understand completely >> how locale files are supposed to work. So where does e.g. iswdigit() get >> its character class information from, should that not be in the locale >> information as well somewhere? > There is no POSIX function to extract TODIGIT info, so it is useless for > now. Ok, so the mklocale src files that DO provide additional TODIGIT mappings (like e.g. am_ET.UTF-8.src or ja_JP.SJIS.src) do so just to be prepared for the day we can use them? > todigit() is SCO extension and its manpage says: > The macro todigit returns the digit character corresponding to its integer > argument. The argument must be in the range 0-9, otherwise the behavior is > undefined. > iswdigit() have the same 0-9 restriction as isdigit() just accepts wint_t I had imagined that TODIGIT would be used for a locale-aware version of digittoint(3) or something like that. What would be a good place to read up about how much can be localised with locales and how much of it we currently (and maybe in the near future) support? Wolfgang From ache at freebsd.org Mon May 28 12:49:46 2007 From: ache at freebsd.org (Andrey Chernov) Date: Mon May 28 12:49:49 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528123456.GA12679@lyxys.ka.sub.org> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> <20070528115250.GA24812@nagual.pp.ru> <20070528123456.GA12679@lyxys.ka.sub.org> Message-ID: <20070528124944.GA26009@nagual.pp.ru> On Mon, May 28, 2007 at 02:34:56PM +0200, Wolfgang Zenker wrote: > > There is no POSIX function to extract TODIGIT info, so it is useless for > > now. > > Ok, so the mklocale src files that DO provide additional TODIGIT mappings > (like e.g. am_ET.UTF-8.src or ja_JP.SJIS.src) do so just to be prepared > for the day we can use them? Depends on POSIX move. > I had imagined that TODIGIT would be used for a locale-aware version of > digittoint(3) or something like that. Our extension digittoint(3) manpage says: If the given character was not a digit as defined by isxdigit(3), the function will return 0. So, the same 0-9 restriction + A-F > What would be a good place to read > up about how much can be localised with locales and how much of it we > currently (and maybe in the near future) support? The Open Group Base Specs Issue 6 http://www.opengroup.org/onlinepubs/009695399/toc.htm -- http://ache.pp.ru/ From wolfgang at lyxys.ka.sub.org Mon May 28 18:17:36 2007 From: wolfgang at lyxys.ka.sub.org (Wolfgang Zenker) Date: Mon May 28 18:17:38 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528124944.GA26009@nagual.pp.ru> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> <20070528115250.GA24812@nagual.pp.ru> <20070528123456.GA12679@lyxys.ka.sub.org> <20070528124944.GA26009@nagual.pp.ru> Message-ID: <20070528181829.GA18332@lyxys.ka.sub.org> * Andrey Chernov [070528 14:49]: > On Mon, May 28, 2007 at 02:34:56PM +0200, Wolfgang Zenker wrote: >> What would be a good place to read >> up about how much can be localised with locales and how much of it we >> currently (and maybe in the near future) support? > The Open Group Base Specs Issue 6 > http://www.opengroup.org/onlinepubs/009695399/toc.htm So, as 7.3.1 says, in the "POSIX locale", which appears to be otherwise known as the "C" locale, only '0' to '9' can be defined as being in class digit. Because we use UTF-8.src as source for the "C" locale, we can not add definitions for digits in other scripts, right? In "a locale", which appears to be the generic case now, we are only allowed to define the digits to in the digit class. The digits '0' to '9' from the "portable character set" (= ASCII?) would be automatically included in the class. So if we have a locale using a non-latin script that happens to have its own "digit" characters, we can not use the UTF-8.src for the LC_CTYPE definitions but would best work with a copy and add DIGIT mappings for the digit characters in the script used? Or are to again fixed to be the ASCII codes '0' to '9'? Wolfgang From wolfgang at lyxys.ka.sub.org Mon May 28 18:39:36 2007 From: wolfgang at lyxys.ka.sub.org (Wolfgang Zenker) Date: Mon May 28 18:39:39 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528181829.GA18332@lyxys.ka.sub.org> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> <20070528115250.GA24812@nagual.pp.ru> <20070528123456.GA12679@lyxys.ka.sub.org> <20070528124944.GA26009@nagual.pp.ru> <20070528181829.GA18332@lyxys.ka.sub.org> Message-ID: <20070528184028.GA19098@lyxys.ka.sub.org> * Wolfgang Zenker [070528 20:18]: > * Andrey Chernov [070528 14:49]: >> On Mon, May 28, 2007 at 02:34:56PM +0200, Wolfgang Zenker wrote: >>> What would be a good place to read >>> up about how much can be localised with locales and how much of it we >>> currently (and maybe in the near future) support? >> The Open Group Base Specs Issue 6 >> http://www.opengroup.org/onlinepubs/009695399/toc.htm > So, as 7.3.1 says, in the "POSIX locale", which appears to be otherwise > known as the "C" locale, only '0' to '9' can be defined as being in class > digit. Because we use UTF-8.src as source for the "C" locale, we can not > add definitions for digits in other scripts, right? > In "a locale", which appears to be the generic case now, we are only > allowed to define the digits to in the digit class. The > digits '0' to '9' from the "portable character set" (= ASCII?) would be > automatically included in the class. > So if we have a locale using a non-latin script that happens to have its > own "digit" characters, we can not use the UTF-8.src for the LC_CTYPE > definitions but would best work with a copy and add DIGIT mappings for > the digit characters in the script used? Or are to > again fixed to be the ASCII codes '0' to '9'? Found the answer in chapter 6. So, to are defined as the respective digits in the portable character set. This leaves no possibility to define digits for other scripts, AFAICS. So, can anyone clue me in why this has been handled this way? It appears to me that the possibilities of localization are quite limited as soon as languages in non-latin scripts come into play. Are these problems usually handled in individual applications then? Wolfgang From ache at freebsd.org Mon May 28 19:55:17 2007 From: ache at freebsd.org (Andrey Chernov) Date: Mon May 28 19:55:20 2007 Subject: Why no non-latin TODIGIT mappings in UTF-8.src ? In-Reply-To: <20070528184028.GA19098@lyxys.ka.sub.org> References: <200705272241.l4RMfg07051300@juno.lyxys.ka.sub.org> <20070528072847.GA18850@nagual.pp.ru> <20070528084659.GA77240@lyxys.ka.sub.org> <20070528115250.GA24812@nagual.pp.ru> <20070528123456.GA12679@lyxys.ka.sub.org> <20070528124944.GA26009@nagual.pp.ru> <20070528181829.GA18332@lyxys.ka.sub.org> <20070528184028.GA19098@lyxys.ka.sub.org> Message-ID: <20070528195515.GA32109@nagual.pp.ru> On Mon, May 28, 2007 at 08:40:28PM +0200, Wolfgang Zenker wrote: > > So, can anyone clue me in why this has been handled this way? It appears > to me that the possibilities of localization are quite limited as soon > as languages in non-latin scripts come into play. Are these problems > usually handled in individual applications then? IMHO this is for for historic practice converting between digit/char using - '0' and + '0' Monotonic sequence requirement is another confirmation. BTW, this originally comes from ISO C standard, POSIX only inherits it. Possibilities of localization are limited in a lot ways besides that one. -- http://ache.pp.ru/