Report #5

Sat Aug 2 01:18:34 UTC 2014

Hello everyone!

Here is my report on progress that was achieved during this time. I've
implemented actual Unicode Collation Algorithm for DUCET (Default
Unicode Collation Element Table). I had to rewrite the entire
implementation: I wasn't satisfied with its quality and the way that
I've organized my source code, so I reverted my code and started again.
My previous implementation was full of hard-coded parts and it was a bit
harder to take anything useful from it for any other project. Now the
entire implementation is available in include/unicode.h and
lib/libc/unicode. If macro _UNICODE_SOURCE is defined, then wcscoll()
will use new collation algorithm. struct _xlocale was modified in the
way it will use two new members, colltable and collsize, which are just
transmitted to __ucscoll(). If element is not found in the given table
or table is NULL, then __ucscoll() tries to find this element in DUCET;
if element was not found, then __ucscoll generates collation.

I couldn't understand how the alternate shall be used though; it seems
that it can be dropped since wcscoll() doesn't has any version that
supports tailoring. I left it for now, but I'm pretty sure that we can
omit it.

I hadn't time to test wcscoll() better (especially using files provided
by Unicode Character Database), so this is the task that I will do right
now. :-) There are still several ways to improve the speed of the
algorithm, but I feel that the time for it hasn't come yet. style(9)
issues will also be handled (if any), just too tired to do it right now.

__ucscoll() just uses __ucsxfrm(), then compares the strings using
wcscmp() (this is the only platform-dependent part of code, I was too
lazy to write __ucslen(), so I left it as it is). This collation
algorithm support three levels; the last IIRC is usually the character
itself if not defined, so I decided to omit it (especially since I'm not
sure how variable weights should be handled). Any thoughs?

-- 
With best regards,
Dmitry Selyutin