Report #9: Unicode support

Wed Aug 27 10:51:24 UTC 2014

I've just seen EuroBSDCon's calendar page and it seems that it is
impossible to join it (i.e. I missed the application deadline).[0]
Well, may be next year? :-)

2014-08-27 14:48 GMT+04:00 Dmitry Selyutin <ghostman.sd at gmail.com>:
> Hi, Pedro, Baptiste,
>
> first of all thanks for your congratulations and kind words! The
> project was really harder that anything I've ever met in my life, but
> at the same time it was the most interesting one. :-) And still
> remains! ;-)
>
>> That is not really uncommon :)
> Well, so I can leave it as it is. :-)
>
>> The project does have access to sparc64 machines so if you have some
>> self-contained test we can run it for you or we can test it as a routine libc
>> test after committing.
> Hopefully I can finish it today or in the next two days.
>
>> You never answered my question concerning the fallback options.
> Really? I thought that I answered. :-D Well, I'll try to explain
> again. DUCET seems to be a bit obsolete collation table, which can be
> more or less successfully used with real languages. However, in real
> world it is completely unusable, so ICU and other use CLDR collation
> table, which supports more levels. I started with DUCET since there
> was much more information about it, but then I found that it doesn't
> fit well, so I switched to CLDR. We have DUCET table somewhere in our
> revisions though; as a fallback option, it still may be useful, so I
> can restore it if you want.
>
>> Changing it to use the NetBSD's cdb support[1] shouldn't be difficult.
> Well, I think I'll do it right after exams. bdb AFAIK is deprecated
> from Linux (though it can be used as bdb46 or something similar). I
> don't know reasons why they did such thing; it would be great if we
> could use a tool which can be used on different platforms without
> modifications and tons of conditional define's and undef's.
>
>> It has simple API way easier to use, the db format is endian safe and final file
>> is smaller than equivalent in bdb format.
> It sounds great!
>
>> I do want to encourage you to go to EuroBSDCon 2014 in Sofia. The
>> FreeBSD Foundation will be allocating funds for students that want to go.
>> I won’t be there (I am a bit far away) but David and other developers will
>> likely be.
> Well, that depends on whether I pass my exams for the postgraduate
> course or not. I'd really like to listen to more experienced
> developers and may be even talk to other people about work which I did
> to better understand the community's opinions.
>
> 2014-08-27 3:17 GMT+04:00 Pedro Giffuni <pfg at freebsd.org>:
>> Hi Baptiste;
>>
>>
>> On 08/26/14 17:16, Baptiste Daroussin wrote:
>>>
>>> On Wed, Aug 27, 2014 at 01:08:58AM +0400, Dmitry Selyutin wrote:
>>>>
>>>> Hello everyone!
>>>>
>>>> Here are the last news about the Unicode support project[0].
>>>> You can always check my repository[1].
>>>>
>>>> During these days I had hardware problems (my HDD peacefully died), so
>>>> development didn't progress so much as before. However, I've
>>>> eliminated these problems, so I tried to fix bugs and reorganize the
>>>> code as much as possible. Now everything shall compile.
>>>>
>>>> I decided to use __attribute__((constructor)) and
>>>> __attribute__((destructor)), since I don't know if there exist a
>>>> better way to open a file once in the startup and closing it when all
>>>> routines close. I've found one or two occurrences of this construction
>>>> in FreeBSD code; AFAICT it is rather common in clang and gcc, so I
>>>> decided to use it. Hopefully it will also allow us to use root
>>>> collation database in the embedded systems (if any such system really
>>>> needs collation algorithm).
>>>>
>>>> As you may know we need a tool that can convert collation text files
>>>> obtained from unicode.org to new collation database (colldb) format.
>>>> There is a version of this tool written in Python
>>>> (share/examples/colldb/colldb.py). IIRC we can't use Python when we
>>>> have a base system though, so it seems that we need to written such
>>>> tool using C language. I was thinking of lex/yacc combo; I've never
>>>> tried it, but I think it shouldn't be too hard to write a tool using
>>>> it. I'd like to know your opinions about this task.
>>>> I've already written a man page (bin/colldb/colldb.1). The only thing
>>>> which seems dubious is that I decided to use the same name as for the
>>>> library itself (well, it seems I have a lack of imagination). So we
>>>> have both colldb.1 and colldb.3 man pages.
>>>>
>>>> The other thing I'd really like to do is to really force network byte
>>>> order in collation database format (I'm sure I've seen a way to do it
>>>> in Berkley databases). It's a pity that I have no platform with
>>>> big-endian (or even PDP!) byte order. Any help here is highly
>>>> appreciated (as well as your thoughts about lex/yacc, i.e. thoughts
>>>> whether it fits well to my task).
>>>>
>>>> Since Google Summer of Code period has passed, I'd like to thank both
>>>> my mentors, Pedro and David, who gave me a helping hand during this
>>>> project, and especially Konrad Jankowski, who found time to answer my
>>>> questions and help me too. Though GSoC is closed, I'd like to stay
>>>> with FreeBSD project. First of all, I want to finish and bring to mind
>>>> this project: I don't think it's really finished, especially its
>>>> testing part, though it seems that new collation algorithm can already
>>>> be used. Then I'd like to work in other parts of my project,
>>>> especially in internationalization parts. I'd also like to improve my
>>>> own library, qc, to provide a rich API for *BSD and POSIX systems,
>>>> since I acutely feel the lack of such API. If it is possible to stay
>>>> with project, I'd be very happy to do it. :-)
>>>>
>>>> P.S. Does anyone knows how to get diff between only for my branch
>>>> (i.e. for my part of repository)? svn diff -r $FIRST:$LAST seems to
>>>> give everything what all FreeBSD's GSoC have done, so I need some
>>>> other command. Thanks for your help!
>>>>
>>>> [0] https://wiki.freebsd.org/SummerOfCode2014/Unicode
>>>> [1] https://socsvn.freebsd.org/socsvn/soc2014/ghostmansd
>>>>
>>> First thank you very much for your work on this subject this is highly
>>> needed.
>>>
>>> Concerning the db format have you thought about using the new netbsd
>>> constant
>>> database format?
>>>
>>> It has simple API way easier to use, the db format is endian safe and
>>> final file
>>> is smaller than equivalent in bdb format.
>>>
>>> Lots of areas of FreeBSD could benefit from using this cdb format as well
>>> imho.
>>
>>
>> While here, let me congratulate Dmitry. The Unicode Collation Algorithm is
>> not something easy/fun to work with.
>>
>> Indeed both David and Konrad suggested it (or tinycdb). The reason for
>> going bdb was that we had time constraints and bdb is already in libc.
>>
>> FWIW, Nexenta kindly re-licensed localedef [1] and their collation support
>> in Illumos which basically implements their own very efficient format. We
>> ended up re-using the tools that libc already has to better focus on the
>> collation part.
>>
>> Changing it to use the NetBSD's cdb support[1] shouldn't be difficult.
>>
>> As Dmitry noted there are still details to work out and we have to run tests
>> and get the code reviewed but all in all I am very satisfied with the
>> advance
>> in this GSoC.
>>
>> Best regards,
>>
>> Pedro.
>>
>> [1] https://github.com/Nexenta/illumos-nexenta/tree/republish-localedef
>> [2] http://cvsweb.netbsd.org/bsdweb.cgi/src/lib/libc/cdb/
>>
>
>
>
> --
> With best regards,
> Dmitry Selyutin

-- 
With best regards,
Dmitry Selyutin