Report #3: Unicode support

Dmitry Selyutin ghostman.sd at gmail.com
Sun Jun 22 22:34:34 UTC 2014


Hello everyone!

I'm glad to tell that I've finished a base sketch of the Unicode
Normalization Algorithm, which seems to work. Files were recently
updated to the most recent version of the Unicode (7.0.0).
Of course this code needs some tuning, e.g. in the worst case one has to
iterate over the whole table in order to check if character can be
normalized; I'm going to fix it using other structure, where each byte
denotes 8 characters, while each bit of this byte means flag if this
character may or may not be normalized. Thus we need to have two arrays
of 139264 characters (for composition and decomposition respectively),
where the state of the each character may be determined by simple
division. That's just a proposal; everyone is welcome to propose a
better way to handle such things.
Of course, the other important part is to prepare a testing suite, but
for this part I have to consult with my mentors, Pedro and David.

-- 
With best regards,
Dmitry Selyutin


More information about the soc-status mailing list