Threaded Mode | Linear Mode

Claudio L. · 04-21-2015, 02:06 PM

(04-17-2015 01:04 PM)Claudio L. Wrote: I think I'll trash the project and go back to my original idea:

Oh, well... I could't resist a good challenge. How to fit a 12 MB library on an embedded system?

(04-17-2015 01:04 PM)Claudio L. Wrote: My code still doesn't include hangul compositions, so it only passes 7000 of the >18000 tests.

So I added hangul, treated a few corner cases here and there, and now I have code passing all 18000+ tests. Ready to begin optimization phase.

(04-17-2015 01:04 PM)Claudio L. Wrote: I still haven't tried to pack the tables in the most compact possible way, merely got a working routine, but doesn't look like it will be fast enough for my purposes.

After analyzing the data from every possible angle, I came up with a way to optimize the speed and compact the data to reasonable levels.
For the interested, We have >27000 symbols in a number space ranging from 0 to 0x10ffff (that's >1 million).
For each character, we need to store several properties:
a) Combining class number (a number from 0-255)
b) NFC Quick check (a yes/no/maybe value)
c) Composition exclusion (yes/no)
d) Canonical decomposition.

This is a lot of information, but pass after pass of analysis revealed that:
a) There's only 55 different classes (as of Unicode 7), and only their order is important, so it can be stored in 6 bits, as a number from 0-54 instead of 0-255.
b) For a quick check, a simple yes/no is enough, no real use for the "maybe" case, so 1 bit will do.
c) This is already 1 bit.

So these 3 properties can be stored packed in a single byte per each symbol. A simple table for fastest possible access would need 1 MByte. Still huge, but a lot less than the 12 MB in the standard library.
Of the 1 million characters, only 27000 exist, and most of those have the same property value, even after packing it.
Looking at ranges of characters that have repeating values revealed that there's a lot of repetition. If we split the million symbols into ranges where the same property number repeats more than 100 times, for example, there's only 116 ranges. That's not much for a table and can be scanned relatively quickly. So the data is stored this way:
Range data:
From A to B, repeated value nn.
From B to C, different values, get from table offset XX.
From C to D, repeated value mm.
From D to E, different values, get from table offset YY.
...

Then there's a table of bytes where every range that has non-repeating values stores its data.
As it turns out, the range data can be compacted in 116 ranges, taking 4 bytes each, and the different bytes take only 4285 bytes.
The total space needed to store the first 3 properties ended up being 4749 bytes (not bad!). By changing the number of repeats, one can have more ranges, with a smaller table of bytes, or less ranges with a larger table of bytes. A value of 100 was chosen to be a good compromise for space, without affecting speed too much. Since ranges have to be scanned in sequence, the number of ranges is directly proportional to the speed. The fastest would be a single range with all data in the table (taking 1 MB), and the slowest would be ranges with a repeat count of 1, and no extra data stored.
Speed is fine for characters in the low ranges (latin alphabets), since to find the properties only a few ranges need to be scanned. For other languages with characters in the higher ranges, up to 116 ranges need to be scanned per character, so it slows down quite a bit. In a system with more ROM space, perhaps a different adjustment would be justified, using more space to double the speed or so.

Now it's time to tackle the fourth property: decompositions.
This one needs to store one or up to two characters for each character. The good news is that there's only 1035 characters that decompose into a single character (just a 1-to-1 replacement, so these characters are basically "repeats" in the Unicode space), and 1020 that decompose into two characters.
Assuming two 32-bit words for the first 1035, and 3 for the other ones, this information can be stored in about 20 kbytes, but it would be painfully slow to access, as the tables would have to be scanned element by element (binary search at best, but then these tables need to be scanned backwards for composition, where binary search won't help).

Time to analyze the patterns to see how to encode this for faster access without increasing the 20k too much.
I'll report back with my findings.

(04-17-2015 01:04 PM)Claudio L. Wrote: I feel I should be spending time and effort on calculations, not text handling.

Yeah, I still feel that, but can't help it.