HP Forums

Full Version: utf-8 for RPL source code
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
I'm planning (actually half-way executing) to move all string handling to be UTF-8 tolerant in newRPL.
I'm saying tolerant and not conformant because for complexity and space I see no reason to support the full specification. The idea is that all string objects will be utf-8 encoded, and the user can insert arbitrary Unicode text, but the calculator will only have the standard font that's always been in the 48/50 series.
Surprisingly, only 16 characters out of the 256 need to be remapped, all others have a 1-1 equivalence between Unicode and the HP48 character set.
At this stage I'm debating whether is worth trying to deal with combining marks or perhaps we should just ignore all that and only attempt to display the 256 characters that were originally used in the calculator, and any other codes will either be ignored or shown as a square. Any thoughts?

Also I think it could be a good idea for the compiler to do some aliasing, for example << and « be treated as equivalent, same as -> and →. Would this be a good idea? or perhaps we should enforce the use of the proper characters and that's it.
(04-07-2015 01:06 AM)Claudio L. Wrote: [ -> ]I'm planning (actually half-way executing) to move all string handling to be UTF-8 tolerant in newRPL.
I'm saying tolerant and not conformant because for complexity and space I see no reason to support the full specification.

Support the spec, the whole spec, and everything in the spec; just bite the bullet and do it, you'll be glad you did, even though its gonna be a royal pain in the butt.

I remember when the Python community went through this... and it WAS a royal pain in the butt... but they persevered and most everyone is very happy with the result. We have a few dissenters, but by and large it was a huge success!

Its just time for everything collaborative in computer science-- 'must needs' Unicode UTF-8 support ( all symbols, all languages ).

The world is becoming a very small place. We need to keep everyone (all heart languages) in mind, and in code. Its a major responsibility. Just sayin

Cheers,
marcus
Smile
(04-07-2015 03:14 AM)MarkHaysHarris777 Wrote: [ -> ]Support the spec, the whole spec, and everything in the spec; just bite the bullet and do it, you'll be glad you did, even though its gonna be a royal pain in the butt.

Well, we have 2 MB of flash in the 50g, of which 1.3 MB is used already. The only "complete" implementation is the reference ICU library. The x86 version binary for Linux is 12.6 MB GZipped, who knows how big it is when expanded.
You see my point now?
Even if I could use the full library, it makes indiscriminate use of malloc/free (since Unicode strings like to expand...), which means it would have to be heavily modified to run on a system where you alloc from TempOb and all routines need to be GC-aware. It would be a nightmare. I could spend the next two years trying to figure out how to implement Unicode, there has to be a reasonable compromise somewhere.
(04-07-2015 01:06 AM)Claudio L. Wrote: [ -> ]At this stage I'm debating whether is worth trying to deal with combining marks or perhaps we should just ignore all that and only attempt to display the 256 characters that were originally used in the calculator, and any other codes will either be ignored or shown as a square.

To follow up on this, I created a routine that does NFC normalization in streaming mode, so that no temporary memory needs to be allocated at all.
It will take a significant amount of space in tables, and is quite complex (as in not going to be fast enough at 6MHz).
Simpler versions simply couldn't handle the complexity (it's painted as easy in the Unicode specification, but there's so many corner cases, all handled with extra tables...).
My code still doesn't include hangul compositions, so it only passes 7000 of the >18000 tests. Basically it works properly for all languages except the oriental ideographic languages (chinese, korean, etc), which require additional huge tables or extra algorithms (hangul).
I still haven't tried to pack the tables in the most compact possible way, merely got a working routine, but doesn't look like it will be fast enough for my purposes.
I think I'll trash the project and go back to my original idea: All strings would be assumed as already NFC normalized. No normalization attempt will be done by newRPL.
The character subset will use their unicode codes and all strings will be UTF8 encoded. Since the existing character mappings have only single characters, they are guaranteed to be already NFC normalized.
Any strange characters will be displayed as box or question mark, and skipped, not altered by any routines.
So it will be Unicode aware (tolerant) and utf8 encoded, but by no means will try to get into this mess in embedded systems. When porting to larger hardware, then it should be easy to add a normalization step using a standards compliant library (which takes like 12 MB of space).
I feel I should be spending time and effort on calculations, not text handling.
(04-17-2015 01:04 PM)Claudio L. Wrote: [ -> ]I think I'll trash the project and go back to my original idea:

Oh, well... I could't resist a good challenge. How to fit a 12 MB library on an embedded system?


(04-17-2015 01:04 PM)Claudio L. Wrote: [ -> ]My code still doesn't include hangul compositions, so it only passes 7000 of the >18000 tests.

So I added hangul, treated a few corner cases here and there, and now I have code passing all 18000+ tests. Ready to begin optimization phase.

(04-17-2015 01:04 PM)Claudio L. Wrote: [ -> ]I still haven't tried to pack the tables in the most compact possible way, merely got a working routine, but doesn't look like it will be fast enough for my purposes.

After analyzing the data from every possible angle, I came up with a way to optimize the speed and compact the data to reasonable levels.
For the interested, We have >27000 symbols in a number space ranging from 0 to 0x10ffff (that's >1 million).
For each character, we need to store several properties:
a) Combining class number (a number from 0-255)
b) NFC Quick check (a yes/no/maybe value)
c) Composition exclusion (yes/no)
d) Canonical decomposition.

This is a lot of information, but pass after pass of analysis revealed that:
a) There's only 55 different classes (as of Unicode 7), and only their order is important, so it can be stored in 6 bits, as a number from 0-54 instead of 0-255.
b) For a quick check, a simple yes/no is enough, no real use for the "maybe" case, so 1 bit will do.
c) This is already 1 bit.

So these 3 properties can be stored packed in a single byte per each symbol. A simple table for fastest possible access would need 1 MByte. Still huge, but a lot less than the 12 MB in the standard library.
Of the 1 million characters, only 27000 exist, and most of those have the same property value, even after packing it.
Looking at ranges of characters that have repeating values revealed that there's a lot of repetition. If we split the million symbols into ranges where the same property number repeats more than 100 times, for example, there's only 116 ranges. That's not much for a table and can be scanned relatively quickly. So the data is stored this way:
Range data:
From A to B, repeated value nn.
From B to C, different values, get from table offset XX.
From C to D, repeated value mm.
From D to E, different values, get from table offset YY.
...

Then there's a table of bytes where every range that has non-repeating values stores its data.
As it turns out, the range data can be compacted in 116 ranges, taking 4 bytes each, and the different bytes take only 4285 bytes.
The total space needed to store the first 3 properties ended up being 4749 bytes (not bad!). By changing the number of repeats, one can have more ranges, with a smaller table of bytes, or less ranges with a larger table of bytes. A value of 100 was chosen to be a good compromise for space, without affecting speed too much. Since ranges have to be scanned in sequence, the number of ranges is directly proportional to the speed. The fastest would be a single range with all data in the table (taking 1 MB), and the slowest would be ranges with a repeat count of 1, and no extra data stored.
Speed is fine for characters in the low ranges (latin alphabets), since to find the properties only a few ranges need to be scanned. For other languages with characters in the higher ranges, up to 116 ranges need to be scanned per character, so it slows down quite a bit. In a system with more ROM space, perhaps a different adjustment would be justified, using more space to double the speed or so.

Now it's time to tackle the fourth property: decompositions.
This one needs to store one or up to two characters for each character. The good news is that there's only 1035 characters that decompose into a single character (just a 1-to-1 replacement, so these characters are basically "repeats" in the Unicode space), and 1020 that decompose into two characters.
Assuming two 32-bit words for the first 1035, and 3 for the other ones, this information can be stored in about 20 kbytes, but it would be painfully slow to access, as the tables would have to be scanned element by element (binary search at best, but then these tables need to be scanned backwards for composition, where binary search won't help).

Time to analyze the patterns to see how to encode this for faster access without increasing the 20k too much.
I'll report back with my findings.

(04-17-2015 01:04 PM)Claudio L. Wrote: [ -> ]I feel I should be spending time and effort on calculations, not text handling.

Yeah, I still feel that, but can't help it.
I recently learned of the existence of this project, and I'm very impressed by the huge work that it involves, especially for a single person. I would be very pleased if one day you succeed.

For me, the HP50g is a nearly perfect calculator. Its flexibility and modularity can't be beaten. The sole drawback is that it could be faster, so the newRPL would be a perfect upgrade.
I'm not interested in the HP Prime, for several reasons.

Unfortunately, I don't have the knowledge to help you. I have seen the HPGCC3 solution, thought it would be fantastic , but just reading how to install it told me it was not for me. I don't plan to make severe modifications on my computer, learn Linux, use esoteric and complex tools, just to learn C language and make short programs for my calculator.
I've tried SysRPL, but it requires a lot of effort due to the huge number of instructions, and making any modification requires again a lot of care, contrary to User RPL.
So I see newRPL as an excellent solution, as long as it is not necessary to be a computer engineer to use it.

(04-17-2015 01:04 PM)Claudio L. Wrote: [ -> ]I feel I should be spending time and effort on calculations, not text handling.

In my HP50g, I've installed a pack of fonts named "MathFont", which replaces many uncommon characters in the original font by math symbols and the entire Greek alphabet, both upper case and lower case. They are more important for me than the accented letters, despite the fact I am French. I use my calculator to do math, not literature. Smile
I've also slightly modified these fonts, both the standard font and the minifont, to make them more appealing to my taste. Will this feature be present in newRPL ? I would prefer that than having 27000 fixed characters…

Maybe my post is totally off-topic, but I don't understand what you wrote above. Smile
I just want to tell you good luck!
(04-22-2015 01:06 AM)Helix Wrote: [ -> ]I've also slightly modified these fonts, both the standard font and the minifont, to make them more appealing to my taste. Will this feature be present in newRPL ? I would prefer that than having 27000 fixed characters…

Maybe my post is totally off-topic, but I don't understand what you wrote above. Smile
I just want to tell you good luck!

The first thing needed after a proper UTF-8 string manipulation is a good font to display the strings!
newRPL uses variable width bitmapped fonts. If you like designing fonts and want to help, that's an area where I'd appreciate help.
All you need to do is create a bitmap in any format (BMP, GIF, etc., anything except JPG, PNG or other formats that use "lossy" compression).
Right now we have 5 pixel high fonts, 6 pixels high and 7 pixels high. The original HP large font is 8 pixels high and the minifont is 6 but narrow.
Designing a font is easy, all you have to do is use any bitmap editor (GIMP, Photoshop, MS Paint), and create a black and white bitmap with 1 pixel more in height than what you intend to use for the font, and as wide as you want.
The additional row will help separate the individual characters later.
For example, the letters A and B on a 5-pixels font could be (here goes my ASCII art):
Code:

_XX__XXX__...
X__X_XXX__...
XXXX_X__X_...
X__X_XXX__...  (4 lines for the characters)
__________... (this line separates the rows)
XXXXX_____... (this line has 5 black pixels for the A, then 5 whites for the B, then it will have 4 or 5 blacks for the C, etc.)
So a 5-pixel font should have a 6-pixel bitmap with the lower row alternating black and white wherever a character changes. Characters can be any width, but I think up to 8 pixels max.
The bitmap should have the symbols for all characters in the HP48 codepage (see here), and any additional symbols you want, not necessarily in any particular order.
Of course, together with the bitmap, I'll need a list in a text file that tells me which unicode characters you have in the bitmap, from left to right.
Once you have a font I'll take it from there, as I have a program that will convert it automatically to the proper format for newRPL.
Just make sure you don't steal anything from copyrighted fonts, since newRPL (and therefore its fonts) will be released under the BSD license, we don't want to impose any restrictions on the users or have copyright claims later.

And you thought it was off-topic?
So, if I understand correctly, you need 3 images, one for each font size: 5 pixels, 6 pixels and 7 pixels height. And of course, the same characters must be present in each font.

But a font with 5 pixel height is very small. Some characters will be impossible to draw at this scale ! Where these tiny characters will be used, only in the soft menus, or also in the stack?

It could be a tedious work, and I'm not ready to reinvent the wheel. I have found some examples of fonts here, but I'm not familiar with the copyright rules. For example, is a "creative commons attribution" OK for a starting point?
(04-22-2015 10:40 PM)Helix Wrote: [ -> ]So, if I understand correctly, you need 3 images, one for each font size: 5 pixels, 6 pixels and 7 pixels height. And of course, the same characters must be present in each font.

But a font with 5 pixel height is very small. Some characters will be impossible to draw at this scale ! Where these tiny characters will be used, only in the soft menus, or also in the stack?

You'll be free to select the fonts to use for each area. Any font could be used on any area (when I say area I mean: Stack, Forms, Text editor, Soft menus, Status Area). If you are crazy enough to use a 5-pixel font for the stack, so be it, the system will allow it and work with it.
Other sizes will be allowed as well, so it may make sense to have perhaps two 6-pixel fonts: one wide for the stack and text, and one very narrow (like the minifont), more adequate for soft menus where space is limited to a few letters.
If a character cannot be represented, a box is fine. You don't need to draw many boxes in the bitmap, one is enough.
When you do the list of codes in the bitmap, let's do this:
* Each glyph in the bitmap, from left to right will correspond to one line in the text file.
* Blank lines in the file will be ignored with no consequence.
* You can add comments in the text file: A line starting with // will be considered a comment. If // is found on a line after the numbers, the rest of the line will be considered a comment and disregarded.
* On each line, you put the Unicode code points that are represented by this glyph, in decimal (or hex with 0x prefix) (for example 65 for capital A).
* If there's more than one character represented by the same glyph, add them in the same line, separated by a comma.
* The first symbol, at the left of the bitmap will be the symbol used for any characters that are not listed in the text file, so this should be a generic box or something. This way, you can skip codes and don't need to add them to the list explicitly.

Let's say you start with a box and the capital letters in the bitmap, your text file could look like this:
Code:

0 // GENERIC BOX FOR UNSUPPORTED CHARACTERS
// CAPITAL LETTERS
65 // A
66 // B
67 // C
...

Notice that while I added the 0 code for the generic box, it's not necessary to add 1-31 (control characters), since they will be mapped to the box automatically.

(04-22-2015 10:40 PM)Helix Wrote: [ -> ]It could be a tedious work, and I'm not ready to reinvent the wheel. I have found some examples of fonts here, but I'm not familiar with the copyright rules. For example, is a "creative commons attribution" OK for a starting point?

I'd say yes, Creative Commons is fine. All you have to do is keep track of which fonts you copy and put their respective credits in a file. We'll pick later which fonts will be in ROM and which will be loadable by the user, but the more fonts we have the better. As long as we don't violate any copyrights, we are OK. Then the ROM distribution will have to carry the credit files.
The tool to convert fonts from a normal bitmap to newRPL format will also be available separately so that users can freely create new fonts as they please. Perhaps I can even try to include one in ROM, so that you can store fonts in an SD card, and can be opened and converted on-the-fly by the calculator. But this is all up in the air right now, so far the firmware can only do +, -, * and /, that's a long way to go.
Your help will be much appreciated, so don't forget to add your own name to those credits!

And of course, if anyone else likes to design their own fonts, I invite them to join in and send some bitmaps!
(04-22-2015 10:40 PM)Helix Wrote: [ -> ]So, if I understand correctly, you need 3 images, one for each font size: 5 pixels, 6 pixels and 7 pixels height. And of course, the same characters must be present in each font.

I think this points needs clarification: when you say "the same characters must be present in each font", there's a lot of flexibility in newRPL for that. Much like on modern operating systems, it's up to each font to decide which characters to provide and which ones don't. For example, the 5-pixel font doesn't have to provide all symbols if it's nearly impossible to read them anyway.
Ideally, the 256 characters in the Hp48 should be there for the font to cover "basic needs".

EDIT: Here's the bitmap of the 5-pixel font from hpgcc3, if it's of any help to anyone out there willing to add more symbols
[attachment=1904]
Your explanations are helpful, thank you.
I think I will begin a font and send you this preliminary work, so if something is wrong you will tell me what to change.
The ability to modify a font with an additional tool (not necessarily included in the ROM), is a feature that will certainly satisfy some picky users like me, who are never happy with this or that letter Smile. And it means that my possible creations won't be set in stone, which is more relaxing for me.


I still have some questions:

We have already 6 and 7 height fonts in the HP50g. Is it possible to reuse these fonts, or do you prefer entirely different fonts? (I'm afraid of the answer…)

Is it possible that the last line of the font is not entirely blank, and used by some characters like p, q, y? I suppose yes…

You don't plan to include an 8 height font? It's a pity, because I find this size of font very easy to read. I've seen some of you earlier posts, with the description of the three rows of soft menus. They consume a lot of space! I'm not sure I would prefer this disposition. I like very much the actual display arrangement of the HP50g. I have reassigned the HIST key for recalling the last menu, and it's enough for my (modest) needs. But of course, it's not my project, and I'm not aware of all the possibilities that you envision.
(04-23-2015 11:00 PM)Helix Wrote: [ -> ]I still have some questions:

We have already 6 and 7 height fonts in the HP50g. Is it possible to reuse these fonts, or do you prefer entirely different fonts? (I'm afraid of the answer…)
Yes, the fonts on the 50g are good, but monospaced. Letters like I are too wide, W or M too narrow, etc. Other than that, the style is good and could be copied, but I'm not sure about copyright issues with HP (have these fonts been released with a permissive license by HP?).

(04-23-2015 11:00 PM)Helix Wrote: [ -> ]Is it possible that the last line of the font is not entirely blank, and used by some characters like p, q, y? I suppose yes…
Yes, but remember that it will "touch" the letters below, so it has to be used only when strictly necessary (like p,q, etc.). As long as it looks good anything goes.

(04-23-2015 11:00 PM)Helix Wrote: [ -> ]You don't plan to include an 8 height font? It's a pity, because I find this size of font very easy to read.
Nothing is set in stone, this is an open source project. I said 5, 6 and 7 because I like more information on-screen at once, and considered this as the bare minimum for the system. But newRPL is not for me, it's for everyone, so if you like 8 or 10 pixel fonts, and put the effort to create them, then by all means they can be included, either in ROM (if there's room left) or as a user-loadable font.

(04-23-2015 11:00 PM)Helix Wrote: [ -> ]I've seen some of you earlier posts, with the description of the three rows of soft menus. They consume a lot of space! I'm not sure I would prefer this disposition. I like very much the actual display arrangement of the HP50g. I have reassigned the HIST key for recalling the last menu, and it's enough for my (modest) needs. But of course, it's not my project, and I'm not aware of all the possibilities that you envision.

It's not as bad as you think. Here, you can see a picture of how it looks on screen. The 2 additional rows don't steal space from your stack, but from the status area, which is now reduced and to the right of the menu (in the picture, the "Memory cleared" message is in the status area).
Lately I've been thinking that the HIST key (or any other) can hide/show the status area (and therefore the second soft menu) upon long-press, to clear even more space if you don't mind not having your variables menu.
Right now this lower area is also used for pop-up error messages. For example, if you press + with only one number in the stack, the 50g opens the error as a pop-up window, it beeps, and you have to press ON to continue working. newRPL shows the pop-up error on top of the second soft-menu and covering the status area (full-width of the screen), for 5 seconds and then disappears. During these 5 seconds, you keep working, no need to press ON or any key, so errors never slow you down (this should be configurable, so errors while the screen is unattended don't just vanish).
So this area takes no more space than on the 50g, and has much more importance, being used simultaneously for status area, error popups and a second soft menu.
(04-21-2015 02:06 PM)Claudio L. Wrote: [ -> ]Now it's time to tackle the fourth property: decompositions.
...
Time to analyze the patterns to see how to encode this for faster access without increasing the 20k too much.
I'll report back with my findings.

Back on topic, similar analysis by ranges revealed that the singletons can be compacted in 4948 bytes, while the doubles need approx. 18 kbytes to be processed with reasonable speed.
While these tables are efficient for decomposition, they won't help with the inverse process, so an additional table is needed for composition. The additional table is expected to require other 12 kbytes.

This brings the total to about 35 kbytes in tables, plus the space taken by the code.
The penalty in speed is not negligible by any means, but we'll have to live with it.
NFC normalization is only intended to be used when importing Unicode text from external sources (SD card/Serial IO). After that, all text will be assumed to be NFC normalized.

End-of-topic.
Reference URL's