HP Forums

Full Version: Improve precision of float numbers
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi all!

I like to build small RPN calculators (ScArY, SCOTT, ARC) with AVR microcontrollers (which are easy to program).

Unfortunately these microcontrollers support float numbers with 4 byte single precision (due IEEE-754) only. That means they support a precision of 6 to 7 decimal digits.

Now I would like to improve this precision to at least 9 decimal digits. The only idea I have is to define a new number format (struct) to separate mantissa and exponent:

Code:
struct real {
  long m;
  int8_t e;
};

But now I have to "reinvent" every mathematical operation like adding two numbers (the following code works, but is far from beeing efficient):

Code:
void realadd(real * res, real a, real b) {
  if (a.e >= b.e) {
    b.m /= _pow10(a.e-b.e);
    res->m = a.m + b.m;
    res->e = a.e;
  }
  else {
    a.m /= _pow10(b.e-a.e);
    res->m = a.m + b.m;
    res->e = b.e;
  }
}

Looking forward to transzendent functions or complex numbers I feel overstrained.

Do you have any other idea how to raise the precision with less effort?

Thanks for any idea.
deetee
There's a couple of links here:

https://stackoverflow.com/questions/6769...g-2-floats

Not much, but it's a start...
Hello Claudio!

Thanks for your link - it leads me to a very promising solution from Nick Gammon - BigNumber:

http://www.gammon.com.au/forum/?id=11519

Regards
deetee
I have built a few homemade calculators too and I decided to use a char array to hold BCD bytes. It takes some time to implement all the functionality but you can add as many decimal places as you like (32 for one project!).
Thinking about it some more:

If you can fit something like bignum in your MCU, you have other options (I was thinking smaller, so the tiny routines in that paper were the first choice):

DecNumber is the reference implementation for multi precision decimal, but it abuses malloc/free so it's not too friendly for small hardware.

Mpdecimal is Python's implementation, I used it for newRPL for quite some time, it's more MCU friendly but there's no transcendentals.

And finally, you could also use newRPLs decimal library (and I don't know why I didn't think of it before, being the author...oh, well), which only needs a static scratch area of memory, no dependencies at all, and it's a single file. Transcendental functions are already done and they are tableless (almost) so it's very MCU friendly. It's not documented very well but since it was designed as a drop-in replacement for mpdecimal, the API is very similar.

You should take a look at all 3, I'm not sure what are the limitations on your hardware but if it fits, it's way better than starting from scratch.
Thanks for the lot of hints.

My next calculator has an OLED display (128x64), a small (16 keys) keyboard with a fast menu function.
The MCU is a ATMEGA32 (Arduino) which offers 28k (with USB) or 32k of flash memory. Till now I tried ATTIN85 projects but 8k is a too hard limit for RPN calculators with a broad spectre of funcionalities.

A first try of BigNumbers - which is an excellent library - cost me approx. 8k (for basic math). This seems to much for my ressources. I think that is similar to the char array suggestion of Druzyek.

But the newRPL decimal library gives hope. Unfortunately I don't know how to extract this library from the 19MB-exe-file and how to involve it to my C-program (Arduino IDE). As I like "all in one source code files" it would be ideally for me to invoke C-subprograms. Sorry for my inflexibility and less experience.

Regards
deetee
(04-11-2019 10:10 AM)deetee Wrote: [ -> ]Hi all!

I like to build small RPN calculators (ScArY, SCOTT, ARC) with AVR microcontrollers (which are easy to program).

Unfortunately these microcontrollers support float numbers with 4 byte single precision (due IEEE-754) only. That means they support a precision of 6 to 7 decimal digits.

Now I would like to improve this precision to at least 9 decimal digits. The only idea I have is to define a new number format (struct) to separate mantissa and exponent:

Code:
struct real {
  long m;
  int8_t e;
};

But now I have to "reinvent" every mathematical operation like adding two numbers (the following code works, but is far from beeing efficient):

Code:
void realadd(real * res, real a, real b) {
  if (a.e >= b.e) {
    b.m /= _pow10(a.e-b.e);
    res->m = a.m + b.m;
    res->e = a.e;
  }
  else {
    a.m /= _pow10(b.e-a.e);
    res->m = a.m + b.m;
    res->e = b.e;
  }
}

Looking forward to transzendent functions or complex numbers I feel overstrained.

Do you have any other idea how to raise the precision with less effort?

Thanks for any idea.
deetee

What I did was to to use uint64_t (which gcc does support) and map it to a new float-64 type.
(see post: https://www.hpmuseum.org/forum/thread-12761.html)
Hi agarza!

(04-13-2019 02:03 PM)agarza Wrote: [ -> ]What I did was to to use uint64_t (which gcc does support) and map it to a new float-64 type.
(see post: https://www.hpmuseum.org/forum/thread-12761.html)

Wow - thanks for this hint. I tried that before, but did not recognize that only Serial.Print doesn't support int64_t.

So my first attempt
Code:
struct real {
  long m;
  int8_t e;
};
can be changed to
Code:
struct real {
  int64_t m;
  int8_t e;
};
... and I can calculate with an (at least) 18 digit precision.

Can you tell me more how to "map it to a new float-64 type" - I didn't find a hint to it in your post?
Do I still have to reinvent mathematical functions?
How did you do it with your DIY calculator?

Regards deetee

PS: I like your DIY calculator - the front view, the custom PCB and how you managed to run the LCD display (I know these displays are really tricky to drive),
As an aside:

You could consider some other Arduino related development boards.

The The PJRC Teensy 3.2 board will do 64 bit maths out of the box and can be programmed from the Arduino IDE once Teensy extensions have been added.

The ageing Arduino Due will do the same job, but its form factor is somewhat bulky for home-made calculators.

Both of these options remove the resource ceiling of the 8-bit Arduino family. You are left wondering what code you can add to fill the device up instead of scratching around to save a few bites of RAM or heap space.
(04-13-2019 09:43 AM)deetee Wrote: [ -> ]Thanks for the lot of hints.

My next calculator has an OLED display (128x64), a small (16 keys) keyboard with a fast menu function.
The MCU is a ATMEGA32 (Arduino) which offers 28k (with USB) or 32k of flash memory. Till now I tried ATTIN85 projects but 8k is a too hard limit for RPN calculators with a broad spectre of funcionalities.

A first try of BigNumbers - which is an excellent library - cost me approx. 8k (for basic math). This seems to much for my ressources. I think that is similar to the char array suggestion of Druzyek.

But the newRPL decimal library gives hope. Unfortunately I don't know how to extract this library from the 19MB-exe-file and how to involve it to my C-program (Arduino IDE). As I like "all in one source code files" it would be ideally for me to invoke C-subprograms. Sorry for my inflexibility and less experience.

Regards
deetee

I think none of the libraries I mentioned will fit in your flash or ram, I thought you had one of the bigger MCUs.

EDIT: By the way, you don't extract the library from the executable. You go to the sources section and look for decimal.c and decimal.h inside the newrpl folder. If you want transcendentals (which are unlikely to fit in your flash, but if you choose a newer AtMega32 with more flash...) look for lighttranscend.c, there's also transcendentals.c but they use tables. Depending on how much precision you want, the tables will become huge. In contrast, lighttranscend.c has no tables (at the expense of performance), and only a few constants (which you should reduce to the number of digits you need to reduce space). The routines are quite low-level, for examples on how to use them (since they are not documented well) you may look at lib-66-transcendentals.c to see what parameters they need.
Have you thought of using a scheme like that used for DEC64? It has all the speed advantages of binary floating point along with all the accuracy benefits of BCD, plus, it supports a a larger range of values.

Each DEC64 value is represented as M*10^(e) where M is the 2's complement mantissa and e is the 2's complement exponent. In the DEC64 implementation, M is 56 bits and e is 8 bits. For an MCU, one could use exponents and mantissas that use fewer bits -- it doesn't affect the basic idea. The only semi computationally expensive process involved with DEC64 is the entry and display of the numbers, but, both of those processes can be executed very quickly.

Regards,

Jonathan
@Claudio: Thanks for guiding me to the source of newRPL. Very impressive and for me a good reference for details - but I agree it doesn't fit in my small MCU.

@Jonathan: Thanks for this hint. Impressive, comprehensive and easy readable code. DEC64 is similar to what I intended. I expected to get rid of bit shifting and & operations by separating mantissa and exponent in different (struct) variables. And I agree with you - to implement entering and displaying of numbers is not easy. First shots cost me approx. 5k ... I worry if I find room for "real" calculator stuff.

Finally I'm close to give up and do it with the intrinsic double format - even if it's 7-decimal-precision only.

Regards
deetee
(04-17-2019 05:07 AM)deetee Wrote: [ -> ]Finally I'm close to give up and do it with the intrinsic double format - even if it's 7-decimal-precision only.
Don't give up! You'll get it if you keep working at it. I think you can get really far if you decide to store BCD bytes like I mentioned. It should also make your input code a little simpler and save you some space. As far as input goes, there are two ways I have tried. One is to let the user enter whatever they like then scan the input to make sure it is a valid number at the end. The other is to keep track of what has already been entered and ignore invalid characters. For example, if the user has already entered a decimal point, ignore the input if they press it again. I went with the second way for my last project and it simplifies things a little. Another big thing that might make your code a lot smaller is only storing one BCD digit per byte instead of two. Numbers will take up twice as much memory in RAM, but you will save some code space by not having to pack and unpack the bytes.

How much flash space do you have on your chip? I have been able to implement BCD math functions on an MSP430 pretty compactly in assembly. The same code might be bigger or smaller on an AVR, so you can't compare directly, but I wouldn't be surprised if you could also implement some pretty decent math functions in just a few K of flash on an AVR too.

I have been keeping tables for each math function to compare different versions and see the size/speed trade off in case I need to go with a slower version to fit everything in flash (so far that hasn't been the case).
[attachment=7160]

I'm still working on interface code, so the final version might take up a lot more flash, but the math functions are done. As you can see, they are pretty small. I bet you could do the same on an AVR.
[attachment=7161]
[attachment=7162]

EDIT: Can I make the images full size?
(04-14-2019 05:56 AM)deetee Wrote: [ -> ]Hi agarza!

(04-13-2019 02:03 PM)agarza Wrote: [ -> ]What I did was to to use uint64_t (which gcc does support) and map it to a new float-64 type.
(see post: https://www.hpmuseum.org/forum/thread-12761.html)

Wow - thanks for this hint. I tried that before, but did not recognize that only Serial.Print doesn't support int64_t.

So my first attempt
Code:
struct real {
  long m;
  int8_t e;
};
can be changed to
Code:
struct real {
  int64_t m;
  int8_t e;
};
... and I can calculate with an (at least) 18 digit precision.

Can you tell me more how to "map it to a new float-64 type" - I didn't find a hint to it in your post?
Do I still have to reinvent mathematical functions?
How did you do it with your DIY calculator?

Regards deetee

PS: I like your DIY calculator - the front view, the custom PCB and how you managed to run the LCD display (I know these displays are really tricky to drive),

You can use Softfloat.

Their is a port of SoftFloat for the avr:
Float64

The gcc floating math libraries do not work on 64bit floats.
Try the 64 bit port for avr: Math64
@Druzyek: Thanks for motivating me.

To save floats in an char array seems to me that it needs much memory (similar to the BigNumber library). To enter an input string and convert it to float is very convenient (I tried this with one of my first calculators - ARC). But finally I switched to interpret every keypress at once and change the X-register immediately.

Softfloat, Float64 and Math64 are working properly - but first tries needed too much memory. And on my todo list is calculating with complex numbers, statistics, matrices and at least programming ... I'm sure that it's possible to involve these capabilities with the intrinsic (7 digit precision) float format.

But I'll try one last thing (and hope this doesn't consume to much memory):
I like the DEC64 format to save mantissa and exponent in one int64_t variable. The advantage is that interchanging numbers with subroutines is very handy. And to convert mantissa and exponent to DEC64 and vice versa can be done very quickly. And the sign of numbers is handled automatically (ie substraction equals addition with negative number ... a-b = a+ (-b) ).

The following code demonstrates how this could be done - even if the code is very inefficient and was quickly done. The input and display of those numbers needed approx 3k and involving basic math (+-*/) approx extra 3k.
What I still can't estimate is how much flash will be needed to call these functions very often (ie in case of calculating exp(x) with Taylor series).

Thanks for every hint and help (and motivation) - I love this forum.

Code:

static uint64_t Nassign(int64_t c, uint8_t e) { // Assign number
  return ((c << 8) + e);
}

static int64_t Nm(int64_t x) { // Extract mantissa
  return (x >> 8);
}

static int8_t Ne(int64_t x) { // Extract exponent
  return (x & 0xff);
}

static int64_t Nexpand(int64_t x) {
  return (Nassign(Nm(x) * 10, Ne(x) - 1));
}

static int64_t Nshrink(int64_t x) {
  return (Nassign(Nm(x) / 10, Ne(x) + 1));
}

static uint64_t Nresize(int64_t x) { // Resize x to max mantissa (needed for printing)
  if (Nm(x)) { // != 0
    while (_abs(Nm(x)) > NMAX) x = Nshrink(x);
    while (_abs(Nm(x)) < NMIN) x = Nexpand(x);
  }
  else return (Nassign(0, 1 - NDIGITS));
  return (x);
}

static uint64_t Nopt(int64_t x) { // Optimize x - reduce trailing zeros
  if (Nm(x) != 0) while (Nm(x) % 10 == 0) Nshrink(x);
  return (x);
}

static int64_t Nadd(int64_t a, int64_t b) { // Add two numbers
  if (Ne(a) == Ne(b)) return (Nassign(Nm(a) + Nm(b), Ne(a))); // Business numbers (fast addition)
  if (Ne(b) > Ne(a)) { // Swap so that a has the bigger exponent
    int64_t tmp = a; a = b; b = tmp;
  }
  while (Ne(a) > Ne(b) && Nm(a) < NMIN) { // Expand a as far as necessary or possible
    a = Nassign(Nm(a) * 10, Ne(a) - 1);
  }
  while (Ne(a) > Ne(b) && Nm(b) != 0) { // Shrink b as far as necessary (costs significance)
    b = Nassign(Nm(b) / 10, Ne(b) + 1);
  }
  return (Nassign(Nm(a) + Nm(b), Ne(b))); // Addition with equal exponents
}

static int64_t Nsub(int64_t a, int64_t b) { // Substract two numbers
  b = Nassign(-Nm(b), Ne(b));
  return (Nadd(a, b));
}

static int64_t Nmult(int64_t a, int64_t b) { // Multiply two numbers
  a = Nresize(a);
  b = Nresize(b);
  int64_t res = Nassign(0, Ne(a) + Ne(b));
  while (Nm(b)) {
    res = Nassign(Nm(res) / 10 + Nm(a) / 10 * (Nm(b) % 10), Ne(res) + 1);
    b = Nassign(Nm(b) / 10, Ne(b));
  }
  return (res);
}

static int64_t Ndiv(int64_t a, int64_t b) { // Divide two numbers
  if (Nm(b) == 0) return NINF;
  a = Nopt(a); b = Nopt(b);
  int64_t res;
  int8_t sign = _sign(Nm(a)) * _sign(Nm(b));
  a = Nassign(_abs(Nm(a)), Ne(a)); b = Nassign(_abs(Nm(b)), Ne(b));
  res = Nassign(0, Ne(a) - Ne(b) + 1);
  int64_t rest = Nm(a);
  while (Nm(res) <= NMIN) {
    res->m = res->m * 10 + rest / b.m;
    rest = rest % b.m * 10;
    (res->e)--;
  }
  return (res);
}
Reference URL's