HP Forums

Full Version: (42S) matrix permanent
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Thanks to:
  • Valentin Albillo's original post, and fantastic 3-line solution for the 71B
  • Albert Chan's different solution and
  • John Keith's improvement
This, then, is an implementation for the 42S of John Keith's improvement of an algorithm posted by Albert Chan solving a problem posted by Valentin Albillo.

usage: with the square matrix in stack register X, XEQ "PER".
The permanent is returned in X, and the original, unchanged matrix is in Y. This is important for low-memory conditions on the 42S.

variables used:

REGS : v (sum of rows)
n : matrix order
i : 2^(n-1)
k : 1..i-1
p : permanent

00 { 147-Byte Prgm }
01▸LBL "PER"
02 ENTER
03 RSUM
04 X=Y?
05 DET
06 REAL?
07 RTN
08 STO "REGS"
09 DIM?
10 R↓
11 STO "n"
12 2
13 X<>Y
14 DSE ST X
15 Y^X
16 STO "i"
17 STO "k"
18 R↓
19 EDIT
20 CLX
21 GTO 14
22▸LBL 10
23 RCL "i"
24 RCL- "k"
25 1
26▸LBL 11
27 RCL ST Y
28 4
29 MOD
30 GTO IND ST X
31▸LBL 01
32▸LBL 03
33 STO+ ST X @ (1,3) -> (-2,2)
34 LASTX
35 -
36 GTO 04
37▸LBL 00
38 X<> ST L
39 STO÷ ST Z @ k := k/4;
40 SQRT @ j := j + 2;
41 +
42 GTO 11
43▸LBL 02
44 SIGN @ j := j + 1;
45 +
46 X<>Y
47 8
48 MOD @ k MOD 8 is 2 or 6
49 4 @ (2,6) -> (-2,2)
50 -
51▸LBL 04
52 RCL "n"
53 1
54 R^
55 STOIJ
56 R↓
57 GETM
58 ×
59 STO+ "REGS"
60 RCL "p"
61▸LBL 14
62 RCL "n"
63 RCL 00
64 DSE ST Y
65▸LBL 13
66 RCL× IND ST Y
67 DSE ST Y
68 GTO 13
69 RCL- ST Z
70 STO "p"
71 DSE "k"
72 GTO 10
73 RCL÷ "i"
74 +/-
75 RCLEL
76 EXITALL
77 X<>Y
78 END


Timing/testing the program is easy, once you realize that the permanent of a matrix of order n, filled with all 1's, is n!.

timing:
order   42S     DM42
 5        0:18     0.05s
 7        1:31     0.22s
 9        7:10     1.03s
12  1:06:15     9.39s

The timing of order 12 for the 42S is but an estimate, fitting a logarithmic curve through the three data points (x=time/n, y=n), yielding a .99999.. correlation, so should be pretty good.

Cheers, Werner
[updated with a shorter version... I seem to always find simple ways of shortening programs, moments after I posted them]
Hi, Werner

Perhaps you can add some comments to code?

Example, why does it use SQRT ?
Does it have something to do with j = bit.ffs(k) ?

With decimal machine, getting bit.ffs (Find First Set) is expensive.
My guess is it would cost way more than these lines (j = f[1]) for next loop.

f[1] = 1
f[i] = f[i+1]
f[i+1] = i+1
Hi, Albert!
The SQRT is a quick way to turn a 4 into a 2 ;-)

The only part, I think, that requires some explanation is lines 23-51.
There, I determine j and dj (+/-2) from k, as John Keith explained.

So, j is the place of the least significant bit in k.
Let >>k be k shifted right till the least significant bit is 1,
then dj=2 if >>k MOD 4 equals 3, else dj=-2 (I *add* dj*Mj, so my signs are swapped).
I also technically calculate the permanent of the transposed matrix, which is of course the same - to save a TRANS(M) which on a 42S will duplicate the matrix. I subsequently fetch the j-th *column* of M instead of the row.

Instead of trying to find the least significant bit of k, successively dividing by 2, I perform k MOD 4 right away, and use GTO IND ST X to branch off:
- half of the time, it is either 1 or 3 and we're done (j=1 and dj=-2 for 1 and 2 for 3).
- if it is 0, divide k by 4, add 2 to j and loop
- if it is 2, add 1 to j, and k MOD 8 is now 2 or 6, and dj -2 or 2, respectively

I had a version using v,f and d vectors (even one using only the stack), but this latest version runs almost twice as fast, also because I store the vector v in REGS, so that I don't need to switch indexing/editing matrices, and the calculation of product(v) is faster.

Cheers, Werner
(01-15-2024 02:45 PM)Werner Wrote: [ -> ]dj=2 if >>k MOD 4 equals 3, else dj=-2 (I *add* dj*Mj, so my signs are swapped).

If matrix cells are non-negative, product of initial v (column sums) is biggest.
This is the reason my code does v[j] = v[j] - d[i]*M[i][j]

Trivia:
If n×n matrix are all ones, its' permanent = n!
First term alone give upper limit:            n^n / 2^(n-1) = 2 * (n/2)^n
For comparison, Stirling's approximation:     n! ~ √(2*pi*n) * (n/e)^n

n=1 --> 2*(n/2)^n = 1 = 1!
n=2 --> 2*(n/2)^n = 2 = 2!
n=3 --> 2*(n/2)^n = 6.75 > (6 = 3!)
n=4 --> 2*(n/2)^n = 32 > (24 = 4!)
...

Quote:Instead of trying to find the least significant bit of k, successively dividing by 2, I perform k MOD 4 right away ...

Nice optimizing trick!

Quote:I had a version using v,f and d vectors (even one using only the stack), but this latest version runs almost twice as fast.

Interestingly, LuaJIT code is faster (almost twice as fast) using v,f and d vectors.
This despite I already use C to code bit.ffs and bit.test.

It is perhaps due to LuaJIT have only 1 kind of number = IEEE double.
Bit manipulations required conversion of double to integer, then back to double.

Guessing which way is faster is hard. We have to code it to know it.
(01-15-2024 04:06 PM)Albert Chan Wrote: [ -> ]If matrix cells are non-negative, product of initial v (column sums) is biggest.
This is the reason my code does v[j] = v[j] - d[i]*M[i][j]
? That's what I do as well, but the sign of the d[i]'s are inversed in my code - and btw you don't have a choice, you have to loop over all prod(sum(+/-M[i][j])), and d[j] being -2 means you swap a +M[i][j] for a -M[i][j].

Quote:Guessing which way is faster is hard. We have to code it to know it.

In the case of the 42S, having to access 4 arrays (M,f,v,d) meant a lot of indexing, as you can have only one matrix indexed/edited at a time - save if you store one in REGS and access it with RCL IND ..

Cheers, W.
(01-15-2024 12:33 PM)Werner Wrote: [ -> ]timing:
order   42S     DM42
 5        0:18     0.05s
 7        1:31     0.22s
 9        7:10     1.03s
12  1:06:15     9.39s

The timing of order 12 for the 42S is but an estimate, fitting a logarithmic curve through the three data points (x=time/n, y=n), yielding a .99999.. correlation, so should be pretty good.

XCas> Digits := 5
XCas> N := float([5,7,9])
XCas> T := float([18,91,430])
XCas> correlation(N, ln(T/N))                      → 0.99999
XCas> c := linear_regression(N, ln(T/N))      → (0.64641,-1.954)

ln(t/n) = c[0]*n + c[1]
t/n = exp(c[1]) * exp(c[0])^n

--> time (sec) ≈ 0.14170 * n * 1.9087^n
--> n=12 time ≈ 3976 sec = 1:06:16

Fit is excellent, but formula does not make sense.
We would expect time proportional to number of terms = 2^(n-1)
Each term, time to update v's and get its product, plus getting j and d.

Below does not produce as good correlation, but make more sense.

XCas> correlation(N, T/2^N)                      → 0.99917
XCas> c := linear_regression(N, T/2^N)      → (0.069336,0.21908)

t/2^n = c[0]*n + c[1] = c[0] * (n + c[1]/c[0])

--> time (sec) ≈ 0.069336 * (n + 3.1596) * 2^n
--> n=12 time ≈ 4305 sec = 1:11:45
(01-16-2024 01:00 AM)Albert Chan Wrote: [ -> ]Fit is excellent, but formula does not make sense.

With only 3 sample points, you can't expect miracles.
I won't burn through a set of batteries to see which estimate proves to be better, probably yours ;-)

Werner
We can use DM42 timings, without n=12 numbers.
Then, we predict n=12 timing, compare against actual time (9.39 sec)

N := [5,7,9]
T := [.05, .22, 1.03]

Linear regression, N vs ln(T/N)
--> correlation = 0.99939
--> time (sec) = 0.00046355 * n * 1.8393^n
--> n=12 time ≈ 8.34 sec

Linear regression, N vs T/2^N
--> correlation = 0.98491
--> time (sec) = 0.00011230 * (n + 8.7101) * 2^n
--> n=12 time ≈ 9.53 sec
(01-15-2024 12:33 PM)Werner Wrote: [ -> ]Thanks to:
  • Valentin Albillo's original post, and fantastic 3-line solution for the 71B
  • Albert Chan's different solution and
  • John Keith's improvement
This, then, is an implementation for the 42S of John Keith's improvement of an algorithm posted by Albert Chan solving a problem posted by Valentin Albillo.

... and based on a Python/Numpy program posted by GitHub user "lesshaste", who deserves credit as well.

Nice to see a version for Free42, I will try your optimization on my RPL program when i get the chance.
(01-17-2024 08:17 PM)John Keith Wrote: [ -> ]... I will try your optimization on my RPL program when i get the chance.

Well, I tried it but it made the program substantially larger and not much faster. RPL doesn't have anything like GTO IND ST X- the closest would be either a CASE statement or nested IF..THEN..ELSE structures. The method might work well on the 71B using ON..GOTO or ON..GOSUB though.
Reference URL's