HP Forums

Full Version: (25C) Sampling without repetition
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
This program generates random sample of k elements out of total set of N elements. Element can be used only once. All such subsets are drawn with equal probabilitites (assuming RNG is fair).

Algorithm considers each integer i from 0 to N-1 and decides on each whether to include it in a sample or not with probability (k-done)/(N-i) where done is the number of already selected items.
The generated elements are not stored, so k and N can be any positive integers, however this algorithm is only practical with relatively small N.

Instruction:
<N> sto 0
<k> sto 1
<0.random_seed> sto 5
f prgm r/s
Output: first element of a sample 0..N-1
r/s
Output: next element of a sample.
When no more elements outputs -1
r/s after that starts a new sample.

Code:

01. 0
02. sto 2
03. sto 3
04. rcl 1
05. rcl 3
06. -
07. x=0
08. gto 30
09. rcl 0
10. rcl 2
11. -
12. /
13. rcl 5
14. 1
15. 1
16. *
17. pi
18. +
19. frac
20. sto 5
21. x>=y
22. gto 27
23. 1
24. sto+3
25. rcl 2
26. stop
27. 1
28. sto+2
29. gto 04
30. 1
31. chs
32. gto 00

Example:
100 sto 0
10 sto 1
0.1234 sto 5
f prgm r/s
[2] r/s [4] r/s [23] r/s [30] r/s [50] r/s
[53] r/s [57] r/s [59] r/s [66] r/s [96] r/s
[-1] -- indicates end of sample

I wonder if it is possible to improve on the algorithm and generate a sample of modest k for arbitrary N, for example, a sample of 30 out of 10000000.
Here's a translation of your program for the HP-42S:
Code:
00 { 41-Byte Prgm }
01▸LBL "SAMPLE"
02 RCL 00
03 STO 02
04 RCL 01
05 STO 03
06▸LBL 00
07 RCL 03
08 RCL 02
09 ÷
10 RAN
11 X≥Y?
12 GTO 01
13 RCL 02
14 STOP
15 DSE 03
16 GTO 01
17 GTO 02
18▸LBL 01
19 DSE 02
20 GTO 00
21▸LBL 02
22 -1
23 END

However, I reversed the order so I could use DSE.
Thus the generated samples appear in descending order.

Example:

100
STO 00
10
STO 01
XEQ "SAMPLE"
85
R/S
64
R/S
62
R/S
59
R/S
55
R/S
53
R/S
39
R/S
32
R/S
30
R/S
11
R/S
-1


(04-27-2017 04:07 AM)nsg Wrote: [ -> ]I wonder if it is possible to improve on the algorithm and generate a sample of modest k for arbitrary N, for example, a sample of 30 out of 10000000.

In this case we might simply generate them at random and hope there's no collision.

Kind regards
Thomas
(10-27-2018 03:52 PM)Thomas Klemm Wrote: [ -> ]Here's a translation of your program for the HP-42S:
Code:
...
15 DSE 03
16 GTO 01
17 GTO 02
18▸LBL 01
...

One more case where something like INV DSZ would be useful (greetings to the TI community ;-)). This can be accomplished quite easily by having the DSE followed by a test that always tests false. Since X holds a positive number at this point, a simple x<0? will do.

Code:
...
15 DSE 03
16 X<0?
17 GTO 02
...

Slightly shorter, slightly faster, saves one label.
Edit: OK, in this special case it's neither shorter nor label-saving, simply because LBL 01 is required anyway. #-)

(10-27-2018 03:52 PM)Thomas Klemm Wrote: [ -> ]
(04-27-2017 04:07 AM)nsg Wrote: [ -> ]I wonder if it is possible to improve on the algorithm and generate a sample of modest k for arbitrary N, for example, a sample of 30 out of 10000000.
In this case we might simply generate them at random and hope there's no collision.

In such a case there is no significant difference between sampling with or without repetition. So it doesn't matter much if a number is drawn twice. Otherwise the idea is: record the sample in a (sorted) list, generate the next number, see if it's part of the list. If yes: get a new number, if not: insert it into the list. Hmmm... somehow this sounds like a job for the HP50. ;-)

Dieter
(04-27-2017 04:07 AM)nsg Wrote: [ -> ]I wonder if it is possible to improve on the algorithm and generate a sample of modest k for arbitrary N,
for example, a sample of 30 out of 10000000.

Code:
import random
def randindex(k, n):
    s = [n] * k             # k samples
    for i in range(k):
        r = random.randrange(n-i)
        j = 0
        while r + j >= s[j]: j = j+1
        s[j+1:i+1] = s[j:i] # make room for s[j]
        s[j] = r + j        # sorted insert
    return s

Above code only need k random numbers, and return sorted samples.

>>> randindex(10,10) # no repeat
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> random.seed(1)
>>> randindex(30,10000000)
[21060, 254459, 283474, 290410, 305901, 938595 ...
(10-28-2018 02:18 PM)Albert Chan Wrote: [ -> ]Above code only need k random numbers, and return sorted samples.

Writing a program for the HP-25 using only 8 registers and 49 programming steps is left as an exercise for the dear reader.

Cheers
Thomas
(10-28-2018 09:18 AM)Dieter Wrote: [ -> ]In such a case there is no significant difference between sampling with or without repetition. So it doesn't matter much if a number is drawn twice. Otherwise the idea is: record the sample in a (sorted) list, generate the next number, see if it's part of the list. If yes: get a new number, if not: insert it into the list. Hmmm... somehow this sounds like a job for the HP50. ;-)

Going further of-topic...
This would be a variation of the Fisher-Yates shuffle, implemented in the ListExt Library as LSHUF.
(10-28-2018 08:03 PM)John Keith Wrote: [ -> ]
(10-28-2018 09:18 AM)Dieter Wrote: [ -> ]In such a case there is no significant difference between sampling with or without repetition. So it doesn't matter much if a number is drawn twice. Otherwise the idea is: record the sample in a (sorted) list, generate the next number, see if it's part of the list. If yes: get a new number, if not: insert it into the list. Hmmm... somehow this sounds like a job for the HP50. ;-)

This would be a variation of the Fisher-Yates shuffle, implemented in the ListExt Library as LSHUF.

I do not see any shuffling from Dieter description.
That seems more like "rinse, and repeat", throwing away (rare) repeated samples.

My Python randindex(...) look more like shuffling, except working thru the samples list.
Reference URL's