HP Forums
(32S, 32SII) Birthday paradox, hash collision probability - Printable Version

+- HP Forums (https://www.hpmuseum.org/forum)
+-- Forum: HP Software Libraries (/forum-10.html)
+--- Forum: General Software Library (/forum-13.html)
+--- Thread: (32S, 32SII) Birthday paradox, hash collision probability (/thread-14890.html)



(32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-25-2020 04:49 PM

The birthday paradox - called a paradox because the results seem so counterintuitive - states that in a group of just 23 people, there is approximately a 50% chance that some will share the same birthday (ignoring year), or stated another way, that the group does not have entirely unique birthdays.

This program allows you to compute the probability that there will be at least one collision when selecting S samples from N possibilities, with repetition allowed (sampling with replacement).

I won't go too deeply into the math and combinatorics behind it, as there are other resources out there that can do it more elegantly, but I'll describe the two formulas the program uses.

The simple formula, which works with smaller inputs such as birthdays, is this:

1-((N-1)/N)^((S^2-S)/2)

With large inputs, like calculating hash collisions, this formula potentially loses all of its significant digits - (N-1)/N will be rounded to 1, giving you an erroneous probability of zero.

Fortunately there's a second approximation which, while giving poor accuracy for small inputs, provides good accuracy for large ones:

1-e^(-S(S-1)/(2N))

(Reference: https://preshing.com/20110504/hash-collision-probabilities/)

Of course, you may notice there's still a problem with this formula: if the x in e^x is very small, the result is 1 + an extremely small number, and the result will still be rounded to 1, giving a probability of 0. Fortunately, there's a simple way to accurately calculate e^x-1 (if you've programmed a TVM solver, you might recognize this):

e^x-1 = 2*sinh(x/2)*e^(x/2)

In order to use this transformation, we have to first restate the problem in the form of e^x-1, which is easy enough:

1-e^(-S(S-1)/(2N))
= -1*(e^(-S(S-1)/(2N))-1)
= -1*2*sinh(-S(S-1)/(2N)/2)*e^(-S(S-1)/(2N)/2)
= -2*sinh(-S(S-1)/(4N))*e^(-S(S-1)/(4N))

So this program first attempts to use the formula for small inputs, and if a result of 0 is obtained, it then applies the second formula to produce a result.

Usage

This program can either be run interactively to compute a probability for a given N and S, or via the solver to calculate N or S given the other two of N, S, and probability.

Interactive use:

XEQ B
Enter value for N, press R/S.
Enter value for S, press R/S.
Enter 0 for P (this variable is used with the solver), press R/S.

Probability will be returned in x. To see the probability expressed as "1 in x chance", press 1/X.

Solver use:

FN= B
Enter initial guesses in the X register and the variable you will be solving for (this is important, as there may be another solution to the equation with a meaningless negative value).
SOLVE, choose N S or P
Wait a couple of seconds, and results will be returned to x.

Examples:

What is the probability that at least two people in a group of 30 will share the same birthday?

XEQ B
"N?" 365 R/S
"S?" 30 R/S
"P?" 0 R/S

.6968, or about 69.68%.


What is the probability of a GUID collision if 5 trillion GUIDs are generated? (A GUID is a 128-bit globally unique identifier, with 2^128 possible values.)

XEQ B
"N?" 2 ENTER 128 y^x R/S
"S?" 5 E 12 R/S
"P?" 0 R/S

3.6734E-14, or about 1 in 27.22 trillion, i.e. not very likely!


How many GUIDs would you need to generate before there is even a 1% chance of a collision?

FN= B
2 ENTER 64 y^x
STO S (initial guess)
SOLVE S
"N?" 2 ENTER 128 y^x R/S
"P?" .01 R/S
"S=2.6153E18"

You would need to generate about 2,615,300,000,000,000,000 GUIDs.

Program Code

Code:
LBL B [CK=EB70 058.5]
INPUT N
INPUT S
INPUT P
RCL N
1/x
LASTx
1
-
*
RCL S
x^2
LASTx
-
2
/
y^x
1
x<>y
-
x!=0?
GTO C
RCL S
+/-
RCL S
1
-
*
4
/
RCL/ N
ENTER
SINH
x<>y
e^x
*
2
*
+/-

LBL C [CK=7EE2 004.5]
RCL- P
RTN

Back story: We had a data copy job at work that was failing, complaining about duplicate values in a GUID column. The table had around 5 million records in it, and I wanted to first assess whether or not there was any possibility that we legitimately had duplicate GUIDs coming from the source system. After determining that was effectively impossible, I traced the issue to some phantom reads that were happening due to a sudden increase in write activity in the source data, and changed the query's concurrency model to prevent the fake duplicate rows from appearing.


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-26-2020 07:12 PM

(04-25-2020 04:49 PM)Dave Britten Wrote:  The simple formula, which works with smaller inputs such as birthdays, is this:

1-((N-1)/N)^((S^2-S)/2)

With large inputs, like calculating hash collisions, this formula potentially loses all of its significant digits - (N-1)/N will be rounded to 1, giving you an erroneous probability of zero.

Fortunately there's a second approximation which, while giving poor accuracy for small inputs, provides good accuracy for large ones:

1-e^(-S(S-1)/(2N))

(Reference: https://preshing.com/20110504/hash-collision-probabilities/)

First formula can be calculated more accurately using log1p:

lua> m = require 'mathx'
lua> function p1(n,s) return -m.expm1(0.5*s*(s-1) * m.log1p(-1/n)) end
lua> function p2(n,s) return -m.expm1(0.5*s*(s-1) / -n) end
lua> function both(n,s) return p1(n,s), p2(n,s) end
lua>
lua> both(365,30)
0.6968162999539532              0.6963200177219244
lua> both(2^128, 5e12)
3.6734198463188465e-014      3.6734198463188465e-014
lua> both(2^128, 2.6153e18)
0.009999839905579181           0.009999839905579181

Probability of all different birthday, for 30 people = perm(365,30) / 365^30 ≈ 29.37%

Probability of "collision" ≈ 1 - 29.37% = 70.63%

Approximation p1() seems better than p2(), at least for test cases.

https://www.hpmuseum.org/forum/thread-11907-post-119843.html#pid119843
https://www.hpmuseum.org/forum/thread-14344-post-126297.html#pid126297


RE: (32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-26-2020 07:58 PM

(04-26-2020 07:12 PM)Albert Chan Wrote:  First formula can be calculated more accurately using log1p:

lua> m = require 'mathx'
lua> function p1(n,s) return -m.expm1(0.5*s*(s-1) * m.log1p(-1/n)) end
lua> function p2(n,s) return -m.expm1(0.5*s*(s-1) / -n) end

Very clever! And also usable on calculators that don't have a built-in ln(1+x) function:

ln(1+x) = ln(1+x)*x/(1+x-1)


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-27-2020 07:54 PM

(04-26-2020 07:58 PM)Dave Britten Wrote:  Very clever! And also usable on calculators that don't have a built-in ln(1+x) function:

ln(1+x) = ln(1+x)*x/(1+x-1)

Thanks. Where do you get it ?

Note: watch out for divide-by-zero, if x is tiny.
Or, use Dieter's log1p and expm1 approximation.

I tried to re-use my formula, for log(probability of no repetition)

ln_P(n,s) := -s/(12*n*(n-s)) - (s + (n-s+0.5)*ln(1-s/n))

30 people birthday "collision" ≈ -expm1(ln_P(365,30)) = 0.706316242724
Using exact formula:           1 - perm(365,30)/365^30 = 0.706316242719

For big n, small x = s/n, we can drop the first term.
Second term has a problem, even if we have accurate log1p

(s + n*ln(1 - s/n)) / n = x + log(1-x) ≈ x - x = 0

There might be a better way. For now, I use:

x + log(1-x) = x - (x + x^2/2 + x^3/3 + x^4/4 + ...) = -(x^2/2 + x^3/3 + x^4/4 + ...)

Code:
m = require 'mathx'
function ln_nr(n,s) -- log(probability of no repetition)
    local x = s/n   -- assumed small s, big n
    s = s - 0.5
    return -((((1/4)*x + 1/3)*x + 1/2)*x*(s-n) + s)*x
end

function p3(n,s) return -m.expm1(ln_nr(n,s)) end

lua> p3(365,30)
0.7063895931587847
lua> p3(2^128, 5e12)
3.6734198463188465e-014
lua> p3(2^128, 2.6153e18)
0.009999839905579181


RE: (32S, 32SII) Birthday paradox, hash collision probability - Dave Britten - 04-27-2020 09:16 PM

(04-27-2020 07:54 PM)Albert Chan Wrote:  
(04-26-2020 07:58 PM)Dave Britten Wrote:  Very clever! And also usable on calculators that don't have a built-in ln(1+x) function:

ln(1+x) = ln(1+x)*x/(1+x-1)

Thanks. Where do you get it ?

Note: watch out for divide-by-zero, if x is tiny.
Or, use Dieter's log1p and expm1 approximation.

Can't remember where it came from exactly, but it was related to TVM calculations. 15C Advanced Functions handbook maybe? And yes, if x+1-1 returns 0, then you just return x, since ln(1+x) approaches x as x gets smaller.


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-27-2020 10:43 PM

(04-27-2020 07:54 PM)Albert Chan Wrote:  x + log(1-x) = x - (x + x^2/2 + x^3/3 + x^4/4 + ...) = -(x^2/2 + x^3/3 + x^4/4 + ...)

If we let y = x/(2-x) → x = 2*y - x*y

x + log(1-x) = x - 2*(y + y^3/3 + y^5/5 + ...) = -x*y - 2*(y^3/3 + y^5/5 + ...)

→ ln(P(n,s)) ≈ -s/(12*n*(n-s)) + 2*(n-s+0.5)(y^3/3 + y^5/5 + ...) - (s-1)y

I find that y^3 term already gives very good approximation.
Drop the first term, this is my revised ln_nr(n, s), shorter and more accurate.

Code:
function ln_nr(n,s)       -- log(probability of no repetition)
    local y = s/(n-s+n)   -- assumed small s, big n
    return (2/3*(n-s+0.5)*y*y - (s-1))*y
end

function P_nr(n,s)        -- return P_nr, 1 - P_nr
    local x = ln_nr(n,s)
    local y = exp(x)
    return y, 1-y+y*(log(y)-x)
end

lua> P_nr(365,30)
0.2936840560221026       0.7063159439778974
lua> P_nr(2^128,5e12)
0.9999999999999633       3.6734198463188465e-014
lua> P_nr(2^128,2.6153e18)
0.9900001600944208       0.009999839905579181

We can even drop y^3 term ...

lua> function ln_nr(n,s) return s*(s-1)/(s-n-n) end

lua> P_nr(365,30)
0.2885585859237581       0.711441414076242
lua> P_nr(2^128,5e12)
0.9999999999999633       3.6734198463188465e-014
lua> P_nr(2^128,2.6153e18)
0.9900001600944208       0.009999839905579181

Update: switched sign of y so x and y has same sign. This make it clear that x + log(1-x) ≤ 0


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 04-28-2020 03:23 PM

(04-27-2020 07:54 PM)Albert Chan Wrote:  ln_P(n,s) := -s/(12*n*(n-s)) - (s + (n-s+0.5)*ln(1-s/n))
(04-27-2020 10:43 PM)Albert Chan Wrote:  ln(P(n,s)) ≈ -s/(12*n*(n-s)) + 2*(n-s+0.5)(y^3/3 + y^5/5 + ...) - (s-1)y

This version use the 2nd form, to avoid numerical cancellation problem of original formula.
It assumed IEEE double, ln(2^-52) ≈ -36.0, and sum y^n terms in reverse order

With all this work, we might as well add more corrections.

XCas> series(1/(e^D-1),D)
→ 1/D -1/2 +D/12 -D^3/720 +D^5/30240 +D^6*order_size(D)

Let u=n-x, f(x) = ln(n-x) = ln(u) → df = ln(n-x) dx = -ln(u) du
With x limit from 0 to s, we have u = n-x = from n to n-s

XCas> diff(-ln(u),u,3) / -720       → 1/(u^3*360)
XCas> diff(-ln(u),u,5) / 30240    → -1/(u^5*1260)

Code:
function ln_nr(n,s)         -- log(probability of no repetition)
    local c = n-s
    local y = s/(c+n)
    local y2 = y*y
    local k = floor(-36/log(y2))*2 + 3
    local t = 1/k
    for k = k-2,3,-2 do t = y2*t + 1/k end
    y2 = (c+c+1)*t*y2*y     -- = (s-1)*y - (s+(n-s+0.5)*ln(1-s/n))
    k = 1/(c*n)
    t = s*k                 -- =  1/(n-s)   - 1/n
    c = t*t + 3*k           -- = (1/(n-s)^3 - 1/n^3) / t
    k = c*(c-k) - k*k       -- = (1/(n-s)^5 - 1/n^5) / t
    return (k - 3.5*c + 105)*t*(-1/1260) + y2 - (s-1)*y
end

lua> m = require'mathx'
lua> function fac(n) return m.tgamma(n+1) end
lua> fac(69.95)/fac(69)
56.58432038352886

lua> function perm(n,s) return exp(ln_nr(n,s)) * n^s end
lua> perm(69.95, 0.95)
56.58432038352816
lua> 70*perm(69.95,-0.05) -- 69.95!/69! = 70*(69.95!/70!)
56.58432038352817
lua> 70/perm(70,0.05)       -- 69.95!/69! = 70/(70!/69.95!)
56.58432038352817

lua> ((69.95+70)/2)^0.95  -- inspired from Approximating Gamma Ratios
56.58427578530249


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-02-2020 11:15 PM

Assume x ≫ a, and let \(m = (x+{1+a\over2})\), prove \(\large {(x+a)!\over x!} ≈ m^a\)

\(\large {(x+a)!\over x!}
= {m!\,/\,x! \over m!\,/\,(x+a)!}
= {perm(m, {1+a\over2}) \over perm(m, {1-a\over2})}\)

A few lines in XCas proved this ...

XCas> P_nr(n,s) := exp(-s*(s-1)/(2n))            // probability of no repetition, assumed n ≫ s
XCas> my_perm(n,s) := P_nr(n,s) * n^s         // avoid name conflict with XCas perm()
XCas> m := x+(1+a)/2
XCas> simplify(my_perm(m,(1+a)/2) / my_perm(m,(1-a)/2) - m^a)
→ 0

XCas> subst((x+a)! / x!, [x,a]=[69,0.95])
→ 56.5843203835
XCas> subst(m^a, [x,a]=[69,0.95])
→ 56.5842757853
XCas> subst(m^a, [x,a]=[70,-0.05]) * 70       // 69.95!/69! = 69.95!/70! * 70
→ 56.5843440581

Update: this ratio formula is better

XCas> ratio(x,a) := (x*x + (a+1)*(x + (a+2)/6)) ^ (a/2)
XCas> ratio(69.,0.95)
→ 56.5843203845
XCas> ratio(70., -.05) * 70                            // 69.95!/69! = 69.95!/70! * 70
→ 56.5843203829


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-03-2020 01:00 PM

(05-02-2020 11:15 PM)Albert Chan Wrote:  XCas> ratio(x,a) := (x*x + (a+1)*(x + (a+2)/6)) ^ (a/2)
XCas> ratio(69.,0.95)
→ 56.5843203845
XCas> ratio(70., -.05) * 70                            // 69.95!/69! = 69.95!/70! * 70
→ 56.5843203829

We can reverse the logic, and produce a compact, accurate ln_nr(n, s) Smile

P_nr(n, s) = perm(n, s) / n^s = ratio(n-s, s) / n^s = b^(s/2)

XCas> b := expand(ratio(n-s,s)^(2/s) / n^2)   → \(-\frac{s}{n}+\frac{s^{2}}{6\cdot n^{2}}+1+\frac{1}{n}-\frac{s}{2\cdot n^{2}}+\frac{1}{n^{2}\cdot 3}\)
XCas> factor(b-1)                                          → \(\large\frac{(-s+1) (6\cdot n-s+2)}{6\cdot n^{2}}\)

Code:
m = require 'mathx'
function ln_nr(n,s) return 0.5*s* m.log1p((s-1)*(s-2-6*n)/(6*n*n)) end
function P_collision(n,s) return -m.expm1(ln_nr(n,s)) end

lua> n, s = 365, 30
lua> x1 = ln_nr(n,s)
lua> x1, -m.expm1(x1)
-1.2252495190150705       0.706315588664643

We can improve above by halving the "interval"

lua> h = s/2
lua> x2 = ln_nr(n,h) + ln_nr(n-h,h) + h * m.log1p(-h/n)
lua> x2, -m.expm1(x2)
-1.2252516087363756       0.7063162023825731

I noticed halving the interval improved ln_nr(n,s) by about 16 times
We can extrapolate for better estimate

lua> x = x2 + (x2-x1)/15   -- (x - x1) ≈ 16*(x - x2)
lua> x, -m.expm1(x)          -- extrapolated ln_nr(n,s), collision probability
-1.2252517480511294       0.7063162432970561

Update: I found a general formula. However, convergence is not good ...

XCas> ln_nr(n,s,kmax) := ln(1+s/(n-s)) - sum(subst(sum((j/n)^k, j=1..jmax),jmax=s)/k, k=1..kmax)
XCas> map(range(1,11), k -> [k, 1 - exp(ln_nr(365.,30.,k))])

\(\left(\begin{array}{cc}
1 & 0.695232406385 \\
2 & 0.705857478651 \\
3 & 0.706293132999 \\
4 & 0.706314950579 \\
5 & 0.706316165395 \\
6 & 0.706316237864 \\
7 & 0.706316242403 \\
8 & 0.706316242698 \\
9 & 0.706316242718 \\
10 & 0.706316242719
\end{array}\right) \)


RE: (32S, 32SII) Birthday paradox, hash collision probability - Albert Chan - 05-04-2020 05:56 PM

(04-27-2020 09:16 PM)Dave Britten Wrote:  Can't remember where it [ ln(1+x) = ln(x+1)*x/(1+x-1) ] came from exactly, but it was related to TVM calculations.
15C Advanced Functions handbook maybe?

You are right !
It was proved in Theorem 4, What Every Computer Scientist Should Know About Floating-Point Arithmetic

Quote:And yes, if x+1-1 returns 0, then you just return x, since ln(1+x) approaches x as x gets smaller.

I've noticed the formula failed for IEEE double with binary exponent of 53, and odd last bit.

It is impossible to represent odd 54-bits integer x with IEEE double (53-bits mantissa).
Let "⇒" represent round-to-even rules

If x last bit is 0: (x+1)-1 ⇒ (x)-1 ⇒ x
If x last bit is 1: (x+1)-1 ⇒ (x+2)-1 ⇒ x+2

lua> log = math.log
lua> log1p = require('mathx').log1p
lua> b = 2^53
lua> for i=1,1000 do x=b+2*i; print(i, log1p(x) , log(1+x)*x/(x+1-1)) end
1       36.7368005696771       36.736800569677094
2       36.7368005696771       36.7368005696771
3       36.7368005696771       36.73680056967709
4       36.7368005696771       36.7368005696771
5       36.7368005696771       36.736800569677094
...


RE: (32S, 32SII) Birthday paradox, hash collision probability - Werner - 05-05-2020 01:46 PM

(05-03-2020 01:00 PM)Albert Chan Wrote:  
Code:
m = require 'mathx'
function ln_nr(n,s) return 0.5*s* m.log1p((s-1)*(s-2-6*n)/(6*n*n)) end
function P_collision(n,s) return -m.expm1(ln_nr(n,s)) end

Fantastic, Albert!
A 41-compatible routine, input n ENTER s:
Code:
>LBL "PCOL"
 50
 %
 DSE ST Y
 X<> ST Z    
 STO/ ST Y    
 3
 x
 STO/ ST Y
 +/-
 R^
 +
 1
 -
 x
 LN1+X
 x
 E^X-1
 +/-
 END

Cheers, Werner