Threaded Mode | Linear Mode

Dave Britten · 04-25-2020, 04:49 PM

The birthday paradox - called a paradox because the results seem so counterintuitive - states that in a group of just 23 people, there is approximately a 50% chance that some will share the same birthday (ignoring year), or stated another way, that the group does not have entirely unique birthdays.

This program allows you to compute the probability that there will be at least one collision when selecting S samples from N possibilities, with repetition allowed (sampling with replacement).

I won't go too deeply into the math and combinatorics behind it, as there are other resources out there that can do it more elegantly, but I'll describe the two formulas the program uses.

The simple formula, which works with smaller inputs such as birthdays, is this:

1-((N-1)/N)^((S^2-S)/2)

With large inputs, like calculating hash collisions, this formula potentially loses all of its significant digits - (N-1)/N will be rounded to 1, giving you an erroneous probability of zero.

Fortunately there's a second approximation which, while giving poor accuracy for small inputs, provides good accuracy for large ones:

1-e^(-S(S-1)/(2N))

(Reference: https://preshing.com/20110504/hash-colli...abilities/)

Of course, you may notice there's still a problem with this formula: if the x in e^x is very small, the result is 1 + an extremely small number, and the result will still be rounded to 1, giving a probability of 0. Fortunately, there's a simple way to accurately calculate e^x-1 (if you've programmed a TVM solver, you might recognize this):

e^x-1 = 2*sinh(x/2)*e^(x/2)

In order to use this transformation, we have to first restate the problem in the form of e^x-1, which is easy enough:

1-e^(-S(S-1)/(2N))
= -1*(e^(-S(S-1)/(2N))-1)
= -1*2*sinh(-S(S-1)/(2N)/2)*e^(-S(S-1)/(2N)/2)
= -2*sinh(-S(S-1)/(4N))*e^(-S(S-1)/(4N))

So this program first attempts to use the formula for small inputs, and if a result of 0 is obtained, it then applies the second formula to produce a result.

Usage

This program can either be run interactively to compute a probability for a given N and S, or via the solver to calculate N or S given the other two of N, S, and probability.

Interactive use:

XEQ B
Enter value for N, press R/S.
Enter value for S, press R/S.
Enter 0 for P (this variable is used with the solver), press R/S.

Probability will be returned in x. To see the probability expressed as "1 in x chance", press 1/X.

Solver use:

FN= B
Enter initial guesses in the X register and the variable you will be solving for (this is important, as there may be another solution to the equation with a meaningless negative value).
SOLVE, choose N S or P
Wait a couple of seconds, and results will be returned to x.

Examples:

What is the probability that at least two people in a group of 30 will share the same birthday?

XEQ B
"N?" 365 R/S
"S?" 30 R/S
"P?" 0 R/S

.6968, or about 69.68%.

What is the probability of a GUID collision if 5 trillion GUIDs are generated? (A GUID is a 128-bit globally unique identifier, with 2^128 possible values.)

XEQ B
"N?" 2 ENTER 128 y^x R/S
"S?" 5 E 12 R/S
"P?" 0 R/S

3.6734E-14, or about 1 in 27.22 trillion, i.e. not very likely!

How many GUIDs would you need to generate before there is even a 1% chance of a collision?

FN= B
2 ENTER 64 y^x
STO S (initial guess)
SOLVE S
"N?" 2 ENTER 128 y^x R/S
"P?" .01 R/S
"S=2.6153E18"

You would need to generate about 2,615,300,000,000,000,000 GUIDs.

Program Code

Code:

LBL B [CK=EB70 058.5]

INPUT N

INPUT S

INPUT P

RCL N

1/x

LASTx

1

-

*

RCL S

x^2

LASTx

-

2

/

y^x

1

x<>y

-

x!=0?

GTO C

RCL S

+/-

RCL S

1

-

*

4

/

RCL/ N

ENTER

SINH

x<>y

e^x

*

2

*

+/-

LBL C [CK=7EE2 004.5]

RCL- P

RTN

Back story: We had a data copy job at work that was failing, complaining about duplicate values in a GUID column. The table had around 5 million records in it, and I wanted to first assess whether or not there was any possibility that we legitimately had duplicate GUIDs coming from the source system. After determining that was effectively impossible, I traced the issue to some phantom reads that were happening due to a sudden increase in write activity in the source data, and changed the query's concurrency model to prevent the fake duplicate rows from appearing.