(32S, 32SII) Birthday paradox, hash collision probability
04-26-2020, 07:12 PM
Post: #2
 Albert Chan Senior Member Posts: 1,676 Joined: Jul 2018
RE: (32S, 32SII) Birthday paradox, hash collision probability
(04-25-2020 04:49 PM)Dave Britten Wrote:  The simple formula, which works with smaller inputs such as birthdays, is this:

1-((N-1)/N)^((S^2-S)/2)

With large inputs, like calculating hash collisions, this formula potentially loses all of its significant digits - (N-1)/N will be rounded to 1, giving you an erroneous probability of zero.

Fortunately there's a second approximation which, while giving poor accuracy for small inputs, provides good accuracy for large ones:

1-e^(-S(S-1)/(2N))

(Reference: https://preshing.com/20110504/hash-colli...abilities/)

First formula can be calculated more accurately using log1p:

lua> m = require 'mathx'
lua> function p1(n,s) return -m.expm1(0.5*s*(s-1) * m.log1p(-1/n)) end
lua> function p2(n,s) return -m.expm1(0.5*s*(s-1) / -n) end
lua> function both(n,s) return p1(n,s), p2(n,s) end
lua>
lua> both(365,30)
0.6968162999539532 ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿0.6963200177219244
lua> both(2^128, 5e12)
3.6734198463188465e-014 ﻿ ﻿ ﻿ ﻿ ﻿ ﻿3.6734198463188465e-014
lua> both(2^128, 2.6153e18)
0.009999839905579181 ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿ ﻿0.009999839905579181

Probability of all different birthday, for 30 people = perm(365,30) / 365^30 ≈ 29.37%

Probability of "collision" ≈ 1 - 29.37% = 70.63%

Approximation p1() seems better than p2(), at least for test cases.