Re: Statistical Challenge Message #6 Posted by Patrick on 6 Oct 2003, 5:55 p.m., in response to message #5 by Larry Corrado, USA (WI)
Well done, Mr. Csaba. Your answer is right on.
In words, the basic idea of the algorithm is that the program always keeps the "current" selection stored, together with the current number of items entered to date. When the n'th input is made, it replaces the previous current selection with probability 1/n. So, for example, the first item entered is selected as the current value with probability 1. The second one entered is selected with probability 1/2, and so on.
After all the data is entered (or at any other time) the current selection is the value the program will return as the "random pick" of the user's data.
Here is the proof that the program selects each item with equal probability. Suppose that the entire list consists of N items. Consider the k'th item in the list, with 1<=k<=N. What is the probability that it ends up as the final selection? In order for our program to be correct, this probability must be 1/N. Well, in order for this to happen, it must be selected when it is keyed in and then all further keyed data inputs must not be selected. These events are independent and so we multiply the probabilities for them each to happen:
(1/k)*(k/(k+1))*((k+1)/(k+2))* ... *((N1)/N) = 1/N
Techniques like this used to abound in the literature in the days when memory and CPU costs were large. Typically, a large data set was read off of a tape and certain reports were generated from the data. Core memory was not so plentiful that it could maintain all of that data in memory all at the same time. Nevertheless, it was critically important to programs of the day that they not require multiple passes through the data  reading a tape was expensive, in CPU time and in wallclock time. "Single pass" type algorithms were key.
The algorithm at the base of this challenge is a simple example of such a singlepass (stochastic) algorithm. Many such have been lost from general knowledge and practice because of how inexpensive memory (e.g., virtual memory) and CPU power has become.
