Table of Contents
Introduction
An elegantly simple algorithm, it only requires n (n: sample size, N: population size) random numbers in the range (0,1). This minimization of the number of calls to the random number generator mattered more on slower hardware, but I have found that it is still one of the most expensive (time-wise) operations when writing these algorithms.
Theory
While I think you should read the paper (it’s only three pages and written very accessibly), I want to create a more visual proof of the statistics involved in this algorithm.
First, let’s imagine that the n samples we want are already selected, and you want to know the numbers in that set. Because I’m difficult, I will only confirm if a number is in the set, but not just give you the list. You know the set is sequential, so you start at zero and will list numbers until I have confirmed n of them for you. Now, what is the probability that 0 isn’t in the n samples?1
Answer
What about 1 (assuming zero wasn’t in the set)?
You're thinking about it before clicking right?
where both the numerator and denominator have decreased by one reflecting that the pool of unknown numbers not in the set decreased by one (we know 0 is not in the set) and that the total number of samples left has decreased by 1 as well. This means that if we were to select m new samples, the probability that none of those samples can be found in the original n samples is:
This fact is true regardless of whether or not you choose samples in order, but the plot below shows the probability change as a factor of n where N=100 for the first chosen item if you were to go in order.
the curve for n=1 decreases at a constant rate with the probability reaching zero at 100, which should make some intuitive sense as we know it shouldn’t have any probability of being greater than 100. Looking at the plots a little longer, and one might begin to realize that we are looking at 1-CDF (Cumulant Distribution Function). This is useful because we can then form Probability Distributions by taking the negative derivative of the curves.
Luckily, we get the results that we would expect for at least 1, with uniform probability distribution forming a straight line (or equal probability for any number being in the sample set). For numbers larger than 1, we being to see that the probability is heavily localized towards the beginning of the sample numbers, which should also make sense because the n sample numbers must be uniformly distributed across the full set of N possibilities.
One can then ask, what is the distribution for the next number? Well, assuming a was the first number then we know that the remaining numbers not in the set are N-n-a+1 (plus 1 because a was in the set) and that the total number of remaining possibilities are N-a. Therefore for some number b the possibility that it isn’t in the set, with some value a is:
While this is nice, it would be better to get an idea of where the second draw would be weighted by how likely the first draw is. We can use the following equation:
which gives us this plot:
This plot has some really nice takeaways. First let’s look at n = 2, in this case we see that it is heavily weighted towards the latter half of the set, which makes sense because the first draw was likely in the first half of the set according to the first figure. n = 3 shows the maximal probability being essentially in the middle of the set, which also makes sense intuitively, as we expect the next draw to be nearer to the end.
Ultimately, we can create plots like this for every subsequent draw, and we will find that the probability distribution results in a uniform spread across the whole set. Which is proved algebraically in the paper.
Python Implementation
Below is a basic implementation of the algorithm, where n: sample size, N: population size
from random import random
def bissel(n,N):
nlist = [] #list of n random integers
r = N - n
Nt = N
for i in range(n):
rand = random()
p = 1
while p > rand:
p *= r/Nt
r -= 1
Nt -= 1
nlist.append(N - Nt + 1)
Nt -= 1
return nlist
-
If the equations are not rendering, consider using an open source browser like Firefox or updating your browser. ↩︎