Proportion of infected individuals in the random sample - Binomial and hypergeometric distributions

We have already reminded and made use of the binomial distribution, assumed well known to the reader. A related problem in probability theory is that of extraction without replacement, which we introduce here for two reasons. The first is that it is little known even by many practitioners (we think e.g. to ourselves and to our colleagues physicists). The second is that some care is needed with the parameters used in literature and in scientific/statistical libraries of computer languages.

Let us imagine an urn containing $m$ white and $n$ black balls. Let us imagine then that we are going to take out of it, at random, $k$ balls and that we are interested in the number $X$ of white balls that we shall get (for convenience of the reader, and also for us who never worked before with such a distribution, we use the same idealized objects and symbols of the R help page - obtained e.g. by `?dhyper'). The probability distribution of $X$ is known as hypergeometric.35In short, referring to the parameters of the probability functions of the R language (see footnote [*]),

$\displaystyle X$ $\displaystyle \sim$ HG$\displaystyle (m, n, k)$  

with expected value and variance
E$\displaystyle (X)$ $\displaystyle =$ $\displaystyle k\cdot \frac{m}{m+n}$  
$\displaystyle \sigma^2(X)$ $\displaystyle =$ $\displaystyle k\cdot \frac{m}{m+n}\cdot\left(\frac{n}{m+n}\right)
\cdot \left(\frac{m+n-k}{m+n-1}\right)\,.$  

In terms of the proportion of `objects' having the characteristic of interest (`white'), their fraction in the urn is then assumed to be $p=m/(m+n)$, corresponding, in our problem, to the proportion of infectees. Using the symbol $n_s$ for the sample size $k$, as we have done so far, and $N$ for the total number of individuals in the population, the above equations can be conveniently rewritten as
E$\displaystyle (X)$ $\displaystyle =$ $\displaystyle p\cdot n_s$ (62)
$\displaystyle \sigma^2(X)$ $\displaystyle =$ $\displaystyle n_s\cdot p\cdot (1-p)
\cdot \left(\frac{N-n_s}{N-1}\right)\,.$ (63)

The expression of the expected value is identical to that of a binomial distribution, while that of the variance differs from it by a factor depending on the difference between the population size and the sample size, vanishing when $n_s$ is equal to $N$. That is simply because in that case we are going to empty the `urn' and therefore we shall count exactly the number of `white balls'. When, instead, $n_s$ is much smaller than $N$ (and then $N\gg 1$), we recover the variance of the binomial. In practice it means that the effect of replacement, related to the chance to extract more than once the same object, becomes negligible.

Moving to our problem, the role of the generic variable $X$ is played by the number of infectees in the sample, indicated by $n_I$ in the previous sections. In terms of their proportion, being $p_s=X/k = n_I/n_s$, we get

E$\displaystyle (p_s) =$   E$\displaystyle \left(\frac{n_I}{n_s}\right)$ $\displaystyle =$ $\displaystyle \frac{m}{m+n} = p\,,$ (64)

as intuitively expected. As far as the variance is concerned, being simply $\sigma(p_s) = \sigma(n_I)/n_s$, we get
$\displaystyle \sigma^2(p_s) = \frac{\sigma^2(n_I)}{n_s^2}$ $\displaystyle =$ $\displaystyle \frac{1}{n_s}\cdot p\cdot\left(1-p\right)
\cdot \left(\frac{N-n_s}{N-1}\right)$ (65)
  $\displaystyle \approx$ $\displaystyle \frac{1}{n_s}\cdot p\cdot\left(1-p\right)\cdot
\left(1-\frac{n_s}{N}\right)$ (66)

being $N\gg 1$ in all practical cases of (our) interest.

Finally, if the sample size is much smaller than the population size, then the last factor can be neglected and the variance can be approximated by $p\cdot(1-p)/n_s$, thus yielding

$\displaystyle \left.\sigma(p_s)\right\vert _{n_s\ll N}$ $\displaystyle \approx$ $\displaystyle \sqrt{\frac{p\cdot (1-p)}{n_s}}\,,$ (67)

the well known standard deviation of the fraction of successes in a binomial distribution with $n_s$ trials, each with probability $p$. The reason is that - it is worth repeating it - when the sample size is much smaller than the population size, then we can neglect the effects of no-replacement and consider the trials as (conditionally) independent Bernoulli processes, each with probability of success $p$.