General considerations on the approximated evaluation of $\sigma(n_P)$ by Eq. ([*])

At this point some further remarks on the utility of Eq. ([*]) is in order. Its advantage, within its limits of validity (checked in our case), is that it allows to disentangle the contributions to the overall uncertainty. In particular we can rewrite it as
$\displaystyle \sigma(n_P)$ $\displaystyle \approx$ $\displaystyle \sigma_R(n_P) \oplus
\sigma_{\pi_1}(n_P) \oplus \sigma_{\pi_2}(n_P)\,,$ (51)

that is a `quadratic sum' (or `quadratic combination', indicated by the symbol `$\oplus$') of three contributions,
$\displaystyle \sigma_R(n_P)$ $\displaystyle =$ $\displaystyle \sqrt{\mbox{E}(\pi_1)\cdot (1-\mbox{E}(\pi_1))\cdot p_s\cdot n_s
+ \mbox{E}(\pi_2)\cdot (1-\mbox{E}(\pi_2))\cdot (1-p_s)\cdot n_s}$  
$\displaystyle \sigma_{\pi_1}(n_P)$ $\displaystyle =$ $\displaystyle \sigma(\pi_1)\cdot p_s\cdot n_s$  
$\displaystyle \sigma_{\pi_2}(n_P)$ $\displaystyle =$ $\displaystyle \sigma(\pi_2)\cdot (1-p_s)\cdot n_s\,,$  

due, as indicated by the suffixes, to the binomials (`$R$' standing for `random'), to the uncertainty on $\pi_1$ and to that on $\pi_2$.

This quadratic combination of the contributions can be easily extended, just dividing by $n_s$, to the uncertainty on the fraction of positives, thus getting

$\displaystyle \sigma(f_P)$ $\displaystyle \approx$ $\displaystyle \sigma_R(f_P) \oplus
\sigma_{\pi_1}(f_P) \oplus \sigma_{\pi_2}(f_P)\,,$ (52)

quadratic sum of
$\displaystyle \sigma_R(f_P)$ $\displaystyle =$ $\displaystyle \sqrt{\mbox{E}(\pi_1)\cdot (1-\mbox{E}(\pi_1))\cdot p_s
+ \mbox{E}(\pi_2)\cdot (1-\mbox{E}(\pi_2))
\cdot (1-p_s)}/\sqrt{n_s} \ $ (53)
$\displaystyle \
\sigma_{\pi_1}(f_P)$ $\displaystyle =$ $\displaystyle \sigma(\pi_1)\cdot p_s$ (54)
$\displaystyle \sigma_{\pi_2}(f_P)$ $\displaystyle =$ $\displaystyle \sigma(\pi_2)\cdot (1-p_s)\,.$ (55)

We see immediately, for example, that for $p_s$ around 0.1 the contribution due to $\pi_2$ dominates over that due to $\pi_1$ by a factor $0.022/0.007\times 0.9/0.1 \approx 30$. This allows us to evaluate, on the basis of the Monte Carlo results shown in Tab. [*], the contribution due the systematic effects alone. For example we get, for our customary values of $p_s=0.1$ and $n_s=10000$, $\sigma(f_P)$ equal to 0.003 and 0.020, respectively. Assuming a quadratic combination, the contribution due to systematics is then $\sqrt{0.020^2-0.003^2} = 0.0198$. Besides questions of rounding,32it is clear that the uncertainty is largely dominated by the uncertainty on $\pi_1$ and $\pi_2$. We can check this result by a direct, although approximated, calculation using Eq. ([*]) and ([*]):
$\displaystyle \
\sigma_{\pi_1}(f_P)$ $\displaystyle =$ $\displaystyle 0.007\times 0.1 = 0.0007$  
$\displaystyle \sigma_{\pi_2}(f_P)$ $\displaystyle =$ $\displaystyle 0.022\times 0.9 = 0.0198$  
$\displaystyle \sigma_{\pi_1}(f_P) \oplus \sigma_{\pi_2}(f_P)$ $\displaystyle \approx$ $\displaystyle \sigma_{\pi_2}(f_P) = 0.0198\,,$  

getting the same result.

Looking at the numbers of Tab. [*], we see that this effect starts already at $n_s=1000$. For example, for $p_s=0.1$ we get $\sqrt{0.022^2-0.010^2}=0.0196$, twice the standard uncertainty of 0.010 due to the binomials alone. The sample size at which the two contributions have the same weight in the global uncertainty is around 300 (for example, for $p_s=0.1$ we get $\sqrt{0.026^2-0.018^2} = 0.019$). The take-home message is, at this point, rather clear (and well known to physicists and other scientists): unless we are able to make our knowledge about $\pi_1$ and $\pi_2$ more accurate, using sample sizes much larger than 1000 is only a waste of time.

However, there is still another important effect we need to consider, due to the fact that we are indeed sampling a population. This effect leads unavoidably to extra variability and therefore to a new contribution to the uncertainty in prediction (which will be somehow reflected into uncertainty in the inferential process).

Before moving to this other important effect, let us exploit a bit more the approximated evaluation of $\sigma(f_P)$. For example, solving with respect to $n_s$ the condition

$\displaystyle \sigma_R(f_P)$ $\displaystyle =$ $\displaystyle \sigma_{\pi_1}(f_P) \oplus \sigma_{\pi_2}(f_P)$  

we get from Eqs. ([*])-([*])
$\displaystyle n_s^*$ $\displaystyle \approx$ $\displaystyle \frac{\mbox{E}(\pi_1)\cdot (1-\mbox{E}(\pi_1))\cdot p_s
+ \mbox{E...
...\cdot (1-p_s)}{ \sigma^2(\pi_1)\cdot p_s^2 + \sigma^2(\pi_2)\cdot (1-p_s)^2}\,,$ (56)

which gives a rough idea of the sample size above which the uncertainty due to systematics starts to dominate. For example, for $p_s=0.1$ we get $n_s=240$ of the order of magnitude ( $\approx 300$) got from the Monte Carlo study. If we require, to be safe, $\sigma_{\pi_1}(f_P) \oplus \sigma_{\pi_2}(f_P) = ($2-3$)
\times \sigma_R(f_P)$ we get $n_s\approx 1000$ and $n_s\approx 2200$, again in reasonable agreement with the results of Tab. [*]. We shall go through a more complete analysis of $n_s^*$ in Sec. [*], in which a further contribution to the uncertainty will be also taken into account.



Subsections