Introduction

The Covid-19 outbreak of these months raised a new interest in data analysis, especially among lay people, for long locked down and really flooded by a tidal wave of numbers, whose meaning has often been pretty unclear, including that of the body counting, which should be in principle the easiest to assess. As practically anyone who has some experience in data analysis, we were also tempted - we have to confess - to build up some models in order to understand what was going on, and especially to forecast future numbers. But we immediately gave up, and not only because faced with numbers that were not really meaningful, without clear conditions, within reasonable uncertainty, about how they were obtained. The basic question is that, we realized soon, we cannot treat a virus spreading in a human population like a bacterial colony in a homogeneous medium, or a continuous (or discretized) thermodynamic system. People live - fortunately! - in far more complex communities (`clusters'), starting from the families, villages and suburbs; then cities, regions, countries and continents of different characteristics, population densities and social behaviors. Then we would have to take into account `osmosis' of different kinds among the clusters, due to local, intermediate and long distance movements of individuals. Not to speak of the diffusion properties of viruses in general and of this one in particular.

A related problem, which would complicate further the model, was the fact that tests were applied, at least at the beginning of pandemic, mainly to people showing evident symptoms or at risk for several reasons, like personnel of the health system. We were then asking ourselves rather soon, why tests were not also made on a possibly representative sample of the entire population, independently of the presence of symptoms or not.¹This would be, in our opinion, the best way to get an idea of the proportion of the population affected at a given `instant' (to be understood as one or a few days) and to take decisions accordingly. It is quite obvious that surveys of this kind would require rather fast and inexpensive tests, to the detriment of their quality, thus unavoidably yielding a not negligible fraction of so called false positives and false negatives.

When we read in a newspaper [16] about a rather cheap antibody blood test able to tag the individuals being or having been infected ²we decided to make some exercises in order to understand whether such a `low quality' test would be adequate for the purpose and what sample size would be required in order to get `snapshots` of a population at regular times. In fact Ref. [16] not only reported the relevant `probabilities', namely 98% to tag an Infected (presently or previously) as Positive (`sensitivity') and 88% to tag a not-Infected as Negative (`specificity'), but also the numbers of tests from which these two numbers resulted. This extra information is important to understand how believable these two numbers are and how to propagate their uncertainty into the other numbers of interest, together with other sources of uncertainty. This convinced us to go through the exercise of understanding how the main uncertainties of the problem would affect the conclusions:

uncertainty due to sampling;
uncertainty due to the fact that the above probabilities differ from 1;
uncertainty about the exact values of these `probabilities'.³

Experts might argue that other sources of uncertainty should be considered, but our point was that already clarifying some issues related to the above contributions would have been of some interest. From the probabilistic point of view, there is another source of uncertainty to be taken into account, which is the prior distribution of the proportion of infectees in the population, however not as important as when we have to judge from a single test if an individual is infected or not.

The paper, written with didactic intent⁴(and we have to admit that it was useful to clarify some issues even to us), is organized in the following way.

Section shows some simple evaluations based on the nominal capabilities of the test, without entering in the probabilistic treatment of the problem. The limitations of such `rough reasoning' become immediately clear.
Then we move in Sec. to probabilistic reasoning, applied to the probability that a person tagged as positive/negative `is' (or `has been') really infected or not infected. The probabilistic tool needed to make this so called `probabilistic inversion' (Bayes' theorem) is then reminded and applied, showing the relevance of the probability that the individual is infected or not, based on other pieces of information/knowledge (`prior probability'), a fundamental ingredient of inference often overlooked.⁵
The effect of the uncertainty on sensititivity, specificity and proportion of infectees in the population is discussed in Sec. . But, before doing that, we have to model the probability density function for these uncertain quantities. Hence an introduction to the application of Bayes' theorem to continuous quantities is required, including some notes on the use of conjugate priors.
From Sec. we switch our focus from single individuals to populations. Our aim, that is inferring the proportion of `infectees' (meaning, let us repeat it once more, `individuals being or having being infected') will be reached in Secs. and . But, for didactic purposes, we proceed by step, starting from the expected number of positives, examining in depth the various sources of uncertainty. In particular, in Sec. we study the measurability of and the dependence of its `resolution power' on the test performances and the sample size. Most of the work is done using Monte Carlo methods, but some useful approximated formulae for the evaluation of uncertainty on the result are given as well.
The probabilistic inference of , that is evaluating its probability density function , conditioned by data and well stated hypotheses, is finally done in Sec. . Having to solve a multidimensional problem, in which is finally obtained by marginalization, Markov Chain Monte Carlo (MCMC) methods become a must. In particular, we use JAGS [24], interfaced with R [25] through the package rjags [26]. We also evaluate, by the help of JAGS, some joint probability distributions and the correlation coefficients among the variables of interest, thus showing the great power of MCMC methods, that have given a decisive boost to Bayesian inference in the past decades.
However, we show in Sec. how to solve the problem exactly, although not in closed form, and limiting ourselves to the pdf of . A simple extension of the expression of the normalization constant allows to evaluate the first moments of the distribution, from which expected value, variance, skewness and kurtosis can be computed (and then an approximation of can be `reconstructed').
An important issue, also of practical relevance, is the inference of the proportions of infectees in different populations, analyzed in Sec. , after having been anticipated in Sec. . In fact, since the uncertainties about sensitivity and specificity act as systematic errors (hereafter `systematics'), the differences between these proportions can be determined better than each of them.
The role of the prior in the inference of , already analyzed in detail in Sec. , is discussed again in Sec. , with particular emphasis to the case in which priors are at odds `with the data' (in the sense specified there). The take away message will be to be very careful in taking literally `comfortable' mathematical models, never forgetting the quotes by Laplace and Box reminded in the front page.

Two appendixes complete the paper. Appendix A is a kind of summary of `Bayesian formulae', with emphasis on the practical importance of unnormalized posteriors obtained by a suitable choice of the so called chain rule of probability theory and on which most Monte Carlo methods to perform Bayesian inference are based. In Appendix B several R scripts are provided in order to allow the reader to reproduce most of the results presented in the paper.