Introduction

The Covid-19 outbreak of these months raised a new interest in data analysis, especially among lay people, for long locked down and really flooded by a tidal wave of numbers, whose meaning has often been pretty unclear, including that of the body counting, which should be in principle the easiest to assess. As practically anyone who has some experience in data analysis, we were also tempted - we have to confess - to build up some models in order to understand what was going on, and especially to forecast future numbers. But we immediately gave up, and not only because faced with numbers that were not really meaningful, without clear conditions, within reasonable uncertainty, about how they were obtained. The basic question is that, we realized soon, we cannot treat a virus spreading in a human population like a bacterial colony in a homogeneous medium, or a continuous (or discretized) thermodynamic system. People live - fortunately! - in far more complex communities (`clusters'), starting from the families, villages and suburbs; then cities, regions, countries and continents of different characteristics, population densities and social behaviors. Then we would have to take into account `osmosis' of different kinds among the clusters, due to local, intermediate and long distance movements of individuals. Not to speak of the diffusion properties of viruses in general and of this one in particular.

A related problem, which would complicate further the model, was the fact that tests were applied, at least at the beginning of pandemic, mainly to people showing evident symptoms or at risk for several reasons, like personnel of the health system. We were then asking ourselves rather soon, why tests were not also made on a possibly representative sample of the entire population, independently of the presence of symptoms or not.1This would be, in our opinion, the best way to get an idea of the proportion of the population affected at a given `instant' (to be understood as one or a few days) and to take decisions accordingly. It is quite obvious that surveys of this kind would require rather fast and inexpensive tests, to the detriment of their quality, thus unavoidably yielding a not negligible fraction of so called false positives and false negatives.

When we read in a newspaper [16] about a rather cheap antibody blood test able to tag the individuals being or having been infected 2we decided to make some exercises in order to understand whether such a `low quality' test would be adequate for the purpose and what sample size would be required in order to get `snapshots` of a population at regular times. In fact Ref. [16] not only reported the relevant `probabilities', namely 98% to tag an Infected (presently or previously) as Positive (`sensitivity') and 88% to tag a not-Infected as Negative (`specificity'), but also the numbers of tests from which these two numbers resulted. This extra information is important to understand how believable these two numbers are and how to propagate their uncertainty into the other numbers of interest, together with other sources of uncertainty. This convinced us to go through the exercise of understanding how the main uncertainties of the problem would affect the conclusions:

Experts might argue that other sources of uncertainty should be considered, but our point was that already clarifying some issues related to the above contributions would have been of some interest. From the probabilistic point of view, there is another source of uncertainty to be taken into account, which is the prior distribution of the proportion of infectees in the population, however not as important as when we have to judge from a single test if an individual is infected or not.

The paper, written with didactic intent4(and we have to admit that it was useful to clarify some issues even to us), is organized in the following way.

Two appendixes complete the paper. Appendix A is a kind of summary of `Bayesian formulae', with emphasis on the practical importance of unnormalized posteriors obtained by a suitable choice of the so called chain rule of probability theory and on which most Monte Carlo methods to perform Bayesian inference are based. In Appendix B several R scripts are provided in order to allow the reader to reproduce most of the results presented in the paper.