The Covid-19 outbreak of these months
raised a new interest in data analysis,
especially among lay people, for long locked down and really flooded
by a tidal wave of numbers, whose meaning has often
been pretty unclear,
including that of the body counting, which should be in principle
the easiest to assess.
As practically anyone who has some experience in data
analysis, we were also tempted - we have to confess -
to build up some models in order to
understand what was going on, and especially to forecast future numbers.
But we immediately gave up, and not only because
faced with numbers that were not really meaningful, without clear
conditions, within reasonable uncertainty, about how they were obtained.
The basic question is that, we realized soon,
we cannot treat a virus spreading in a
human population like a bacterial colony in a homogeneous medium,
or a continuous (or discretized) thermodynamic system.
People live - fortunately! -
in far more complex communities (`clusters'), starting from the families,
villages and suburbs; then cities, regions, countries and
continents of different characteristics, population densities
and social behaviors. Then we would have to take into account
`osmosis' of different kinds among the clusters,
due to local, intermediate and long distance movements
of individuals. Not to speak of the diffusion properties
of viruses in general and of this one in particular.
A related problem, which would complicate further the model,
was the fact that tests were applied,
at least at the beginning of pandemic,
mainly to people showing evident symptoms
or at risk for several reasons, like personnel of the health system.
We were then asking ourselves rather soon,
why tests were not also made on
a possibly representative sample of the entire population,
independently of the presence of symptoms
or not.1This would be, in our opinion,
the best way to get an idea of the proportion of the
population affected at a given
`instant' (to be understood as one or a few days)
and to take decisions accordingly.
It is quite obvious that surveys of this kind
would require rather fast and inexpensive tests,
to the detriment of their quality, thus unavoidably yielding
a not negligible fraction of so called false positives
and false negatives.
When we read in a newspaper [16]
about a rather cheap
antibody blood test able to tag
the individuals being or having been infected 2we decided to make some exercises in order to understand
whether such a `low quality' test would be adequate for the purpose
and what sample size would be required in order
to get `snapshots` of a population at regular times.
In fact Ref. [16]
not only reported the relevant `probabilities',
namely 98% to tag an Infected (presently or previously) as Positive
(`sensitivity') and 88% to tag a not-Infected as Negative
(`specificity'), but also the
numbers of tests from which these two numbers resulted.
This extra information is important to understand how
believable these two numbers are and how to propagate their uncertainty
into the other numbers of interest, together with other sources of uncertainty.
This convinced us to go through the exercise
of understanding how the main uncertainties of the problem
would affect the conclusions:
- uncertainty due to sampling;
- uncertainty due to the fact that the above
probabilities differ from 1;
- uncertainty about the exact values of these
`probabilities'.3
Experts might argue that other sources of uncertainty should
be considered, but our point was that already clarifying some issues
related to the above contributions would have been of some interest.
From the probabilistic point of view, there is another
source of uncertainty to be taken into account,
which is the prior distribution
of the proportion of infectees in the population,
however not as important as when we have
to judge from a single test if an individual is infected or not.
The paper, written with didactic intent4(and we have to admit that it was useful to clarify
some issues even to us), is organized in the following way.
- Section shows some simple evaluations based on the
nominal capabilities of the test, without entering in the
probabilistic treatment of the problem. The limitations
of such `rough reasoning' become immediately clear.
- Then we move in Sec. to probabilistic
reasoning, applied to the probability
that a person tagged as positive/negative `is' (or `has been') really
infected or not infected. The probabilistic tool needed
to make this so called `probabilistic inversion' (Bayes' theorem)
is then reminded and applied, showing the relevance
of the probability that the individual is infected or not,
based on other pieces of information/knowledge
(`prior probability'), a fundamental ingredient
of inference often overlooked.5
- The effect of the uncertainty on sensititivity, specificity
and proportion of infectees in the population
is discussed in Sec. .
But, before doing that, we have to model the probability density
function for these uncertain quantities. Hence an introduction
to the application of Bayes' theorem
to continuous quantities is required,
including some notes on the use of conjugate priors.
- From Sec. we switch our focus
from single individuals to populations. Our aim,
that is inferring the proportion of `infectees'
(meaning, let us repeat it once more,
`individuals being or having being infected') will be reached
in Secs. and .
But, for didactic purposes, we proceed by step, starting from
the expected number of positives, examining in depth the various
sources of uncertainty.
In particular, in Sec.
we study the measurability of and the dependence
of its `resolution power' on the test performances and
the sample size. Most of the work is done using Monte Carlo methods,
but some useful approximated formulae
for the evaluation of uncertainty on the result are given as well.
- The probabilistic inference of , that is evaluating its probability
density function , conditioned by data and well stated hypotheses,
is finally done in Sec. . Having to solve
a multidimensional problem, in which is finally obtained by
marginalization, Markov Chain Monte Carlo (MCMC) methods
become a must. In particular, we use JAGS [24],
interfaced with R [25] through the package
rjags [26]. We also evaluate, by the help of JAGS, some joint
probability distributions and the correlation coefficients among
the variables of interest, thus showing the great power of
MCMC methods, that have given a decisive boost to Bayesian inference
in the past decades.
- However, we show in Sec. how to solve
the problem exactly, although not in closed form, and limiting ourselves
to the pdf of . A simple extension of
the expression of the normalization constant allows to evaluate
the first moments of the distribution, from which expected value,
variance, skewness and kurtosis can be computed
(and then an approximation of can be `reconstructed').
- An important issue, also of practical relevance,
is the inference of the proportions of infectees in
different populations, analyzed in Sec. ,
after having been anticipated in Sec. .
In fact, since the uncertainties about sensitivity and specificity
act as systematic errors (hereafter `systematics'),
the differences between these proportions
can be determined better than each of them.
- The role of the prior in the inference of , already
analyzed in detail
in Sec. , is discussed again
in Sec. , with particular emphasis
to the case in which priors are at odds `with the data'
(in the sense specified there). The take away message
will be to be very careful in taking literally
`comfortable' mathematical models, never forgetting
the quotes by Laplace and Box reminded in the front page.
Two appendixes complete the paper. Appendix A is
a kind of summary of `Bayesian formulae', with emphasis
on the practical importance of unnormalized posteriors obtained
by a suitable choice of the so called chain rule of probability
theory and on which
most Monte Carlo methods to perform Bayesian inference are based.
In Appendix B several R scripts are provided in order to allow the reader
to reproduce most of the results presented in the paper.