next up previous
Next: Conclusions Up: From Observations to Hypotheses Previous: Falsificationism and its statistical


Forward to the past: probabilistic reasoning

The dominant school in statistics since the beginning of last century is based on a quite unnatural approach to probability, in contrast to that of the founding fathers (Poisson, Bernoulli, Bayes, Laplace, Gauss, etc.). In this approach (frequentism) there is no room for the concept of probability of causes, probability of hypotheses, probability of the values of physical quantities, and so on. Problems in the probability of the causes (``the essential problem of the experimental method''![4]) have been replaced by the machinery of the hypothesis tests. But people think naturally in terms of probability of causes, and the mismatch between natural thinking and standard education in statistics leads to the troubles discussed above.

I think that the way out is simply to go back to the past. In our time of rushed progress an invitation to go back to century old ideas seems at least odd (imagine a similar proposal regarding physics, chemistry or biology!). I admit it, but I do think it is the proper way to follow. This doesn't mean we have to drop everything done in probability and statistics in between. Most mathematical work can be easily recovered. In particular, we can benefit of theoretical clarifications and progresses in probability theory of the past century. We also take great advantage of the boost of computational capability occurred very recently, from which both symbolic and numeric methods have enormously benefitted. (In fact, many frequentistic ideas had their raison d'être in the computational barrier that the original probabilistic approach met. Many simplified - though often simplistic - methods were then proposed to make the live of practitioners easier. But nowadays computation cannot be considered any longer an excuse.)

In summary, the proposed way out can be summarized in an invitation to use probability theory consistently. But before you do it, you need to review the definition of probability, otherwise it is simply impossible to use all the power of the theory. In the advised approach probability quantifies how much we believe in something, i.e. we recover its intuitive idea. Once this is done, we can essentially use the formal probability theory based on Kolmogorov axioms (which can indeed derived, and with a better awareness about their meaning, from more general principles! - but I shall not enter this issue here).

This `new' approach is called Bayesian because of the central role played by Bayes theorem in learning from experimental data. The theorem teaches how the probability of each hypothesis $H_i$ has to be updated in the light of the new observation $E$:

$\displaystyle P(H_i\,\vert\,E,I)$ $\textstyle =$ $\displaystyle \frac{P(E\,\vert\,H_i,I)\cdot P(H_i\,\vert\,I)}
{P(E\,\vert\,I)}\,.$ (1)

$I$ stands for a background condition, or status of information, under which the inference is made. A more frequent Bayes' formula in text books, valid if the hypotheses are exhaustive and mutually exclusive, is
$\displaystyle P(H_i\,\vert\,E,I)$ $\textstyle =$ $\displaystyle \frac{P(E\,\vert\,H_i,I)\cdot P(H_i\,\vert\,I)}
{\sum_iP(E\,\vert\,H_i,I)\cdot P(H_i\,\vert\,I)}\,.$ (2)

The denominator in the right hand side of (2) is just a normalization factor and, as such, it can be neglected. Moreover it is possible to show that a similar structure holds for probability density functions (p.d.f.) if a continuous variable is considered ($\mu$ stands here for a generic `true value', associated to a parameter of a model). Calling `data' the overall effect $E$, we get the following formulae on which inference is to be ground:
$\displaystyle P(H_i\,\vert\,\mbox{data},I)$ $\textstyle \propto$ $\displaystyle P(\mbox{data}\,\vert\,H_i,I)\cdot P(H_i\,\vert\,I)$ (3)
$\displaystyle f(\mu\,\vert\,\mbox{data},I)$ $\textstyle \propto$ $\displaystyle f(\mbox{data}\,\vert\,\mu,I)
\cdot f(\mu\,\vert\,I)\,,$ (4)

the first formula used in probabilistic comparison of hypotheses, the second (mainly) in parametric inference. In both cases we have the same structure:
$\displaystyle \mbox{\bf posterior}$ $\textstyle \propto$ $\displaystyle \mbox{\bf likelihood} \times
\mbox{\bf prior}\,,$ (5)

where `posterior' and `prior' refer to our belief on that hypothesis, i.e. taking or not taking into account the `data' on which the present inference is based. The likelihood, that is ``how much we believe that the hypothesis can produce the data'' (not to be confused with ``how much we believe that the data come from the hypothesis''!), models the stochastic flow that leads from the hypothesis to the observations, including the best modeling of the detector response. The structure of (5) shows us that the inference based on Bayes theorem satisfies automatically the likelihood principle (likelihoods that differ by constant factors lead to the same posterior).

The proportionality factors in (3) and (4) are determined by normalization, if absolute probabilities are needed. Otherwise we can just put our attention on probability ratios:

$\displaystyle \frac{P(H_i\,\vert\,\mbox{data},I)}
{P(H_j\,\vert\,\mbox{data},I)}$ $\textstyle =$ $\displaystyle \frac{P(\mbox{data}\,\vert\,H_i,I}{P(\mbox{data}\,\vert\,H_j,I)}
\cdot\frac{P(H_i\,\vert\,I)}{P(H_j\,\vert\,I)}$ (6)
$\displaystyle \mbox{i.e.}\hspace{4cm}$   $\displaystyle \mbox{ }$  
$\displaystyle \mbox{\bf posterior odds}$ $\textstyle \propto$ $\displaystyle \mbox{\bf Bayes factor} \times
\mbox{\bf prior odds}\,:$ (7)

odds are updated by data via the ratio of the likelihoods, called Bayes factor.

There are some well known psychological (indeed cultural and even ideological) resistances to this approach due to the presence of the priors in the theory. Some remarks are therefore in order:

To make some numerical examples, let us solve two of the problems met above. (In order to simplify the notation the background condition `$I$' is not indicated explicitly in the following formulae).
Solution of the AIDS problem (Example 4)

Applying Eq. (6) we get
$\displaystyle \frac{P(\mbox{HIV}\,\vert\,\mbox{Pos})}{P(\overline{\mbox{HIV}}\,\vert\,\mbox{Pos})}$ $\textstyle =$ $\displaystyle \frac{P(\mbox{Pos}\,\vert\,\mbox{HIV})}
{P(\mbox{Pos}\,\vert\,\overline{\mbox{HIV}})}
\cdot \frac{P(\mbox{HIV})}{P(\overline{\mbox{HIV}})}\,.$ (8)

The Bayes factor $P(\mbox{Pos}\,\vert\,\mbox{HIV})/
P(\mbox{Pos}\,\vert\,\overline{\mbox{HIV}})$ is equal to 1/0.002 = 500. This is how much the information provided by the data `pushes' towards the hypothesis `infected' with respect to the hypothesis `healthy'. If the ratio of priors were equal to 1 [i.e. $P(\mbox{HIV})=P(\overline{\mbox{HIV}})$!], we would get final odds of 500, i.e. $P(\mbox{HIV}\,\vert\,\mbox{Pos})=500/501= 99.8\%$. But, fortunately, for a randomly chosen Italian $P(\mbox{HIV})$ is not 50%. Putting some more reasonable numbers, that might be 1/600 or 1/700, we have final odds of 0.83 or 0.71, corresponding to a $P(\mbox{HIV}\,\vert\,\mbox{Pos})$ of 45% or 42%. We understand now the source of the mistake done by quite some people in front of this problem: priors were unreasonable! This is a typical situation: using the Bayesian reasoning it is possible to show the hidden assumptions of non-Bayesian reasonings, though most users of the latter methods object, insisting in claiming they ``do not use priors''.
Solution of the three hypothesis problem (Example 1)

Figure 4: Example 1: likelihoods for the three different hypotheses. The vertical bar corresponds to the observation $x=3$.
\begin{figure}\centering\epsfig{file=GiulioDAgostini_2004_01_fig04.eps,clip=,width=9.0cm}\end{figure}
The Bayes factors between hypotheses $i$ and $j$, i.e. $BF_{i,j}=f(x=3\,\vert\,H_i)/f(x=3\,\vert\,H_j)$, are $BF_{2,1}=18$, $BF_{3,1}=25$ and $BF_{3,2}=1.4$. The observation $x=3$ favors models 2 and 3, but the resulting probabilities depend on priors. Assuming prior equiprobability among the three generators we get the following posterior probabilities for the three models: 2.3%, 41% and 57%. (In alternative, we could know that the extraction mechanism does not choose the three generators at random with the same probability, and the result would change.)

Instead, if we made an analysis based on p-value we would get that $H_1$ is ``excluded'' at a 99.87% C.L. or at 99.7% C.L., depending whether a one-tail or a two-tail test is done. Essentially, the perception that $H_1$ could be the correct cause of $x=3$ is about 10-20 times smaller than that given by the Bayesian analysis. As far as the comparison between $H_2$ and $H_3$ is concerned, the p-value analysis is in practice inapplicable (what would you do?) and one says that both models describe about equally well the result, which is more or less what we get out of the Bayesian analysis. However, the latter analysis gives some quantitative information: a slight hint in favor of $H_2$, that could be properly combined with many other small hints coming from other pieces of experimental information, and that, all together, might allow us to finally arrive to select one of the models.


next up previous
Next: Conclusions Up: From Observations to Hypotheses Previous: Falsificationism and its statistical
Giulio D'Agostini 2004-12-22