Next: Frequentists and Bayesian `sects' Up: Appendix on probability and Previous: Bayesian networks Contents

Why do frequentistic hypothesis tests `often work'?

The problem of classifying hypotheses according to their credibility is natural in the Bayesian framework. Let us recall briefly the following way of drawing conclusions about two hypotheses in the light of some data:

$\displaystyle \frac{P(H_i\,\vert\,\mbox{Data})}{P(H_j\,\vert\,\mbox{Data})}$

$\displaystyle =$

$\displaystyle \frac{P(\mbox{Data}\,\vert\,H_i)}{P(\mbox{Data}\,\vert\,H_j)}\cdot \frac{P_\circ(H_i)}{P_\circ(H_j)}\,.$

(8.6)

This form is very convenient, because:

it is valid even if the hypotheses do not form a complete class [a necessary condition if, instead, one wants to give the result in the standard form of Bayes' theorem given by formula ()];
it shows that the Bayes factor is an unbiased way of reporting the result (especially if a different initial probability could substantially change the conclusions);
the Bayes factor depends only on the likelihoods of observed data and not at all on unobserved data (contrary to what happens in conventional statistics, where conclusions depend on the probability of all the configurations of data in the tails of the distribution^8.16). In other words, Bayes' theorem applies in the form () and not as

$\displaystyle \underbrace{\frac{P(H_i\,\vert\,\mbox{Data+Tail})} {P(H_j\,\vert\... ...a+Tail}\,\vert\,H_j)}\cdot \frac{P_\circ(H_i)}{P_\circ(H_j)}}_{\bf\Large ?}\,;$
testing a single hypothesis does not make sense: one may talk of the probability of the Standard Model (SM) only if one is considering an Alternative Model (AM), thus getting, for example,

$\displaystyle \frac{P(\mbox{AM}\,\vert\,\mbox{Data})} {P(\mbox{SM}\,\vert\,\mbox{Data})}$ $\displaystyle =$ $\displaystyle \frac{P(\mbox{Data}\,\vert\,\mbox{AM})} {P(\mbox{Data}\,\vert\,\mbox{SM})}\cdot \frac{P_\circ(\mbox{AM})}{P_\circ(\mbox{SM})}\,:$

${P(\mbox{Data}\,\vert\,\mbox{SM})}$ can be arbitrarily small, but if there is not a reasonable alternative one has only to accept the fact that some events have been observed which are very far from the expectation value;
repeating what has been said several times, in the Bayesian scheme the conclusions depend only on observed data and on previous knowledge; in particular, they do not depend on
- how the data have been combined;
- data not observed and considered to be even rarer than the observed data;
- what the experimenter was planning to do before starting to take data. (I am referring to predefined fiducial cuts and the stopping rule, which, according to the frequentistic scheme should be defined in the test protocol. Unfortunately I cannot discuss this matter here in detail and I recommend the reading of Ref. [10]).

At this point we can finally reply to the question: ``why do commonly-used methods of hypothesis testing usually work?'' (see Sections

and

**Figure:** Testing a hypothesis $H_\circ$ implies that one is ready to replace it with an alternative hypothesis.
$\begin{figure}\centering\epsfig{file=dago86.eps,clip=,width=10cm}\end{figure}$

By reference to Fig. (imagine for a moment the figure without the curve ), the argument that $\theta_m$ provides evidence against $H_\circ$ is intuitively accepted and often works, not (only) because of probabilistic considerations of $\theta$ in the light of $H_\circ$ , but because it is often reasonable to imagine an alternative hypothesis that

maximizes the likelihood $f(\theta_m\,\vert\,H_1)$ or, at least

$\displaystyle \frac{P(\theta_m\,\vert\,H_1)}{P(\theta_m\,\vert\,H_\circ)} \gg 1\,;$
has a comparable prior [ $P_\circ(H_1)\approx P_\circ(H_\circ)$ ], such that

$\displaystyle \frac{P(H_1\,\vert\,\theta_m)}{P(H_\circ\,\vert\,\theta_m)} = \fr... ...\,H_1)}{P(\theta_m\,\vert\,H_\circ)}\cdot \frac{P_\circ(H_1)}{P_\circ(H_\circ)}$ $\displaystyle \approx$ $\displaystyle \frac{P(\theta_m\,\vert\,H_1)}{P(\theta_m\,\vert\,H_\circ)} \longrightarrow \gg 1\,.$

So, even though there is no objective or logical reason why the frequentistic scheme should work, the reason why it often does is that in many cases the test is made when one has serious doubts about the null hypothesis. But a peak appearing in the middle of a distribution, or any excess of events, is not, in itself, a hint of new physics (Fig.

is an invitation to meditation...).

**Figure:** Experimental obituary (courtesy of Alvaro de Rujula[71]).
$\begin{figure}\centering\epsfig{file=Alvaro.eps,clip=,width=7.0cm}\end{figure}$

My recommendations are therefore the following.

Be very careful when drawing conclusions from $\chi^2$ tests, `3 $\sigma$ golden rule', and other `bits of magic';
Do not pay too much attention to fixed rules suggested by statistics `experts', supervisors, and even Nobel laureates, taking also into account that
- they usually have permanent positions and risk less than PhD students and postdocs who do most of the real work;
- they have been `miseducated' by the exciting experience of the glorious 1950s to 1970s: as Giorgio Salvini says, ``when I was young, and it was possible to go to sleep at night after having added within the day some important brick to the building of the elementary particle palace. We were certainly lucky.''[72]. Especially when they were hunting for resonances, priors were very high, and the 3-4 $\sigma$ rule was a good guide.
Fluctuations exist. There are millions of frequentistic tests made every year in the world. And there is no probability theorem ensuring that the most extreme fluctuations occur to a precise Chinese student, rather than to a large HEP collaboration (this is the same reasoning of many Italians who buy national lotteria tickets in Rome or in motorway restaurants, because `these tickets win more often'...).

As a conclusion to these remarks, and to invite the reader to take with much care the assumption of equiprobability of hypothesis (a hidden assumption in many frequentistic methods), I would like to add this quotation by Poincaré [6]:

``To make my meaning clearer, I go back to the game of écarté mentioned before.^8.17 My adversary deals for the first time and turns up a king. What is the probability that he is a sharper? The formulae ordinarily taught give 8/9, a result which is obviously rather surprising. If we look at it closer, we see that the conclusion is arrived at as if, before sitting down at the table, I had considered that there was one chance in two that my adversary was not honest. An absurd hypothesis, because in that case I should certainly not have played with him; and this explains the absurdity of the conclusion. The function on the à priori probability was unjustified, and that is why the conclusion of the à posteriori probability led me into an inadmissible result. The importance of this preliminary convention is obvious. I shall even add that if none were made, the problem of the à posteriori probability would have no meaning. It must be always made either explicitly or tacitly.''

Next: Frequentists and Bayesian `sects' Up: Appendix on probability and Previous: Bayesian networks Contents

Giulio D'Agostini 2003-05-15