Next: Statistical significance versus probability Up: Uncertainty in physics and Previous: Unsuitability of confidence intervals Contents

Misunderstandings caused by the standard paradigm of
hypothesis tests

Similar problems of interpretation appear in the usual methods used to test hypotheses. I will briefly outline the standard procedure and then give some examples to show the kind of paradoxical conclusions that one can reach.

A frequentistic hypothesis test follows the scheme outlined below (see Fig. ). ^1.10

Formulate a hypothesis $H_\circ$ .
Choose a test variable $\theta$ of which the probability density function $f(\theta\,\vert\,H_\circ)$ is known (analytically or numerically) for a given $H_\circ$ .
Choose an interval $[\theta_1,\theta_2]$ such that there is high probability that $\theta$ falls inside the interval:

$\displaystyle P(\theta_1 \le \theta \le \theta_2) = 1 - \alpha\,,$ (1.6)

with $\alpha$ typically equal to 1% or 5%.
Perform an experiment, obtaining $\theta = \theta_{{\small\it m}}$ .
Draw the following conclusions :
- if $\theta_1 \le \theta_{{\small\it m}} \le \theta_2 \hspace{0.48 cm }\Longrightarrow$ $H_\circ$ accepted;
- otherwise $\Longrightarrow$ $H_\circ$ rejected with a significance level $\alpha$ .

**Figure:** Hypothesis test scheme in the frequentistic approach.
$\begin{figure}\centering\epsfig{file=dago76.eps,clip=,width=9.0cm}\end{figure}$

The usual justification for the procedure is that the probability $\alpha$ is so low that it is practically impossible for the test variable to fall outside the interval. Then, if this event happens, we have good reason to reject the hypothesis.

One can recognize behind this reasoning a revised version of the classical `proof by contradiction' (see, e.g., Ref. [10]). In standard dialectics, one assumes a hypothesis to be true and looks for a logical consequence which is manifestly false in order to reject the hypothesis. The slight difference is that in the hypothesis test scheme, the false consequence is replaced by an improbable one. The argument may look convincing, but it has no grounds. In order to analyse the problem well, we need to review the logic of uncertainty. For the moment a few examples are enough to indicate that there is something troublesome behind the procedure.

Example 4:

Choosing the rejection region in the middle of the distribution.
Imagine choosing an interval $[\theta_1^*,\theta_2^*]$ around the expected value of $\theta$ (or around the mode) such that

$\displaystyle P(\theta_1^* \le \theta \le \theta_2^*) = \alpha\,,$

(1.7)

with $\alpha$ small (see Fig.

). We can then reverse the test, and reject the hypothesis if the measured $\theta_{{\small\it m}}$ is inside the interval.

**Figure:** Would you accept this scheme to test hypotheses?
$\begin{figure}\centering\epsfig{file=dago78.eps,clip=,width=9.0cm}\end{figure}$

This strategy is clearly unacceptable, indicating that the rejection decision cannot be based on the argument of practically impossible observations (smallness of $\alpha$ ).

One may object that the reason is not only the small probability of the rejection region, but also its distance from the expected value. Figure is an example against this objection.

**Figure:** Would you accept this scheme to test hypotheses?
$\begin{figure}\centering\epsfig{file=dago79.eps,clip=,width=9.0cm}\end{figure}$

Although the situation is not as extreme as that depicted in Fig.

, one would need a certain amount of courage to say that the $H_\circ$ is rejected if the test variable falls by chance in `the bad region'.

Example 5:

Has the student made a mistake?
A teacher gives to each student an individual sample of 300 random numbers, uniformly distributed between 0 and 1. The students are asked to calculate the arithmetic average. The prevision^1.11of the teacher can be quantified with

E $\displaystyle \left[\overline{X}_{300}\right]$	$\displaystyle =$	$\displaystyle \frac{1}{2}$	(1.8)
$\displaystyle \sigma\left[\overline{X}_{300}\right]$	$\displaystyle =$	$\displaystyle \frac{1}{\sqrt{12}} \cdot \frac{1}{\sqrt{300}} = 0.017\,,$	(1.9)

with the random variable $\overline{X}_{300}$ normally distributed because of the central limit theorem. This means that there is 99% probability that an average will come out in the interval $0.5\pm (2.6 \times 0.017)$ :

$\displaystyle P(0.456 \le \overline{X}_{300} \le 0.544) = 99\%\,.$

(1.10)

Imagine that a student obtains an average outside the above interval (e.g. $\overline{x}=0.550)$ . The teacher may be interested in the probability that the student has made a mistake (for example, he has to decide if it is worthwhile checking the calculation in detail). Applying the standard methods one draws the conclusion that

``the hypothesis $H_\circ$ = `no mistakes' is rejected at the 1% level of significance'',

i.e. one receives a precise answer to a different question. In fact, the meaning of the previous statement is simply

``there is only a 1% probability that the average falls outside the selected interval, if the calculations were done correctly''.

But this does not answer our natural question,^1.12 i.e. that concerning the probability of mistake, and not that of results far from the average if there were no mistakes. Moreover, the statement sounds as if one would be 99% sure that the student has made a mistake! This conclusion is highly misleading.

How is it possible, then, to answer the very question concerning the probability of mistakes? If you ask the students (before they take a standard course in hypothesis tests) you will hear the right answer, and it contains a crucial ingredient extraneous to the logic of hypothesis tests:

``It all depends on who has made the calculation!''

In fact, if the calculation was done by a well-tested program the probability of mistake would be zero. And students know rather well their probability of making mistakes.

Example 6:

A bad joke to a journal.^1.13

A scientific journal changes its publication policy. The editors announce that results with a significance level of 5% will no longer be accepted. Only those with a level of $\le 1\%$ will be published. The rationale for the change, explained in an editorial, looks reasonable and it can be shared without hesitation: ``We want to publish only good results.''

1000 experimental physicists, not convinced by this severe rule, conspire against the journal. Each of them formulates a wrong physics hypothesis and performs an experiment to test it according to the accepted/rejected scheme.

Roughly 10 physicists get $1\%$ significant results. Their papers are accepted and published. It follows that, contrary to the wishes of the editors, the first issue of the journal under the new policy contains only wrong results!

The solution to the kind of paradox raised by this example seems clear: The physicists knew with certainty that the hypotheses were wrong. So the example looks like an odd case with no practical importance. But in real life who knows in advance with certainty if a hypothesis is true or false?

Next: Statistical significance versus probability Up: Uncertainty in physics and Previous: Unsuitability of confidence intervals Contents

Giulio D'Agostini 2003-05-15

Misunderstandings caused by the standard paradigm of hypothesis tests

Misunderstandings caused by the standard paradigm of
hypothesis tests