Next: Forward to the past: Up: From Observations to Hypotheses Previous: Inference, forecasting and related

Falsificationism and its statistical variations

The essence of the so called falsificationism is that a theory should yield verifiable predictions, i.e. predictions that can be checked to be true or false. If an effect is observed that contradicts the theory, the theory is ruled out, i. consistently. it is falsified. Though this scheme is certainly appealing, and most scientists are convinced that this is the way Science proceeds,² it is easy to realize that this scheme is a bit naïve, when one tries to apply it literally, as we shall see in a while. Before doing that, it is important to recognize that falsificationism is nothing but an extension of the classical proof by contradiction to the experimental method.

The proof by contradiction of standard dialectics and mathematics consists in assuming true a hypothesis and in looking for (at least) one of its logical consequences that is manifestly false. If a false consequence exists, then the hypothesis under test is considered false and its opposite true (in the sequel $\overline H$ will indicate the hypothesis opposite to , i.e. $\overline H$ is true if false, and vice versa). Indeed, there is no doubt that if we observe an effect that is impossible within a theory, this theory has to be ruled out. But the strict application of the falsificationist criterion is not maintainable in the scientific practice for several reasons.

What should we do of all theories which have not been falsified yet? Should we consider them all at the same level, parked in a kind of Limbo? This approach is not very effective. Which experiment should we perform next? The natural development of Science shows that new investigations are made in the direction that seems mostly credible (and fruitful) at a given moment.
If the predictions of a theory are characterized by the internal or external probabilistic behavior discussed above, how can we ever think of falsifying such a theory, speaking rigorously? For instance, there is no way to falsify hypothesis of Example 1, because any real number is compatible with any Gaussian. For the same reason, falsificationism cannot be used to make an inference about the value of a physical quantity (for a Gaussian response of the detector, no value of $\mu$ can be falsified whatever we observe, and, unfortunately, falsificationism does not tell how to classify non-falsified values in credibility).

An extension of strict falsificationism is offered by the statistical test methods developed by statisticians. Indeed, the latter methods might be seen as attempts of implementing in practice the falsificationism principle. It is therefore important to understand the `little' variations of the statistical tests with respect to the proof of contradiction (and hence to strict falsificationism).

a)

The impossible consequence is replaced by an improbable consequence. If this improbable consequence occurs, then the hypothesis is rejected, otherwise it is accepted. The implicit argument on the basis of the hypothesis test approach of conventional statistics is: ``if

is practically impossible given

, then

is considered practically false given the observation

.'' But this probability inversion -- initially qualitative, but then turned erroneously quantitative by most practitioners, attributing to `

given

' the same probability of `

given

' -- is not logically justified and it is not difficult to show that it yields misleading conclusions. Let us see some simple examples.

Example 2: Considering only hypothesis of Example 1 and taking `` $4\le x \le 5$ '', we can calculate the probability of obtaining from : $P(E\,\vert\,H_1)=3\times10^{-5}$ . This probability is rather small, but, once has occurred, we cannot state that `` has little probability to come from '', or that `` has little probability to have caused '': is certainly due to !
Example 3: ``I play honestly at lotto, betting on a rare combination'' () and ``win'' (). You cannot say that since is `practically impossible' given , then hypothesis has to be `practically excluded', after you have got the information that I have won [such a conclusion would imply that it is `practically true' that ``I have cheated'' ( $=\overline H$ )].
Example 4: An AIDS test to detect HIV infection is perfect to tag HIV infected people as `positive' (=Pos), i.e. $P(\mbox{Pos}\,\vert\,\mbox{HIV})=1$ , but it can sometimes err, and classify healthy persons ( $=\overline{\mbox{HIV}}$ ) as positive, although with low probability, e.g. $P(\mbox{Pos}\,\vert\,\overline{\mbox{HIV}})=0.2\%$ . An Italian citizen is chosen at random to undergo such a test and he/she is tagged positive. We cannot claim that ``since it was practically impossible that a healthy person resulted positive, then this person is practically infected'', or, quantitatively, ``there is only 0.2% probability that this person is not infected''.

We shall see later how to solve these kind of problems correctly. For the moment the important message is that it is not correct to replace `improbable' in logical methods that speak about `impossible' (and to use then the reasoning to perform `probabilistic inversions'): impossible and improbable differ in quality, not just in quantity!

b)

In many cases the number of effects due to a hypothesis is so large that each effect is `practically impossible'.³ Even those who trust the reasoning based on the small probability of effects to falsify hypotheses have to realize that the reasoning fails in these cases, because every observation can be used as an evidence against the hypothesis to be tested. Statisticians have then worked out methods in which the observed effect is replaced by two ensembles of effects, one of high chance and another of low chance. The reasoning based on the `practically impossible' effect is then extended to the latter ensemble. This is the essence of all tests tests based on ``p-values''[8] (what physicists know as ``probability
of tails'' upon which $\chi^2$ and other famous tests are based). Logically,⁴ the situation gets worse, because conclusions do not depend anymore on what has been observed, but also on effects that have not been observed⁵(see e.g. Ref. [7]).

c)

Apart from the simple case of just one observation, the data are summarized by a `test variable' (e.g. $\chi^2$ ), function of the data, and the reasoning discussed above is applied to the test variable. This introduces an additional, arbitrary ingredient to this already logically tottering construction.⁶

d)

Even in simple problems, that could be formulated in terms of a single quantity, given the empirical information there might be ambiguity about which quantity plays the role of the random variable upon which the p-value has to be calculated.⁷

Anyhow, apart from questions that might seem subtle philosophical quibbles, conventional tests lead to several practical problems.

In my opinion the most serious problem is the fact that p-values are constantly used in scientific conclusions as if they were the probability that the hypothesis under test is true (for example people report a p-value of 0.0003 as ``the hypothesis is excluded at 99.97% C.L.'', as if they were 99.97% confident that the hypothesis to test is false). The consequence of this misunderstanding is very serious, and it is essentially responsible for all claims of fake discoveries in the past decades (see some examples in Sec. 1.9 of Ref. [1].)
Statistical tests are not based on first principles of any kind. Hundreds of statistical tests have been contrived and their choice is basically arbitrary. I have experienced that discussions in experimental teams about which test to use and how to use it are not deeper than discussions in pubs among soccer fans (Italian readers might think at the `Processo di Biscardi' talk show, quite often also in the tones).
There is sometimes a tendency to look for the test that gives the desired result. Personally, I find that the fancier the name of the test is, the less believable the claim is, because I am pretty sure that other, more common tests were discarded because `they were not appropriate', an expression to be often interpreted as ``the other tests did not support what the experimentalist wanted the data to prove'' (and I could report of people that, frustrated by the `bad results' obtained with frequentistic tests, contacted me hoping for a Bayesian miracle -- they got regularly disappointed because, `unfortunately', Bayesian methods, consciously applied, tend not to feed vain illusions).
Standard statistical methods, essentially a contradictory collection of ad-hoc-eries, induce scientists, and physicists in particular, to think that `statistics' is something `not serious', thus encouraging `creative' behaviors.⁸

Next: Forward to the past: Up: From Observations to Hypotheses Previous: Inference, forecasting and related

Giulio D'Agostini 2004-12-22