Likelihood and maximum likelihood methods

Some comments on likelihood are also in order, because the reader might have heard this term and might wonder if and how it fits in the scheme of reasoning expounded here.

One of the problems with this term is that it tends to have several meanings, and then to create misunderstandings. In plane English `likelihood' is ``1. the condition of being likely or probable; probability'', or ``2. something that is probable''⁵⁸; but also ``3. (Mathematics & Measurements / Statistics) the probability of a given sample being randomly drawn regarded as a function of the parameters of the population''.

Technically, with reference to the example of the previous appendix, the likelihood is simply $P(x_E\,\vert\,H_i,I)$ , where is fixed (the observation) and is the `parameter'. Then it can take two values, $P(x_E\,\vert\,H_1,I) = 3.68\times 10^{-8}$ and $P(x_E\,\vert\,H_2,I) = 1.99\times 10^{-8}$ .

If, instead of only two models we had a continuity of models, for example the family of all Gaussian distributions characterized by central value $\mu$ and `effective width' (standard deviation) $\sigma$ , our likelihood would be $P(x_E\,\vert\,\mu,\sigma,I)$ , i.e.

$\displaystyle {\cal L}(\mu,\sigma\,;\,x_E)$

$\displaystyle =$

$\displaystyle P(x_E\,\vert\,\mu,\sigma,I)\,,$

(37)

written in this way to remember that: 1) a likelihood is a function of the model parameters and not of the data; 2) ${\cal L}(\mu,\sigma\,;\,x_E)$ is not a probability (or a probability density function) of $\mu$ and $\sigma$ . Anyway, for the rest of the discussion we stick to the very simple likelihood based on the two Gaussians. That is, instead of a double infinity of possibilities, our space of parameters is made only of two points, $\{\mu_1=0,\sigma_1=1\}$ and $\{\mu_1=0.4,\sigma_2=2\}$ . Thus the situation gets simpler, although the main conceptual issues remain substantially the same.

In principle there is nothing bad to give a special name to this function of the parameters. But, frankly, I had preferred statistics gurus named it after their dog or their lover, rather than call it `likelihood'.⁵⁹The problem is that it is very frequent to hear students, teachers and researcher explaining that the `likelihood' tells ``how likely the parameters are'' (this is the probability of the parameters! not the `likelihood'). Or they would say, with reference to our example, ``it is the probability that comes from '' (again, this expression would be the probability of given , and not the probability of given the models!) Imagine if we have only in the game: comes with certainty from , although does not yield with certainty .⁶⁰

Several methods in `conventional statistics' use somehow the likelihood to decide which model or which set of parameters describes at best the data. Some even use the likelihood ratio (our Bayes factor), or even the logarithm of it (something equal or proportional, depending on the base, to the weight of evidence we have indicated here by JL). The most famous method of the series is the maximum likelihood principle. As it is easy to guess from its name, it states that the best estimates of the parameters are those which maximize the likelihood.

All that seems reasonable and in agreement with what it has been expounded here, but it is not quite so. First, for those who support this approach, likelihoods are not just a part of the inferential tool, they are everything. Priors are completely neglected, more or less because of the objections in footnote 9. This can be acceptable, if the evidence is overwhelming, but this is not always the case. Unfortunately, as it is now easy to understand, neglecting priors is mathematically equivalent to consider the alternative hypotheses equally likely! As a consequence of this statistics miseducation (most statistics courses in the universities all around the world only teach `conventional statistics' and never, little, or badly probabilistic inference) is that too many unsuspectable people fail in solving the AIDS problem of appendix B, or confuse the likelihood with the probability of the hypothesis, resulting in misleading scientific claims (see also footnote 60 and Ref. [3]).

The second difference is that, since ``there are no priors'', the result cannot have a probabilistic meaning, as it is openly recognized by the promoters of this method, who, in fact, do not admit we can talk about probabilities of causes (but most practitioners seem not to be aware of this `little philosophical detail', also because frequentistic gurus, having difficulties to explain what is the meaning of their methods, they say they are `probabilities', but in quote marks!⁶¹). As a consequence, the resulting `error analysis', that in human terms means to assign different beliefs to different values of the parameters, is cumbersome. In practice the results are reasonable only if the possible values of the parameters are initially equally likely and the `likelihood function' has a `kind shape' (for more details see chapters 1 and 12 of Ref. [3]).

Giulio D'Agostini 2010-09-30