From Bayesian inference to maximum-likelihood and minimum chi-square model fitting

Next: Gaussian approximation of the Up: Inferring numerical values of Previous: Hierarchical modelling and hyperparameters

From Bayesian inference to maximum-likelihood and minimum chi-square model fitting

Let us continue with the case in which we know so little about appropriate values of the parameters that a uniform distribution is a practical choice for the prior. Equation (52) becomes

$\displaystyle p({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$d$}}, I)$

$\textstyle \propto$

$\displaystyle p({\mbox{\boldmath$d$}} \,\vert\,{\mbox{\boldmath$\theta$}},I) \,... ...$\theta$}},I) = {\cal L}({\mbox{\boldmath$\theta$}}; {\mbox{\boldmath$d$}}) \,,$

(61)

where, we recall, the likelihood ${\cal L}({\mbox{\boldmath$\theta$}}; {\mbox{\boldmath$d$}})$ is seen as a mathematical function of ${\mbox{\boldmath$\theta$}}$ , with parameters ${\mbox{\boldmath$d$}}$ .

The set of ${\mbox{\boldmath$\theta$}}$ that is most likely is that which maximizes ${\cal L}({\mbox{\boldmath$\theta$}}; {\mbox{\boldmath$d$}})$ , a result known as the maximum likelihood principle. Here it has been obtained again as a special case of a more general framework, under clearly stated hypotheses, without need to introduce new ad hoc rules. Note also that the inference does not depend on multiplicative factors in the likelihood. This is one of the ways to state the likelihood principle, ideally desired by frequentists, but often violated. This `principle' always and naturally holds in Bayesian statistics. It is important to remark that the use of unnecessary principles is dangerous, because there is a tendency to use them uncritically. For example, formulae resulting from maximum likelihood are often used also when non-uniform reasonable priors should be taken into account, or when the shape of ${\cal L}({\mbox{\boldmath$\theta$}}; {\mbox{\boldmath$d$}})$ is far from being multi-variate Gaussian. (This is a kind of ancillary default hypothesis that comes together with this principle, and is the source of the often misused ` $\Delta (-\ln {\cal L}) = 1/2$ ' rule to determine probability intervals.)

The usual least squares formulae are easily derived if we take the well-known case of pairs $\{x_i,y_i\}$ (the generic ${\mbox{\boldmath$d$}}$ stands for all data points) whose true values are related by a deterministic function $\mu_{y_i} = y(\mu_{x_i},{\mbox{\boldmath$\theta$}})$ and with Gaussian errors only in the ordinates, i.e. we consider $x_i \approx \mu_{x_i}$ . In the case of independence of the measurements, the likelihood-dominated result becomes,

$\displaystyle p({\mbox{\boldmath$\theta$}} \,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \propto$

$\displaystyle \prod_i \exp\left[ -\frac{(y_i-y(x_i,{\mbox{\boldmath$\theta$}}))^2}{2\,\sigma_{i}^2}\right]$

(62)

$\displaystyle p({\mbox{\boldmath$\theta$}} \,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \propto$

$\displaystyle \exp\left[-\frac{1}{2}\chi^2\right] \,,$

(63)

where

$\displaystyle \chi^2({\mbox{\boldmath$\theta$}})$

$\textstyle =$

$\displaystyle \sum_i \frac{(y_i-y(x_i,{\mbox{\boldmath$\theta$}}))^2}{\sigma_{i}^2}$

(64)

is called `chi-square,' well known among physicists. Maximizing the likelihood is equivalent to minimizing $\chi^2$ , and the most probable value of ${\mbox{\boldmath$\theta$}}$ is easily obtained (i.e. the mode indicated with ${\mbox{\boldmath$\theta$}}_m$ ), analytically in easy cases, or numerically for more complex ones.

As far as the uncertainty in ${\mbox{\boldmath$\theta$}}$ is concerned, the widely-used evaluation of the covariance matrix $\mathbf{V}({\mbox{\boldmath$\theta$}})$ (see Sect. 5.6) from the Hessian,

$\displaystyle (V^{-1})_{ij} ({\mbox{\boldmath$\theta$}})$

$\textstyle =$

$\displaystyle \left. \frac{1}{2} \frac{\partial^2\chi^2}{\partial\theta_i\parti... ...ta_j} \right\vert _{{\mbox{\boldmath$\theta$}}={\mbox{\boldmath$\theta$}}_m}\,,$

(65)

is merely consequence of an assumed multi-variate Gaussian distribution of ${\mbox{\boldmath$\theta$}}$ , that is a parabolic shape of $\chi^2$ (note that the ` $\Delta (-\ln {\cal L}) = 1/2$ ' rule, and the from this rule resulting ` $\Delta \chi^2 = 1$ rule,' has the same motivation). In fact, expanding $\chi^2({\mbox{\boldmath$\theta$}})$ in series around its minimum, we have

$\displaystyle \chi^2({\mbox{\boldmath$\theta$}})$

$\textstyle \approx$

$\displaystyle \chi^2({\mbox{\boldmath$\theta$}}_m) + \frac{1}{2}\, {\mbox{\boldmath$\Delta$}}\theta^T\, \mathbf{H}\,{\mbox{\boldmath$\Delta$}}\theta$

(66)

where ${\mbox{\boldmath$\Delta$}}\theta$ stands for the the set of differences $\theta_i-\theta_{m_i}$ and $\mathbf{H}$ is the Hessian matrix, whose elements are given by twice the r.s.h. of Eq. (65). Equation (63) becomes then

$\displaystyle p({\mbox{\boldmath$\theta$}} \,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \approx$

$\displaystyle \propto \exp\left[-\frac{1}{2} {\mbox{\boldmath$\Delta$}}\theta^T\, \mathbf{H}\,{\mbox{\boldmath$\Delta$}}\theta \right] \,,$

(67)

which we recognize to be a multi-variate Gaussian distribution if we identify $\mathbf{H}=\mathbf{V}^{-1}$ . After normalization, we get finally

$\displaystyle p({\mbox{\boldmath$\theta$}} \,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \approx$

$\displaystyle (2 \pi)^{-n/2}\, (\det\mathbf{V})^{-1/2}\, \exp \left[-\frac{1}{2... ...Delta$}}\theta^T \, \mathbf{V}^{-1}{\mbox{\boldmath$\Delta$}}\theta \right] \,,$

(68)

with

equal to the dimension of ${\mbox{\boldmath$\theta$}}$ and $\det\mathbf{V}$ indicating the determinant of $\mathbf{V}$ . Holding this approximation, $\mbox{E}({\mbox{\boldmath$\theta$}})$ is approximately equal to ${\mbox{\boldmath$\theta$}}_m$ . Note that the result (68) is exact when $y(\mu_{x_i},{\mbox{\boldmath$\theta$}})$ depends linearly on the various $\theta_i$ .

In routine applications, the hypotheses that lead to the maximum likelihood and least squares formulae often hold. But when these hypotheses are not justified, we need to characterize the result by the multi-dimensional posterior distribution $p({\mbox{\boldmath$\theta$}})$ , going back to the more general expression Eq. (52).

The important conclusion from this section, as was the case for the definitions of probability in Sect. 3, is that Bayesian methods often lead to well-known conventional results, but without introducing them as new ad hoc rules as the need arises. The analyst acquires then a heightened sense of awareness about the range of validity of the methods. One might as well use these `recovered' methods within the Bayesian framework, with its more natural interpretation of the results. Then one can speak about the uncertainty in the model parameters and quantify it with probability values, which is the usual way in which physicists think.

Next: Gaussian approximation of the Up: Inferring numerical values of Previous: Hierarchical modelling and hyperparameters

Giulio D'Agostini 2003-05-13