Computational issues: normalization, fit summaries, priors and approximations

where is `simply' a normalization factor. (This factor is usually the most difficult thing to calculate and it is often obtained approximately by numerical methods. But this is, in principle, just a technical issue.) Once we have got we have a full knowledge about and therefore about our uncertainty concerning the model parameters, the distribution of each of which can be obtained by marginalization:

(52) | |||

(53) | |||

(54) |

Similarly the joint distribution of and can be obtained as

(55) |

from which we can easily see that we recover Eq. (30) in the case we think the extra variability discussed in the previous section is absent. This limit case corresponds to a prior of sharply peaked around zero, i.e. .

Other interesting limit cases are the following.

- Errors only on the axis and no extra variability.

Making the limit of Eq. (30) for and neglecting irrelevant factors we get

This is the best known and best understood case. - Errors only on the axis and extra variability.

Making the limit of Eq. (53) for

- Scattering of data point around the hypothesized straight line
only due to `extra variability'.

This case corresponds to the joint determination of , and made by the method of the `residuals', that can be considered a kind of approximated solution of Eq. (61), achieved by iteration. [Indeed, if there are `enough' data points the `best estimates' achieved by the residual method are very close to the expected values of , and evaluated from if we assumed a flat prior distribution for the parameters.]

The prior has been left on purpose open in the above formulas, although we have already anticipated that usually a flat prior about all parameters gives the correct result in most 'healthy' cases, characterized by a sufficient number of data points. I cannot go here through an extensive discussion about the issue of the priors, often criticized as the weak point of the Bayesian approach and that are in reality one of its points of force. I refer to more extensive discussions available elsewhere (see e.g. [2] and references therein), giving here only a couple of advices. A flat prior is in most times a good starting point (unless one uses some packages, like BUGS [11], that does not like flat prior in the range to ; in this case one can mimic it with a very broad distribution, like a Gaussian with very large ). If the result of the inference `does not offend your physics sensitivity', it means that, essentially, flat priors have done a good job and it is not worth fooling around with more sophisticated ones. In the specific case we are looking closer, that of Eq. (53), the most critical quantity to watch is obviously , because it is positively defined. If, starting from a flat prior (also allowing negative values), the data constrain the value of in a (positive) region far from zero, and - in practice consequently - its marginal distribution is approximatively Gaussian, it means the flat prior was a reasonable choice. Otherwise, the next-to-simple modeling of is via the step function . A more technical choice would be a gamma distribution, with suitable parameters to `easily' accommodate all envisaged values of .

The easiest case, that happens very often if one has `many' data points (where `many' might be already as few as some dozens), is that obtained starting from flat priors is approximately a multi-variate Gaussian distribution, i.e. each marginal is approximately Gaussian. In this case the expected value of each variable is close to its mode, that, since the prior was a constant, corresponds to the value for which the likelihood gets its maximum. Therefore the parameter estimates derived by the maximum likelihood principle are very good approximations of the expected values of the parameters calculated directly from . In a certain sense the maximum likelihood principle best estimates are recovered as a special case that holds under particular conditions (many data points and vague priors). If either condition fails, the result the formulas derived from such a principle might be incorrect. This is the reason I dislike unneeded principles of this kind, once we have a more general framework, of which the methods obtained by `principles' are just special cases under well defined conditions.

The simple case in which
is approximately
multi-variate Gaussian allows also to approximately
evaluate the covariance matrix of the fit parameters from
the Hessian of its logarithm.^{6}This is due to a well known
property of the multi-variate Gaussian and it is not strictly
related to flat priors.
In fact it can easily proved that if the generic
is a multivariate Gaussian, then

where

(62) |

is the covariance matrix of the parameters and is the value for which gets its maximum and then its minimum.

An interesting feature of this approximated procedure is that, since it is based on the logarithm of the pdf, normalization factors are irrelevant. In particular, if the priors are flat, the relevant summaries of the inference can be obtained from the logarithm of the likelihood, stripped of all irrelevant factors (that become additive constants in the logarithm and vanish in the derivatives). Let us write down, for some cases of interest, the minus-log-likelihoods, stripped of constant terms and indicated by , i.e. .

- Simplest case: linear fit with only known errors on the axis
[from Eq. (58)]:

where we recognize the famous chi-squared. Applying Eq. (63) we get then the covariance matrix of the fit parameters as

(See Ref. [2] for the fully developed example yielding analytic formulas for the expected values and covariance matrix of the and .) Note that the often used (but also often misused![15]) ` rule' to calculate the covariance matrix of the parameters comes from the same Gaussian approximation of the final pdf and prior insensitivity. [And, because of the factor between Eqs. (63) and (66), there is an equivalent `minus-log-likelihood = ' rule, applicable under the same conditions]. - Errors also on the axis:

In this case expected values and covariance matrix cannot be obtained directly in closed form. Nevertheless, one can use iteratively the formulas for in which the estimate of is used to evaluate the terms (having the meaning of effective -error) in the likelihood of the next iteration. Instead it is wrong to simply replace the denominator of the of Eq. (65) with , because this approximation does not take into account the first term of the r.h.s. of Eq. (67) and the slope will be underestimated (as a consequence, the intercept will be over- or under-estimated, depending on the sign of the correlation coefficient between and , a sign that depends on the sign of the barycenter of the points.) - Dispersion on the axis only due to
[from Eq. (61)]:

(66)

- The most complete case seen here [from Eq. (53)]:

- As the previous item, but for the general
[from Eq. (34)]: