Next: From power law to Up: Fits, and especially linear Previous: Extra variability of the

# Computational issues: normalization, fit summaries, priors and approximations

At this point it is important to understand that in Bayesian approach the full result of the inference is given by final distribution, that in our case is - we rewrite it here:

where is simply' a normalization factor. (This factor is usually the most difficult thing to calculate and it is often obtained approximately by numerical methods. But this is, in principle, just a technical issue.) Once we have got we have a full knowledge about and therefore about our uncertainty concerning the model parameters, the distribution of each of which can be obtained by marginalization:
 (52) (53) (54)

Similarly the joint distribution of and can be obtained as
 (55)

from which we can easily see that we recover Eq. (30) in the case we think the extra variability discussed in the previous section is absent. This limit case corresponds to a prior of sharply peaked around zero, i.e. .

Other interesting limit cases are the following.

• Errors only on the axis and no extra variability.
Making the limit of Eq. (30) for and neglecting irrelevant factors we get
 (56) (57)

This is the best known and best understood case.
• Errors only on the axis and extra variability.
Making the limit of Eq. (53) for
 (58)

• Scattering of data point around the hypothesized straight line only due to extra variability'.
 (59) (60)

This case corresponds to the joint determination of , and made by the method of the residuals', that can be considered a kind of approximated solution of Eq. (61), achieved by iteration. [Indeed, if there are enough' data points the best estimates' achieved by the residual method are very close to the expected values of , and evaluated from if we assumed a flat prior distribution for the parameters.]
Although, as it has been pointed out above, the full result of the inference is provided by the final pdf, often we do not need such a detailed description of our uncertainty, and we are only interested to provide some summaries'. The most interesting ones are the expected values, standard deviations and correlation coefficients, i.e. , , , , , , , and . They are evaluated from using their definitions, that are assumed to be known [hereon we often omit the conditions on which the pdf depends, and we write instead of , and so on]. Obviously, these are not the only possible summaries. One might report in addition the mode or the median of each variable, one-dimensional or multi-dimensional probability regions [i.e. regions in the space of the parameters that are believed to contain the true value of the parameter(s) with a well defined probability level], and so on. It all depends on how standard or unusual the shape of is. I just would like to stress that the most important summaries are expected value, standard deviation and correlation coefficients, because these are the quantities that mostly matter in subsequent evaluations of uncertainty. Giving only most probable' values and probability intervals might bias the results of further analyzes [15].

The prior has been left on purpose open in the above formulas, although we have already anticipated that usually a flat prior about all parameters gives the correct result in most 'healthy' cases, characterized by a sufficient number of data points. I cannot go here through an extensive discussion about the issue of the priors, often criticized as the weak point of the Bayesian approach and that are in reality one of its points of force. I refer to more extensive discussions available elsewhere (see e.g. [2] and references therein), giving here only a couple of advices. A flat prior is in most times a good starting point (unless one uses some packages, like BUGS [11], that does not like flat prior in the range to ; in this case one can mimic it with a very broad distribution, like a Gaussian with very large ). If the result of the inference does not offend your physics sensitivity', it means that, essentially, flat priors have done a good job and it is not worth fooling around with more sophisticated ones. In the specific case we are looking closer, that of Eq. (53), the most critical quantity to watch is obviously , because it is positively defined. If, starting from a flat prior (also allowing negative values), the data constrain the value of in a (positive) region far from zero, and - in practice consequently - its marginal distribution is approximatively Gaussian, it means the flat prior was a reasonable choice. Otherwise, the next-to-simple modeling of is via the step function . A more technical choice would be a gamma distribution, with suitable parameters to easily' accommodate all envisaged values of .

The easiest case, that happens very often if one has many' data points (where many' might be already as few as some dozens), is that obtained starting from flat priors is approximately a multi-variate Gaussian distribution, i.e. each marginal is approximately Gaussian. In this case the expected value of each variable is close to its mode, that, since the prior was a constant, corresponds to the value for which the likelihood gets its maximum. Therefore the parameter estimates derived by the maximum likelihood principle are very good approximations of the expected values of the parameters calculated directly from . In a certain sense the maximum likelihood principle best estimates are recovered as a special case that holds under particular conditions (many data points and vague priors). If either condition fails, the result the formulas derived from such a principle might be incorrect. This is the reason I dislike unneeded principles of this kind, once we have a more general framework, of which the methods obtained by principles' are just special cases under well defined conditions.

The simple case in which is approximately multi-variate Gaussian allows also to approximately evaluate the covariance matrix of the fit parameters from the Hessian of its logarithm.6This is due to a well known property of the multi-variate Gaussian and it is not strictly related to flat priors. In fact it can easily proved that if the generic is a multivariate Gaussian, then

 (61)

where
 (62)

is the covariance matrix of the parameters and is the value for which gets its maximum and then its minimum.

An interesting feature of this approximated procedure is that, since it is based on the logarithm of the pdf, normalization factors are irrelevant. In particular, if the priors are flat, the relevant summaries of the inference can be obtained from the logarithm of the likelihood, stripped of all irrelevant factors (that become additive constants in the logarithm and vanish in the derivatives). Let us write down, for some cases of interest, the minus-log-likelihoods, stripped of constant terms and indicated by , i.e. .

• Simplest case: linear fit with only known errors on the axis [from Eq. (58)]:
 (63)

where we recognize the famous chi-squared. Applying Eq. (63) we get then the covariance matrix of the fit parameters as
 (64)

(See Ref. [2] for the fully developed example yielding analytic formulas for the expected values and covariance matrix of the and .) Note that the often used (but also often misused![15])  rule' to calculate the covariance matrix of the parameters comes from the same Gaussian approximation of the final pdf and prior insensitivity. [And, because of the factor between Eqs. (63) and (66), there is an equivalent minus-log-likelihood = ' rule, applicable under the same conditions].
• Errors also on the axis:
 (65)

In this case expected values and covariance matrix cannot be obtained directly in closed form. Nevertheless, one can use iteratively the formulas for in which the estimate of is used to evaluate the terms (having the meaning of effective -error) in the likelihood of the next iteration. Instead it is wrong to simply replace the denominator of the of Eq. (65) with , because this approximation does not take into account the first term of the r.h.s. of Eq. (67) and the slope will be underestimated (as a consequence, the intercept will be over- or under-estimated, depending on the sign of the correlation coefficient between and , a sign that depends on the sign of the barycenter of the points.)
• Dispersion on the axis only due to [from Eq. (61)]:
 (66)

• The most complete case seen here [from Eq. (53)]:
 (67)

• As the previous item, but for the general [from Eq. (34)]:
 (68)

Next: From power law to Up: Fits, and especially linear Previous: Extra variability of the
Giulio D'Agostini 2005-11-21