next up previous
Next: From power law to Up: Fits, and especially linear Previous: Extra variability of the


Computational issues: normalization, fit summaries, priors and approximations

At this point it is important to understand that in Bayesian approach the full result of the inference is given by final distribution, that in our case is - we rewrite it here:
$\displaystyle f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ $\textstyle =$ $\displaystyle k\, \prod_i
\frac{1}{\sqrt{\sigma^2_v + \sigma_{y_i}^2+m^2\,\sigm...
...+ \sigma_{y_i}^2+m^2\,\sigma_{x_i}^2) }
\right]}\, f(m,c,\sigma_v\,\vert\,I)\,,$  

where $k$ is `simply' a normalization factor. (This factor is usually the most difficult thing to calculate and it is often obtained approximately by numerical methods. But this is, in principle, just a technical issue.) Once we have got $k$ we have a full knowledge about $f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ and therefore about our uncertainty concerning the model parameters, the distribution of each of which can be obtained by marginalization:
$\displaystyle f(m\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ $\textstyle =$ $\displaystyle \int\! f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)\,\,dc\,d\sigma_v$ (52)
$\displaystyle f(c\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ $\textstyle =$ $\displaystyle \int\! f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)\,\,dm\,d\sigma_v$ (53)
$\displaystyle f(\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ $\textstyle =$ $\displaystyle \int\! f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)\,\,dm\,dc\,.$ (54)

Similarly the joint distribution of $m$ and $c$ can be obtained as
$\displaystyle f(m,c\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ $\textstyle =$ $\displaystyle \int\! f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)\,d\sigma_v\,,$ (55)

from which we can easily see that we recover Eq. (30) in the case we think the extra variability discussed in the previous section is absent. This limit case corresponds to a prior of $\sigma _v$ sharply peaked around zero, i.e. $f(\sigma_v\,\vert\,I) = \delta(\sigma_v)$.

Other interesting limit cases are the following.

Although, as it has been pointed out above, the full result of the inference is provided by the final pdf, often we do not need such a detailed description of our uncertainty, and we are only interested to provide some `summaries'. The most interesting ones are the expected values, standard deviations and correlation coefficients, i.e. $E(m)$, $E(c)$, $E(\sigma_v)$, $\sigma(m)$, $\sigma(c)$, $\sigma(\sigma_v)$, $\rho(m,c)$, $\rho(m,\sigma_v)$ and $\rho(c,\sigma_v)$. They are evaluated from $f(m,c,\sigma_v)$ using their definitions, that are assumed to be known [hereon we often omit the conditions on which the pdf depends, and we write $f(m,c,\sigma_v)$ instead of $f(m,c,\sigma_v\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$, and so on]. Obviously, these are not the only possible summaries. One might report in addition the mode or the median of each variable, one-dimensional or multi-dimensional probability regions [i.e. regions in the space of the parameters that are believed to contain the true value of the parameter(s) with a well defined probability level], and so on. It all depends on how standard or unusual the shape of $f(m,c,\sigma_v)$ is. I just would like to stress that the most important summaries are expected value, standard deviation and correlation coefficients, because these are the quantities that mostly matter in subsequent evaluations of uncertainty. Giving only `most probable' values and probability intervals might bias the results of further analyzes [15].

The prior $f(m,c,\sigma_v\,\vert\,I)$ has been left on purpose open in the above formulas, although we have already anticipated that usually a flat prior about all parameters gives the correct result in most 'healthy' cases, characterized by a sufficient number of data points. I cannot go here through an extensive discussion about the issue of the priors, often criticized as the weak point of the Bayesian approach and that are in reality one of its points of force. I refer to more extensive discussions available elsewhere (see e.g. [2] and references therein), giving here only a couple of advices. A flat prior is in most times a good starting point (unless one uses some packages, like BUGS [11], that does not like flat prior in the range $-\infty$ to $+\infty$; in this case one can mimic it with a very broad distribution, like a Gaussian with very large $\sigma$). If the result of the inference `does not offend your physics sensitivity', it means that, essentially, flat priors have done a good job and it is not worth fooling around with more sophisticated ones. In the specific case we are looking closer, that of Eq. (53), the most critical quantity to watch is obviously $\sigma _v$, because it is positively defined. If, starting from a flat prior (also allowing negative values), the data constrain the value of $\sigma _v$ in a (positive) region far from zero, and - in practice consequently - its marginal distribution is approximatively Gaussian, it means the flat prior was a reasonable choice. Otherwise, the next-to-simple modeling of $\sigma _v$ is via the step function $\theta(\sigma_v)$. A more technical choice would be a gamma distribution, with suitable parameters to `easily' accommodate all envisaged values of $\sigma _v$.

The easiest case, that happens very often if one has `many' data points (where `many' might be already as few as some dozens), is that $f(m,c,\sigma_v)$ obtained starting from flat priors is approximately a multi-variate Gaussian distribution, i.e. each marginal is approximately Gaussian. In this case the expected value of each variable is close to its mode, that, since the prior was a constant, corresponds to the value for which the likelihood ${\cal L}(m,c,\sigma_v\,;\, {\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}})$ gets its maximum. Therefore the parameter estimates derived by the maximum likelihood principle are very good approximations of the expected values of the parameters calculated directly from $f(m,c,\sigma_v)$. In a certain sense the maximum likelihood principle best estimates are recovered as a special case that holds under particular conditions (many data points and vague priors). If either condition fails, the result the formulas derived from such a principle might be incorrect. This is the reason I dislike unneeded principles of this kind, once we have a more general framework, of which the methods obtained by `principles' are just special cases under well defined conditions.

The simple case in which $f(m,c,\sigma_v)$ is approximately multi-variate Gaussian allows also to approximately evaluate the covariance matrix of the fit parameters from the Hessian of its logarithm.6This is due to a well known property of the multi-variate Gaussian and it is not strictly related to flat priors. In fact it can easily proved that if the generic $f({\mbox{\boldmath$\theta$}})$ is a multivariate Gaussian, then

$\displaystyle (V^{-1})_{ij}({\mbox{\boldmath$\theta$}})$ $\textstyle =$ $\displaystyle \left.\frac{\partial^2 \varphi}
{\partial\theta_i\,\partial\theta_j}
\right\vert _{{\mbox{\boldmath$\theta$}}={\mbox{\boldmath$\theta$}}_m}$ (61)

where
$\displaystyle \varphi({\mbox{\boldmath$\theta$}})$ $\textstyle =$ $\displaystyle -\log f({\mbox{\boldmath$\theta$}})\,,$ (62)

$V_{ij}({\mbox{\boldmath$\theta$}})$ is the covariance matrix of the parameters and ${\mbox{\boldmath$\theta$}}_m$ is the value for which $f({\mbox{\boldmath$\theta$}})$ gets its maximum and then $\varphi({\mbox{\boldmath$\theta$}})$ its minimum.

An interesting feature of this approximated procedure is that, since it is based on the logarithm of the pdf, normalization factors are irrelevant. In particular, if the priors are flat, the relevant summaries of the inference can be obtained from the logarithm of the likelihood, stripped of all irrelevant factors (that become additive constants in the logarithm and vanish in the derivatives). Let us write down, for some cases of interest, the minus-log-likelihoods, stripped of constant terms and indicated by $L$, i.e. $\varphi({\mbox{\boldmath$\theta$}}\, ;\, {\mbox{\boldmath$x$}},{\mbox{\boldmath...
...theta$}}\, ;\, {\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}} ) + \mbox{\it const}$.


next up previous
Next: From power law to Up: Fits, and especially linear Previous: Extra variability of the
Giulio D'Agostini 2005-11-21