next up previous
Next: Nonlinear propagation Up: Sources of asymmetric uncertainties Previous: Sources of asymmetric uncertainties

Non parabolic $\chi ^2$ or log-likelihood curves

The standard methods in physics to adjust theoretical parameters to experimental data are based on maximum likelihood principle ideas. In practice, depending on the situation, the `minus log-likelihood' of the parameters [ $\varphi({\mbox{\boldmath$\theta$}};\mbox{data})=-\ln L({\mbox{\boldmath$\theta$}};\mbox{data})$] or the $\chi ^2$ function of the parameters [i.e. the function $\chi^2({\mbox{\boldmath$\theta$}};\mbox{data})$] are minimized. The notation used reminds that $\varphi$ and $\chi ^2$ are seen as mathematical function of the parameters ${\mbox{\boldmath$\theta$}}$, with the data acting as `parameters' of the functions. As it is well understood, a part from an irrelevant constant non depending on fit parameters, $\varphi$ and $\chi ^2$ differ by just a factor of two when the likelihood, seen as a joint probability function or a p.d.f. of the data, is a (multivariate) Gaussian distribution of the data: $\varphi=\chi^2/2+k$ (the constant $k$ is often neglected, since we concentrate on the terms which depend on the fit parameters - but sometimes $\chi ^2$ and minus log-likelihood might differ by terms depending on fit parameters!). For sake of simplicity, let us take one parameter fit. Following the usual practice, we indicate the parameter by $\theta$ (though this fit parameter is just any of the input quantities ${\mbox{\boldmath$X$}}$ of Sec. 2).

If $\varphi(\theta)$ or $\chi^2(\theta)$ have a nice parabolic shape, the likelihood is, apart a multiplicative factor, a Gaussian function4 of $\theta$. In fact, as is well known from calculus, any function can be approximated to a parabola in the vicinity of its minimum. Let us see in detail the expansion of $\varphi(\theta)$ around its minimum $\theta_m$:

$\displaystyle \varphi(\theta)$ $\textstyle \approx$ $\displaystyle \varphi(\theta_m) +
\left.\frac{\partial \varphi}{\partial \theta...
...{\partial \theta^2}\right\vert _{\theta_m}
\!\!\frac{1}{2} (\theta-\theta_m)^2$ (4)
  $\textstyle \approx$ $\displaystyle \varphi(\theta_m) +
\frac{1}{2} \frac{1}{\alpha^2} (\theta-\theta_m)^2 ,$ (5)

where the second term of the r.h.s. vanishes by definition of minimum and we have indicated with $\alpha$ the inverse of the second derivative at the minimum. Going back to the likelihood, we get:
$\displaystyle L(\theta; \mbox{data})$ $\textstyle \approx$ $\displaystyle \exp{\left[- \varphi(\theta_m)\right]} \cdot
\exp{\left[-\frac{1}{2} \frac{1}{\alpha^2} (\theta-\theta_m)^2\right]}$ (6)
  $\textstyle \approx$ $\displaystyle k  \exp{\left[-\frac{(\theta-\theta_m)^2}
{2 \alpha^2}\right]} ,$ (7)

apart a multiplicative factor, this is `Gaussian' centered in $\theta_m$ with standard deviation $(\partial^2 \varphi/\partial \theta^2\vert _{\theta_m})^{-2}$. However, although this function is mathematically a Gaussian, it does not have yet the meaning of probability density $f(\theta \vert \mbox{data})$ in an inferential sense, i.e. describing our knowledge about $\theta$ in the light of the experimental data. In order to do this, we need to process the likelihood through Bayes theorem, which allows probabilistic inversions to be achieved using basic rules of probability theory and logic. Besides a conceptually irrelevant normalization factor (that has to be calculated at some moment) the Bayes formula is
$\displaystyle f(\theta \vert \mbox{data})$ $\textstyle \propto$ $\displaystyle f(\mbox{data} \vert \theta)
\cdot f_0(\theta) .$ (8)

We can speak now about the ``probability that $\theta$ is within a given interval'' and calculate it, together with expectation of $\theta$, standard deviation and so on.5 If the prior $f_0(\theta)$ is much vaguer that what the data can teach us (via the likelihood), then it can be re-absorbed in the normalization constant, and we get:
$\displaystyle f(\theta \vert \mbox{data})$ $\textstyle \propto$ $\displaystyle f(\mbox{data} \vert \theta)
= L(\theta;\mbox{data})$ (9)
$\displaystyle \mbox{i.e}     $      
  $\textstyle \propto$ $\displaystyle \exp{\left[-\varphi(\theta;\mbox{data})\right]}$ (10)
$\displaystyle \mbox{or}  $      
  $\textstyle \propto$ $\displaystyle \exp{\left[-\frac{\chi^2(\theta;\mbox{data})}{2}\right]}$ (11)
$\displaystyle \mbox{ parabolic}  \varphi \mbox{or} \chi^2 :  $      
$\displaystyle \rightarrow f(\theta \vert \mbox{data})$ $\textstyle =$ $\displaystyle \frac{1}{\sqrt{2\pi}  \sigma_\theta} 
\exp{\left[-\frac{(\theta-\mbox{\small E}[\theta])^2}
{2 \sigma_\theta^2}\right]} .$ (12)

If this is the case, it is a simple exercise to show that
$\mbox{E}[\theta]$ is equal to $\theta_m$ which minimizes the $\chi ^2$ or $\varphi$.
$\sigma_\theta$ can be obtained by the famous conditions $\Delta \chi^2 = 1$ or $\Delta \varphi = 1/2$, respectively, or by the second derivative around $\theta_m$: $\sigma_\theta^{-2} =
1/2\times \left.(\partial^2 \chi^2/\partial \theta^2)
\right\vert _{\theta_m}$ or $\sigma_\theta^{-2} =
\left.(\partial^2 \varphi/\partial \theta^2)
\right\vert _{\theta_m}$, respectively.
Though in the frequentistic approach language and methods are usually more convoluted (even when the same numerical results of the Bayesian reasoning are obtained), due to the fact that probabilistic statements about physics quantities and fit parameters are not allowed in that approach, it is usually accepted that the above rules $a$ and $b$ are based on the parabolic behavior of the minimized functions. When this approximation does not hold, the frequentist has to replace a prescription by other prescriptions that can handle the exception.6 The situation is simpler and clearer in the Bayesian approach, in which the above rules $a$ and $b$ do hold too, but only as approximations under well defined conditions. In case the underlying conditions fail we know immediately what to do: For example, if the $\chi ^2$ description of the data was a good approximation, then $f(\theta)\propto e^{-\chi^2/2}$, properly normalized, is the solution to the problem.7
Figure: Example (Ref. [3]) of asymmetric $\chi ^2$ curve (left plot) with a $\chi ^2$ minimum at $\mu =5$ ($\mu $ stands for the value of a generic physics quantity). The result based on the $\chi ^2_{min}+1$ `prescription' is compared (plot on the right side) with the p.d.f. based on a uniform prior, i.e. $f(\mu \vert \mbox{data})\propto \exp[-\chi^2/2]$.
& \\
A non parabolic, asymmetric $\chi ^2$ produces an asymmetric $f(\theta)$ (see Fig. 2), the mode of which corresponds, indeed, to what obtained minimizing $\chi ^2$, but expected value and standard deviation differ from what is obtained by the `standard rule'. In particular, expected value and variance must be evaluated from their definitions:
$\displaystyle \mbox{E}[\theta]$ $\textstyle =$ $\displaystyle \int\!\theta f(\theta \vert \mbox{data}) \mbox{d}\theta$ (13)
$\displaystyle \sigma^2_\theta$ $\textstyle =$ $\displaystyle \int\!(\theta-\mbox{E}[\theta])^2
 f(\theta \vert \mbox{data}) \mbox{d}\theta .$ (14)

Other examples of asymmetric $\chi ^2$ curves, including the case with more than one minimum, are shown in Chapter 12 of Ref. [3], and compared with the results coming from frequentist prescriptions (but, indeed, there is not a general accepted rule to get frequentistic results - whatever they mean - when the $\chi ^2$ shape gets complicated).

Unfortunately, it is not easy to translate numbers obtained by ad hoc rules into probabilistic results, because the dependence on the actual shape of the $\chi ^2$ or $\varphi$ curve can be not trivial. Anyhow, some rules of thumb can be given in next-to-simple situations where the $\chi ^2$ or $\varphi$ has only one minimum and the $\chi ^2$ or $\varphi$ curve looks like a `skewed parabola', like in Fig. 2:

Figure: Example of two-dimensional multi-spots ``68% CL'' and ``95% CL'' contours obtained slicing the $\chi ^2$ or the minus log-likelihood curve at some magic levels. What do they mean?

The remarks about misuse of $\Delta \chi^2 = 1$ and $\Delta \varphi = 1/2$ rules can be extended to cases where several parameters are involved. I do not want to go into details (in the Bayesian approach there is nothing deeper than studying $k e^{-\chi^2/2}$ or $k e^{-\varphi}$ in function of several parameters.8), but I just want to get the reader worried about the meaning of contour plots of the kind shown in Fig. 3.

next up previous
Next: Nonlinear propagation Up: Sources of asymmetric uncertainties Previous: Sources of asymmetric uncertainties
Giulio D'Agostini 2004-04-27