Next: Linear fit with normal Up: Fits, and especially linear Previous: Introduction

Probabilistic parametric inference from a set of data points with errors on both axes

Let us consider a `law' that relates the `true' values of two quantities, indicated here by $\mu_x$ and $\mu_y$ :

$\begin{displaymath} \mu_y = \mu_y(\mu_x;{\mbox{\boldmath$\theta$}})\,, \end{displaymath}$

(3)

where ${\mbox{\boldmath$\theta$}}$ stands for the parameters of the law, whose number is

. In the linear case Eq. (3) reduces to

$\displaystyle \mu_y$

$\textstyle =$

$\displaystyle m\,\mu_x + c$

(4)

i.e. ${\mbox{\boldmath$\theta$}} = \{m,c\}$ and

. As it is well understood, because of `errors' we do not observe directly $\mu_x$ and $\mu_y$ , but experimental quantities²

and

that might differ, on an event by event basis, from $\mu_x$ and $\mu_y$ . The outcome of the `observation' (see footnote 2)

for a given $\mu_{x_i}$ (analogous reasonings apply to

and $\mu_{y_i}$ ) is modeled by an error function $f(x_i\,\vert\,\mu_{x_i},I)$ , that is indeed a probability density function (pdf) conditioned by $\mu_{x_i}$ and the `general state of knowledge'

. The latter stands for all background knowledge behind the analysis, that is what for example makes us to believe the relation $\mu_y = \mu_y(\mu_x;{\mbox{\boldmath$\theta$}})$ , the particular mathematical expressions for $f(x_i\,\vert\,\mu_{x_i},I)$ and $f(y_i\,\vert\,\mu_{y_i},I)$ , and so on. Note that the shape of the error function might depend on the value of $\mu_{x_i}$ , as it happens if the detector does not respond the same way to different solicitations. A usual assumption is that errors are normally distributed, i.e.

$\displaystyle x_i$	$\textstyle \sim$	$\displaystyle {\cal N}(\mu_{x_i}, \sigma_{x_i})$	(5)
$\displaystyle y_i$	$\textstyle \sim$	$\displaystyle {\cal N}(\mu_{y_i}, \sigma_{y_i})\,,$	(6)

where the symbol ` $\sim$ ' stands for `is described by the distribution' (or `follows the distribution'), and where we still leave the possibility that the standard deviations, that we consider known, might be different in different observations. Anyway, for sake of generality, we shall make use of assumptions (5) and (6) only in next section.

If we think of pairs of measurements of $\mu_x$ and $\mu_y$ , before doing the experiment we are uncertain about quantities (all 's, all 's, all $\mu_x$ 's and all $\mu_y$ 's, indicated respectively as ${\mbox{\boldmath$x$}}$ , ${\mbox{\boldmath$y$}}$ , ${\mbox{\boldmath$\mu$}}_x$ and ${\mbox{\boldmath$\mu$}}_y$ ) plus the number of parameters, i.e. in total , that become in linear fits. [But note that, due to believed deterministic relationship (3), the number of independent variables is in fact .] Our final goal, expressed in probabilistic terms, is to get the pdf of the parameters given the experimental information and all background knowledge:

$\begin{displaymath}\hspace{5.6cm}\Longrightarrow f({\mbox{\boldmath $\theta$}}\,... ...}},{\mbox{\boldmath $y$}},I) \ \ \mbox{for linear fits}\,]\,. \end{displaymath}$

Probability theory teaches us how to get the conditional pdf $f({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$ if we know the joint distribution $f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$ . The first step consists in calculating the $2\,N + M$ variable pdf (only

of which are independent) that describes the uncertainty of what is not precisely known, given what it is (plus all background knowledge). This is achieved by a multivariate extension of Eq. (1):

$\displaystyle f({\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$	$\textstyle =$	$\displaystyle \frac{f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldm... ...theta$}}\,\vert\,I)} {f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}}\,\vert\,I)}$	(7)
	$\textstyle =$	$\displaystyle \frac{f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldm... ...ox{\boldmath$\mu$}}_x\,d{\mbox{\boldmath$\mu$}}_y\,d{\mbox{\boldmath$\theta$}}}$	(8)

Equations (7) and (8) are two different ways of writing Bayes' theorem in the case of multiple inference. Going from (7) to (8) we have `marginalized' $f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$ over ${\mbox{\boldmath$\mu$}}_x$ , ${\mbox{\boldmath$\mu$}}_y$ and ${\mbox{\boldmath$\theta$}}$ , i.e. we used an extension of Eq. (2) to many variables. [The standard text book version of the Bayes formula differs from Eqs. (7) and (8) because the joint pdf's that appear on the r.h.s. of Eqs. (7)-(8) are usually factorized using the so called 'chain rule', i.e. an extension of Eq. (1) to many variables.]

The second step consists in marginalizing the $(2\,N + M)$ -dimensional pdf over the variables we are not interested to:

$\displaystyle f({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle =$

$\displaystyle \int\! f({\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mb... ...ox{\boldmath$y$}},I)\,\, d{\mbox{\boldmath$\mu$}}_x\,d{\mbox{\boldmath$\mu$}}_y$

(9)

Before doing that, we note that the denominator of the r.h.s. of Eqs. (7)-(8) is just a number, once the model and the set of observations $\{{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}}\}$ is defined, and then we can absorb it in the normalization constant. Therefore Eq. (9) can be simply rewritten as

$\displaystyle f({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \propto$

$\displaystyle \int\! f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\bold... ...eta$}}\,\vert\,I)\,\, d{\mbox{\boldmath$\mu$}}_x\,d{\mbox{\boldmath$\mu$}}_y\,.$

(10)

We understand then that, essentially, we need to set up $f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$ using the pieces of information that come from our background knowledge

. This seems a horrible task, but it becomes feasible tanks to the chain rule of probability theory, that allows us to rewrite $f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$ in the following way:

$\displaystyle f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$	$\textstyle =$	$\displaystyle f({\mbox{\boldmath$x$}}\,\vert\,{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}},I)$
		$\displaystyle \cdot f({\mbox{\boldmath$y$}}\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}},I)$
		$\displaystyle \cdot f({\mbox{\boldmath$\mu$}}_y\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\theta$}},I)$
		$\displaystyle \cdot f({\mbox{\boldmath$\mu$}}_x\,\vert\,{\mbox{\boldmath$\theta$}},I)$
		$\displaystyle \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$	(11)

(Obviously, among the several possible ones, we choose the factorization that matches our knowledge about of physics case.) At this point let us make the inventory of the ingredients, stressing their effective conditions and making use of independence, when it holds.

Each observation depends directly only on the corresponding true value $\mu_{x_i}$ :

$\displaystyle f({\mbox{\boldmath$x$}}\,\vert\,{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}},I)$ $\textstyle =$ $\displaystyle f({\mbox{\boldmath$x$}}\,\vert\,{\mbox{\boldmath$\mu$}}_x,I) = \prod_i f(x_i\,\vert\,\mu_{x_i},I)$ (12)

$\displaystyle [\ \Longrightarrow \prod_i {\cal N}(\mu_{x_i},\sigma_{x_i}) \ ].$ (13)

(In square brackets is the `routinely' used pdf.)
Each observation depends directly only on the corresponding true value $\mu_{y_i}$ :

$\displaystyle f({\mbox{\boldmath$y$}}\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}},I)$ $\textstyle =$ $\displaystyle f({\mbox{\boldmath$y$}}\,\vert\,{\mbox{\boldmath$\mu$}}_y,I) = \prod_i f(y_i\,\vert\,\mu_{y_i},I)$ (14)

$\displaystyle [\ \Longrightarrow \prod_i {\cal N}(\mu_{y_i},\sigma_{y_i}) \ ].$ (15)
Each true value $\mu_y$ depends only, and in a deterministic way, on the corresponding true value $\mu_x$ and on the parameters ${\mbox{\boldmath$\theta$}}$ . This is formally equivalent to take an infinitely sharp distribution of $\mu_{y_i}$ around $\mu_y(\mu_{x_i};{\mbox{\boldmath$\theta$}})$ , i.e. a Dirac delta function:

$\displaystyle f({\mbox{\boldmath$\mu$}}_y\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\theta$}},I)$ $\textstyle =$ $\displaystyle \prod_i \delta[\,\mu_{y_i}- \mu_{y}(\mu_{x_i},{\mbox{\boldmath$\theta$}})\,]$ (16)

$\displaystyle [\ \Longrightarrow \prod_i \delta( \mu_{y_i} -m\, \mu_{x_i} - c )\ ]$ (17)
Finally, $\mu_{x_i}$ and ${\mbox{\boldmath$\theta$}}$ are usually independent and become the priors of the problem,³ that one takes `vague' enough, unless physical motivations suggest to do otherwise. For the $\mu_{x_i}$ we take immediately uniform distributions over a large domain (a `flat prior'). Instead, we leave here the expression of $f({\mbox{\boldmath$\theta$}}\,\vert\,I)$ undefined, as a reminder for critical problems (e.g. one of the parameter is positively defined because of its physical meaning), though it can also be taken flat in routine applications with `many' data points.

$\displaystyle f({\mbox{\boldmath$\mu$}}_x\,\vert\,{\mbox{\boldmath$\theta$}},I)\cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$ $\textstyle =$ $\displaystyle f({\mbox{\boldmath$\mu$}}_x\,\vert\,I)\cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$ (18)

$\textstyle =$ $\displaystyle k_x\,f({\mbox{\boldmath$\theta$}}\,\vert\,I)$ (19)

The constant value of $f({\mbox{\boldmath$\mu$}}_x\,\vert\,I)$ , indicated here by , is then in practice absorbed in the normalization constant.

In conclusion we have

$\displaystyle f({\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}}\,\vert\,I)$	$\textstyle =$	$\displaystyle \prod_i f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_i\,\vert\,\mu_{y_i},... ...\,]\, \cdot f(\mu_{x_i}\,\vert\,I)\cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$
			(20)
	$\textstyle =$	$\displaystyle \prod_i k_{x_i}\,f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_i\,\vert\,\... ...,{\mbox{\boldmath$\theta$}})\,]\, \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$	(21)
	$\textstyle \propto$	$\displaystyle \prod_i f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_i\,\vert\,\mu_{y_i},... ...box{\boldmath$\theta$}})\,]\, \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I) \,.$	(22)

**Figure 1:** *Graphical representation of the model in term of a Bayesian network (see text).*
$\begin{figure}\begin{center} \epsfig{file=bn1.eps,clip=,width=0.3\linewidth} \end{center} \end{figure}$

Figure 1 provides a graphical representation of the model [or, more precisely, a graphical representation of Eq. (20)]. In this diagram the probabilistic connections are indicated by solid lines and the deterministic connections by dashed lines. These kind of networks of probabilistic and deterministic relations among uncertain quantities is known as `Bayesian network',⁴ 'belief network', 'influence network', 'causal network' and other names meaning substantially the same thing. From Eqs. (10) and (22) we get then

$\displaystyle f({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$	$\textstyle \propto$	$\displaystyle \left[\int \prod_i k_{x_i}\,f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_... ...{\mbox{\boldmath$\mu$}}_y \right] \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$
			(23)
	$\textstyle \propto$	$\displaystyle f( {\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}}\,\vert\,{\mbox{\bo... ...dmath$x$}},{\mbox{\boldmath$y$}}) \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)$	(24)

where we have factorized the unnormalized `final' pdf into the `likelihood'⁵ ${\cal L}({\mbox{\boldmath$\theta$}}\,;\, {\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}})$ (the content of the large square bracket) and the `prior' $f({\mbox{\boldmath$\theta$}}\,\vert\,I)$ .

We see than that, a part from the prior, the result is essentially given by the product of terms, each of which depending on the individual pair of measurements:

$\displaystyle f({\mbox{\boldmath$\theta$}}\,\vert\,{\mbox{\boldmath$x$}},{\mbox{\boldmath$y$}},I)$

$\textstyle \propto$

$\displaystyle \left[\prod_i {\cal L}_i({\mbox{\boldmath$\theta$}}\,;\,x_i,y_i,I)\right] \cdot f({\mbox{\boldmath$\theta$}}\,\vert\,I)\,,$

(25)

where

$\displaystyle {\cal L}_i({\mbox{\boldmath$\theta$}}\,;\,x_i,y_i) = f(x_i,y_i\,\vert\,{\mbox{\boldmath$\theta$}},I)$	$\textstyle =$	$\displaystyle k_{x_i}\, \int f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_i\,\vert\,\mu... ... \mu_{y}(\mu_{x_i},{\mbox{\boldmath$\theta$}})\,] \,\, d{\mu_{x_i}}d{\mu_{y_i}}$
			(26)
	$\textstyle =$	$\displaystyle k_{x_i}\, \int f(x_i\,\vert\,\mu_{x_i},I) \cdot f(y_i\,\vert\,\mu_{y}(\mu_{x_i},{\mbox{\boldmath$\theta$}}),I) \,\, d{\mu_{x_i}}$	(27)

and the constant factor $k_{x_i}$ , irrelevant in the Bayes formula, is a reminder of the priors about $\mu_{x_i}$ (see footnote 5).

Next: Linear fit with normal Up: Fits, and especially linear Previous: Introduction

Giulio D'Agostini 2005-11-21

$\displaystyle f({\mbox{\boldmath$y$}}\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\mu$}}_y,{\mbox{\boldmath$\theta$}},I)$	$\textstyle =$	$\displaystyle f({\mbox{\boldmath$y$}}\,\vert\,{\mbox{\boldmath$\mu$}}_y,I) = \prod_i f(y_i\,\vert\,\mu_{y_i},I)$	(14)
		$\displaystyle [\ \Longrightarrow \prod_i {\cal N}(\mu_{y_i},\sigma_{y_i}) \ ].$	(15)

$\displaystyle f({\mbox{\boldmath$\mu$}}_y\,\vert\,{\mbox{\boldmath$\mu$}}_x,{\mbox{\boldmath$\theta$}},I)$	$\textstyle =$	$\displaystyle \prod_i \delta[\,\mu_{y_i}- \mu_{y}(\mu_{x_i},{\mbox{\boldmath$\theta$}})\,]$	(16)
		$\displaystyle [\ \Longrightarrow \prod_i \delta( \mu_{y_i} -m\, \mu_{x_i} - c )\ ]$	(17)