Appendix A - Some remarks on `Bayes' formulae'

Equation ([*]) is a straight consequence of the probability rule relating joint probability to conditional probability, that is, for the generic `events' $A$ and $B$,
$\displaystyle P(A\cap B)$ $\displaystyle =$ $\displaystyle P(B\,\vert\,A)\cdot P(A) =
P(A\,\vert\,B)\cdot P(B)\,, \hspace{2.3cm}\,$(A.1)  

having added to $P($Inf$)$ of Eq. ([*]) the suffix `0' in order to emphasize its role of `prior' probability. Equation (A.1) yields trivially
$\displaystyle P(A\,\vert\,B)$ $\displaystyle =$ $\displaystyle \frac{P(B\,\vert\,A)\cdot P(A)}{P(B)}
\ \ \longrightarrow \ \ \frac{P(B\,\vert\,A)\cdot P_0(A)}{P(B)}\,,
\hspace{1.5cm}\,$(A.2)  

having also emphasized that $P(A)$ in r.h.s. is the probability of $A$ before it is updated by the new condition $B$.59But, indeed, the essence of the Bayes' rule is given by
$\displaystyle P(A\,\vert\,B)$ $\displaystyle =$ $\displaystyle \frac{P(A\cap B)}{P(B)} = \frac{P(A, B)}{P(B)}\,,
\hspace{4.9cm}\,$(A.3)  

in which we have rewritten the `$A \cap B$' in the way it is custom for uncertain numbers (`random variables'), as we shall see in while. Moreover, as we can `expand' the numerator (using the so called chain rule) to go from Eq. (A.3) to Eq. (A.2), and then Eq. ([*]), similarly we can expand the denominator in two steps. We start `decomposing' $B$ into $B\cap A$ and $B\cap \overline{A}$, from which it follows
$\displaystyle B$ $\displaystyle =$ $\displaystyle (B\cap A) \cup (B\cap \overline{A})$  
$\displaystyle P(B)$ $\displaystyle =$ $\displaystyle P(B\cap A) + P(B\cap \overline{A})$  
  $\displaystyle =$ $\displaystyle P(B\,\vert\,A)\cdot P(A) + P(B\,\vert\,\overline{A})\cdot P(\overline{A})$  

After the various `expansions' we can rewrite Eq. (A.3) as
$\displaystyle P(A\,\vert\,B)$ $\displaystyle =$ $\displaystyle \frac{P(B\,\vert\,A)\cdot P(A)}
{ P(B\,\vert\,A)\cdot P(A) + P(B\,\vert\,\overline{A})\cdot P(\overline{A})}\,.
\hspace{2.7cm}\,$(A.4)  

Finally, if instead of only two possibilities $A$ and $\overline{A}$, we have a complete class of hypotheses $H_i$, i.e. such that $\sum_iP(H_i)=1$ and $P(H_i\cap H_j)=0$ for $i\ne j$, we get the famous
$\displaystyle P(H_i\,\vert\,E)$ $\displaystyle =$ $\displaystyle \frac{P(E\,\vert\,H_i)\cdot P(H_i)}
{ \sum_i P(E\,\vert\,H_i)\cdot P(H_i)}
\ \ \ \longleftarrow \ \
\frac{P(H_i\cap E)}{P(E)}\,,
\hspace{1.3cm}\,$(A.5)  

having also replaced the symbol $B$ by $E$, given its meaning of effect, upon which the probabilities of the different hypotheses $H_i$ are updated. Moreover, the sum in the denominator of the first r.h.s. of Eq. (A.5) makes it explicit that the denominator is just a normalization factor, and therefore the essence of the reasoning can be expressed as
$\displaystyle P(H_i\,\vert\,E)$ $\displaystyle \propto$ $\displaystyle P(E\,\vert\,H_i)\cdot P(H_i)
= P(H_i\cap E) \hspace{3.4cm}\,$(A.6)  

The extension to discrete `random variables' is straightforward, since the probability distribution $f(x)$ has the meaning of $P(X=x)$, with $X$ the name of the variable and $x$ one of the possible values that it can assume. Similarly, $f(x,y)$ stands for $P(X=x,Y=y)\equiv P\left((X=x)\cap (Y=y)\right)$, $f(x\,\vert\,y)$ for $P(X=x\,\vert\,Y=y)$, and so on. Moreover all possible values of $X$, as well as all possible values of $Y$, form a complete class of hypotheses (the distributions are normalized). Equation (A.3) and its variations and `expansions' becomes then, for $X$ and $Y$,
$\displaystyle f(x\,\vert\,y)$ $\displaystyle =$ $\displaystyle \frac{f(x,y)}{f(y)}
= \frac{f(y\,\vert\,x)\cdot f(x)}{f(y)} =
\fr...
...\frac{f(y\,\vert\,x)\cdot f(x)}{\sum_x f(y\,\vert\,x)\cdot f(x)}
\hspace{0.6cm}$  
  $\displaystyle \propto$ $\displaystyle f(y\,\vert\,x)\cdot f(x) = f(x,y)\,,
\hspace{7.3cm}\,$(A.7)  

which can be further extended to several other variables. For example, adding $Z$, $V$ and $W$ and being interested to the joint probability that $X$ and $Z$ assume the values $x$ and $z$, conditioned by $Y=y$, $V=v$ and $W=w$, we get
$\displaystyle f(x,z\,\vert\,y,v,w)$ $\displaystyle =$ $\displaystyle \frac{f(x,y,v,w,z)}{f(y,v,w)}\,.
\hspace{5.4cm}\,$(A.8)  

To conclude, some remarks are important, especially for the applications:
  1. Equations (A.7) and (A.8) are valid also for continuous variables, in which case the various `$f()$' have the meaning of probability density function, and the sums needed to get the (possibly joint) marginal in the denominator are replaced by integration.
  2. The numerator of Eq. (8) is `expanded' using a chain rule, choosing, among the several possibilities, that which makes explicit the (assumed) causal connections60of the different variables in the game, as stressed in the proper places through the paper (see e.g. footnote [*], Sec. [*] and Sec. [*]).
  3. A related remark is that, among the variables entering the game, as those of Eq. (A.8), some may be continuous and other discrete and the probabilistic meaning of `$f(\ldots)$', taking the example of a bivariate case $f(x,y)$ with $x$ discrete and $y$ continuous, is given by $P(X=x, y\le Y \le y+$d$y) = f(x,y)\,$d$y$, with the normalization condition given by $\sum_x\int f(x,y)\,$d$y = 1$.
  4. Finally, a crucial observation is that, given the model which connects the variables (the graphical representations of the kinds shown in the paper are very useful to understand it) and its parameters, the denominator of Eq. (A.8) is just a number (although often very difficult to evaluate!), and therefore, as we have seen in Eq. (A.7), the last equation can be rewritten as$^{(*)}$
    $\displaystyle f(x,z\,\vert\,y,v,w)$ $\displaystyle \propto$ $\displaystyle f(x,y,v,w,z)\,,
\hspace{5.4cm}\,$(A.9)  



    or, denoting by $\tilde{f}()$ the un-normalized posterior distribution,

    $\displaystyle \tilde f(x,z\,\vert\,y,v,w)$ $\displaystyle =$ $\displaystyle f(x,y,v,w,z)\,.
\hspace{5.4cm}\,$(A.10)  

    The importance of this remark is that, although a closed form of posterior is often prohibitive in practical cases, an approximation of it can be obtained by Monte Carlo techniques, which allow us to evaluate the quantities of interest, like averages, probability intervals, and so on (see references in footnote [*]).


---------------------------------------------------
$^{(*)}$ Perhaps a better way to rewrite (A.9) and (A.10), in order to avoid confusion, could be

$\displaystyle f(x,z\,\vert\,y=y_0,v=v_0,w=w_0)$ $\displaystyle \propto$ $\displaystyle f(x,y_0,v_0,w_0,z)$  
$\displaystyle \tilde f(x,z\,\vert\,y=y_0,v=v_0,w=w_0)$ $\displaystyle =$ $\displaystyle f(x,y_0,v_0,w_0,z)\,,$  

in order to emphasize the fact that $y$, $v$ and $w$ assume precise values, under which the possible values of $x$ and $z$ are conditioned. Anyway, it is just a question of getting used with that notation. For example, sticking to a textbook two dimensional case, the bivariate normal distribution is given by
$\displaystyle f(x,y)\!$ $\displaystyle =$ $\displaystyle \frac{1}{2\,\pi\,\sigma_x\,\sigma_y\,\sqrt{1-\rho^2}}\,
\exp \lef...
...u_y)}{\sigma_x\,\sigma_y}
+ \frac{(y-\mu_y)^2}{\sigma_y^2}
\right]
\right\} \,.$  





The distribution of $x$, conditioned by $y=y_0$ is then

$\displaystyle f(x\,\vert\,y_0)\!$ $\displaystyle \propto$ $\displaystyle \frac{1}{2\,\pi\,\sigma_x\,\sigma_y\,\sqrt{1-\rho^2}}\,
\exp \lef...
...\mu_y)}{\sigma_x\,\sigma_y}
+ \frac{(y_0-\mu_y)^2}{\sigma_y^2}
\right]
\right\}$  
  $\displaystyle \propto$ $\displaystyle \exp \left\{
-\frac{1}{2\,(1-\rho^2)}
\left[ \frac{(x-\mu_x)^2}{\sigma_x^2}
- 2\,\rho\,\frac{x\,(y_0-\mu_y)}{\sigma_x\,\sigma_y}
\right]
\right\}$  
  $\displaystyle \propto$ $\displaystyle \exp \left\{
-\frac{1}{2\,(1-\rho^2)\,\sigma_x^2}
\left[x^2 -2\,x\,\left(\mu_x+\rho\frac{\sigma_x}{\sigma_y}\,
(y_0-\mu_y)\right)
\right]
\right\}$  
  $\displaystyle \propto$ $\displaystyle \exp \left\{-\frac{\left[x^2 -2\,x\,\left(\mu_x+\rho\frac{\sigma_x}{\sigma_y}\,
(y_0-\mu_y)\right)
\right]}
{2\,(1-\rho^2)\,\sigma_x^2}
\right\}$  
  $\displaystyle \propto$ $\displaystyle \exp \left\{-\frac{\left[x -\,\left(\mu_x+\rho\frac{\sigma_x}{\sigma_y}\,
(y_0-\mu_y)\right)
\right]^2}
{2\,(1-\rho^2)\,\sigma_x^2}
\right\} \,,$  

in which we recognize a Gaussian distribution with $\mu_{x\vert y_0} = x +\rho\frac{\sigma_x}{\sigma_y}\,(y_0-\mu_y)$ and $\sigma_{x\vert y_0} = \sqrt{1-\rho^2}\,\sigma_x$.
$[$In the various steps all factors (and hence all addends at the exponent) not depending on $x$ have been ignored. Finally, in the last step the `trick' of complementing the exponential has been used, because adding $\left(\mu_x+\rho\frac{\sigma_x}{\sigma_y}\,
(y_0-\mu_y)\right)^2\!/ \,(2\,(1-\rho^2)\,\sigma_x^2)
$ at the exponent is the same as multiplying by a constant factor.$]$