Gibbs sampling in the spaces $(\mu ,\sigma )$ and $(\mu ,\tau )$

Before we write down the model to solve our little problem described by the graphical model of Fig. 5, some words on the Gibbs sampling are needed, also to understand why this kind of programs do not use $\sigma$ as second parameter of the Gaussian, but rather $1/\sigma^2$, traditionally indicated by $\tau$.

We have seen that if we have a problem with Gaussian error functions and a flat prior on $\mu$, then the posterior of $\mu$ is still a Gaussian, and it remains Gaussian also if we assign to $\mu$ a Gaussian prior characterized by $x_0$ and $\sigma_0$, a flat prior being recovered for $\sigma_o\rightarrow\infty$ (and allow me to draw again your attention on footnote 9). Let us see what happens if we are also in condition of uncertainty concerning $\sigma$, assumed to be the same in all $n$ measurements (typical problem of when we collect a sample a measurements under apparently the same conditions and we are interested in inferring both $\mu$ and $\sigma$). The graphical model is still the one on the right hand side of Fig. 5, but with $\sigma_i=\sigma$ for all $i$. The analogue of Eq. (3) is now

$\displaystyle f(\mu,\sigma\,\vert\,\underline{x},I)$ $\textstyle \propto$ $\displaystyle f(\mu,\sigma,\underline{x}\,\vert\,I)\,.$ (14)

Applying the chain rule only to $x_1$ and $x_2$, to begin, and noting that $\mu$ and $\sigma$ do not depend on each other, as it is usually the case,13we have, instead of Eq. (4),
$\displaystyle f(x_1,x_2,\mu,\sigma\,\vert\,I)$ $\textstyle =$ $\displaystyle f(x_1\,\vert\,\mu,\sigma,I) \cdot
f(x_2\,\vert\,\mu,\sigma,I) \cdot
f(\mu\,\vert\,I)\cdot f(\sigma\,\vert\,I) \,.$ (15)

Equations (5) becomes then, also extending Eq. (15) to all $x_i$,

\begin{eqnarray*}
f(\underline{x},\mu,\sigma\,\vert\,I) & = &
\left[ \prod_i f...
...mu,\sigma)\right]\cdot
f_0(\mu)\cdot f_0(\sigma) \,. \nonumber
\end{eqnarray*}


Now, if for some reasons we fix $\sigma$ to the hypothetical value $\sigma^*$ (and for simplicity we use a flat prior for $\mu$) then we recover, without any calculation, something similar to Eq. (12):
$\displaystyle f(\mu\,\vert\,\underline{x},\sigma^*,f_0(\mu)=k)$ $\textstyle =$ $\displaystyle \frac{1}{\sqrt{2\,\pi}\,\sigma_C}\, \exp\left[- \frac { (\mu-\overline{x})^2}
{2\,\sigma_C^2} \right]$ (16)

where now $\overline{x}$ is simply the arithmetic average and $\sigma_C = \sigma^*/\sqrt{n}$. If we had taken into account a prior $f_0(\mu)$ modeled by a Gaussian, then $f(\mu\,\vert\,\underline{x},\sigma^*)$ would still be a Gaussian, as we have seen above. In particular, it easy to sample by Monte Carlo the `random' variable $\mu$ described by Eq. (16), because it is rather easy to write a Gaussian `random' number generator, or to use one of those available in the mathematical libraries of programming languages.

Let us now do the opposite exercise, the utility of which will be clear in a while: imagine we are interested in $f(\sigma\,\vert\,\underline{x},\mu^*,f_0(\sigma)=k)$, having imposed the condition $\mu=\mu^*$. Equations (6) and (7) are now turned into (note the factor $1/\sigma$ in front of each exponent, since it cannot be any longer absorbed in the normalization constant!)

$\displaystyle f(\sigma\,\vert\,\underline{x},\mu^*,f_0(\sigma)=k)$ $\textstyle \propto$ $\displaystyle f(\underline{x},\mu^*,\sigma) \ \propto \
\prod_i f_{{\cal N}}(x_i\,\vert\,\mu^*,\sigma)$ (17)
  $\textstyle \propto$ $\displaystyle \prod_i\,
\frac{1}{\mathbf{\sigma}}\,
\exp\left[-\frac{(x_i-\mu^*)^2}{2\,\sigma^2}\right]$ (18)
  $\textstyle \propto$ $\displaystyle \frac{1}{\mathbf{\sigma}^n}\,
\exp\left[-\frac{\sum_i(x_i-\mu^*)^2}{2\,\sigma^2}\right]$  
  $\textstyle \propto$ $\displaystyle \frac{1}{\mathbf{\sigma}^n}\,
\exp\left[-\frac{K^2(\underline{x},\mu^*)}{\sigma^2}\right] \,,$  

where $K^2(\underline{x},\mu^*)$ is a constant, given $\underline{x}$ and $\mu^*$, written in a way to remind that it is by definition non negative. Unfortunately, opposite to the case of $f(\mu\,\vert\,\overline{x},\sigma^*)$, this is an unusual form in probability theory. But a simple change of variable rescues us. In fact, if instead of $\sigma$ we use $\tau=1/\sigma^2$, then the last equation becomes
$\displaystyle f(\tau\,\vert\,\underline{x},\mu^*,f_0(\sigma)=k)$ $\textstyle \propto$ $\displaystyle \tau^{\frac{n}{2}}\,\exp\left[-K^2(\underline{x},\mu^*)\cdot\tau\right]$  

in which probability and statistics experts recognize a Gamma distribution, usually written for the generic variable $z$ as
$\displaystyle f(z\,\vert\,\alpha,\beta)$ $\textstyle =$ $\displaystyle \frac{\beta^{\,\alpha}}{\Gamma(\alpha)}\,
z^{\alpha -1}\,e^{-\beta\,z}\,,$  

where $\Gamma()$ is the Gamma function (and hence the name of the distribution). Therefore
$\displaystyle f(\tau\,\vert\,\underline{x},\mu^*,f_0(\sigma)=k)$ $\textstyle =$ $\displaystyle \frac{\beta^{\,\alpha}}{\Gamma(\alpha)}
\,\tau^{\alpha-1}\,e^{-\beta\,\tau}$ (19)

with $\alpha = 1 + n/2$ and $\beta = K^2(\underline{x},\mu^*) = \sum_i(x_i-\mu^*)^2/2$. Being this a well known probability distribution, there are formulae available for the summaries of interest.14For example, expected value and variance are given by $\alpha/\beta$ and $\alpha/\beta^2$, respectively. But, moreover, there are Gamma random generators available, which is what we need for sampling.

We are then finally ready to describe the Gibbs sampler algorithm, applied to our two-dimensional case (but it can be applied in higher dimensionality problems too):

(And, obviously, for each $\tau_i$ there is a related $\sigma_i$.) Now, amazing enough (but there are mathematical theorems ensuring the `correct' behavior [21]), the points so obtained sample the bi-dimensional distribution $(\mu,\,\tau)$, and then $(\mu ,\sigma )$, in the sense that the expected frequency to visit a given region is proportional to the probability of that region (just Bernoulli theorem, not to be confused with the frequentist `definition' of probability! - see e.g. Ref. [31]). Moreover, for the way it has been described, it is clear that the probability of the move $(\mu_i,\tau_i)\rightarrow (\mu_{i+1},\tau_i)$ depends only $(\mu_i,\tau_i)$ and not on the previous states. This is what defines a Markov Chain Monte Carlo, of which the Gibbs sample is one of the possible algorithms.

There is still the question of $f_0(\tau)$, less trivial then $f_0(\mu)$, because $\tau$ has to be positive.15In this case a convenient prior would be a Gamma, with $\alpha_0$ and $\beta_0$ properly chosen, because it easy to see that, when multiplied by Eq. (19), the result is still a Gamma:

$\displaystyle f(\tau\,\vert\,\underline{x},\mu^*)$ $\textstyle \propto$ $\displaystyle \tau^{\alpha-1}\,e^{-\beta\,\tau} \cdot
\tau^{\alpha_0-1}\,e^{-\beta_0\,\tau}$ (20)
  $\textstyle \propto$ $\displaystyle \tau^{\alpha+\alpha_0-2}\,e^{-(\beta+\beta_0)\,\tau}\,.$ (21)

We can easily see that a flat prior for $\tau$ is recovered in the limit $\alpha_0\rightarrow 1$ and $\beta_0\rightarrow 0$.

A last comment concerning the initial point for the sampling is in order. Obviously, the initial steps of the history (the sequence) depend on our choice, and therefore they can be somehow not `representative'. The usual procedure to overcome this problem consists in discarding the `first points' of the sequence, better if after a visual inspection, or using criteria based on past experience (notoriously, this kind of techniques are between science and art, even when they are grounded on mathematical theorems, which however only speak of `asymptotic behavior'). But, as a matter of fact, the convergence of the Gibbs sampler for low dimensional problems is very fast and modern computers are so powerful that, in the case of doubt, we can simply throw away several thousands initial points `just for security'.