Introduction

As reminded by Gauss' opening quote, the reason why we combine “single observations” is to have an equivalent one “of greater accuracy than the single observations”. In fact, accidental errors tend to cancel,1if the observations are independent, and small systematic errors do too, if the measurements are performed using different devices and the data are analyzed by different methods. The simplest combination of the individual results is the arithmetic mean. But...- and let speak Gauss again [1]
“But if it seems that the same degree of accuracy cannot be attributed to the several observations, let us assume that the degree of accuracy in each may be considered proportional to the numbers $e$, $e'$, $e''$, $e'''$, etc. respectively, that is, that errors reciprocally proportional to these numbers could have been made in the observations with equal facility; the, according to the principles to be propounded below, the most probable mean value will no longer be the simple arithmetic mean, but

\begin{displaymath}\frac{e e \delta + e' e' \delta' +e'' e'' \delta'' +
e''' e'...
...e''' e''' + \mbox{etc} }\,,\mbox{''}
\hspace{1cm}\mbox{(G1)}
\end{displaymath}

that is, in modern notation, $\left(\sum_ie_i^2\,\delta_i\right)/\left(\sum_ie_i^2\right)$, in which we recognize the well known weighted average, provided $e_i=1/\sigma_i$. Then, later on (the conversion to modern notation is now straightforward),
“The degree of precision to be assigned to the mean found as above will be [...]

\begin{displaymath}\sqrt{}(e e + e' e' +e'' e'' +
e''' e''' + \mbox{etc})\,; \hspace{2.0cm}\mbox{(G2)} \end{displaymath}

so that four or nine equally exact observations are required, if the mean is to possess a double or a triple accuracy.”
The advantage of having `distilled' the many observations into a single, equivalent one is that further calculations are simplified, or made feasible at all, especially in the absence of powerful computers. But this idea works if there is no, or little, loss of information.2 This leads us to the important concept of statistical sufficiency, which will be reminded in Sec. 2 for the well understood case of Gaussian errors.

There is then the question of what to do in the case in which the individual results `appear' to be in mutual disagreement. The reason of the quote marks has been discussed in Ref. [2], of which this work is a kind of appendix. In short, they are a reminder, if needed, of the fact that, rigorously speaking, we can never be absolute sure that the `discrepancies' are not just due to statistical fluctuations. A way to implement our doubts has been shown in Ref. [2] and it consists in modifying the probabilistic model relating causes (the parameters of the model, first and foremost the `true' value of the quantity we aim to infer, although with uncertainty) and effects (the empirical observations), adding some extra-causes (additional parameters of the model) which might affect the observations. All quantities of interest are then embedded in a kind of network, a graphical representation of which can be very helpful to grasp its features.3

Traditionally, at least in particle physics, a different approach is followed. The degree of disagreement is quantified by the $\chi ^2$ of the differences between the individual results and the weighted average. Then, possibly, the standard deviation of the weighted mean is enlarged by a factor $\sqrt {\chi ^2/\nu }$. The rationale of the prescription, as a fast and dirty rule to get an rough idea of the range of the possible values of the quantity of interest, is easy to understand. However, there are several reasons of dissatisfaction, as discussed later on in Sec. 4, and therefore this simplistic scaling should be used with some care, and definitely avoided in those cases in which the outcome is critical for fundamental physics issues.4Moreover, it will be shown how the outcome can be biased if the prescription is first applied to a sub-sample of the individual results and, subsequently, the partial result is combined with the remaining ones using the same rule. The conclusions will also contain some general considerations on how an experimental result should be presented in order to use it at best a) to confront it with theory; b) to combine it with other results concerning the same physics quantity; c) to `propagate' it into other quantities of interest.

Finally, a puzzle is proposed in the Appendix in order to show that independent Gaussian errors are not a sufficient condition for the standard weighted average.