“But if it seems that the same degree of accuracy cannot be attributed to the several observations, let us assume that the degree of accuracy in each may be considered proportional to the numbers,
,
,
, etc. respectively, that is, that errors reciprocally proportional to these numbers could have been made in the observations with equal facility; the, according to the principles to be propounded below, the most probable mean value will no longer be the simple arithmetic mean, but
that is, in modern notation,
“The degree of precision to be assigned to the mean found as above will be [...]
so that four or nine equally exact observations are required, if the mean is to possess a double or a triple accuracy.”The advantage of having `distilled' the many observations into a single, equivalent one is that further calculations are simplified, or made feasible at all, especially in the absence of powerful computers. But this idea works if there is no, or little, loss of information.2 This leads us to the important concept of statistical sufficiency, which will be reminded in Sec. 2 for the well understood case of Gaussian errors.
There is then the question of what to do in the case in which the individual results `appear' to be in mutual disagreement. The reason of the quote marks has been discussed in Ref. [2], of which this work is a kind of appendix. In short, they are a reminder, if needed, of the fact that, rigorously speaking, we can never be absolute sure that the `discrepancies' are not just due to statistical fluctuations. A way to implement our doubts has been shown in Ref. [2] and it consists in modifying the probabilistic model relating causes (the parameters of the model, first and foremost the `true' value of the quantity we aim to infer, although with uncertainty) and effects (the empirical observations), adding some extra-causes (additional parameters of the model) which might affect the observations. All quantities of interest are then embedded in a kind of network, a graphical representation of which can be very helpful to grasp its features.3
Traditionally, at least in particle physics, a different approach is
followed. The degree of disagreement is quantified by
the of the differences between
the individual results and the weighted average.
Then, possibly, the standard deviation of the weighted mean is enlarged
by a factor
. The rationale of the prescription,
as a fast and dirty rule to get an rough idea of the range
of the possible values of the quantity of interest,
is easy to understand.
However, there are several reasons of dissatisfaction, as discussed later on in
Sec. 4,
and therefore
this simplistic scaling should be used with some care, and definitely avoided
in those cases in which the outcome
is critical for fundamental physics issues.4Moreover, it will be shown how the outcome can be biased
if the prescription is first applied to a sub-sample of the
individual results and,
subsequently, the partial result is combined
with the remaining ones using the same
rule. The conclusions will also contain
some general considerations on how an experimental result should be presented
in order to use it at best a) to confront it with theory;
b) to combine it with other results concerning
the same physics quantity; c) to `propagate' it
into other quantities of interest.
Finally, a puzzle is proposed in the Appendix in order to show that independent Gaussian errors are not a sufficient condition for the standard weighted average.