Conclusions

It is self-evident that the prescription of enlarging the standard deviation of a weighted average by a $\chi ^2$ motivated scaling factor is often not able to capture the pattern of individual results, in the case they show a sizable skewed distribution or they tend to cluster in two different regions. Instead, I was not aware of the fact that the $\sqrt {\chi ^2/\nu }$ scaling might produce a bias, with respect to the average of all individual results, if the scaling is first applied to a sub-sample of results, and then to all of them [2]. This is due to the fact that the procedure does not hold statistical sufficiency and, therefore, individual results should be used without pre-grouping.

However, this does not imply that the latter is the correct way to proceed in the case the pattern of the individual results is at odds with the weighted average applied to all points. A more pondered analysis should rather be performed in order to model our doubts, as done e.g. in Ref. [2]. (In the case of the charged kaon mass, there is however a curious compensation, such that the biased result comes out to agree, at least in terms of central value and `error', with that of the `sceptical analysis' [2].)

I would like to conclude with some remarks concerning how to report an experimental result, in perspective of its further uses. In fact, a result is not an end in itself, as Physics and all Sciences are not just collections of facts (and even an experimental result is not a mere `fact', since it is derived from many empirical observations through models relying on a web of beliefs¹⁷).

Focusing on pure science, results are finally confronted with theoretical evaluations (not strictly `predictions') in order to classify in degree of belief the possible models describing how `the World works' (note that the acclaimed Popperian falsification is an idealistic scheme that however seldom applies in practice [23,24]). But in order to achieve the best selective power, individual results are combined together, as we have seen in this note. Moreover a result could be propagated into other evaluations, as it is, itself, practically always based on other results, since it depends on quantities which enter the theoretical model(s) on which it relies (`principles of measurement' [10]), including those who govern the “pieces of apparatus”, as reminded in footnote 17.

Therefore, it is important to provide, as outcome of an experimental investigation, something that can be used at best, even after years, for comparison, combination and propagation. Fortunately there is something on which there is universal consensus: the most complete information resulting from the the empirical findings, concerning a quantity that can assume values with continuity, is the so called likelihood function.¹⁸In fact, in the case of independent experiments reporting evidence on the same physics quantity the rule of the combination is straightforwards, as it results from probability theory, without the need of ad hoc prescriptions: just multiply the individual likelihoods. It follows then that the likelihood (or its negative log) should be described at best in a publication, as for example done in Ref.[27], in which several negative log-likelihoods were shown in figures and parameterized around their minimum by suitable polynomials.

Reducing the detailed information provided by the likelihood in a couple of numbers does not provide, in general, an effective and unbiased way to report the result of the findings, unless the likelihood is, with some degree of approximation, Gaussian. Instead, if the likelihood is not Gaussian [or the $\chi ^2$ is not parabolic, in those cases in which the likelihood can be rewritten as $\exp(-\chi^2/2)$ ], then reporting the value that maximize it, with an `error' related to the curvature of its negative log at the minimum, or `asymmetric errors' derived from a prescription that is only justified for a Gaussian likelihood, is also an inappropriate way of reporting the information contained in the findings. This is because, when a result is given in terms of $\hat{x}^{+\Delta_+}_{-\Delta_-}$ , then $\hat x$ is often used in further calculations, and the $(\Delta_+,\Delta_-)$ 's are `propagated' into further uncertainties in `creative' ways, forgetting that the well know formulae for propagations in linear combinations (or in linearized forms) rely on probabilistic properties of means and variances (and the Central Limit Theorem makes the result Gaussian if `several' contributions are considered). There are, instead, no similar theorems that apply to the `best values' obtaining minimizing the $-\ln{\cal L}$ or the $\chi ^2$ and to the (possibly asymmetric) `errors' obtained by the $\Delta(-\ln{\cal L})$ and the $\Delta\chi^2$ rules, still commonly used to evaluate `errors'. Therefore these rules might produce biased results, directly and/or in propagations [26].

Reporting the likelihood is also very important in the case of `negative searches', in which a lower/upper bound is usually reported. In fact, although there is no way to combine the bounds (and so people often rely on the most stringent one, which could be just due a larger fluctuation of the background with respect to its expectation) there are little doubts on how to `merge' the individual (independent) likelihoods in a single combined likelihood, from which conventional bounds can be evaluated (see Ref. [28] and chapter 13 of Ref. [29]).

Finally, a puzzle is proposed in the Appendix, as a warning on the use of the weighted average to combine results, even if they are believed to be independent and affected by Gaussian errors.

It is indebted to Enrico Franco for extensive discussions on the subject and comments on the manuscript.