First referee (received 5 January 2000) ------------------------------------------------------------------------- Referee report DK7212 This is a very unusual paper. Unfortunately it doesn't meet the standards of publication in Phys Rev. The main idea of this paper is a proposal for a Bayesian method to combine results of different experiments, to replace the usual weighted average, which for inconsistent data is customarily modified using the time-tested "PDG scale factor." The author gives no evidence that there is anything wrong with the PDG scale factor approach, but he tries to throw doubt on two of the pillars of classical statistics on which the PDG method is based, namely the weighted average (which he calls the "standard combination rule") and Pearson's chi-square. I don't find either of these attacks convincing; in fact, the author shows somewhat less knowledge of classical statistics than one would expect even from a typical reader of this paper, which is a bad start. For example, after stating the conditions assumed to hold when the weighted average is used, he says only that "If one, or several, of these hypotheses [the conditions he gives] are not satisfied, the result [the weighted average and its standard deviation] is questionable." But he doesn't say what is "questionable", namely what are the properties of the weighted average which hold under the stated conditions but do not hold if they are not satisfied. This failure to make things clear at the outset has a bad effect on the clarity of the whole paper and reduces the author's credibility. A little later on the same page, the author states that "As a strict rule, the chi-square test is not really logically grounded ... although it does 'often work' ... ". Now it should be pointed out that Pearson's chi-square test occupies a position in statistical theory roughly comparable to that of special relativity in HEP. It is used in all fields of science, and an enormous number -- perhaps even the majority -- of all experimental results are based on it in one way or another. In HEP alone, no track or event is accepted in a reconstruction program without passing a chi-square test. Hundreds of Monte Carlo programs study the results of these tests and if they do not conform to the theoretical predictions, this is taken as an indication that the experiment or the program is not yet well understood. If there was anything fundamentally wrong with the chi-square test, it should have been discovered long ago. Thus it was with great interest that I consulted the CERN internal report (by the same author) given as a reference for this earth-shaking discovery. Needless to say, the report was just as unclear and unconvincing as this manuscript. In section 2, the author gives a Bayesian interpretation of the weighted average, and shows that it has severe problems, namely that it requires distribution of beliefs, including prior beliefs which have to exist before the experiments are performed. Then he concludes that the usual (non-Bayesian) interpretation must be wrong because it does not suffer from the same problems as the Bayesian interpretation. The logic here is that the Bayesian interpretation is right, therefore any other interpretation that is right must include the elements (and hence the problems) of the Bayesian interpretation. I don't think many physicists will be convinced by this logic. In section 3, the new method (essentially the Dose-von der Linden method) is proposed. It is Bayesian, and the important element is the prior belief function of the stretch factors r by which the individual experimental errors have been underestimated. A gamma distribution of beliefs is chosen. There is no justification for this, and no indication of how the results depend on this function (although there is some discussion of varying the parameters of the function). He points out some nice intuitive features of the new method, such as the fact that two independent experiments get more weight than a single experiment with the same amount of data. Unfortunately he then makes the misleading statement that in the weighted average "the two situations are absolutely equivalent". This is misleading because it is true only if there is no scale factor. That is why the scale factor was introduced, to use the additional information to determine whether the conditions for the validity of the weighted average are satisfied. In section 4, all this is applied to a somewhat controversial situation in weak interactions. The desired result is achieved, namely he gives evidence that the Fermilab measurements should have less weight than the CERN measurements. The same result could have been achieved much more simply and more transparently using standard robust classical statistics, in which measurements farthest from the weighted average are unweighted by an amount depending on various criteria, usually their contribution to chi-square. In the present manuscript, the mathematics of the Bayesian formulation obscures the way in which the re-weighting is performed, but at least the values of the "weights" are recovered in section 5. Section 5 calculates the individual scale factors (weights) resulting from the assumed prior distribution of r. The weights are reasonable, not very different from what you might get with a classical robust method. We don't know if several prior functions were tried before finding one which gave reasonable weights, we are only told that a particular function gives these particular weights. Conclusions. The proposed method gives not only the probability density of the true value of Re(eps'/eps), but it also tells us exactly by how much each experiment underestimated its errors. Unfortunately, there is no indication of whether any of this is true. The author criticizes classical theory because the conditions under which the weighted average is valid may not be satisfied, but for the method he proposes, we don't even know under what conditions (if any!) the results are valid. This is not Science. I do not recommend publication. -------------------------------------------------------------------------