Reply to 1st report (sent 6 January 2000) ------------------------------------------------------------------------- Dear Editor, I have received the communication of the negative decision concerning the manuscript `Sceptical combination of experimental results: General considerations and application to epsilon_prime/epsilon'' (dk7212). I have read the referee report and I don't find arguments which convince me that I have done a bad work. Comments to the report are inclosed. Therefore, I would like to ask you to reconsider the manuscript for publication. Sincerely, Giulio D'Agostini ------------------------------------------------------------------------- Referee report DK7212 and comments by the author > > This is a very unusual paper. Unfortunately it doesn't meet the standards > of publication in Phys Rev. > > The main idea of this paper is a proposal for a Bayesian method to combine > results of different experiments, to replace the usual weighted average, > which for inconsistent data is customarily modified using the time-tested > "PDG scale factor." - The fact that method is `Bayesian' does not mean that it is wrong. On the other way around, Bayesian methods are receiving more and more attention in the last decades in all scientific disciplines (see e.g. Science, Vol. 286, 19 Nov 1999, pp. 1460-1464). I understand that these methods are unknown or misunderstood among physicists, but also in our field thinks are changing, and papers based on Bayesian analysis are starting to come out. The interest is such that editors are intersted in presenting this approach to the readers. For example, I have even been invited to write an introductory paper for the American Journal of Physics (December 1999 issue), and quite recently I have been invitated by the Institute of Physics Publishing to write a review paper for Reports in Progress in Physics. - The paper is `unusual' in the sense that invites the reader to look critically to prescriptions "customarily" used because of the authority of the PDG, but which have not statistical solid ground. - At the very end of the report the referee argues that the paper is not scientific. Anticipating my conclusions, I think that what is not scientific is to prevent the circulation of ideas just because of personal prejudices (or conflict-of-interest?) of the referee. > > The author gives no evidence that there is anything wrong with the PDG > scale factor approach, but he tries to throw doubt on two of the pillars > of classical statistics on which the PDG method is based, namely the > weighted average (which he calls the "standard combination rule") and > Pearson's chi-square. I don't find either of these attacks convincing; in > fact, the author shows somewhat less knowledge of classical statistics > than one would expect even from a typical reader of this paper, which is a > bad start. For example, after stating the conditions assumed to hold when > the weighted average is used, he says only that "If one, or several, of > these hypotheses [the conditions he gives] are not satisfied, the result > [the weighted average and its standard deviation] is questionable." But > he doesn't say what is "questionable", namely what are the properties of > the weighted average which hold under the stated conditions but do not > hold if they are not satisfied. This failure to make things clear at the > outset has a bad effect on the clarity of the whole paper and reduces the > author's credibility. - I have studied rather in depth the so called "classical statistics", and, as a result of this study, I belong now to the increasing number of people who thing that it has no more scientific validity than a collection of "cooking recipes", starting from the "basic pillars" on which the PDG method is based. - I avoided on purpose to critisize the basis of PDG rule in the paper. A rule is a rule. Either one obeys to it or not. I just describe it shortly and compare the results. I even did not try to apply it in some of the complicated situations depicted in Fig. 4 (but the reader can guess what would happen). > > A little later on the same page, the author states that "As a strict rule, > the chi-square test is not really logically grounded ... although it does > 'often work' ... ". Now it should be pointed out that Pearson's > chi-square test occupies a position in statistical theory roughly > comparable to that of special relativity in HEP. - This last sentence is at least blasphemous. I am not the sole person to state that chi-square belong to the class of frequentistic adhoc-eries to test hypotheses, which `often work' but sometimes produces terrible mistakes in scientific judgements: I give examples and extended discussions about this point in the CERN report 99-03, mentioned later by the referee. > It is used in all fields > of science, and an enormous number -- perhaps even the majority -- of all > experimental results are based on it in one way or another. In HEP alone, > no track or event is accepted in a reconstruction program without passing > a chi-square test. Hundreds of Monte Carlo programs study the results of > these tests and if they do not conform to the theoretical predictions, > this is taken as an indication that the experiment or the program is not > yet well understood. If there was anything fundamentally wrong with the > chi-square test, it should have been discovered long ago. Thus it was > with great interest that I consulted the CERN internal report (by the same > author) given as a reference for this earth-shaking discovery. Needless > to say, the report was just as unclear and unconvincing as this > manuscript. > > In section 2, the author gives a Bayesian interpretation of the weighted > average, and shows that it has severe problems, namely that it requires > distribution of beliefs, including prior beliefs which have to exist > before the experiments are performed. Then he concludes that the usual > (non-Bayesian) interpretation must be wrong because it does not suffer > from the same problems as the Bayesian interpretation. The logic here is > that the Bayesian interpretation is right, therefore any other > interpretation that is right must include the elements (and hence the > problems) of the Bayesian interpretation. I don't think many physicists > will be convinced by this logic. - These arguments show that the referee has confused ideas and/or strong personal prejudices against the Bayesian approach. Unbiased experienced physicists which have read the paper were absolutely convinced by the logics in the paper. - As far as the CERN report is concerned, it has been defined enlighting by those who read it with open mind, but I cannot pretend that it is appreaciated by a defender of the so called statistical theory (better called `statistical practice') that I strongly critisize. > > In section 3, the new method (essentially the Dose-von der Linden method) > is proposed. It is Bayesian, and the important element is the prior belief > function of the stretch factors r by which the individual experimental > errors have been underestimated. A gamma distribution of beliefs is > chosen. There is no justification for this, and no indication of how the > results depend on this function (although there is some discussion of > varying the parameters of the function). He points out some nice intuitive > features of the new method, such as the fact that two independent > experiments get more weight than a single experiment with the same amount > of data. Unfortunately he then makes the misleading statement that in the > weighted average "the two situations are absolutely equivalent". This is > misleading because it is true only if there is no scale factor. That is > why the scale factor was introduced, to use the additional information to > determine whether the conditions for the validity of the weighted average > are satisfied. - The reason for the choice of the function is clearly stated in the paper. - The statement which is considered "misleading" is, instead, absolutely appropriate, because the text says "the two situations are absolutely equivalent in the standard combination rule", and in the text "standard combination rule" is properly defined as Eqs. (1)-(2). - As far as the PDG `scale factor' is concerned one should note that: 1) it is mentioned only in section 4, only for comparison, i.e. after the incriminated sentence; 2) it is only used to enlarge the uncertainty, while the method proposed here works also in case of ``too much'' overlapping data; 3) the interpretation of the PDG result is always Gaussian, while the method proposed here has no such a constraint, and hence it can describe better the case of conflicting results. > > In section 4, all this is applied to a somewhat controversial situation in > weak interactions. The desired result is achieved, namely he gives > evidence that the Fermilab measurements should have less weight than the > CERN measurements. The same result could have been achieved much more > simply and more transparently using standard robust classical statistics, > in which measurements farthest from the weighted average are unweighted by > an amount depending on various criteria, usually their contribution to > chi-square. In the present manuscript, the mathematics of the Bayesian > formulation obscures the way in which the re-weighting is performed, but > at least the values of the "weights" are recovered in section 5. - The result of the application was not to show that CERN is "better" than Fermilab, but rather to see if the picture of a positive epsilon'/epsilon survives this kind of sceptical analysis, and I am very glad of the positive result. - I have had some exchange of mails and clarified my position with the outstanding KTeV member NN [private reference omitted in the web version of this letter ...] (who had absolutely no problem in following the logic of the paper, he was only initially somewhat upset by the outcomes of section 5). By the way, NN belongs to the many persons who don't like the PDG rule (see hep-ex/9911031), and, with this respect, he agrees with ... [omitted for same privacy reason]. > > Section 5 calculates the individual scale factors (weights) resulting from > the assumed prior distribution of r. The weights are reasonable, not very > different from what you might get with a classical robust method. We > don't know if several prior functions were tried before finding one which > gave reasonable weights, we are only told that a particular function gives > these particular weights. - Table 4) gives the variation of results for the different _reasonable_ assumptions used in the analysis (I am not interested to see what happens if one uses crazy assumptions); - In the text it is stated explicitely the importance of "initial distribution of r, which protects us against unexpectedly large values of the rescaling factors". So I don't see how a `classical robust method' should provide "not very different results". > > Conclusions. > The proposed method gives not only the probability density of the true > value of Re(eps'/eps), but it also tells us exactly by how much each > experiment underestimated its errors. Unfortunately, there is no > indication of whether any of this is true. - What means `true' in the realm of uncertainty? We can only state how much we are confident that Re(eps'/eps) is in a certain range and, as a second level, which experiment is most likely for having overlooked some systematics. This is the most we can can do in our human condition in front to the unknown. > The author criticizes > classical theory because the conditions under which the weighted average > is valid may not be satisfied, but for the method he proposes, we don't > even know under what conditions (if any!) the results are valid. This is > not Science. - The conditions are all clearly defined and they look reasonable to experienced physicists which don't have to defend a position of power in dictating rules as a PDG member or consultant. > > I do not recommend publication. > ------------------------------------------------------------------------- > > In conclusion, I think that the negative reaction of the referee is essentially due to his strong prejudices against Bayesian statistics and to his personal interest in defending a `prescription' on which there is quite some disagreement in the HEP community. Giulio D'Agostini