Reply to 1st report (sent 6 January 2000)

-------------------------------------------------------------------------

   Dear Editor,

   I have received the communication of the negative decision 
   concerning the manuscript `Sceptical combination of experimental
   results: General considerations and application to 
   epsilon_prime/epsilon'' (dk7212).  

   I have read the referee report and I don't find arguments 
   which convince me that I have done a bad work. Comments 
   to the report are inclosed.
 
   Therefore, I would like to ask you to reconsider the manuscript 
   for publication. 

     Sincerely,

     Giulio D'Agostini   


-------------------------------------------------------------------------
Referee report DK7212 and comments by the author

> 
> This is a very unusual paper.  Unfortunately it doesn't meet the standards
> of publication in Phys Rev.
> 
> The main idea of this paper is a proposal for a Bayesian method to combine
> results of different experiments, to replace the usual weighted average,
> which for inconsistent data is customarily modified using the time-tested
> "PDG scale factor."
  - The fact that method is `Bayesian' does not mean that it is wrong. 
    On the other way around, Bayesian methods are receiving more and more
    attention in the last decades in all scientific disciplines
    (see e.g. Science, Vol. 286, 19 Nov 1999, pp. 1460-1464).
    I understand that these methods are unknown or misunderstood
    among physicists, but also in our field thinks are changing,
    and papers based on Bayesian analysis are starting to come out.
    The interest is such that editors are intersted in presenting 
    this approach to the readers. For example, I have even been invited 
    to write an introductory paper for the American Journal
    of Physics (December 1999 issue), and quite recently I have been
    invitated by the Institute of Physics Publishing to write a 
    review paper for Reports in Progress in Physics. 
  - The paper is `unusual' in the sense that invites the reader 
    to look critically to prescriptions "customarily" used because 
    of the authority of the PDG, but which have not statistical
    solid ground.  
  - At the very end of the report the referee argues that the paper is
    not scientific. Anticipating my conclusions, I think that what is not
    scientific is to prevent the circulation of ideas just because of
    personal prejudices (or conflict-of-interest?) of the referee.
> 
> The author gives no evidence that there is anything wrong with the PDG
> scale factor approach, but he tries to throw doubt on two of the pillars
> of classical statistics on which the PDG method is based, namely the
> weighted average (which he calls the "standard combination rule") and
> Pearson's chi-square. I don't find either of these attacks convincing; in
> fact, the author shows somewhat less knowledge of classical statistics
> than one would expect even from a typical reader of this paper, which is a
> bad start.  For example, after stating the conditions assumed to hold when
> the weighted average is used, he says only that "If one, or several, of
> these hypotheses [the conditions he gives] are not satisfied, the result
> [the weighted average and its standard deviation] is questionable."  But
> he doesn't say what is "questionable", namely what are the properties of
> the weighted average which hold under the stated conditions but do not
> hold if they are not satisfied.  This failure to make things clear at the
> outset has a bad effect on the clarity of the whole paper and reduces the
> author's credibility. 
  - I have studied rather in depth the so called "classical statistics", 
    and, as a result of this study, I belong now to the increasing number
    of people who thing that it has no more scientific validity than 
    a collection of "cooking recipes", starting from the "basic pillars"
    on which the PDG method is based. 
  - I avoided on purpose to critisize the basis of PDG rule in the paper. 
    A rule is a rule. Either one obeys to it or not. I just describe 
    it shortly and compare the results. I even did not try to apply it 
    in some of the complicated situations depicted in Fig. 4 
    (but the reader can guess what would happen). 
> 
> A little later on the same page, the author states that "As a strict rule,
> the chi-square test is not really logically grounded ... although it does
> 'often work' ... ".  Now it should be pointed out that Pearson's
> chi-square test occupies a position in statistical theory roughly
> comparable to that of special relativity in HEP. 
  - This last sentence is at least blasphemous. I am not the sole person 
    to state that chi-square belong to the class of frequentistic
    adhoc-eries to test hypotheses, which `often work' but sometimes
    produces terrible mistakes in scientific judgements: I give examples
    and extended discussions about this point in the CERN report 99-03, 
    mentioned later by the referee.  
>                                                  It is used in all fields
> of science, and an enormous number -- perhaps even the majority -- of all
> experimental results are based on it in one way or another. In HEP alone,
> no track or event is accepted in a reconstruction program without passing
> a chi-square test. Hundreds of Monte Carlo programs study the results of
> these tests and if they do not conform to the theoretical predictions,
> this is taken as an indication that the experiment or the program is not
> yet well understood. If there was anything fundamentally wrong with the
> chi-square test, it should have been discovered long ago.  Thus it was
> with great interest that I consulted the CERN internal report (by the same
> author)  given as a reference for this earth-shaking discovery. Needless
> to say, the report was just as unclear and unconvincing as this
> manuscript. 
> 
> In section 2, the author gives a Bayesian interpretation of the weighted
> average, and shows that it has severe problems, namely that it requires
> distribution of beliefs, including prior beliefs which have to exist
> before the experiments are performed.  Then he concludes that the usual
> (non-Bayesian) interpretation must be wrong because it does not suffer
> from the same problems as the Bayesian interpretation. The logic here is
> that the Bayesian interpretation is right, therefore any other
> interpretation that is right must include the elements (and hence the
> problems) of the Bayesian interpretation.  I don't think many physicists
> will be convinced by this logic. 
  - These arguments show that the referee has confused ideas and/or
    strong personal prejudices against the Bayesian approach. 
    Unbiased experienced physicists which have read the paper 
    were absolutely convinced by the logics in the paper.
  - As far as the CERN report is concerned, it has been defined enlighting
    by those who read it with open mind, but I cannot pretend 
    that it is appreaciated by a defender of the so called statistical
    theory (better called `statistical practice') that I strongly
    critisize.
> 
> In section 3, the new method (essentially the Dose-von der Linden method) 
> is proposed. It is Bayesian, and the important element is the prior belief
> function of the stretch factors r by which the individual experimental
> errors have been underestimated.  A gamma distribution of beliefs is
> chosen.  There is no justification for this, and no indication of how the
> results depend on this function (although there is some discussion of
> varying the parameters of the function). He points out some nice intuitive
> features of the new method, such as the fact that two independent
> experiments get more weight than a single experiment with the same amount
> of data.  Unfortunately he then makes the misleading statement that in the
> weighted average "the two situations are absolutely equivalent".  This is
> misleading because it is true only if there is no scale factor.  That is
> why the scale factor was introduced, to use the additional information to
> determine whether the conditions for the validity of the weighted average
> are satisfied. 
  - The reason for the choice of the function is clearly stated in the
    paper.
  - The statement which is considered "misleading" is, instead, absolutely 
    appropriate, because the text says "the two situations are 
    absolutely equivalent in the standard combination rule", and in the
    text "standard combination rule" is properly defined as Eqs. (1)-(2). 
  - As far as the PDG `scale factor' is concerned one should note that: 
    1) it is mentioned only in section 4, only for comparison, i.e. 
       after the incriminated sentence; 
    2) it is only used to enlarge the uncertainty, while the method
       proposed here works also in case of ``too much'' overlapping data;
    3) the interpretation of the PDG result is always Gaussian, 
       while the method proposed here has no such a constraint, and 
       hence it can describe better the case of conflicting results.  
> 
> In section 4, all this is applied to a somewhat controversial situation in
> weak interactions.  The desired result is achieved, namely he gives
> evidence that the Fermilab measurements should have less weight than the
> CERN measurements. The same result could have been achieved much more
> simply and more transparently using standard robust classical statistics,
> in which measurements farthest from the weighted average are unweighted by
> an amount depending on various criteria, usually their contribution to
> chi-square. In the present manuscript, the mathematics of the Bayesian
> formulation obscures the way in which the re-weighting is performed, but
> at least the values of the "weights" are recovered in section 5.
  - The result of the application was not to show that CERN is "better"
    than Fermilab, but rather to see if the picture of a positive
    epsilon'/epsilon survives this kind of sceptical analysis, and I am 
    very glad of the positive result. 
  - I have had some exchange of mails and clarified my position with the 
    outstanding KTeV member NN [private reference omitted in the web 
   version of this letter ...]  (who had absolutely no problem 
    in following the logic of the paper, he was only initially somewhat
    upset by the outcomes of section 5). By the way, NN belongs to
    the many persons who don't like the PDG rule (see hep-ex/9911031),
    and, with this respect, he agrees with ... [omitted for same 
    privacy reason]. 
> 
> Section 5 calculates the individual scale factors (weights) resulting from
> the assumed prior distribution of r.  The weights are reasonable, not very
> different from what you might get with a classical robust method.  We
> don't know if several prior functions were tried before finding one which
> gave reasonable weights, we are only told that a particular function gives
> these particular weights.
  - Table 4) gives the variation of results for the different
    _reasonable_ assumptions used in the analysis (I am not interested
    to see what happens if one uses crazy assumptions);
  - In the text it is stated explicitely the importance of "initial 
    distribution of r, which protects us against unexpectedly large values
    of the rescaling factors". So I don't see how a `classical robust
    method' should provide "not very different results". 
> 
> Conclusions.  
>   The proposed method gives not only the probability density of the true
> value of Re(eps'/eps), but it also tells us exactly by how much each
> experiment underestimated its errors. Unfortunately, there is no
> indication of whether any of this is true.  
  - What means `true' in the realm of uncertainty? We can only 
    state how much we are confident that Re(eps'/eps) is in a certain
    range and, as a second level, which experiment is most likely 
    for having overlooked some systematics. This is the most we can can 
    do in our human condition in front to the unknown. 
>                                              The author criticizes
> classical theory because the conditions under which the weighted average
> is valid may not be satisfied, but for the method he proposes, we don't
> even know under what conditions (if any!) the results are valid.  This is
> not Science.
  - The conditions are all clearly defined and they look reasonable to 
    experienced physicists which don't have to defend a position 
    of power in dictating rules as a PDG member or consultant.
> 
> I do not recommend publication.
> -------------------------------------------------------------------------
> 
> 

  In conclusion, I think that the negative reaction of the referee
  is essentially due to his strong prejudices against Bayesian
  statistics and to his personal interest in defending a `prescription'
  on which there is quite some disagreement in the HEP community.  
    
      Giulio D'Agostini