------------------------------------------------------------------------ Clarifications and answers to questions concerning the YR CERN-99-03 Last modification 16/12/99 ------------------------------------------------------------------------ Forword: The YR is a report of a still ongoing work to the ------- understanding of probabilistic reasoning and their application to data analysis. All comments and suggestions are welcome. References: all papers citated with just the beginning of the title ---------- can be found in my web page on probability and statistics: http://www-zeus.roma1.infn.it/~agostini/prob+stat.html Acknowledgements: For the moment I prefer to keep most authors of questions --------------- and comments anonymous, but if some authors wish, I can cite his/her name. The only exception, for the moment, is Fred James (CERN), the main contributor to this list (Q5-Q15) [Q12-Q14 are in collaboration with Louis Lyons (Oxford)], who account in total for the "various people claiming there are mathematical errors and problems of interpretation in the report" (www.cern.ch/CERN/Divisions/EP/Events/CLW/reading.html). I wish to thank all those who have contributed. ------------------------------------------------------------------------ Q15: Footnote 3 on p. 120 is very misleading and has upset the authors of both ref [46] and ref [60]. They do not "admit" the Bayesian approach is good for decision problems, they "claim" it. And they don't "stick to the frequentist approach" for decision problems, they stick to the frequentist approach for some other problems. Unlike many Bayesians, they do distinguish between decision problems and other problems. A15: Many other people got upset by the way Bayesian inference has been presented by PDG, by the F-C paper, and by the prompt adoption by PDG. Coming back to the sentence, I think it is not as dramatic as you put it. The meaning is simply this: a decision problem deals with at least two hypotheses, a value of probability assigned to each of them and some utility/loss assigned to each of them. Since we are dealing with probability of hypothesis, they can only be degrees of beliefs. So, sometimes you admit that probability can be a degree of belief, sometimes not ("stick to the frequentistic approach": this is the sense of the last part of the sentence). In this sense "admit" means just that sometimes have to make use of degrees of belief, althoug you insist (also during this exchange of opinions) that the the right probability is the "frequentistic one". Finally, if you seriously believe that there that there are several probabilities, I don't know what to say: in my opinion, an attitude of having several "probabilities" is at least strange. A consistent use of subjective probability doesn't have this kind of problems (as Section 8.1 was meant for). Q14: The paragraph at the top of p. 153 is not right mathematically. It is a well-known property of Bayesian analysis that conclusions drawn from a prior which is flat in one variable will not be the same as conclusions drawn from the same data with a prior flat in a different variable. Making the variables discrete does not change this. A14: Making the variable discrete means, by definition, not to consider continuous variables anymore, and then also the Jacobian looses meaning. Let assume to have two variables, x and y, such that y=ln(x). The variable x is defined (or we are interested to it) in the interval [0.5,1.5], and it is such that the experiment has a resolution of 0.01 on it, all over the range. Discretizing the variable means to consider 100 intervals of x, to which correspond 100 intervals of y, as follow: x y [0.50,0.51] [-0.693,-0.673] [0.51,0.52] [-0.673,-0.654] .... ...... [1.49,1.50] [ 0.399, 0.405] Clearly, equiprobability in x intervals maps into equiprobability in y intervals. Q13: At the top of page 127 you say that a simulation is equivalent to assuming a flat distribution for mu. Simulations are true for the value assumed in the simulation, and no prior distribution, flat or otherwise, is implied in any simulation. A13: This is a continuation of what discussed in Sec. 1.7, and, indeed, in that section there is a reference to figure 1.3. Let us explainin it better with a numerical example: - The likelihood is Gaussian around the true value, with standard deviation 0.1. - Given the nature of the problem, _we_know_ that the prior distribution of the true value is not uniform in that region Let's take, for sake of simplicity, a prior p.d.f. f_0(m)=1/m, with a cutoff at 0.01 to avoid numerical problems. - Take the simple simulation in which only one event is generated, for the true value m=2, and that the result is x=2 (an idealistic and symmetric case, which gives, however the idea of what is going on). - What will be our inference on m? 1. The maximum likelihood will give a best estimate of m=2, and a 68% confidence interval [-0.1,0.3]; 2. A Bayesian inference with uniform prior will provide a Gaussian p.d.f. for m around x=0.2 and sigma=0.1, from which the 68% probability interval [-0.1,0.3] will be inferred. 2. The "correct" Bayesian inference, in which the most likely prior knowledge for mu is taken, will give a non-Gaussian p.d.f., having E[m]=0.14 and sigma=0.095, and central 68% probability interval (around the median) [0.04,0.25]. => the "correct" procedure gives the wrong result. What is the problem? The initial distribution should have been used also in the simulation, as is well known to people who work with asymmetric distributions, where large migrations play in important role. [clearly an example with one event as that done above is impossible, and one has to go through a complete simulation]. The "problem" is that we are too often used to work with very narrow likelihoods, and tend to forget the cases in which the problem is more complicated, and one needs more sophisticated `unfoldings'. For example, the people working on gamma-gamma physics are very well aware of this problem. The inference on the first and last bins of the acceptance can depend critically on the `believed' (people prefer to say "assumed") behaviour of the physics quantity outside the region of acceptance. Q12: The paragraph at the top of p. 81 implies that there is some arbitrariness in the classical likelihood method due to the arbitrariness of the variable used. This is not true, because the method is invariant under change of variables. A12: I must admit, that on a first sight, extracting the sentence from the context, it gives this impression. It is really a case of bad wording of which I apologize. Certainly the likelihood function is invariant under change of variable, and this is the reason why, when the choice of priors is really critical (as it happens in the frontier measurement close to the detector sensitivity), we recommend to abstain to calculate probabilities and only give likelihoods (both likelihoods of background alone and of signal+background), for example in the form of Bayes factors (see "Inferring the intensity of ..."). For example, in the ZEUS paper hep-ex/9905039, which makes use of the R-function, it is stated explicitly that "R is invariant with respect to the variable transformation". The intention of the section was to show that certainly inferences depend on the choice of the prior (unless the likelihood is very narrow, as discussed elsewhere in the Y.R.). However, at the moment of the intuitive interpretation of the result (this is what is done, at the end of the day by the practitioner), an implicit uniform distribution is intuitively used (as discussed in Sec. 1.7). So the effect of adhering strictly to the the maximum likelihood principle is that of hiding the problem, under the slogan that "there are no priors". A11: On p. 65, the second bullet is not meaningful, since the condition f(x1) > f(x2) is not invariant under transformation of variables x -> x'. The problem is that f is not a probability (belief), but a probability DENSITY, and has to be integrated between two values to give a finite belief. Q11: This is a tricky point related with events having "zero probability but different degree of belief" (see de Finetti book). In the report it is just sketched, as in the following I recover "normal" p.d.f., to which physicists are more accustomed, and go on. Let's say that for the note is not a crucial problem, but there is nothing wrong. A way to understand is the following. Imagine an ideal (mathematical) experiment in which a `point' is let drop on a plane having a coordinate system on it (x,y). The `point' is let drop above approx. (e.g. the hand is trembling) the position (x=0, y=0). Then you ask: "what is the probability that it reaches exactly the point (x=0,y=0)?" The answer is "zero". We give the same answer for (x=pi/2, y=ln2), (x=1,y=\sqrt{3}), etc... But now ask: "do you believe more that it will reach (x=0,y=0) or (x=pi^3, y=e^3)?" (units in cm). We believe more (x=0,y=0) than (x=pi^3, y=e^2), even if they have both zero probability. A10: What means that "P(E) is not an intrinsic characteristic of the event E" (p.47) and that "Absolute probability makes no sense" (p. 122, line 6). Does this mean, for example, that the probability that a Lambda will decay to (p pi-) is not an intrinsic characteristic of Lambda decays? Is it conditional on something? Q10: This is a very crucial philosophical point, respect to which, I admit, I oscillate, as I confess it at the end of pag. 136, in a part which can be obscure, perhaps. Anyhow, the simplest interpretation is the following. Let's take a Lambda, the first time we "see" a Lambda (obviously it is a bit artificial). Let us also suppose we know nothing about strangeness, SM, etc. What will this Lambda do? You can have some ideas, another physicists others. We do experiments and, after much experimentation, we arrive to state that, GIVEN all our past knowledge, we are 63.9% sure that it will decay in p pi- and 35.8% in n pi0. As you can imagine, one can discuss for ever arguing the deep meaning of quantum mechanics. In an approach in which probability is related to knowledge, it doesn't really matter if one believes that probability is an intrinsic property of nature or it is just the limit of our knowledge (a la Einstein). A9: Second quote of p. 15 does not imply the first one. Q9: Quote in line 9 is exactly how frequentist books report the "verdict" of a test, the frequentistic meaning of it is statements in lines 12-13. It is a matter of fact that this is a major source of confusion, and I have spends hours (many!) trying to convince people (professional physicists of all ages and nationalities) that line 9 had to be meant as lines 12-13, and NOT that there is only 1% chance that H0 is true. In the 19 November 1999 issue of Science magazine there is article about the `Bayesian boom'. The misinterpretation of p-values is one of the things which is universally recognised, by statisticians of all schools (and even the frequentistic ones admit that are at loss trying to explain their meaning to students). [Science Vol 286 page 1460] Q8: The frequentistic hypothesis test scheme of p. 13 is not well described. Moreover you don't consider `type A and type B errors'. [nothing to do with type A and type B uncertainties of ISO!] A8: I insist that what I wrote is correct. The fact that if chi^2 far away from E[chi^2], there could be easily a new model which could have a better chi^2 is a different story, although this is how we intuitively reason. This is discussed in Section 8.8. Important points are: - testing a single hypothesis (H0 alone) has, strictly speaking, no sense. Nevertheless, it is true that we find ourselves very often in situations in which it is easy to imagine alternative hypotheses, therefore ... - ... large deviations can stimulate people to search for alternative explanations, but they are not enough to state that H0 is unlike; - what really matters is not not the probability of the tail, but the relative eights (see Fig. 8.2) (but, obviously small "p-value" corresponds to low p.d.f. and therefore it can be taken as a rough rule for where to start worrying, what is dangerous is to take it literary). As far as type A and type B errors are concerned, this is certainly a big improvement, but in the frequentistic scheme is still missing the important ingredient of the prior probabilities with which the two hypotheses are thought to enter into the game. But at least, the information about the two errors could be in simple problems analogue of Bayes factors (although the latter are superior since they are based only on the ratio of likelihood, and not integrated over the tail(s)). Q7: The example of pag. 11 is mathematically wrong! How can the distance between dog and hunter be different from that between hunter and dog? A7: The text says: "we know that ...", and then "if we observe observe the dog". Calling d the distance, your two statements coincide with the following single statement P( d < 100 | "I see the hunter") = 50% and obviously, the probability had to be the same. Now, the inferential problem is different: P( d < 100 | "I see the hunter") = ? and this is 50% only under the conditions stated in YR. Q6: Do you really with the Howson and Urbach statement that `statisticians' don't say why one should be 95% confident? (Quote on p. 11) A6: Yes, I agree with Howson and Urbach. The problem is that "classical" confidence has no "physical meaning", where by physical I mean a statement concerning the physical world (perhaps it would be more appropriated to call it "epistemological meaning"). Perhaps I was more explicit in physics/9811046, Section 5 (and of p. 8). Stated in a more explicit a person stating "I am 95% confident that the Higgs mass is above 77.5 GeV", should have the same confidence that the Higgs mass is above 77.5 as he/she is "95% confident of extracting a white ball from a box containing 95 white balls out of 100 in total". Unfortunately, conventional statistics does not make the to statements comparable, and this create confusion. Q5: How does point 10 of the ISO list of sources of uncertainties (Sec. 1.3) account for the uncertainties of the kind \sqrt{n} in the case of counting measurements? In this case there are not "repeated observations". A5: ISO is essentially interested in metrological measurements, less in Poisson processes. In my notes I make the extension of considering point 10 "whatever could come from a non-deterministic process", and therefore I associate the \sqrt{n} uncertainty to type A category In the sense of Sec. 6.1.2). I understand now that one could simply stick to the literal definition of type A and type B uncertainties, and thus considering the \sqrt{n} uncertainty of type B, i.e. "evaluated by other means", "based on all available information on the possible variability of X_i" (see p. 100-101). Perhaps this way of seeing it is even more appropriate, anyhow, the overall result will not change. Q4: In the case of the measurement of a neutrino mass (assuming it is zero), doesn't the prior defined in the positive region bias the results? If many experiment are done the average will converge to a positive value, instead than zero. A4: This is similar to the problem discussed in Sec. 6.1.5. Q3: You introduce the Gaussian distribution from the central limit theorem, that could be better justified by Maximum Entropy principle. A3: I have already commented on indiscriminate use of MaxEnt, in the A2. As far as the use of Gaussian as convenient likelihood function for many physics measurement, this is due to our present understanding of the probabilistic behaviour of statistical errors and central limit theorem (the same as for Brownian motion). I find this introduction even more corresponding to our understanding of the measuring devices than the original one given by Gauss. He did, in fact, exactly the opposite. He started from a uniform prior for the true value, assumed that the best estimate of the true value was the arithmetic average, and just using symmetry arguments he found that the error function was ... the Gaussian. [For reference see AJP paper "Teaching statistics ..."] Q2: With respect to your footnote on page 88: I belong to the school of thought which assumes 1/lambda to be the appropriate prior. Am right in supposing that you hesitate in view of (5.33) and (5.34)? The point I want to make is that uniform prior in case of the Binomial problem is not least committal. It is certainly contained in the class of the beta distribution, i.e. f_0(p) \propto p^{a-1}*(1-p)^{b-1}. Taking as initial expected value 1/2, because of symmetry, it is possible to show that the maximum variance (i.e. maximum uncertainty) is obtained for a=b=1, thus yielding 1/[p(1-p)]. A2: This question contains two questions, both related to the so called `entropic priors', i.e. derived from Maximum Entropy principle. In these two cases entropic priors lead to the Jeffreys' priors, derived by symmetry arguments. I have written two papers in which I manifest my worries toward this kind of priors and, more in general, toward `reference priors' [see "Jeffreys' priors..." and "Overcoming priors anxiety"]. There I also give reasons to the use of the uniform prior "as a starting point". In principle I have nothing against entropic priors. My worries are just practical. I have not been able to find a single realistic example in which they seem reasonable, in the sense that they model the knowledge of the person (by definition an expert) who does the inference. For example: measuring the efficiency of a physics detector; inferring the proportion of persons who will vote a candidate, having interviewed a small sample of persons, etc.. The other risks related to, generally speaking, reference priors are the dogmatization of the theory (also discussed in the cited papers), and the possible absurd results, because mathematical convenience is preferred to the real knowledge of the problem. Finally, asking to the direct question if my choice is to recover Laplace formulae, the answer is not at all. I even criticise the great Jeffreys, whose book is a milestone of probability theory, because it seems to me that he loves recovering the Student distribution in the small sample problem. I don't care of recovering famous or well "established" results (i.e. results to which we are just accustomed by the use, often beacuse of the authority of the proposant, or because "they often work" - there is a section in the YR dedicated to this last point). To conclude, What matters is honest prior knowledge, the rules of logic to `propagate' probability, and the normative rule of the coherent bet - at the end I will always ask to assess some bet odds! Q1: Why should one feel obliged to follow metrological rules? (footnote 8, p. 25) A1: Just a typing error. It should have been exactly the opposite: "One should not ..." (in fact the following sentence starts with "however"). The idea is that physicists in frontier research should be very critical toward "rules", no matter if they come from ISO, DIN or PDG. Nevertheless one cannot simply ignore their work and conclusions.