Footnotes

This problem has been treated in much detail in Ref. [1], taking cue from questions related to the Covid-19 pandemic.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... by ²

I try, whenever it is possible, to stick to the convention of capital letters for the name of a variable and small letters for its possible values. Exceptions are Greek letters and quantities naturally defined by a small letter, like

for a `rate'.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

....³

If, instead, we are uncertain about $\lambda$ and quantify its uncertainty by the probability density function $f(\lambda\,\vert\,I)$ , where

stands for our status of information about that quantity, the distribution of the counts will be given by $f(x\,\vert\,\lambda,I) = \int_0^\infty\!\! f(x\,\vert\,\lambda,I)\cdot f(\lambda\,\vert\,I)\,$ d $\lambda\,.$

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... theory ⁴

In brief: the expected value of a linear combination is the linear combination of the expected values; the variance of a linear combination is the linear combination of the variances, with squared coefficients.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...,⁵

Such a distribution is known in the literature as Skellam distribution [10] and it is available in R [11] installing the homonym package [12]. The distribution of the differences corresponding to the cases of Tab.

can be easily plotted by the following R commands, producing a bar plot similar to that of Fig.

  library(skellam)
  d = -5:5
  barplot(dskellam(d,1,1), names=d, col='cyan')

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...R:⁶

This function, hopefully having a didactic value, is not optimized at all and it uses the fact that the R function dpois() returns zero for negative values of the variable.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... R.⁷

The core of the R code is given, for the case of $\lambda_1=\lambda_2=1$ , by

lambda1 = lambda2 = 1; n = 10^6
x1 = rpois(n,lambda1)
x2 = rpois(n,lambda2)
rx = x1/x2
rx = rx[!is.nan(rx) & (rx != Inf)]
barplot(table(rx)/n, col='cyan', xlab='x1/x2', ylab='f(x1/x2)')

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... counts.⁸

Ref. [3] is a kind of `masterpiece' of the kind of convoluted reasoning involved. For example, the paper starts with the following incipit (quote marks original):

When the result of the measurement of a physical quantity is published as $R=R_0\pm \sigma_0$ without further explanation, it is implied that is a gaussian-distributed measurement with mean and variance $\sigma_0^2$ . This allows one to calculate various confidence intervals of given “probability”, i.e., the “probability” that the true value of is within a given interval.

However, nowhere in the paper is explained why probability is within quote marks. The reason is simply because the authors are fully aware that frequentist ideology, to which they overtly adhere, refuses to attach the concept of probability to true values, as well as to model parameters, and so on (see e.g. Ref. [13]). But authoritative statements of this kind might contribute to increase the confusion of practitioners [14], who then tend to take frequentist `confidence levels' as if they were probability values [15].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... `prior'.⁹

This name is somehow unfortunate, because it might induce people think to time order, as discussed in Ref. [1], in which it is shown how, instead, the `prior' can be applied in a second step, in particular by someone else, if a `flat prior' used.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...fig:pdf_lambda_x1_x2_MC_high.¹⁰

The complete script is provided in Appendix B.2.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... value”¹¹

For the reason of the quote marks see footnote

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...fig:pdf_lambda_x1_x2_MC_high,¹²

Zero overflow in that plot is only due to the `limited' number of sampled, chosen to be

, and to the fact that the same script of the other plot has been used.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...eq:sommatoria),¹³

for a different approach to get Eq. (

) see footnote

, in which the integrand of Eq. (

) is interpreted as the joint pdf of $\rho_\lambda$ , $\lambda_1$ and $\lambda_2$ .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... integrals ¹⁴

Work done on a Raspberry Pi3, thanks to Mathematica 12.0 generously offered by Wolfram Inc. to the Raspbian system.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

....¹⁵

To be precise, the condition is

and

, but, given the role of the two variables in our context, it means for all possible counts (including

, for which the pdf becomes $1/(1+\rho_\lambda)^2$ , having however infinite mean and variance.)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... concept ¹⁶

For a historical review and modern developments, with implication on Artificial Intelligence application, see Ref. [18], an influential book on which I have however reservations when the author talks about causality in Physics (I have the suspicion he has never really read Newton or Laplace [20] or PoincarÃ© [21], and perhaps not even Hume [16]).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... true,¹⁷

Usually we say `if

has occurred', but, indeed, in probability theory there are, in general, `hypotheses', to which we associate a degree of belief, being the states TRUE and FALSE just the limits, mapped into

and

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... variables.¹⁸

Starting from $f(\lambda\,\vert\,x)$ given by Eq. (

) we get

$\displaystyle f(\lambda\,\vert\,x)\,$ d $\displaystyle \lambda\, =\, \frac{\lambda^x\cdot e^{-\lambda}}{x!}\,$ d $\displaystyle \lambda$	$\displaystyle =$	$\displaystyle \frac{(r\cdot T)^x\cdot e^{-r\,T}}{x!}\cdot T\,$ d $\displaystyle r \,=\, f(r\,\vert\,x,T)\,$ d $\displaystyle r$
$\displaystyle f(r\,\vert\,x,T)$	$\displaystyle =$	$\displaystyle \frac{T^{x+1}\cdot r^x\cdot e^{-T\,r}}{x!}\,.$

in which we recognize a Gamma pdf with $\alpha=x+1$ and $\beta=T$ .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

....¹⁹

The real issue with the `likelihood' is not just replacing in Eq. (

) $f(x\,\vert\,r,T,I)$ by ${\cal L}(r\,;\, x,T)$ , but rather the fact, that, being this a function of

, it is perceived as `the likelihood of

'. The result is that it is often (almost always) turned by practitioners into `probability of

', being `likelihood' and 'probability' used practically as synonyms in the spoken language. It follows, for example, that the value that maximizes the likelihood function is perceived as the `most probable' value, in the light of the observations.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...,²⁰

This is true unless the balance shows a `clear anomaly', and then you stick to what you believed your weight should be. But you still learn something from the measurement, indeed: the balance is broken [13].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... derived.²¹

It is perhaps important to remind that in probability theory the full result of the inference is the probability distribution (typically a pdf, for continuous quantities) of the quantity of interest as inferred from the data, the model and all other pertinent pieces of information. Mode, mean, median, standard deviation and probability intervals are just useful numbers to summarize with few numbers the distribution, with should always be reported, unless it is (with some degree of approximation) as simple as a Gaussian, so that mean and standard deviation provide the complete information. For example, the shape of a not trivial pdf can be expressed with coefficients of a suitable fit made in the region of interest. Or one can provide several moments of a distribution, from which the pdf can be reobtained (see e.g. Ref. [22]).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...BR,conPia,conPeppe,²²

Note how at that time we wrote Eq. (

) in a more expanded way, but the essence of this factor is given by Eq. (

). For recent developments and applications see Refs. [24,25,26].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... likelihood.²³

The more interesting case, originally taken into account in Refs. [17,23], is when some events are observed, which could be, however, also attributed to irreducible background.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... belief.²⁴

Remember that, as Laplace used to say, “the theory of probabilities is basically just common sense reduced to calculus”, that “All models are wrong, but some are useful” (G.Cox) and that even Gauss was `sorry' because `his' error function could not be strictly true [28] (see quote in footnote 9 of Ref. [29]).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... details,²⁵

Another way to arrive to Eq. (

) is to start from Eq. (

), applying the transformation of variables $\rho=\rho_\lambda\cdot T_2/T_1$ :

$\displaystyle f(\rho_\lambda\,\vert\,\ldots)\,$ d $\displaystyle \rho_\lambda$	$\displaystyle =$	$\displaystyle \frac{(x_1+x_2+1)!}{x_1!\,x_2!} \cdot \rho_\lambda^{\,x_1}\cdot (1+\rho_\lambda)^{\,-(x_1+\,x_2+2)}\,$ d $\displaystyle \rho_\lambda$
	$\displaystyle =$	$\displaystyle \frac{(x_1+x_2+1)!}{x_1!\,x_2!} \cdot \left(\frac{T_1}{T_2}\right... ...t(1+\frac{T_1}{T_2}\cdot \rho\right)^{\,-(x_1+\,x_2+2)}\cdot \frac{T_1}{T_2} \,$ d $\displaystyle \rho$
	$\displaystyle =$	$\displaystyle \frac{(x_1+x_2+1)!}{x_1!\,x_2!} \cdot T_1^{x_1}\cdot T_2^{-x_1}\c... ...\left(T_2 + T_1\!\cdot\! \rho\right)^{\,-(x_1+\,x_2+2)}\cdot \frac{T_1}{T_2} \,$ d $\displaystyle \rho$
	$\displaystyle =$	$\displaystyle \frac{(x_1+x_2+1)!}{x_1!\,x_2!} \cdot T_1^{x_1+1}\cdot T_2^{x_2+1} \cdot \rho^{x_1}\cdot \left(T_2 + T_1\!\cdot\! \rho\right)^{\,-(x_1+\,x_2+2)} \,$ d $\displaystyle \rho$
$\displaystyle f(\rho\,\vert\,\ldots)$	$\displaystyle =$	$\displaystyle \frac{(x_1+x_2+1)!}{x_1!\,x_2!} \cdot T_1^{x_1+1}\cdot T_2^{x_2+1} \cdot \rho^{x_1}\cdot \left(T_2 + T_1\!\cdot\! \rho\right)^{\,-(x_1+\,x_2+2)}\,.$

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...inferred ²⁶

For those who have doubts about the meaning of `deduction' and `induction', Ref. [35] is highly recommended (and they will discover that Sherlock Holmes was indeed not deducing explanations).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...,²⁷

It is interesting to note that there is an alternative way to get Eq. (

), starting from the joint distribution $f(\rho_\lambda,\lambda_1,\lambda_2\,\vert\,x_1,x_2)$ and then marginalizing it. In fact, using the chain rule, we get

$\displaystyle f(\rho_\lambda,\lambda_1,\lambda_2\,\vert\,x_1,x_2)$	$\displaystyle =$	$\displaystyle f(\rho_\lambda\,\vert\,\lambda_1,\lambda_2)\cdot f(\lambda_1\,\vert\,x_1,x_1)\cdot f(\lambda_2\,\vert\,x_1,x_2)$
	$\displaystyle =$	$\displaystyle f(\rho_\lambda\,\vert\,\lambda_1,\lambda_2)\cdot f(\lambda_1\,\vert\,x_1)\cdot f(\lambda_2\,\vert\,x_2)\,.$

But, being $\rho_\lambda$ deterministically related to $\lambda_1$ and $\lambda_2$ , $f(\rho_\lambda\,\vert\,\lambda_1,\lambda_2)$ is nothing but $\delta(\rho_\lambda\! -\! \lambda_1/\lambda_2)$ (see also other examples in Ref. [1]). Integrating then $f(\rho_\lambda,\lambda_1,\lambda_2\,\vert\,x_1,x_2)$ over $\lambda_1$ and $\lambda_2$ we get Eq. (

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... exponentially,²⁸

Empirically, we can evaluate, taking two points form the histogram of Fig.

, the following exponential: $f(\log(r))\approx 1.16\times \exp{(-2.30\cdot\vert\log(r)\vert)}$ .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... equations:²⁹

Perhaps it is worth noticing again (see footnote

), since this observation seems raised for the first time in this paper, that Eq. (

) can be seen not only as an extension to continuous variables of Eq. (

), but also as the joint pdf $f(\rho,r_1,r_2)$ obtained by the chain rule, that is $f(\rho,r_1,r_2)\! =\! f(\rho\,\vert\,r_1,r_2)\cdot f(r_1)\cdot f(r_2)$ , where $f(\rho\,\vert\,r_1,r_2)\!=\!\delta(\rho\!-\!r_1/r_2)$ , followed by marginalization.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

....³⁰

If you like to reproduce the final result, given by Eq. (

) with Mathematica, here are the commands to get it, although the output will appear a bit cryptic (but you will recognize the resulting plot):

 
  rM = 10
  frho := Integrate[r2*DiracDelta[r1 - rho*r2]/rM^2, {r1, 0, rM}, {r2, 0, rM}]
  frho
  Plot[frho, {rho, 0, 5}]

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... as ³¹

We can check that $P(1/10\le \rho\le 10)\!=\!9/10$ , as previously guessed from symmetry arguments.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... simulations.³²

Curiously, this distribution has the property that $f(1/\rho) = f(\rho)$ . I wonder if there are others.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... interest.³³

But when measuring rates, a flat prior has more implications than one might think, as discussed in chapter 13 of Ref. [13], and therefore a full understanding of the physical case is desirable.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

...,³⁴

Note that, since priors are logically needed, programs of this kind require them, even if they are flat. This can be seen as an annoyance, but it is instead a power of these programs: first they can include also non trivial priors; second, even if one wants to use flat priors, the user is forced to think on the fact that priors are unavoidable, instead of following the illusion that she is using a prior-free method [42], sometimes very dangerous, unless one does simple routine measurements characterized by a very narrow likelihood [13].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... other.³⁵

At this point a clarification is in order. When we make fits and say, again with reference to Fig. 1 of Ref. [43], that the observations are independent from each other we are referring to the fact that each depends only on its $\mu_{y_i}$ , e.g. $y_i \sim {\cal N}(\mu_{y_i}, \sigma_Y)$ , but not on the other $y_{j\ne i}$ . Instead, the true values $\mu_{y_i}$ are certainly correlated, being $\mu_{y} = \mu_{y}( \mu_{x};$ $\theta$ .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... which ³⁶

Indeed, it can also be found in the literature as the probability of the number of failures before the first success occurs' (for example, my preferred vademecum of Probability Distributions, that is the homonymous app [31], reports both distributions).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... occurs.³⁷

Also of this distribution there are two flavors, the other one describing the number of trials before the -th success [31].

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... instant ³⁸

One could also think at `things' occurring in `points' in some different space. All what we are going to say in the domain of time can be translated in other domains.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... details,³⁹

It is indeed a useful exercise to derive the Erlang distribution starting from

$\displaystyle f(t\,\vert\,r,k\!=\!2)\ =\ \int_0^\infty\!\int_0^\infty\delta(t-t_1-t_2)\cdot f(t_1\,\vert\,r,k\!=\!1)\cdot f(t_2\,\vert\,r,k\!=\!1)\,$ d $\displaystyle t_1$ d $\displaystyle t_2\,,$

and going on until the general rule is obtained.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

... maximum ⁴⁰

Taking the log of $f(x\,\vert\,\alpha,\beta)$ , we get the condition of maximum by

$\displaystyle \frac{\partial}{\partial x}\log f(x\,\vert\,\alpha,\beta)$

$\displaystyle =$

$\displaystyle \frac{\alpha-1}{x} - \beta = 0\,,$

resulting in $x=(\alpha-1)/\beta$ .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.