Contents
Preface
List of Contributors
1 Fundamental Concepts
1.1 Introduction
1.2 Probability Density Functions
1.3 Theoretical Distributions
1.4 Probability
1.5 Inference and Measurement
1.6 Exercises
References
2 Parameter Estimation
2.1 Parameter Estimation in High Energy Physics: Introductory Words
2.2 Parameter Estimation: Definition and Properties
2.3 The Method of Maximum Likelihood
2.4 The Method of Least Squares
2.5 Maximum-Likelihood Fits: Unbinned, Binned, Standard and Extended Likelihood
2.6 Bayesian Parameter Estimation
2.7 Exercises
References
3 Hypothesis Testing
3.1 Basic Concepts
3.2 Choosing the Test Statistic
3.3 Choice of the Critical Region
3.4 Determining Test Statistic Distributions
3.5 p-Values
3.6 Inversion of Hypothesis Tests
3.7 Bayesian Approach to Hypothesis Testing
3.8 Goodness-of-Fit Tests
3.9 Conclusion
3.10 Exercises
References
4 Interval Estimation
4.1 Introduction
4.2 Characterisation of Interval Constructions
4.3 Frequentist Methods
4.4 Bayesian Methods
4.5 Graphical Comparison of Interval Constructions
4.6 The Role of Intervals in Search Procedures
4.7 Final Remarks and Recommendations
4.8 Exercises
References
5 Classification
5.1 Introduction to Multivariate Classification
5.2 Classification from a Statistical Perspective
5.3 Multivariate Classification Techniques
5.4 General Remarks
5.5 Dealing with Systematic Uncertainties
5.6 Exercises
References
6 Unfolding
6.1 Inverse Problems
6.2 Solution with Orthogonalisation
6.3 Regularisation Methods
6.4 The Discrete Cosine Transformation and Projection Methods
6.5 Iterative Unfolding
6.6 Unfolding Problems in Particle Physics
6.7 Programs Used for Unfolding in High Energy Physics
6.8 Exercise
References
7 Constrained Fits
7.1 Introduction
7.2 Solution by Elimination
7.3 The Method of Lagrange Multipliers
7.4 The Lagrange Multiplier Problem with Linear Constraints and Quadratic Objective Function
7.5 Iterative Solution of the Lagrange Multiplier Problem
7.6 Further Reading and Web Resources
7.7 Exercises
References
8 How to Deal with Systematic Uncertainties
8.1 Introduction
8.2 What Are Systematic Uncertainties?
8.3 Detection of Possible Systematic Uncertainties
8.4 Estimation of Systematic Uncertainties
8.5 How to Avoid Systematic Uncertainties
8.6 Conclusion
8.7 Exercise
References
9 Theory Uncertainties
9.1 Overview
9.2 Factorisation: A Cornerstone of Calculations in QCD
9.3 Power Corrections
9.4 The Final State
9.5 From Hadrons to Partons
9.6 Exercises
References
10 Statistical Methods Commonly Used in High Energy Physics
10.1 Introduction
10.2 Estimating Efficiencies
10.3 Estimating the Contributions of Processes to a Dataset: The Matrix Method
10.4 Estimating Parameters by Comparing Shapes of Distributions: The Template Method
10.5 Ensemble Tests
10.6 The Experimenter’s Role and Data Blinding
10.7 Exercises
References
11 Analysis Walk-Throughs
11.1 Introduction
11.2 Search for a Z′ Boson Decaying into Muons
11.3 Measurement
11.4 Exercises
References
12 Applications in Astronomy
12.1 Introduction
12.2 A Survey of Applications
12.3 Nested Sampling
12.4 Outlook and Conclusions
12.5 Exercises
References
The Authors
Index
Related Titles
Brock, I., Schorner-Sadenius, T. (eds.)
Physics at the Terascale
2011
ISBN: 978-3-527-41001-9
Russenschuck, S.
Field Computation for Accelerator Magnets
Analytical and Numerical Methods for Electromagnetic Design and Optimization
2010
ISBN: 978-3-527-40769-9
Halpern, P.
Collider
The Search for the World's Smallest Particles
2009
ISBN: 978-0-470-28620-3
Martin, B., Shaw, G.
Particle Physics
2008
ISBN: 978-0-470-03294-7
Griffiths, D.
Introduction to Elementary Particles
2008
ISBN: 978-3-527-40601-2
Reiser, M.
Theory and Design of Charged Particle Beams
2008
ISBN: 978-3-527-40741-5
Wangler, T.P.
RF Linear Accelerators
2008
ISBN: 978-3-527-40680-7
Padamsee, H., Knobloch, J., Hays, T.
RF Superconductivity for Accelerators
2008
ISBN: 978-3-527-40842-9
Talman, R.
Accelerator X-Ray Sources
2006
ISBN: 978-3-527-40590-9
The Editors
Dr. Olaf Behnke
DESY
Hamburg
Germany
olaf.behnke@desy.de
Dr. Kevin Kröninger
Universität Göttingen
II. Physikalisches Institut
Göttingen, Germany
kevin.kroeninger@phys.uni-goettingen.de
Dr. Gregory Schott
Karlsruher Institut für Technologie
Institut für Experimentelle Kernphysik
Karlsruhe, Germany
gregory.schott@cern.ch
Dr. Thomas Schörner-Sadenius
DESY
Hamburg, Germany
thomas.schoerner@desy.de
The Cover Picture
represents a hypothetical invariant-mass distribution. The markers with error bars represent the experimental data, the blue area the estimated background and the green regions possible signals for M = 200, M = 300 and M = 400 (in arbitrary units).
The inset shows the negative logarithm of the likelihood function used to identify a resonance in the mass spectrum.
All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.
Library of Congress Card No.:
applied for
British Library Cataloguing-in-Publication Data:
A catalogue record for this book is available from the British Library.
Bibliographic information published by the Deutsche Nationalbibliothek
The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.
© 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Boschstr. 12, 69469 Weinheim, Germany
All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers.
Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.
Print ISBN 978-3-527-41058-3
ePDF ISBN 978-3-527-65344-7
ePub ISBN 978-3-527-65343-0
mobi ISBN 978-3-527-65342-3
oBook ISBN 978-3-527-65341-6
Cover Design Grafik-Design Schulz, Fußgönheim
Preface
Statistical inference plays a crucial role in the exact sciences. In fact, many results can only be obtained with the help of sophisticated statistical methods. In our field of experimental particle physics, statistical reasoning enters into basically every step of our data analysis work.
Recent years have seen the development of many new statistical techniques and of complex software packages implementing these. Consequently, the requirements on the statistics knowledge for scientists in high energy physics have increased dramatically, as have the needs for education and documentation in this field. This book aims at contributing to this purpose. It targets a broad readership at all career levels, from students to senior researchers, and is intended to provide comprehensive and practical advice for the various statistical analysis tasks typically encountered in high energy physics. To achieve this, the book is split into 12 chapters, all written by a different expert author or team of two authors and focusing on a well-defined topic:
The next chapters elucidate the basic tools used to infer results from data:
The following chapters deal with more advanced tasks encountered frequently:
The determination of systematic uncertainties is a key task for any measurement that is often performed as the very last step of a data analysis. We feel that it is worthwhile to discuss this – often neglected – topic in two chapters:
The following three chapters complete the book:
In all chapters, care has been taken to be as practical and concrete as the material allows – for this purpose many specifically designed examples have been inserted into the text body of the chapters. A further deepening of the understanding of the book material can be achieved with the dedicated exercises at the end of all chapters. Hints and solutions to the exercises, together with some necessary software, are available from a webpage provided by the publisher. Here, we will also collect feedback, corrections and other information related to this volume; please check www.wiley.com for the details.
Many people have contributed to this book, and we would like to thank all of them. First of all, we thank the authors of the individual chapters for the high- quality material they provided.
Besides the authors, a number of people are needed to successfully conclude a book project like this one: numerous colleagues contributed by means of discussion, by providing expert advice and answers to our questions. We cannot name them all.
Katarina Brock spent many hours editing and polishing all the figures and providing a unified layout for them. Konrad Kieling from Wiley provided valuable support in typesetting the book. Vera Palmer and Ulrike Werner from Wiley provided constant support in all questions related to this book. We thank Tatsuya Nakada for his permission to use his exercise material.
Our last and very heartfelt thanks goes to our friends, partners and families who endured, over a considerable period, the very time- and also nerve-consuming genesis of this book. Without their support and tolerance this book would not exist today.
All comments, criticisms and questions you might have on the book are welcome – please send them to the authors via email:
olaf.behnke@desy.de,
kevin.kroeninger@phys.uni-goettingen.de,
thomas.schoerner@ desy. de,
gregory. schott@cern.ch.
Hamburg Göttingen Karlsruhe November 2012 |
Olaf Behnke Kevin Kröninger Thomas Schörner-Sadenius and Gréory Schott |
List of Contributors
Roger Barlow
University of Huddersfield
Huddersfield
United Kingdom
Olaf Behnke
DESY
Hamburg
Germany
Volker Blobel
Universität Hamburg
Hamburg
Germany
Luc Demortier
The Rockefeller University
New York, New York
United States of America
Markus Diehl
DESY
Hamburg
Germany
Aart Heijboer
Nikhef
Amsterdam
Netherlands
Carsten Hensel
Universität Göttigen
II. Physikalisches Institut
Göttingen
Germany
Kevin Kröninger
Universität Göttingen
II. Physikalisches Institut
Göttingen
Germany
Benno List
DESY
Hamburg
Germany
Lorenzo Moneta
CERN
Geneva
Switzerland
Harrison B. Prosper
Florida State University
Tallahassee, Florida
United States of America
Grégory Schott
Karlsruher Institut für Technologie
Institut für Experimentelle Kernphysik
Karlsruhe
Germany
Helge Voss
Max-Planck-Institut für Kernpyhsik
Heidelberg
Germany
Ivo van Vulpen
Nikhef
Amsterdam
Netherlands
Rainer Wanke
Institut für Physik
Universität Mainz
Mainz
Germany
Particle physics is all about random behaviour. When two particles collide, or even when a single particle decays, we can’t predict with certainty what will happen, we can only give probabilities of the various different outcomes. Although we measure the lifetimes of unstable particles and quote them to high precision – for the τ lepton, for example, it is 0.290±0.001 ps – we cannot say exactly when a particular τ will decay: it may well be shorter or longer. Although we know the probabilities (called, in this context, branching ratios) for the different decay channels, we can’t predict how any particular τ will decay – to an electron, or a muon, or various hadrons.
Then, when particles travel through a detector system they excite electrons in random ways, in the gas molecules of a drift chamber or the valence band of semiconducting silicon, and these electrons will be collected and amplified in further random processes. Photons and phototubes are random at the most basic quantum level. The experiments with which we study the properties of the basic particles are random through and through, and a thorough knowledge of that fundamental randomness is essential for machine builders, for analysts, and for the understanding of the results they give.
It was not always like this. Classical physics was deterministic and predictable. Laplace could suggest a hypothetical demon who, aware of all the coordinates and velocities of all the particles in the Universe, could then predict all future events. But in today’s physics the demon is handicapped not only by the uncertainties of quantum mechanics – the impossibility of knowing both coordinates and velocities – but also by the greater understanding we now have of chaotic systems. For predicting the flight of cannonballs or the trajectories of comets it was assumed, as a matter of common sense, that although our imperfect information about the initial conditions gave rise to increasing inaccuracy in the predicted motion, better information would give rise to more accurate predictions, and that this process could continue without limit, getting as close as one needed (and could afford) to perfect prediction. We now know that this is not true even for some quite simple systems, such as the compound pendulum.
That is only one of the two ways that probability comes into our experiments. When a muon passes through a When a muon passes through a detector it may, with some probability, produce a signal in a drift chamber: the corresponding calculation is a prediction. Conversely a drift chamber signal may, with some probability, have been produced by a muon, or by some other particle, or just by random noise. To interpret such a signal is a process called inference. Prediction works forwards in time and inference works backwards. We use the same mathematical tool – probability – to cover both processes, and this causes occasional confusion. But the statistical processes of inference are, though less visibly dramatic, of vital concern for the analysis of experiments. Which is what this book is about.
The outcomes of random processes may be described by a variable (or variables) which can be discrete or continuous, and a discrete variable can be quantitative or qualitative. For example, when a τ lepton decays it can produce a muon, an electron, or hadrons: that’s a qualitative difference. It may produce one, three or five charged particles: that’s quantitative and discrete. The visible energy (i.e. not counting neutrinos) may be between 0 and 1777 MeV: that’s quantitative and continuous.
The probability prediction for a variable x is given by a function: we can call it f(x). If x is discrete then f(x) is itself a probability. If x is continuous then f(x) has the dimensions of the inverse of x: it is ∫ f(x)dx that is the dimensionless probability, and f(x) is called a probability density function or pdf.1) There are clearly an infinite number of different pdfs and it is often convenient to summarise the properties of a particular pdf in a few numbers.
If the variable x is quantitative then for any functiSpon g(x) one can form the average
(1.1)
where the integral (for continuous x) or the sum (for discrete x) covers the whole range of possible values. This is called the expectation value. It is also sometimes written g, as in quantum mechanics. It gives the mean, or average, value of g, which is not necessarily the most likely one – particularly if x is discrete.
For any pdf f(x), the integer powers of x have expectation values. These are called the (algebraic) moments and are defined as
(1.2)
The first moment, α1, is called the mean or, more properly, arithmetic mean of the distribution; it is usually called µ and often written . It acts as a key measure of location, in cases where the variable x is distributed with some known shape about a particular point.
Conversely there are cases where the shape is what matters, and the absolute location of the distribution is of little interest. For these it is useful to use the central moments
(1.3)
The second central moment is also known as the variance, and its square root as the standard deviation:
(1.4)
The variance is a measure of the width of a distribution. It is often easier to deal with algebraically whereas the standard deviation σ has the same dimensions as the variable x; which to use is a matter of personal choice. Broadly speaking, statisticians tend to use the variance whereas physicists tend to use the standard deviation.
The third and fourth central moments are used to build shape-describing quantities known as skew and kurtosis (or curtosis):
(1.5)
(1.6)
Division by the appropriate power of σ makes these quantities dimensionless and thus independent of the scale of the distribution, as well as of its location. Any symmetric distribution has zero skew: distributions with positive skew have a tail towards higher values, and conversely negative skew distributions have a tail towards lower values. The Poisson distribution has a positive skew, the energy recorded by a calorimeter has a negative skew. A Gaussian has a kurtosis of zero – by definition, that’s why there is a ‘3’ in the formula. Distributions with positive kurtosis (which are called leptokurtic) have a wider tail than the equivalent Gaussian, more centralised or platykurtic distributions have negative kurtosis. The Breit–Wigner distribution is leptokurtic, as is Students t. The uniform distribution is platykurtic.
Suppose you have a pdf f(x, y) which is a function of two random variables, x and y. You can not only form moments for both x and y, but also for combinations, particularly the covariance
(1.7)
If the joint pdf is factorisable: f(x, y) = fx(x) · fy(y), then x and y are independent, and the covariance is zero (although the converse is not necessarily true: a zero covariance is a necessary but not a sufficient condition for two variables to be independent).
A dimensionless version of the covariance is the correlation ρ:
(1.8)
The magnitude of the correlation lies between 0 (uncorrelated) and 1 (completely correlated). The sign can be positive or negative: amongst a sample of students there will probably be a positive correlation between height and weight, and a negative correlation between academic performance and alcohol consumption.
If there are several (i.e. more than two) variables, x1, x2,…, xN, one can form the covariance and correlation matrices:
(1.9)
(1.10)
and Vii is just .
Mathematically, any pdf f(x, y) is a function of two variables x and y. They can be similar in nature, for example the energies of the two electrons produced by a converting high energy photon, or they can be different, for example the position and direction of particles undergoing scattering in material.
Often we are really interested in one parameter (say x) while the other (say y) is just a nuisance parameter. We want to reject the extra information shown in the two-dimensional function (or scatter plot). This can be done in two ways: the projection of x, f(x)|y is obtained by choosing a particular value of y, the marginal distribution f(x) = ∫ f(x, y)dy is found by integrating over y.
Projections can be useful for illustration, otherwise to be meaningful you have to have a good reason for choosing that specific value of y. Marginalisation requires that the distribution in y, like that of x, is properly normalised.
There are many other properties that can be quoted, depending on the point we want to bring out, and on the established usage of the field.
The mean is not always the most helpful measure of location. The mode is the value of x at which the pdf f(x) is maximum, and if you want a typical value to quote it serves well. The median is the midway point, in the sense that half the data lie above and half below. It is useful in describing very skewed distributions (particularly financial income) in which fluctuations in a small tail would give a big change in the mean.
We can also specify dispersion in ways that are particularly useful for non-Gaussian distributions by using quantiles: the upper and lower quartiles give the values above which, and below which, 25% of the data lie. Deciles and percentiles are also used.
The cumulative distribution function
(1.11)
where Θ is the Heaviside or step function (Θ(x) = 1 for x ≥ 0 and 0 otherwise), giving the probability that a variable will take a value up to a, is occasionally useful.
The characteristic function
(1.12)
which is just (up to factors of 2π) the Fourier transform of the pdf, is also met with sometimes as it has useful properties.
A pdf is a mathematical function. It involves a variable (or variables) describing the random quantity concerned. This may be a discrete integer or a continuous real number. It also involves one or more parameters. In what follows we will denote a random variable by x for a real number and r for an integer. Parameters generally have their traditional symbols for particular pdfs: where we refer to a generic parameter we will call it θ. It is often helpful to write a function as f(x; θ) or f(x|θ), separating this way more clearly the random variable(s) from the adjustable parameter(s). The semicolon is preferred by some, the line has the advantage that it matches the notation used for conditional probabilities, described in Section 1.4.4.1.
There are many pdfs in use to model the results of random processes. Some are based on physical motivations, some on mathematics, and some are just empirical forms that happen to work well in particular cases.
The overwhelmingly most useful form is the Gaussian or normal distribution. The Poisson distribution is also encountered very often, and the binomial distribution is not uncommon. So we describe these in some detail, and then some other distributions rather more briefly.
The Gaussian, or normal, distribution for a continuous random variable x is given by
(1.13)
It has two parameters; the function is manifestly symmetrical about the location parameter µ, which is the mean (and mode, and median) of the distribution. The scale parameter σ is also the standard deviation of the distribution. So there is, in a sense, only one Gaussian, the unit Gaussian or standard normal distribution f(x; 0, 1) shown in Figure 1.1. Any other Gaussian can be obtained from this by scaling by a factor σ and translating by an amount µ. The Gaussian distribution is sometimes denoted .
The Gaussian is ubiquitous (hence the name ‘normal’) because of the central limit theorem, which states that if any distribution is convoluted with itself a large number of times, the resulting distribution tends to a Gaussian form. For a proof, see for example Appendix 2 in [1].
Gaussian random numbers are much used in simulation, and a suitable random number generator is available on most systems. If it is not, then you can generate a unit Gaussian by taking two uniformly generated random numbers u1, u2, set θ = 2πu1, , and then r cos θ and r sin θ are independent samples from a unit Gaussian.
The product of two independent Gaussians gives a two-dimensional function
(1.14)
but the most general quadratic form in the exponent must include the cross term and can be written as
(1 15)
where the parameter ρ is the correlation between x and y. For N variables, for which we will use the vector x, the full form of the multivariate Gaussian can be compactly written using matrix notation:
(1.16)
Here, V is the covariance matrix described in Section 1.2.2.3.
The error function and the complementary error function are basically closely related to the cumulative Gaussian
(1.17)
(1.18)
Their main use is in calculating Gaussian p-values (see Section 1.3.4.6). The probability that a Gaussian random variable will lie within one standard deviation, or ‘1 σ’, of the mean is 68% obtained by calculating erf(y = 1). Conversely, the chance that a variable drawn from a Gaussian random process will lie outside 1 σ is 32%. Given such a process – say a mean of 10.2 and a standard deviation of 3.1 – then if you confront a particular measurement – say 13.3 – it is quite plausible that it was produced by the process. One says that its p-value, the probability that the process would produce a measurement this far, or further, from the ideal mean, is 32%. Conversely, if the number were 25.7 rather than 13.3, that would be 5 σ rather than 1 σ, for which the p-value is only 5.7 · 10–7. In discussion of discoveries (or otherwise) of new particles and new effects this language is turned round, and a discovery with a p-value of 5.7 · 10–7 is referred to as a ‘5 σ result’2). A translation is given in Table 1.1 – although for practical purposes it is easier to use functions such as pnorm and qnorm in the programming language R [2], or TMath::Prob in ROOT[3].
Deviation | p-value (%) |
1σ | 31.7 |
2σ | 4.56 |
3σ | 0.270 |
4σ | 0.00633 |
5σ | 0.0000573 |
The Poisson distribution
(1.19)
describes the probability of n events occurring when the mean expected number is v; n is discrete and v is continuous. Typical examples are the number of clicks produced by a Geiger counter in an interval of time, or, famously, the number of Prussian cavalrymen killed by horse-kicks [4]. Some examples are shown in Figure 1.2.
The Poisson distribution has a mean of v and a standard deviation This property – that the standard deviation is the square root of the mean – is a key fact about distributions generated by a Poisson process, which is important as this includes most cases where a number of samples is taken, including the contents of the bin of a histogram.
The binomial distribution describes a generalisation of the simple problem of the numbers of heads and tails that can arise from spinning a coin several times. The probability for getting r ‘successes’ from N ‘trials’ given an intrinsic probability of success p is
(1.20)
Sometimes one writes q instead of 1 – p, which makes the algebra prettier. The distribution has a mean of Np and a standard deviation . The factor N!/[r!(N – r!)] is the number of ways that r objects may be chosen from N, and is often written .
If p is small then the distribution can be approximated by a Poisson distribution3) of mean Np. This is often used implicitly when analysing Monte Carlo samples: if you generate 1 000 000 Monte Carlo events, of which 100 end up in some particular histogram bin, then strictly speaking this is described by a binomial process rather than a Poisson. In practice you can take the error as the Poisson rather than a binomial . This doesn’t work if p is large. If 9 out of 10 events are accepted by the trigger, the error on the trigger efficiency of 90% is not but (in such a case the shortcut is to take the one lost event as approximately Poisson, giving the error as 10%, which is close).
If N is large and p is not small then the distribution is approximately a Gaussian.
If there are not just two possible outcomes but n, with probabilities {p1, p2, …, pn}, then the total probability of getting r1 of the first outcome, r2 of the second, and so on, is
(1.21)
This is the multinomial distribution.
There are many, many other possible distribution functions, and it is worth listing some of those more often met with.
The uniform distribution, also known as the rectangular or top-hat distribution, is constant inside some range – call this range – a/2 to a/2, so the width is a; if the range is not central about zero but about some other value this is easily done by a translation. The mean, clearly, is zero, and the standard deviation is . This can be used in position measurements by a hodoscope: if a rectangular slab of scintillator gives a signal, you know that a track went through it but you do not know where. It is reasonable to assume a uniform distribution for the pdf of the hit position.
This can be relevant in considering some systematic uncertainties on the total result, as is also discussed in Section 8.4.1.2. For example, if you set up an experiment to run overnight, counting events with some efficiency E1, and when you arrive in the morning you find a component has tripped so the efficiency is E2, with no information about when this happened, your efficiency has to be quoted as . It can also be applied to theoretical models: when two models give different predictions you are justified in using their mean as your prediction, with a (systematic) error which is the difference divided by , if (and only if) these two models represent absolute extremes and you really have no feeling as to where between the two extremes the truth may lie.
In nuclear and particle physics the function
(1.22)
gives the variation with the energy E of a cross section produced by the formation of a state with mass M and width Γ. It can be written more neatly in dimensionless form as
(1.23)
where x = (E – M)/(Γ/2). The mean is clearly M. It does not have a variance: the integral ∫ x2 f(x)dx is divergent. If you have to compare this curve and with that of a Gaussian, the full width at half maximum (FWHM) is clearly Γ for this curve and for a Gaussian it is .
This distribution is used in fitting resonance peaks (provided the width is much larger than the measurement error on E). It also has an empirical use in fitting a set of data which is almost Gaussian but has wider tails. This often arises in cases where a fraction of the data is not so well measured as the rest. A double Gaussian may give a good fit, but it often turns out that this form does an adequate job without the need to invoke extra parameters.
When a charged particle passes an atom, its electrons experience a changing electromagnetic field and acquire energy. The amount of energy may be large; on rare occasions it will be large enough to create a delta ray. The probability distribution for the energy loss was computed by Landau [5] and is given by
(1.24)
where λ = (Δ — Δ0)/ξ. Here, Δ is the actual energy loss, Δ0 is a location parameter, and ξ is a scale, exact values for which depend on the material. This distribution has a peak at Δ0, cuts off quickly below that, and has a very large long positive tail. The function is shown in Figure 1.3.
The Landau distribution has very unpleasant mathematical properties. Some of its integrals diverge, for example it has no variance (like the Cauchy distribution), and, worse than that, it does not even have a mean. The ensuing complications can be avoided on a case-by-case basis by imposing an upper limit on the energy loss, as a particle cannot lose more than 100% of its energy.
There is a function which is described in some places as ‘the Landau distribution’. It is not. It is an approximation to the Landau distribution [6], and not a very good one at that.
This considers the familiar binomial, but with a twist. As before, some process has a random probability p of success and q = 1 – p of failure, and is repeated for many trials. But now instead of asking the probability of r successes from a fixed number of trials n, we ask for the probability of r successes before encountering a fixed number k of failures. This is given by
(1.25)
It is the probability for r successes and k – 1 failures in any permutation, followed by a final kth failure. The combinatorial factor can also be written , hence the name ‘negative binomial’. This can readily be extended to non-integer values by writing it as
(1.26)
although it is not clear what physical meaning this may have. Γ is the Gamma function, defined as
(1.27)
The negative binomial distribution has a mean µ = (p/q)k and a variance V = (p/q2)k. The negative binomial approaches the Poisson as k becomes large and p small with constant pk = µ.
If you take a sample of n values, {x1,…, xn}, from a Gaussian and histogram their differences from the true mean, divided by the standard deviation (a quantity often called the pull distribution), then this gives a unit Gaussian, that is a Gaussian with µ = 0, σ = 1, which can be a useful check that you have your errors right. If, as often happens, the true mean is unknown, then the spread about the measured mean is slightly smaller than 1, by a factor .
If the standard deviation σ is also unknown, then you can use instead the estimated if µ is known or if it is not. Now, for small n especially, this is not a very good estimator, and because you are dividing the differences from the mean by this bad estimate, the distribution for
(1.28)
is not given by a Gaussian, but by Student’s t distribution for n – 1 degrees of freedom, where Student’s t distribution is given by
(1.29)
This tends to a unit Gaussian as n becomes large, but for small n it has tails which are significantly wider (see Figure 1.4): large t values can result if is an underestimate of the true value. The mean is clearly zero; the variance is not one, as it would be for a unit Gaussian, but n/(n – 2).
In describing the agreement between a predictive function g(x) and a set of n measurements {(xi, yi)}, it is useful to form the total squared deviation
(1.30)
where σi is the Gaussian error on measurement i: if these errors are the same for all measurements then the factor can, of course, be taken outside the summation.
Each term will clearly contribute an amount of order one to the sum, and it is no surprise that E[f(χ2; n)] = n. The distribution is given by
(1.31)
Some examples for different n are shown in Figure 1.5.
The χ2 distribution is used a great deal in considering the question of whether a particular set of measurements (with their errors) and a particular model are compatible. This is addressed through the cumulative χ2 distribution. For a given value of χ2, the complement of the cumulative distribution gives the p-value, the probability that, given that the model is indeed correct, a measurement would give a result with a χ2 this large, or larger. If the value of χ2 obtained is large compared to n then the p-value is small, that is the probability that a set of measurements truly described by this model would give such a large disagreement is small, and doubt is cast on the model, or the data (or both). The mean of f(χ2, n) is just n, and the standard deviation is . For large n the distribution converges to the Gaussian, as it must by the central limit theorem. However, the convergence is actually rather slow, and this approximation is not often used. Instead the p-value should be obtained accurately from functions such as TMath::Prob in ROOT or pchisq in R
If the model has free parameters θ which are not given, but were found by fitting the data, then the same χ2 test can be used, but for n one takes the number of data points minus the number of fitted parameters. This is called the number of degrees of freedom. Strictly speaking this is only true if the model is a linear one (i.e. linear in the parameters). This is often the case, either exactly or to a good approximation, but there are some instances where this condition does not hold, leading to the computation of deceptively small and inaccurate p-values.
You will occasionally obtain χ2 values that seem very small: χ2 n. There is no standard procedure for rejecting these, but you should treat them with some suspicion and consider whether the model may have been formulated after the data had been measured (‘retrospective prediction’), or whether perhaps the errors have been over-generously estimated.
If the logarithm of the variable is given by a Gaussian distribution f(ln x; µ, σ) then the distribution for x itself is the log-normal distribution
(1.32)
Just as the central limit theorem dictates that any variable which is the sum of a large number of random components is described by a Gaussian distribution, any variable which is the product of a large number of random factors, none of which dominates the behaviour, is described by the log-normal. For instance, the signal registered by an electron in a calorimeter may be described by a log-normal distribution, as a certain fraction of the energy may be lost to dead material, a fraction to lost photons, a fraction to neutron production, and so on. The mean is given by , and the standard deviation is .
The Weibull distribution is:
(1.33)
This gives a shape which rises from zero to a peak and then falls back to zero again. It was originally invented to describe the failure rates in aging light bulbs. There are no failures at small times (because they are new and fresh) or at long times (because they have all failed). It is a rather more realistic modelling of real-life ‘lifetime’ than the simple exponential decay law for which the failure probability is constant.
The parameter α is just a scale factor and β controls the shape. The case β = 1 corresponds to the simple exponential decay law, whereas β > 1 describes the behaviour when the failure probability increases with age, and gives successively sharper peaks. A case where the failure probability falls with time (perhaps because of initial burn-in) is described by β < 1. Examples are shown in Figure 1.6. The mean is and the variance is 1/α2 {Γ [1 + (2/β)] – Γ [1 + (1/β)]2}. A location parameter x0 may also be needed in some problems, replacing x by x – x0.
We use probability every day, in both our work as physicists and our everyday lives. Sometimes this is a matter of precise calculation, when we buy an insurance policy or decide whether to publish a result, sometimes it is more intuitive, as when we decide to take an umbrella to work in the morning.
But although we are familiar with the concept of probability, on closer inspection it turns out that there are subtleties. When we get into technicalities there turn out to be different definitions of the concept which are not always compatible.
Let A be an event. Then the probability P(A) is a number obeying three conditions, the Kolmogorov axioms [7]:
From these axioms a whole system of theorems and properties can be derived. However, the theory contains no statement as to what the numbers actually mean. For mathematicians this is, of course, not a problem, but it does not help us to apply the results.
The probability of a coin landing heads or tails is clearly 1/2. Symmetry dictates that it cannot be anything else. Likewise the chance of drawing a particular card from a pack has to be 1/52. The original development of probability by Laplace, Pascal and their contemporaries, to aid the gambling fraternity, was founded on this equally likely construction. ‘Probability’ could be defined by taking fundamental symmetry where all cases were equally likely (say, the six sides on a dice), and extended to more complex cases (say, rolling two dice) by counting combinations.
Unfortunately this definition does not generalise to cases of continuous variables, where there is no fundamental symmetry: if you ‘draw a line at random’ from a given point, this could be done by taking coordinates of the endpoint from a uniform distribution, or by drawing an angle uniformly taken between 0 and 360°, the results are incompatibly different. This approach thus leads to a dead end.
Problems with the classical definition led to the alternative definition of probability as the limit of frequency by Venn, von Mises [8] and others. If a selection is made N times under identical circumstances, then the fraction of cases resulting in a particular outcome A tends to a limit, and this limit is what is meant by the probability:
(1.34)
This is the generally adopted definition, taught in most elementary courses and textbooks. It satisfies, of course, the Kolmogorov axioms.
Where the classical definition is valid it leads to the same results. But there is an important philosophical difference. The probability P(A) is not some intrinsic property of A, it also depends on the way the sampling is done: on how the collective or ensemble of total possible outcomes has been constructed.
Thus, to use von Mises’ example: the life insurance companies determine that the probability of one of their (male) clients dying between the ages of 40 and 41 is 1.1%. This is a hard and verifiable number, essential for the correct adjustment of the premium paid. However, it is not an intrinsic probability of the person concerned: you cannot say that a particular client has this number attached to them as a property in the same way that their height and weight are. The client belongs not just to this ensemble (insured 40-yr-old males) but to many others: 40-yr-old males, non-smoking 40-yr-old males, non-smoking professional lion tamers – and for each of these ensembles there will be a different number.
So there are cases with several possible ensembles, and the value of P(A) is ambiguous until the ensemble is specified. There are also cases where there is no ensemble, as the event is unique. The Big Bang is an obvious example, but others can be found much nearer home. For example, what is the probability P(rain) that it will rain tomorrow? Now, there is only one tomorrow, and it will either rain or it will not, so P(rain) is either 0 or 1. Von Mises condemns any further discussion as ‘unscientific’ use of language. This is further discussed (and resolved) in Section 1.5.2.
Another way of extending the unsatisfactory classical definition of probability was made by de Finetti [9] and others. De Finetti’s starting point is the provocative ‘Probability does not exist.’ It has no objective status: it is something the human mind has constructed.
He shows that one can consistently define a personal probability (or degree-of-belief) P(A) in A by establishing the odds of a bet whereby you lose €1 if A subsequently turns out to be false, and you receive €G if it turns out to be true. If P(A) > 1/(1 + G) you will accept the bet; if P(A) < 1/(1 + G) you will decline it.
Such personal probability is indeed something we use every day: when you decide whether or not to take an umbrella to work in the morning your decision is based on your personal probability of there being rain (and also the ‘costs’ involved in (a) getting wet and (b) having something extra to carry). However, there is no need for my personal probability to be the same as yours, or anyone else’s. It is thus often referred to as a subjective probability. Subjective probability is also generally known as Bayesian probability, because of the great use it makes of Bayes’ theorem [10]. This is a simple and fundamental result which is actually valid for any of the probability definitions being used.
Suppose A and B are two events, and introduce the conditional probability P(A | B), the probability of event A given that B is true (for instance: the probability that a card is the six of spades, given that it is black, P(six of spades|black) is 1/26).
The probability of both A and B occurring, P(A B) is clearly P(A | B)P(B). But it is also P(B | A) P(A). Equating these two quantities gives
(1.35)
This is used in problems like the famous ‘taxi colour’ example.