Cover Page

Contents

Preface

List of Contributors

1 Fundamental Concepts

1.1 Introduction

1.2 Probability Density Functions

1.3 Theoretical Distributions

1.4 Probability

1.5 Inference and Measurement

1.6 Exercises

References

2 Parameter Estimation

2.1 Parameter Estimation in High Energy Physics: Introductory Words

2.2 Parameter Estimation: Definition and Properties

2.3 The Method of Maximum Likelihood

2.4 The Method of Least Squares

2.5 Maximum-Likelihood Fits: Unbinned, Binned, Standard and Extended Likelihood

2.6 Bayesian Parameter Estimation

2.7 Exercises

References

3 Hypothesis Testing

3.1 Basic Concepts

3.2 Choosing the Test Statistic

3.3 Choice of the Critical Region

3.4 Determining Test Statistic Distributions

3.5 p-Values

3.6 Inversion of Hypothesis Tests

3.7 Bayesian Approach to Hypothesis Testing

3.8 Goodness-of-Fit Tests

3.9 Conclusion

3.10 Exercises

References

4 Interval Estimation

4.1 Introduction

4.2 Characterisation of Interval Constructions

4.3 Frequentist Methods

4.4 Bayesian Methods

4.5 Graphical Comparison of Interval Constructions

4.6 The Role of Intervals in Search Procedures

4.7 Final Remarks and Recommendations

4.8 Exercises

References

5 Classification

5.1 Introduction to Multivariate Classification

5.2 Classification from a Statistical Perspective

5.3 Multivariate Classification Techniques

5.4 General Remarks

5.5 Dealing with Systematic Uncertainties

5.6 Exercises

References

6 Unfolding

6.1 Inverse Problems

6.2 Solution with Orthogonalisation

6.3 Regularisation Methods

6.4 The Discrete Cosine Transformation and Projection Methods

6.5 Iterative Unfolding

6.6 Unfolding Problems in Particle Physics

6.7 Programs Used for Unfolding in High Energy Physics

6.8 Exercise

References

7 Constrained Fits

7.1 Introduction

7.2 Solution by Elimination

7.3 The Method of Lagrange Multipliers

7.4 The Lagrange Multiplier Problem with Linear Constraints and Quadratic Objective Function

7.5 Iterative Solution of the Lagrange Multiplier Problem

7.6 Further Reading and Web Resources

7.7 Exercises

References

8 How to Deal with Systematic Uncertainties

8.1 Introduction

8.2 What Are Systematic Uncertainties?

8.3 Detection of Possible Systematic Uncertainties

8.4 Estimation of Systematic Uncertainties

8.5 How to Avoid Systematic Uncertainties

8.6 Conclusion

8.7 Exercise

References

9 Theory Uncertainties

9.1 Overview

9.2 Factorisation: A Cornerstone of Calculations in QCD

9.3 Power Corrections

9.4 The Final State

9.5 From Hadrons to Partons

9.6 Exercises

References

10 Statistical Methods Commonly Used in High Energy Physics

10.1 Introduction

10.2 Estimating Efficiencies

10.3 Estimating the Contributions of Processes to a Dataset: The Matrix Method

10.4 Estimating Parameters by Comparing Shapes of Distributions: The Template Method

10.5 Ensemble Tests

10.6 The Experimenter’s Role and Data Blinding

10.7 Exercises

References

11 Analysis Walk-Throughs

11.1 Introduction

11.2 Search for a Z′ Boson Decaying into Muons

11.3 Measurement

11.4 Exercises

References

12 Applications in Astronomy

12.1 Introduction

12.2 A Survey of Applications

12.3 Nested Sampling

12.4 Outlook and Conclusions

12.5 Exercises

References

The Authors

Index

Related Titles

Brock, I., Schorner-Sadenius, T. (eds.)

Physics at the Terascale

2011

ISBN: 978-3-527-41001-9

Russenschuck, S.

Field Computation for Accelerator Magnets

Analytical and Numerical Methods for Electromagnetic Design and Optimization

2010

ISBN: 978-3-527-40769-9

Halpern, P.

Collider

The Search for the World's Smallest Particles

2009

ISBN: 978-0-470-28620-3

Martin, B., Shaw, G.

Particle Physics

2008

ISBN: 978-0-470-03294-7

Griffiths, D.

Introduction to Elementary Particles

2008

ISBN: 978-3-527-40601-2

Reiser, M.

Theory and Design of Charged Particle Beams

2008

ISBN: 978-3-527-40741-5

Wangler, T.P.

RF Linear Accelerators

2008

ISBN: 978-3-527-40680-7

Padamsee, H., Knobloch, J., Hays, T.

RF Superconductivity for Accelerators

2008

ISBN: 978-3-527-40842-9

Talman, R.

Accelerator X-Ray Sources

2006

ISBN: 978-3-527-40590-9

The Editors

Dr. Olaf Behnke

DESY

Hamburg

Germany

olaf.behnke@desy.de

Dr. Kevin Kröninger

Universität Göttingen

II. Physikalisches Institut

Göttingen, Germany

kevin.kroeninger@phys.uni-goettingen.de

Dr. Gregory Schott

Karlsruher Institut für Technologie

Institut für Experimentelle Kernphysik

Karlsruhe, Germany

gregory.schott@cern.ch

Dr. Thomas Schörner-Sadenius

DESY

Hamburg, Germany

thomas.schoerner@desy.de

The Cover Picture

represents a hypothetical invariant-mass distribution. The markers with error bars represent the experimental data, the blue area the estimated background and the green regions possible signals for M = 200, M = 300 and M = 400 (in arbitrary units).

The inset shows the negative logarithm of the likelihood function used to identify a resonance in the mass spectrum.

All books published by Wiley-VCH are carefully produced. Nevertheless, authors, editors, and publisher do not warrant the information contained in these books, including this book, to be free of errors. Readers are advised to keep in mind that statements, data, illustrations, procedural details or other items may inadvertently be inaccurate.

Library of Congress Card No.:

applied for

British Library Cataloguing-in-Publication Data:

A catalogue record for this book is available from the British Library.

Bibliographic information published by the Deutsche Nationalbibliothek

The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.d-nb.de.

All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form – by photoprinting, microfilm, or any other means – nor transmitted or translated into a machine language without written permission from the publishers.

Registered names, trademarks, etc. used in this book, even when not specifically marked as such, are not to be considered unprotected by law.

Print ISBN 978-3-527-41058-3

ePDF ISBN 978-3-527-65344-7

ePub ISBN 978-3-527-65343-0

mobi ISBN 978-3-527-65342-3

oBook ISBN 978-3-527-65341-6

Cover Design Grafik-Design Schulz, Fußgönheim

Preface

Statistical inference plays a crucial role in the exact sciences. In fact, many results can only be obtained with the help of sophisticated statistical methods. In our field of experimental particle physics, statistical reasoning enters into basically every step of our data analysis work.

Recent years have seen the development of many new statistical techniques and of complex software packages implementing these. Consequently, the requirements on the statistics knowledge for scientists in high energy physics have increased dramatically, as have the needs for education and documentation in this field. This book aims at contributing to this purpose. It targets a broad readership at all career levels, from students to senior researchers, and is intended to provide comprehensive and practical advice for the various statistical analysis tasks typically encountered in high energy physics. To achieve this, the book is split into 12 chapters, all written by a different expert author or team of two authors and focusing on a well-defined topic:

Fundamental Concepts introduces the basics of statistical data analyses, such as probability density functions and their properties, theoretical distributions (Gaussian, Poisson and many others) and concepts of probability (frequentist and Bayesian reasoning).

The next chapters elucidate the basic tools used to infer results from data:

Parameter Estimation illustrates how to determine the best parameter values of a model from fitting data, for example how to estimate the strength of a signal.
Hypothesis Testing lays out the framework that can be used to decide on hypotheses such as ‘the data can be explained by fluctuations of the known background sources alone’ or ‘the model describes the data reasonably well’.
Interval Estimation discusses how to determine confidence or credibility intervals for parameter values, for example upper limits on the strength of a signal.

The following chapters deal with more advanced tasks encountered frequently:

Classification presents various methods to optimally discriminate different event classes, for example signal from background, using multivariate data input. These methods can be very useful to enhance the sensitivity of a measurement, for example to find and measure a signal in the data that is otherwise drowned in background.
Unfolding describes strategies and methods for correcting data for the usually inevitable effects of detector bias, acceptance, and resolution, which in particular can be applied in measurements of differential distributions.
Constrained Fits discusses how to exploit physical constraints, such as energy– momentum conservation, to improve measurements or to determine unknown parameters.

The determination of systematic uncertainties is a key task for any measurement that is often performed as the very last step of a data analysis. We feel that it is worthwhile to discuss this – often neglected – topic in two chapters:

How to Deal with Systematic Uncertainties elucidates how to detect and avoid sources of systematic uncertainties and how to estimate their impact.
Theory Uncertainties illuminates various aspects of theoretical uncertainties, in particular for the strong interaction.

The following three chapters complete the book:

Statistical Methods Commonly Used in High Energy Physics introduces various practical analysis tools and methods, such as the template and matrix methods for the estimation of sample compositions, or the determination of biases of analysis procedures by means of ensemble tests.
Analysis Walk-Throughs provides a synopsis of the book by going through two complete analysis examples – a search for a new particle and a measurement of the properties of this hypothetical new particle.
Applications in Astronomy takes us on a journey to the field of astronomy and illustrates, with several examples, the sophisticated data analysis techniques used in this research area.

In all chapters, care has been taken to be as practical and concrete as the material allows – for this purpose many specifically designed examples have been inserted into the text body of the chapters. A further deepening of the understanding of the book material can be achieved with the dedicated exercises at the end of all chapters. Hints and solutions to the exercises, together with some necessary software, are available from a webpage provided by the publisher. Here, we will also collect feedback, corrections and other information related to this volume; please check www.wiley.com for the details.

Many people have contributed to this book, and we would like to thank all of them. First of all, we thank the authors of the individual chapters for the high- quality material they provided.

Besides the authors, a number of people are needed to successfully conclude a book project like this one: numerous colleagues contributed by means of discussion, by providing expert advice and answers to our questions. We cannot name them all.

Katarina Brock spent many hours editing and polishing all the figures and providing a unified layout for them. Konrad Kieling from Wiley provided valuable support in typesetting the book. Vera Palmer and Ulrike Werner from Wiley provided constant support in all questions related to this book. We thank Tatsuya Nakada for his permission to use his exercise material.

Our last and very heartfelt thanks goes to our friends, partners and families who endured, over a considerable period, the very time- and also nerve-consuming genesis of this book. Without their support and tolerance this book would not exist today.

All comments, criticisms and questions you might have on the book are welcome – please send them to the authors via email:

olaf.behnke@desy.de,

kevin.kroeninger@phys.uni-goettingen.de,

thomas.schoerner@ desy. de,

gregory. schott@cern.ch.

Hamburg
Göttingen
Karlsruhe
November 2012

Olaf Behnke
Kevin Kröninger
Thomas Schörner-Sadenius and
Gréory Schott

List of Contributors

Roger Barlow

University of Huddersfield

Huddersfield

United Kingdom

Olaf Behnke

DESY

Hamburg

Germany

Volker Blobel

Universität Hamburg

Hamburg

Germany

Luc Demortier

The Rockefeller University

New York, New York

United States of America

Markus Diehl

DESY

Hamburg

Germany

Aart Heijboer

Nikhef

Amsterdam

Netherlands

Carsten Hensel

Universität Göttigen

II. Physikalisches Institut

Göttingen

Germany

Kevin Kröninger

Universität Göttingen

II. Physikalisches Institut

Göttingen

Germany

Benno List

DESY

Hamburg

Germany

Lorenzo Moneta

CERN

Geneva

Switzerland

Harrison B. Prosper

Florida State University

Tallahassee, Florida

United States of America

Grégory Schott

Karlsruher Institut für Technologie

Institut für Experimentelle Kernphysik

Karlsruhe

Germany

Helge Voss

Max-Planck-Institut für Kernpyhsik

Heidelberg

Germany

Ivo van Vulpen

Nikhef

Amsterdam

Netherlands

Rainer Wanke

Institut für Physik

Universität Mainz

Mainz

Germany

1 Fundamental Concepts

Roger Barlow

1.1 Introduction

Particle physics is all about random behaviour. When two particles collide, or even when a single particle decays, we can’t predict with certainty what will happen, we can only give probabilities of the various different outcomes. Although we measure the lifetimes of unstable particles and quote them to high precision – for the τ lepton, for example, it is 0.290±0.001 ps – we cannot say exactly when a particular τ will decay: it may well be shorter or longer. Although we know the probabilities (called, in this context, branching ratios) for the different decay channels, we can’t predict how any particular τ will decay – to an electron, or a muon, or various hadrons.

Then, when particles travel through a detector system they excite electrons in random ways, in the gas molecules of a drift chamber or the valence band of semiconducting silicon, and these electrons will be collected and amplified in further random processes. Photons and phototubes are random at the most basic quantum level. The experiments with which we study the properties of the basic particles are random through and through, and a thorough knowledge of that fundamental randomness is essential for machine builders, for analysts, and for the understanding of the results they give.

It was not always like this. Classical physics was deterministic and predictable. Laplace could suggest a hypothetical demon who, aware of all the coordinates and velocities of all the particles in the Universe, could then predict all future events. But in today’s physics the demon is handicapped not only by the uncertainties of quantum mechanics – the impossibility of knowing both coordinates and velocities – but also by the greater understanding we now have of chaotic systems. For predicting the flight of cannonballs or the trajectories of comets it was assumed, as a matter of common sense, that although our imperfect information about the initial conditions gave rise to increasing inaccuracy in the predicted motion, better information would give rise to more accurate predictions, and that this process could continue without limit, getting as close as one needed (and could afford) to perfect prediction. We now know that this is not true even for some quite simple systems, such as the compound pendulum.

That is only one of the two ways that probability comes into our experiments. When a muon passes through a When a muon passes through a detector it may, with some probability, produce a signal in a drift chamber: the corresponding calculation is a prediction. Conversely a drift chamber signal may, with some probability, have been produced by a muon, or by some other particle, or just by random noise. To interpret such a signal is a process called inference. Prediction works forwards in time and inference works backwards. We use the same mathematical tool – probability – to cover both processes, and this causes occasional confusion. But the statistical processes of inference are, though less visibly dramatic, of vital concern for the analysis of experiments. Which is what this book is about.

1.2 Probability Density Functions

The outcomes of random processes may be described by a variable (or variables) which can be discrete or continuous, and a discrete variable can be quantitative or qualitative. For example, when a τ lepton decays it can produce a muon, an electron, or hadrons: that’s a qualitative difference. It may produce one, three or five charged particles: that’s quantitative and discrete. The visible energy (i.e. not counting neutrinos) may be between 0 and 1777 MeV: that’s quantitative and continuous.

The probability prediction for a variable x is given by a function: we can call it f(x). If x is discrete then f(x) is itself a probability. If x is continuous then f(x) has the dimensions of the inverse of x: it is ∫ f(x)dx that is the dimensionless probability, and f(x) is called a probability density function or pdf.¹⁾ There are clearly an infinite number of different pdfs and it is often convenient to summarise the properties of a particular pdf in a few numbers.

1.2.1 Expectation Values

If the variable x is quantitative then for any functiSpon g(x) one can form the average

(1.1) images

where the integral (for continuous x) or the sum (for discrete x) covers the whole range of possible values. This is called the expectation value. It is also sometimes written images g images , as in quantum mechanics. It gives the mean, or average, value of g, which is not necessarily the most likely one – particularly if x is discrete.

1.2.2 Moments

For any pdf f(x), the integer powers of x have expectation values. These are called the (algebraic) moments and are defined as

(1.2) images

The first moment, α₁, is called the mean or, more properly, arithmetic mean of the distribution; it is usually called µ and often written images . It acts as a key measure of location, in cases where the variable x is distributed with some known shape about a particular point.

Conversely there are cases where the shape is what matters, and the absolute location of the distribution is of little interest. For these it is useful to use the central moments

(1.3) images

1.2.2.1 Variance

The second central moment is also known as the variance, and its square root as the standard deviation:

(1.4) images

The variance is a measure of the width of a distribution. It is often easier to deal with algebraically whereas the standard deviation σ has the same dimensions as the variable x; which to use is a matter of personal choice. Broadly speaking, statisticians tend to use the variance whereas physicists tend to use the standard deviation.

1.2.2.2 Skew and Kurtosis

The third and fourth central moments are used to build shape-describing quantities known as skew and kurtosis (or curtosis):

(1.5) images

(1.6) images

Division by the appropriate power of σ makes these quantities dimensionless and thus independent of the scale of the distribution, as well as of its location. Any symmetric distribution has zero skew: distributions with positive skew have a tail towards higher values, and conversely negative skew distributions have a tail towards lower values. The Poisson distribution has a positive skew, the energy recorded by a calorimeter has a negative skew. A Gaussian has a kurtosis of zero – by definition, that’s why there is a ‘3’ in the formula. Distributions with positive kurtosis (which are called leptokurtic) have a wider tail than the equivalent Gaussian, more centralised or platykurtic distributions have negative kurtosis. The Breit–Wigner distribution is leptokurtic, as is Students t. The uniform distribution is platykurtic.

1.2.2.3 Covariance and Correlation

Suppose you have a pdf f(x, y) which is a function of two random variables, x and y. You can not only form moments for both x and y, but also for combinations, particularly the covariance

(1.7) images

If the joint pdf is factorisable: f(x, y) = f_x(x) · f_y(y), then x and y are independent, and the covariance is zero (although the converse is not necessarily true: a zero covariance is a necessary but not a sufficient condition for two variables to be independent).

A dimensionless version of the covariance is the correlation ρ:

(1.8) images

The magnitude of the correlation lies between 0 (uncorrelated) and 1 (completely correlated). The sign can be positive or negative: amongst a sample of students there will probably be a positive correlation between height and weight, and a negative correlation between academic performance and alcohol consumption.

If there are several (i.e. more than two) variables, x₁, x₂,…, x_N, one can form the covariance and correlation matrices:

(1.9) images

(1.10) images

and V_ii is just images .

1.2.2.4 Marginalisation and Projection

Mathematically, any pdf f(x, y) is a function of two variables x and y. They can be similar in nature, for example the energies of the two electrons produced by a converting high energy photon, or they can be different, for example the position and direction of particles undergoing scattering in material.

Often we are really interested in one parameter (say x) while the other (say y) is just a nuisance parameter. We want to reject the extra information shown in the two-dimensional function (or scatter plot). This can be done in two ways: the projection of x, f(x)|_y is obtained by choosing a particular value of y, the marginal distribution f(x) = ∫ f(x, y)dy is found by integrating over y.

Projections can be useful for illustration, otherwise to be meaningful you have to have a good reason for choosing that specific value of y. Marginalisation requires that the distribution in y, like that of x, is properly normalised.

1.2.2.5 Other Properties

There are many other properties that can be quoted, depending on the point we want to bring out, and on the established usage of the field.

The mean is not always the most helpful measure of location. The mode is the value of x at which the pdf f(x) is maximum, and if you want a typical value to quote it serves well. The median is the midway point, in the sense that half the data lie above and half below. It is useful in describing very skewed distributions (particularly financial income) in which fluctuations in a small tail would give a big change in the mean.

We can also specify dispersion in ways that are particularly useful for non-Gaussian distributions by using quantiles: the upper and lower quartiles give the values above which, and below which, 25% of the data lie. Deciles and percentiles are also used.

1.2.3 Associated Functions

The cumulative distribution function

(1.11) images

where Θ is the Heaviside or step function (Θ(x) = 1 for x ≥ 0 and 0 otherwise), giving the probability that a variable will take a value up to a, is occasionally useful.

The characteristic function

(1.12) images

which is just (up to factors of 2π) the Fourier transform of the pdf, is also met with sometimes as it has useful properties.

1.3 Theoretical Distributions

A pdf is a mathematical function. It involves a variable (or variables) describing the random quantity concerned. This may be a discrete integer or a continuous real number. It also involves one or more parameters. In what follows we will denote a random variable by x for a real number and r for an integer. Parameters generally have their traditional symbols for particular pdfs: where we refer to a generic parameter we will call it θ. It is often helpful to write a function as f(x; θ) or f(x|θ), separating this way more clearly the random variable(s) from the adjustable parameter(s). The semicolon is preferred by some, the line has the advantage that it matches the notation used for conditional probabilities, described in Section 1.4.4.1.

There are many pdfs in use to model the results of random processes. Some are based on physical motivations, some on mathematics, and some are just empirical forms that happen to work well in particular cases.

The overwhelmingly most useful form is the Gaussian or normal distribution. The Poisson distribution is also encountered very often, and the binomial distribution is not uncommon. So we describe these in some detail, and then some other distributions rather more briefly.

1.3.1 The Gaussian Distribution

The Gaussian, or normal, distribution for a continuous random variable x is given by

(1.13) images

It has two parameters; the function is manifestly symmetrical about the location parameter µ, which is the mean (and mode, and median) of the distribution. The scale parameter σ is also the standard deviation of the distribution. So there is, in a sense, only one Gaussian, the unit Gaussian or standard normal distribution f(x; 0, 1) shown in Figure 1.1. Any other Gaussian can be obtained from this by scaling by a factor σ and translating by an amount µ. The Gaussian distribution is sometimes denoted images .

The Gaussian is ubiquitous (hence the name ‘normal’) because of the central limit theorem, which states that if any distribution is convoluted with itself a large number of times, the resulting distribution tends to a Gaussian form. For a proof, see for example Appendix 2 in [1].

Gaussian random numbers are much used in simulation, and a suitable random number generator is available on most systems. If it is not, then you can generate a unit Gaussian by taking two uniformly generated random numbers u₁, u₂, set θ = 2πu₁, , and then r cos θ and r sin θ are independent samples from a unit Gaussian.

Figure 1.1 The unit Gaussian or standard normal distribution.

The product of two independent Gaussians gives a two-dimensional function

(1.14) images

but the most general quadratic form in the exponent must include the cross term and can be written as

(1 15) images

where the parameter ρ is the correlation between x and y. For N variables, for which we will use the vector x, the full form of the multivariate Gaussian can be compactly written using matrix notation:

(1.16) images

Here, V is the covariance matrix described in Section 1.2.2.3.

The error function and the complementary error function are basically closely related to the cumulative Gaussian

(1.17)

(1.18)

Their main use is in calculating Gaussian p-values (see Section 1.3.4.6). The probability that a Gaussian random variable will lie within one standard deviation, or ‘1 σ’, of the mean is 68% obtained by calculating erf(y = 1). Conversely, the chance that a variable drawn from a Gaussian random process will lie outside 1 σ is 32%. Given such a process – say a mean of 10.2 and a standard deviation of 3.1 – then if you confront a particular measurement – say 13.3 – it is quite plausible that it was produced by the process. One says that its p-value, the probability that the process would produce a measurement this far, or further, from the ideal mean, is 32%. Conversely, if the number were 25.7 rather than 13.3, that would be 5 σ rather than 1 σ, for which the p-value is only 5.7 · 10^–7. In discussion of discoveries (or otherwise) of new particles and new effects this language is turned round, and a discovery with a p-value of 5.7 · 10^–7 is referred to as a ‘5 σ result’²⁾. A translation is given in Table 1.1 – although for practical purposes it is easier to use functions such as pnorm and qnorm in the programming language R [2], or TMath::Prob in ROOT[3].

Table 1.1 Two-sided Gaussian p-values for 1σ to 5σ deviations.

Deviation p-value (%)

1σ 31.7

2σ 4.56

3σ 0.270

4σ 0.00633

5σ 0.0000573

1.3.2 The Poisson Distribution

The Poisson distribution

(1.19)

describes the probability of n events occurring when the mean expected number is v; n is discrete and v is continuous. Typical examples are the number of clicks produced by a Geiger counter in an interval of time, or, famously, the number of Prussian cavalrymen killed by horse-kicks [4]. Some examples are shown in Figure 1.2.

The Poisson distribution has a mean of v and a standard deviation This property – that the standard deviation is the square root of the mean – is a key fact about distributions generated by a Poisson process, which is important as this includes most cases where a number of samples is taken, including the contents of the bin of a histogram.

Figure 1.2 Poisson distributions for (a) v = 0.5, (b) v = 1.0, (c) v = 2.5, (d) v = 5.0, (e) v = 10.0, (f) v = 25.0.

Example 1.1 Counting cosmic muons

In an experiment built to measure cosmic muons, the expected rate of muons in one run of the experiment is 0.45 events. This means that you have a 64% probability of observing no decays, a 29% probability of a single decay, 6% chance of seeing two and less than 1% of seeing three.

1.3.3 The Binomial Distribution

The binomial distribution describes a generalisation of the simple problem of the numbers of heads and tails that can arise from spinning a coin several times. The probability for getting r ‘successes’ from N ‘trials’ given an intrinsic probability of success p is

(1.20)

Sometimes one writes q instead of 1 – p, which makes the algebra prettier. The distribution has a mean of Np and a standard deviation . The factor N!/[r!(N – r!)] is the number of ways that r objects may be chosen from N, and is often written .

Example 1.2 Tracking chambers

A charged particle in an experiment goes through a set of six tracking chambers, which measure its position. Each of them is 95% efficient. If you require all six chambers to register a hit in order to define a reconstructed track, the efficiency of the system will clearly be 0.95⁶ = 73.5%. If you are satisfied with five or more hits the efficiency is 96.7%. If at least four hits are enough, the track efficiency is 99.8%.

If p is small then the distribution can be approximated by a Poisson distribution³⁾ of mean Np. This is often used implicitly when analysing Monte Carlo samples: if you generate 1 000 000 Monte Carlo events, of which 100 end up in some particular histogram bin, then strictly speaking this is described by a binomial process rather than a Poisson. In practice you can take the error as the Poisson rather than a binomial . This doesn’t work if p is large. If 9 out of 10 events are accepted by the trigger, the error on the trigger efficiency of 90% is not but (in such a case the shortcut is to take the one lost event as approximately Poisson, giving the error as 10%, which is close).

If N is large and p is not small then the distribution is approximately a Gaussian.

If there are not just two possible outcomes but n, with probabilities {p₁, p₂, …, p_n}, then the total probability of getting r₁ of the first outcome, r₂ of the second, and so on, is

(1.21)

This is the multinomial distribution.

1.3.4 Other Distributions

There are many, many other possible distribution functions, and it is worth listing some of those more often met with.

1.3.4.1 The Uniform Distribution

The uniform distribution, also known as the rectangular or top-hat distribution, is constant inside some range – call this range – a/2 to a/2, so the width is a; if the range is not central about zero but about some other value this is easily done by a translation. The mean, clearly, is zero, and the standard deviation is . This can be used in position measurements by a hodoscope: if a rectangular slab of scintillator gives a signal, you know that a track went through it but you do not know where. It is reasonable to assume a uniform distribution for the pdf of the hit position.

This can be relevant in considering some systematic uncertainties on the total result, as is also discussed in Section 8.4.1.2. For example, if you set up an experiment to run overnight, counting events with some efficiency E₁, and when you arrive in the morning you find a component has tripped so the efficiency is E₂, with no information about when this happened, your efficiency has to be quoted as . It can also be applied to theoretical models: when two models give different predictions you are justified in using their mean as your prediction, with a (systematic) error which is the difference divided by , if (and only if) these two models represent absolute extremes and you really have no feeling as to where between the two extremes the truth may lie.

1.3.4.2 The Cauchy, or Breit–Wigner, or Lorentzian Distribution

In nuclear and particle physics the function

(1.22)

gives the variation with the energy E of a cross section produced by the formation of a state with mass M and width Γ. It can be written more neatly in dimensionless form as

(1.23)

where x = (E – M)/(Γ/2). The mean is clearly M. It does not have a variance: the integral ∫ x² f(x)dx is divergent. If you have to compare this curve and with that of a Gaussian, the full width at half maximum (FWHM) is clearly Γ for this curve and for a Gaussian it is .

This distribution is used in fitting resonance peaks (provided the width is much larger than the measurement error on E). It also has an empirical use in fitting a set of data which is almost Gaussian but has wider tails. This often arises in cases where a fraction of the data is not so well measured as the rest. A double Gaussian may give a good fit, but it often turns out that this form does an adequate job without the need to invoke extra parameters.

1.3.4.3 The Landau Distribution

When a charged particle passes an atom, its electrons experience a changing electromagnetic field and acquire energy. The amount of energy may be large; on rare occasions it will be large enough to create a delta ray. The probability distribution for the energy loss was computed by Landau [5] and is given by

(1.24)

where λ = (Δ — Δ₀)/ξ. Here, Δ is the actual energy loss, Δ₀ is a location parameter, and ξ is a scale, exact values for which depend on the material. This distribution has a peak at Δ₀, cuts off quickly below that, and has a very large long positive tail. The function is shown in Figure 1.3.

Figure 1.3 The Landau distribution.

The Landau distribution has very unpleasant mathematical properties. Some of its integrals diverge, for example it has no variance (like the Cauchy distribution), and, worse than that, it does not even have a mean. The ensuing complications can be avoided on a case-by-case basis by imposing an upper limit on the energy loss, as a particle cannot lose more than 100% of its energy.

There is a function which is described in some places as ‘the Landau distribution’. It is not. It is an approximation to the Landau distribution [6], and not a very good one at that.

1.3.4.4 The Negative Binomial Distribution

This considers the familiar binomial, but with a twist. As before, some process has a random probability p of success and q = 1 – p of failure, and is repeated for many trials. But now instead of asking the probability of r successes from a fixed number of trials n, we ask for the probability of r successes before encountering a fixed number k of failures. This is given by

(1.25)

It is the probability for r successes and k – 1 failures in any permutation, followed by a final kth failure. The combinatorial factor can also be written , hence the name ‘negative binomial’. This can readily be extended to non-integer values by writing it as

(1.26)

although it is not clear what physical meaning this may have. Γ is the Gamma function, defined as

(1.27)

The negative binomial distribution has a mean µ = (p/q)k and a variance V = (p/q²)k. The negative binomial approaches the Poisson as k becomes large and p small with constant pk = µ.

1.3.4.5 Student’s t Distribution

If you take a sample of n values, {x₁,…, x_n}, from a Gaussian and histogram their differences from the true mean, divided by the standard deviation (a quantity often called the pull distribution), then this gives a unit Gaussian, that is a Gaussian with µ = 0, σ = 1, which can be a useful check that you have your errors right. If, as often happens, the true mean is unknown, then the spread about the measured mean is slightly smaller than 1, by a factor .

If the standard deviation σ is also unknown, then you can use instead the estimated if µ is known or if it is not. Now, for small n especially, this is not a very good estimator, and because you are dividing the differences from the mean by this bad estimate, the distribution for

(1.28)

Figure 1.4 Student’s t distribution for n = 2,5, 10, 15,25 with the Gaussian (dotted) for comparison.

is not given by a Gaussian, but by Student’s t distribution for n – 1 degrees of freedom, where Student’s t distribution is given by

(1.29)

This tends to a unit Gaussian as n becomes large, but for small n it has tails which are significantly wider (see Figure 1.4): large t values can result if is an underestimate of the true value. The mean is clearly zero; the variance is not one, as it would be for a unit Gaussian, but n/(n – 2).

Example 1.3 Light yields in scintillators

You have five samples of scintillator from a manufacturer with light yields measured (in some units) as 1.23, 1.42, 1.35, 1.29 and 1.40. A second, cheaper, manufacturer provides a sample whose yield is 1.19. Does this give reason to believe that the cheaper sample has an inferior light yield?

The sample mean is 1.338 and the estimated standard deviation is 0.079, so the cheaper sample is 1.90 standard deviations below the mean. If this were a Gaussian distribution then the probability of a value lying this far below the mean is only 2.9% – so you would take this as strong evidence that the cheaper process was not so good. But for Student’s t with four degrees of freedom the probability is (consulting the tables or evaluating a function) only 6.5%, so your evidence would be weaker (the calculations were done using the R function pt(x, ndf)).

1.3.4.6 The χ² Distribution

In describing the agreement between a predictive function g(x) and a set of n measurements {(x_i, y_i)}, it is useful to form the total squared deviation

(1.30)

where σ_i is the Gaussian error on measurement i: if these errors are the same for all measurements then the factor can, of course, be taken outside the summation.

Each term will clearly contribute an amount of order one to the sum, and it is no surprise that E[f(χ²; n)] = n. The distribution is given by

_(1.31)

Some examples for different n are shown in Figure 1.5.

The χ² distribution is used a great deal in considering the question of whether a particular set of measurements (with their errors) and a particular model are compatible. This is addressed through the cumulative χ² distribution. For a given value of χ², the complement of the cumulative distribution gives the p-value, the probability that, given that the model is indeed correct, a measurement would give a result with a χ² this large, or larger. If the value of χ² obtained is large compared to n then the p-value is small, that is the probability that a set of measurements truly described by this model would give such a large disagreement is small, and doubt is cast on the model, or the data (or both). The mean of f(χ², n) is just n, and the standard deviation is . For large n the distribution converges to the Gaussian, as it must by the central limit theorem. However, the convergence is actually rather slow, and this approximation is not often used. Instead the p-value should be obtained accurately from functions such as TMath::Prob in ROOT or pchisq in R

Figure 1.5 χ² distributions for n = 1,2, 3, 6 and 10.

If the model has free parameters θ which are not given, but were found by fitting the data, then the same χ² test can be used, but for n one takes the number of data points minus the number of fitted parameters. This is called the number of degrees of freedom. Strictly speaking this is only true if the model is a linear one (i.e. linear in the parameters). This is often the case, either exactly or to a good approximation, but there are some instances where this condition does not hold, leading to the computation of deceptively small and inaccurate p-values.

Example 1.4 Resistance measurements

A series of ten measurements are made of resistance R as a function of temperature T. The temperature is controlled very accurately, but the resistance is only measured with an accuracy of 2Ω. A theoretical model predicts a value for R of (10.3 + 0.047 · T)Ω. The evaluation of χ² gives a value of 25.1. What can you say?

In this case one would use n = 10. Evaluating (using the R function pchisq) the probability of getting a value as large as 25.1 from n = 10 gives 0.5%. It seems very implausible that the model really describes this data. (This does not necessarily mean the model is wrong. It could be that the data are badly measured. Or that the measurement accuracy has been estimated too optimistically.)

You will occasionally obtain χ² values that seem very small: χ² n. There is no standard procedure for rejecting these, but you should treat them with some suspicion and consider whether the model may have been formulated after the data had been measured (‘retrospective prediction’), or whether perhaps the errors have been over-generously estimated.

1.3.4.7 The Log-Normal Distribution

If the logarithm of the variable is given by a Gaussian distribution f(ln x; µ, σ) then the distribution for x itself is the log-normal distribution

(1.32)

Just as the central limit theorem dictates that any variable which is the sum of a large number of random components is described by a Gaussian distribution, any variable which is the product of a large number of random factors, none of which dominates the behaviour, is described by the log-normal. For instance, the signal registered by an electron in a calorimeter may be described by a log-normal distribution, as a certain fraction of the energy may be lost to dead material, a fraction to lost photons, a fraction to neutron production, and so on. The mean is given by , and the standard deviation is .

1.3.4.8 The Weibull Distribution

The Weibull distribution is:

(1.33)

Figure 1.6 Weibull distributions for α = 1.0, progressively more peaked for β = 1.0, 2.0, 3.0 and 5.0.

This gives a shape which rises from zero to a peak and then falls back to zero again. It was originally invented to describe the failure rates in aging light bulbs. There are no failures at small times (because they are new and fresh) or at long times (because they have all failed). It is a rather more realistic modelling of real-life ‘lifetime’ than the simple exponential decay law for which the failure probability is constant.

The parameter α is just a scale factor and β controls the shape. The case β = 1 corresponds to the simple exponential decay law, whereas β > 1 describes the behaviour when the failure probability increases with age, and gives successively sharper peaks. A case where the failure probability falls with time (perhaps because of initial burn-in) is described by β < 1. Examples are shown in Figure 1.6. The mean is and the variance is 1/α² {Γ [1 + (2/β)] – Γ [1 + (1/β)]²}. A location parameter x₀ may also be needed in some problems, replacing x by x – x₀.

1.4 Probability

We use probability every day, in both our work as physicists and our everyday lives. Sometimes this is a matter of precise calculation, when we buy an insurance policy or decide whether to publish a result, sometimes it is more intuitive, as when we decide to take an umbrella to work in the morning.

But although we are familiar with the concept of probability, on closer inspection it turns out that there are subtleties. When we get into technicalities there turn out to be different definitions of the concept which are not always compatible.

1.4.1 Mathematical Definition of Probability

Let A be an event. Then the probability P(A) is a number obeying three conditions, the Kolmogorov axioms [7]:

1. P(A) ≥ 0;

2. P(U) = 1, where U is the set of all A, the sample space;

3. P(A B) = P(A) + P(B) for any A, B which are exclusive, that is A B = 0.

From these axioms a whole system of theorems and properties can be derived. However, the theory contains no statement as to what the numbers actually mean. For mathematicians this is, of course, not a problem, but it does not help us to apply the results.

1.4.2 Classical Definition of Probability

The probability of a coin landing heads or tails is clearly 1/2. Symmetry dictates that it cannot be anything else. Likewise the chance of drawing a particular card from a pack has to be 1/52. The original development of probability by Laplace, Pascal and their contemporaries, to aid the gambling fraternity, was founded on this equally likely construction. ‘Probability’ could be defined by taking fundamental symmetry where all cases were equally likely (say, the six sides on a dice), and extended to more complex cases (say, rolling two dice) by counting combinations.

Unfortunately this definition does not generalise to cases of continuous variables, where there is no fundamental symmetry: if you ‘draw a line at random’ from a given point, this could be done by taking coordinates of the endpoint from a uniform distribution, or by drawing an angle uniformly taken between 0 and 360°, the results are incompatibly different. This approach thus leads to a dead end.

1.4.3 Frequentist Definition of Probability

Problems with the classical definition led to the alternative definition of probability as the limit of frequency by Venn, von Mises [8] and others. If a selection is made N times under identical circumstances, then the fraction of cases resulting in a particular outcome A tends to a limit, and this limit is what is meant by the probability:

(1.34)

This is the generally adopted definition, taught in most elementary courses and textbooks. It satisfies, of course, the Kolmogorov axioms.

Where the classical definition is valid it leads to the same results. But there is an important philosophical difference. The probability P(A) is not some intrinsic property of A, it also depends on the way the sampling is done: on how the collective or ensemble of total possible outcomes has been constructed.

Thus, to use von Mises’ example: the life insurance companies determine that the probability of one of their (male) clients dying between the ages of 40 and 41 is 1.1%. This is a hard and verifiable number, essential for the correct adjustment of the premium paid. However, it is not an intrinsic probability of the person concerned: you cannot say that a particular client has this number attached to them as a property in the same way that their height and weight are. The client belongs not just to this ensemble (insured 40-yr-old males) but to many others: 40-yr-old males, non-smoking 40-yr-old males, non-smoking professional lion tamers – and for each of these ensembles there will be a different number.

So there are cases with several possible ensembles, and the value of P(A) is ambiguous until the ensemble is specified. There are also cases where there is no ensemble, as the event is unique. The Big Bang is an obvious example, but others can be found much nearer home. For example, what is the probability P(rain) that it will rain tomorrow? Now, there is only one tomorrow, and it will either rain or it will not, so P(rain) is either 0 or 1. Von Mises condemns any further discussion as ‘unscientific’ use of language. This is further discussed (and resolved) in Section 1.5.2.

1.4.4 Bayesian Definition of Probability

Another way of extending the unsatisfactory classical definition of probability was made by de Finetti [9] and others. De Finetti’s starting point is the provocative ‘Probability does not exist.’ It has no objective status: it is something the human mind has constructed.

He shows that one can consistently define a personal probability (or degree-of-belief) P(A) in A by establishing the odds of a bet whereby you lose €1 if A subsequently turns out to be false, and you receive €G if it turns out to be true. If P(A) > 1/(1 + G) you will accept the bet; if P(A) < 1/(1 + G) you will decline it.

Such personal probability is indeed something we use every day: when you decide whether or not to take an umbrella to work in the morning your decision is based on your personal probability of there being rain (and also the ‘costs’ involved in (a) getting wet and (b) having something extra to carry). However, there is no need for my personal probability to be the same as yours, or anyone else’s. It is thus often referred to as a subjective probability. Subjective probability is also generally known as Bayesian probability, because of the great use it makes of Bayes’ theorem [10]. This is a simple and fundamental result which is actually valid for any of the probability definitions being used.

1.4.4.1 Bayes’ Theorem

Suppose A and B are two events, and introduce the conditional probability P(A | B), the probability of event A given that B is true (for instance: the probability that a card is the six of spades, given that it is black, P(six of spades|black) is 1/26).

The probability of both A and B occurring, P(A B) is clearly P(A | B)P(B). But it is also P(B | A) P(A). Equating these two quantities gives

(1.35)

This is used in problems like the famous ‘taxi colour’ example.