Hypotheses Testing

Chapter 8
Hypotheses Testing
In the previous chapters we used experimental data to estimate parameters. Here we will use
data to test hypotheses. A typical example is to test whether the data are compatible with
the theoretical prediction or to choose among di↵erent hypothesis which one best represents
the data.
8.1
Hypotheses and tests statistics
Let’s begin by defining some terminology that we will need in the following. The goal of a
statistical test is to make a statement about how well the observed data stand in agreement
(accept) or not (reject) with given predicted probabilities, i.e. a hypothesis. The typical
naming for the hypothesis under test is the null hypothesis or H0 . The alternative hypothesis, if there is one, is usually called H1 . If there are several alternative hypotheses
they are labeled H1 , H2 , . . . The hypothesis can be simple if the p.d.f. of the random variable under test is completely specified (e.g. the data are drawn from a gaussian p.d.f. with
specified mean ad width) or composite if at least one of the parameters is not specified (e.g.
the data are drawn from a Poisson with mean >3).
In order to tell in a quantitative way what it means to test a hypothesis we need to build
a function of the measured variables ~x, called test statistic t(~x). If we build it in a clever
way, the test statistic will be distributed di↵erently depending on the hypothesis under test:
g(t(~x)|H0 ) or g(t(~x)|H1 ). This pedantic notation is used here to stress that the test statistic is a function of the data and that it is the distribution of the test statistic values that
is di↵erent under the di↵erent hypotheses (the lighter notation g(t|Hi ) will be used from
now on). Comparing the value of the test statistic computed on the actual data, with the
value(s) obtained computing it under di↵erent hypotheses, we can quantitatively state the
level of agreement. That’s the general idea. The way this is implemented in practice will be
explained in the next sections.
The test statistic can be any function of the data: it can be a multidimensional vector ~t(~x)
or a single real number t(~x). Even the data themselves {~x} can be used as a test statistic.
Collapsing all the information about the data into a single meaningful variable is particularly
helpful in visualizing the test statistic and the separation between the two hypothesis. There
is no general rule about the choice of the test statistic. The specific choice will depend on
the particular case at hand. Di↵erent test statistic will give in general di↵erent results and
97
98
CHAPTER 8. HYPOTHESES TESTING
it is up to the physicist to decide which is the most appropriate for the specific problem.
Example: In order to better understand the terminology we can use a specific example
based on particle identification. The average specific ionization dE/dx of two charged particle with the same speed
passing through matter will be di↵erent depending on their masses
(see Fig.8.1.1). Because of this dependence dE/dx can be used as a particle identification
tool to distinguish particle types. For example the ionization of electrons with momenta in
the range of few GeV tend to be larger than the one of pions of the same momentum range.
If we want to distinguish an electron from a pion in a given bin of
we can use the specific
Figure 8.1.1: Left: the specific ionization for some particle types (in green pions and in red
electrons; other particles species are shown with di↵erent colors); Right: the projections of
the left plot on the y-axis, i.e. the measured specific ionization for pions and electrons.
ionization itself as test statistic t(~x) = dE/dx. This is a typical case where the data itself is
used as test statistic. The test statistics will then be distributed di↵erently under the two
following hypotheses (see Fig.8.1.1 right):
• null hypothesis g(t|H0 ) = P
dE ±
dx |e
• alternative hypothesis g(t|H1 ) = P
dE ±
dx |⇡
2
Example: When testing data for the presence of a signal, we define the null hypothesis as
the “background only” hypothesis and the alternative hypothesis as the “signal+background”
hypothesis.
2
Example: Fig.8.1.2 shows the cross section (e+ e ) ! W + W ( ) measured by the L3 collaboration at di↵erent centre of mass energies. In this case the test statistic is the cross-section
as a function of energy. The measured values are then compared with di↵erent theoretical
models (di↵erent hypothesis). We haven’t explained yet how to quantitatively accept or reject an hypothesis, but already at a naive level we can see that data clearly prefer one of the
models.
2
8.2. SIGNIFICANCE, POWER, CONSISTENCY AND BIAS
99
Figure 8.1.2: Analysis of the cross section of e+ e ! W + W ( ) as a function of the centre
of mass energy (L3 detector at LEP).
The p.d.f. describing the test statistic corresponding to a certain hypothesis g(~t|H) is usually built from a data set that has precisely the characteristic associated to that hypothesis.
In the particle identification example discussed before the expected data used to build the
p.d.f. for the two hypotheses were pure sample of electrons and pure samples of pions. For
example you can get a pure sample of electrons selecting tracks from photon conversions
! e+ e and a pure sample of pions from the self-tagging decays of charmed mesons
⇤+
D ! ⇡ + D0 ; D0 ! k ⇡ + (D⇤ ! ⇡ D0 ; D0 ! k + ⇡ ) (self-tagging means that by knowing
the charge of the ⇡ in the first decay you can unambiguously assign the pion/kaon hypothesis
to the positive/negative charge of the second decay). In other cases the p.d.f. are built from
dedicated measurement (e.g. a test beam1 ) or from Monte Carlo simulations.
8.2
Significance, power, consistency and bias
In order to accept or reject a null hypothesis we partition the space of the test statistics
values into a critical region and its complementary the acceptance region, for the test
hypothesis t (see Fig. 8.2.3), such that there is a small probability, assuming H0 to be correct
to observe data with a test statistics in that region. The value of the test statistics chosen to
define the two regions is called decision boundary: “tcut ”. If the value of the test statistic
computed on the data sample under test falls in the rejection region then the null hypothesis
is discarded, otherwise it is accepted (or more precisely not reject it).
1
In a test beam, a beam of particles is prepared in a well defined condition (particle type, energy, etc...)
and it is typically used to test a device under development. This configuration inverts the typical experimental
conditions where a device with known properties is used to characterize particles in a beam or from collisions.
100
CHAPTER 8. HYPOTHESES TESTING
Figure 8.2.3: A test statistic in red, where we defined an acceptance x  1 and rejection
region x > 1.
Given a test statistic, some parameters are usually defined when sizing a rejection region.
The first one is the significance level of the test (see Fig.8.2.4). It is defined as the integral
of the null hypothesis p.d.f. above the decision boundary:
Z 1
↵=
g(t|H0 )dt
(8.2.1)
tcut
The probability ↵ can be read as the probability to reject H0 even if H0 is in reality correct.
This is called an error of the first kind. If we have an alternative hypothesis H1 , an error
of the second kind occurs when H0 is accepted but the correct hypothesis is in reality the
alternative one H1 . The integral of the alternative hypothesis p.d.f. below tcut is called the
power of the test to discriminate against the alternative hypothesis H1 (see Fig.8.2.4):
Z tcut
=
g(t|H1 )dt
(8.2.2)
1
A good test has both ↵ and
small, which is equivalent to say high significance and high
Figure 8.2.4: Illustration of the acceptance and rejection region both for the hypothesis H0
(on the left hand side) and the alternative H1 (on the right hand side) under the same choice
of decision boundary.
power. This means that H0 and H1 are well separated. Table 8.2.5 summarize the di↵erent
8.2. SIGNIFICANCE, POWER, CONSISTENCY AND BIAS
101
ways to mistakenly interpret the data in terms of errors of the first and second kind. While
errors of the first type can be controlled by choosing ↵ sufficiently small, errors of the second
type, depending on the separation between the two hypothesis, are not as easily controllable.
In HEP searches we typically speak of “evidence” when ↵  3 · 10 3 , and of “discovery” when
↵  3 · 10 7 (corresponding to the probability outside 3 and 5 respectively in a single
sided tail gaussian); these numbers are purely conventional and they don’t have any scientific
ground. They are defined this way to set a high threshold for such important claims about
the observation of new phenomena.
Figure 8.2.5: Example of errors of the first and second kind (Wikipedia).
Example We consider now a machine BM1 which is used for bonding wires of Si-detector
modules. The produced detectors had a scrap rate of P0 = 0.2. This machine BM1 should
now be replaced with a newer bonding machine called BM2, if (and only if) the new machine
can produce detector modules with a lower scrap rate P1 . In a test run we produce n = 30
modules. To verify P1 < P0 statistically, we use the hypothesis test discussed above. Define
the two hypotheses H0 and H1 as:
H 0 : P1
0.2;
H1 : P1 < 0.2.
(8.2.3)
We choose ↵ = 0.05 and our test statistic t is the number of malfunctioning detector modules.
This quantity is distributed according to a binomial distribution, with the total number of
produced modules n = 30 and a probability P . The rejection region for H0 is constructed
out of
nc ✓ ◆
X
n
P0i (1 P0 )n i < ↵.
(8.2.4)
i
i=0
Here, the critical value is denoted by nc , and it is the maximal number of malfunctioning
modules produced by BM2 which still implies a rejection of H0 with CL 1 ↵. By going
through the calculation we find that for nc = 2 the value of ↵ is still just below 0.05. Thus
the rejection region for H0 is K = 0, 1, 2. This means that if we find two or less malfunctioning modules produced by BM2 we will replace BM1 by the new machine BM2. If there are
3 or even more malfunctioning detector modules, the old bonding machine BM1 should be
preferred.
2
Once the test statistics is defined there is a trade-o↵ between ↵ and , the smaller you make
↵ the larger will be; it’s up to the experimenter to decide what is acceptable and what is not.
102
CHAPTER 8. HYPOTHESES TESTING
Example Suppose we want to distinguish K p elastic scattering events from inelastic
scattering events where a ⇡ 0 is produced. H0 : K p ! K p ; H1 : K p ! K p⇡ 0 . The
detector used for this experiment is a spectrometer capable of measuring the momenta of all
the charged particles (K , p) but it is blind to neutral particles (⇡ 0 ). The considered test
statistic is the “missing mass” M defined as the di↵erence between the initial and final visible
mass. The true value of the missing mass is M = 0 under the null hypothesis H0 (no ⇡ 0
produced) and M⇡0 = 135 M eV /c2 under the alternative hypothesis H1 (a ⇡ 0 is produced).
The critical region can be defined as M > Mc . The value of Mc depends on the significance
and power we want to obtain (see Fig.8.2.6): a high value of Mc will correspond to a high
significance at the expenses of the power, while low values of Mc will result in a high power
but low significance.
2
Figure 8.2.6: Top: the p.d.f. for the test statistic M under the null hypothesis of elastic
scattering H0 centred at M = 0; bottom the p.d.f. for the test statistic under the alternative
hypothesis of inelastic scattering H1 centred at M = m⇡0 . Mc defines the critical region.
Some caution is necessary when using ↵. Suppose you have 20 researchers looking for a new
phenomenon which “in reality” does not exist. Their H0 hypothesis is that what they see is
only background. One of them is liable to reject H0 with ↵ = 5%, while the other 19 will not.
This is part of the game and therefore, before rushing for publication, that researcher should
balance the claim with what the others don’t see. That’s the main reason why anytime there
is a discovery claim, we always need to have the results to be corroborated by independent
measurements. We will come back to this point when we will talk about the look-elsewhere
8.2. SIGNIFICANCE, POWER, CONSISTENCY AND BIAS
103
e↵ect.
Example Let’s use again the example of the electron/pion separation. As already shown
before the specific ionization dE/dx of a charged particle can be used as a test statistic to
distinguish particle types, for example electrons (e) from pions (⇡)(see Fig. 8.1.1). The
selection efficiency is defined as the probability for a particle to pass the selection cut:
Z tcut
Z tcut
✏e =
g(t|e)dt = 1 ↵
✏⇡ =
g(t|⇡)dt =
(8.2.5)
1
1
By moving the value of tcut you can change the composition of your sample. The lower the
value of tcut the larger the electron efficiency but the higher the contamination from pions
and vice-versa. In general, one can set a value of tcut , select a sample and work out what is
the fraction of electrons Nacc present in the initial sample (before the requirement t < tcut ).
The number of accepted particles in the sample is composed by:
Nacc = ✏e Ne + ✏⇡ N⇡ = ✏e Ne + ✏⇡ (Ntot
which gives
Ne =
Ne )
Nacc ✏⇡ Ntot
✏e ✏⇡
(8.2.6)
(8.2.7)
From this, one can immediately notice that the Ne can only be calculated if ✏e 6= ✏⇡ , i.e.
Ne can only be extracted if there is any separation power at all. If there are systematic
uncertainties in ✏e or ✏⇡ these will translate into an uncertainty on Ne . One should try to
select the critical region tcut such that the total error on Ne is negligible.
Up to now we used only the p.d.f describing what is the probability that a electron/pion gives
a certain amount of ionization; using the Bayes’ theorem (see 1.6) we can invert the problem
and ask what is the probability that a particle releasing a given ionization signal t is a pion
or an electron:
ae g(t|e)
h(e|t) =
(8.2.8)
ae g(t|e) + a⇡ g(t|⇡)
h(⇡|t) =
a⇡ g(t|⇡)
ae g(t|e) + a⇡ g(t|⇡)
(8.2.9)
where ae , a⇡ = 1 ae are the prior probabilities for the electron and pion hypotheses. So
to give the probability that a particle is an electron (or better the degree of belief that a
given particle with a measured t is an electron) one needs to know the prior probability for
all possible hypothesis as well as their p.d.f.
The other side of the problem is to estimate the purity pe of the sample of candidates passing
the requirement t < tcut :
pe =
=
=
#electrons with t < tcut
#particles with t < tcut
R tcut
1 ae g(t|e)dt
R tcut
ae )g(t|⇡))dt
1 (ae g(t|e) + (1
ae ✏e Ntot
Nacc
(8.2.10)
(8.2.11)
(8.2.12)
104
CHAPTER 8. HYPOTHESES TESTING
2
In high energy physics a parallel nomenclature has been defined with time to express the
same concepts we have encounter in this section. Typically we call:
• background efficiency = probability to accept background = probability of a type-I
error
• signal efficiency = power of the test = probability of a type two error
• purity = probability for an event to be signal, once we have accepted it as signal
8.3
Is there a signal ?
A typical application of hypothesis testing in high energy physics is to test for the presence
of a signal in data. The easiest case is represented by counting experiments. In this type
of experiments the detector is used to count the number of events satisfying some selection
criteria (slang: “cut-and-count”). The number of expected events in case of background only
hypothesis is compared with the measured number and the signal would typically appear as
an excess over the expected background2 .
Let n be a number of events which is the sum of some signal and some background events
n = ns + nb . Each of the components can be treated as a Poisson variable ⌫s (signal) and
⌫b (background) and so the total ⌫ = ⌫s + ⌫b is also a Poisson variable. The probability to
observe n events is:
(⌫s + ⌫b )n (⌫s +⌫b )
f (n; ⌫s , ⌫b ) =
e
(8.3.13)
n!
Suppose you measure nobs events. To quantify our degree of confidence in the discovery of
a new phenomenon, i.e. ⌫s 6= 0, we can compute how likely it is to find nobs events or more
from background alone.
P (n
nobs ) =
1
X
n=nobs
f (n; ⌫s = 0, ⌫b ) = 1
nobs
X1
n=0
f (n; ⌫s = 0, ⌫b ) = 1
nobs
X1
n=0
⌫bn
e
n!
⌫b
.
(8.3.14)
For example, if we expect ⌫b = 0.5 background events and we observe nobs = 5, then the
p-value from is 1.7 · 10 4 . This is not the probability of the hypothesis ⌫s = 0. It is rather
the probability, under the assumption ⌫s = 0, of obtaining as many events as observed or more.
Often the result of a measurement is given as the estimated value of a number of events plus
or minus one standard deviation. Since the standard deviation of a Poisson
p variable is equal
to the square root of its mean, from the previous example, we have 5 ± 5 for an estimate
of ⌫, i.e. after subtracting the expected background, 4.5 ± 2.2 for our estimate of ⌫s . This
is very misleading: it is only two standard deviations from zero, and it gives the impression
that ⌫s is not very incompatible with zero, but we have seen from the p-value that it is not
the case. The subtlety is that we need to ask for the probability that a Poisson variable of
2
The signal doesn’t always appear as an excess of events. In case of neutrino disappearance experiments
the signal is given by a deficit of events.
8.4. NEYMAN PEARSON LEMMA
105
mean ⌫b will fluctuate up to nobs or higher, not for the probability that a gaussian variable
with mean nobs will fluctuate down to ⌫b or lower.
Another important point is that usually ⌫b is known within some uncertainty. If we set
⌫ = 0.8 rather than 0.5, the p-value would increase by almost an order of magnitude to
1.4 · 10 3 . It is therefore crucial to quantify the systematic uncertainty of the background
when evaluating the significance of a new e↵ect.
In other types of searches the signal would reveal itself as a resonance, i.e. an excess of
data in a localized region of a mass spectrum (slang: “bump hunt”), or as an excess of events
in the tail of a distribution. Two examples are show in Fig.8.3.7. In these cases the signal
is extracted from the background using a fit (more on this will be developed in the next
sections). In this case on top of using the number of expected events, we add the information
about the shape.
Figure 8.3.7: Left: Higgs boson search in 2011. The data are well described by the background
only hypothesis. Right: search for an excess of events at high missing transverse energy.
8.4
Neyman Pearson Lemma
We haven’t addressed up to now the choice of tcut . The only thing we know up to now is
that it a↵ects the efficiency and the purity of the sample under study. Ideally what we want
is to set the desired efficiency and for that value get the best possible purity.
Take the case of a simple hypothesis H0 and allow for an alternative hypothesis H1 (e.g. the
typical situation of signal and background). The Neyman Pearson lemma states that the
acceptance region giving the highest power (i.e. the highest purity) for a given significance
level ↵ is the region of space such that
g(~t|H0 )
> c,
g(~t|H1 )
(8.4.15)
where c is the knob we can tune to achieve the desired efficiency and g(~t|Hi ) is the distribution
of ~t under the hypothesis Hi .
106
CHAPTER 8. HYPOTHESES TESTING
Basically what the lemma says is that there is function r defined as
r=
g(~t|H0 )
g(~t|H1 )
that brings the problem to a 1 dimensional case and that gives the best purity given a fixed
efficiency. The function r is called the likelihood ratio for the simple hypothesis H0 and H1
(in the likelihood the data are fixed, the hypothesis is the variable). The corresponding acceptance region is given by r > c. Any monotonic function of r will be good too and will
lead to the same test.
The main draw back of the Neyman-Pearson lemma is that is is valid if and only if both
H0 and H1 are simple hypothesis (and that is pretty rare). Even in those cases in order to
determine c one needs to know g(t|H0 ) and g(t|H1 ). These must be determined by Monte
Carlo simulations (or data driven techniques) for both data and background. The resulting
p.d.f. is represented by a multidimensional histogram. This can cause some troubles when
the dimensionality of the problems is high. Say we have M bins for each of the n components
of t, then the total number of bins is M n , i.e. M n , parameters must be determined from
Monte Carlo or data. A way to address this problem is by using a Multi-Variate technique
as we will see in Chapter ??.
8.5
Goodness of Fit
A typical application of hypothesis testing is the goodness of fit, quantifying how well the null
hypothesis H0 (a function f (x)) describes a sample of data, without any specific reference to
an alternative hypothesis. The test statistic has to be constructed such that it reflects the
level of agreement between the observed data and the predictions of H0 (i.e. the values of
f(x)).
The p value is the probability, under the assumption of H, to observe data with equal or
lesser compatibility with H, relative to the data we got. N.B. it is not the probability that H
is true! As frequentist the probability of the hypothesis is not even defined: the probability
is defined on the data. As Bayesian the probability of the hypothesis is a di↵erent thing and
it is defined through the Bayes theorem using the prior hypothesis.
8.5.1
The
2
-Test
We have already encountered the 2 as a goodness of fit test in section Sec.7.5. The 2 -test
is by far the most commonly used goodness of fit test. Its first application is with a set of
measurements xi and yi , where the xi are supposed to be exact (or at least with negligible
uncertainty) and the yi are known with an uncertainty i . We want to test the function f (x)
which we believe it gives (predicts) the correct value of yi for each value of xi ; to do so we
define the 2 as:
2
=
N
X
[yi
i=1
f (xi )]2
2
i
.
(8.5.16)
Degrees of freedom for
2 -test
on fitted data
107
If the uncertainties on the yi measurements are correlated, the above formula becomes (with
the lighter matrix notation see Sec.7.3):
2
= (yT
f T )V
1
(y
f)
(8.5.17)
where V is the covariance matrix. A function that correctly describes the data will give a
small di↵erence between the values predicted by the function f and the measurements yi . This
di↵erence reflects the statistical uncertainty on the measurements, so for N measurements
the 2 should be roughly N .
Recalling the p.d.f. of the 2 distribution (see section 2.2.3):
N
P(
2
2 2
; N) =
( N2 )
N 2
2
e
2
(8.5.18)
(where the expectation value of this distribution is N , and so 2 /N ⇠ 1), we can base our
decision boundary on the goodness-of-fit, by defining the p value:
Z 1
2
p = Prob( ; N ) =
P ( 02 ; N )d 02
(8.5.19)
2
which is called the 2 probability. This expression gives the probability that the function
describing the N measured data points gives a 2 as large or larger than the one we obtained
from our measurement.
Example Suppose you compute a 2 of 20 for N=5 points. The naive reaction is that
the function is a very
1). To quantify that we compute
R 1poor model of the data (20/5 = 4
the 2 probability 20 P ( 2 , 5)d 2 . In ROOT you can compute this as TMath::Prob(20,5) =
0.0012. The 2 probability is indeed very small and the H0 hypothesis should be discarded.2
You have to be careful when using the 2 probability to take decisions. For instance if
the 2 is large, giving a very small 2 probability, it could be both that the function f is a
bad representation of the data or that the uncertainties are underestimated. On the other
hand if you obtain a very small value for the 2 , the function cannot be blamed, so you might
have overestimated the uncertainties. It’s up to you to interpret correctly the meaning of the
probability. A very useful tool for this scope is the pull distribution (see Sec.6.11), where
each entry is defined as (measured-predicted)/uncertainty = (yi f (xi ))/ i ; if everything
is done correctly (i.e. the model is correct and the uncertainties are computed correctly)
the pull will result in a normal distribution centred at 0 with width 1. If the pull is not
centred at 0 (bias) the model is incorrect, if the pull has a width larger than 1 either the
uncertainties are underestimated or the model is wrong, if the pull has a width smaller than
1 the uncertainties are overestimated.
8.5.2
Degrees of freedom for
2
-test on fitted data
The concept of 2 developed above only works if you are given a set of data points and a
function (model). If the function comes out from a fit to the data then, by construction,
you will get a 2 which is smaller than the expected, because you fit the parameters of the
108
CHAPTER 8. HYPOTHESES TESTING
function in order to minimize it.
This problem turns out to be very easy to treat. You just need to change the number of
degrees of freedom in the computation. For example, suppose you have N points and you
fitted m parameters of your function to minimize the 2 sum; then all you have to do to
compute the new 2 probability is to reduce the number of d.o.f. to n = N m.
Example You have a set of 20 points, you consider as function f(x) a straight line and
you get 2 = 36.3. If you use a parabola you get 2 = 20.1. The straight line has 2 degrees of freedom (slope and intercept), so the number of d.o.f. of the problem is 20-2=18;
the 2 probability is TMath::Prob(36.3,18) = 0.0065 which makes the hypothesis that
data are described by a straight line improbable. If you now fit it with a parabola you get
TMath::Prob(20.1,17) = 0.27 which means that you can’t reject the hypothesis that the
data are distributed according to a parabolic shape.
2
Notes on the
2 -test:
p
• For large values of d.o.f. the
distribution
of
2 2 can be approximated with a gaussian
p
distribution with mean 2n 1 and standard deviation 1. When in the past the
integrals were extracted from tables this was a neat trick; still it is a useful simplification
when the the 2 is used in some explicit calculation.
• The 2 -test can also be used as a goodness of fit test for binned data. The number of
events in bin i (i = 1, 2, . . . , n) are yi , with bin i having mean value xi . The predicted
number
of events is thus f (xi ). The errors are given by Poisson statistics in the bin
p
( f (xi )) and the 2 is
n
X
[yi f (xi )]2
2
=
,
(8.5.20)
f (xi )
i=1
where the number of degrees of freedom n is given by the number of bins minus the
number of fitted parameters (do not forget the overall normalization of the model when
counting the number of fitted parameters).
• when binning data, you should try to have enough entries per bin such that the computation of the 2 is actually meaningful; as a rule of thumb you should have at least
5 entries per bin. Most of the results for binned data are only true asymptotically,
e.g., the normal limit of the multinomial p.d.f. or the distribution of 2 log or the
distribution of 2 .
8.5.3
Run test
The 2 collapses in one number the level of agreement between a hypothesis and set of data.
There are cases where behind a 2 ⇠ 1 hides in reality a very poor agreement between the
data and the model. Consider the situation which is illustrated in figure 8.5.8. The 12
data points are fitted by a straight line, which clearly does not describe the data adequately.
Nevertheless, in this example, the 2 is 12.0 and thus 2 /n = 1. In cases such as this one the
run test provides important extra information. The run test works like this: every time the
measured data point lies Above the function, we write an A in a sequence, and every time
the data lies Below the function, we write a B. If the data are distributed according to the
Run test
109
Figure 8.5.8: Example for the application of the run test. The dashed line is the hypothesized
fit (a straight line), whereas the crosses are the actual data.
hypothesis function, then they should fluctuate up and down creating very short sequences
of A’s and B’s (runs). The sequence in the pictures reads AAABBBBBBAAA, making only
three runs and possibly pointing to a poor description of the data.
The probability of the A’s and B’s giving a particular number of runs can be calculated.
Suppose there are NA points above and NB below with N = NA + NB . The total number of
possible combinations without repetitions is given by (see chapter 1):
✓ ◆
N
N!
=
(8.5.21)
NA
NA ! NB !
this will be our denominator. For the numerator suppose that r is even and the sequence
starts with an A. There are NA A-points and r/2 1 divisions between them (occupied by
B’s). With NA point you can place NA 1 dividing line, in the next step NA 2 and so on,
A 1
giving N
r/1 2 di↵erent A arrangements. The same argument can be made for the B’s. So
we find for the probability of r runs:
Pr = 2
NA 1
r/2 1
·
NB 1
r/2 1
N
NA
,
(8.5.22)
where the extra factor of 2 is there because we chose to start with an A and we could have
started with a B. When r is odd you get:
Pr = 2
NA 1
(r 3)/2
·
NB 1
(r 1)/2
+
NA 1
(r 1)/2
N
NA
·
NB 1
(r 3)/2
(8.5.23)
These are probabilities to get r runs given a sequence of A’s and B’s.
It can be shown that
2NA NB
N
2NA NB (2NA NB
N 2 (N 1)
hri = 1 +
V (r) =
(8.5.24)
1)
(8.5.25)
110
CHAPTER 8. HYPOTHESES TESTING
In the example above the number of expected runs is hri = 1 + 2 · 6 · 6/12 = 7 with =
1.65. The deviation between the expected and the observed is 7 3 = 4 and it constitutes
2.4 standard deviations which is significant at the 1% level for the one-sided test. Thus the
straight line fit could be rejected despite the (far too) good 2 value.
The run test does not substitute the 2 test, it’s in a sense complementary; the 2 test ignore
the sign of the fluctuations, while the run test only looks at them.
8.5.4
Unbinned tests
Unbinned tests are used when the binning procedure would result in a too large loss of
information (e.g. when the data set is small). They are all based on the comparison of the
cumulative distribution function (c.d.f.) F (x) of the model f (x) under some hypothesis H0
and the c.d.f. for the data. To define a c.d.f. on data we define define an order statistics,
i.e. a rule to order the data3 and then define on it the Empirical Cumulative Distribution
Function e.c.d.f.:
0 , x < x1
Sn (x) = nr , xr  x < xr+1
(8.5.26)
1 , xn < x
This is just the fraction of events not exceeding x (which is a staircase function from 0 to 1),
see Fig.8.5.10.
The first unbinned test we describe is the Smirnov-Cram´
er-von Mises test. We define a
measure of the distance between Sn (x) and F(x) as:
Z 1
2
W =
[Sn (x) F (x)]2 dF (x)
(8.5.27)
0
(dF(x) can be in general a non decreasing weight). Inserting the explicit expression of Sn (x)
in this definition we get:
◆
n ✓
X
1
2i 1 2
nW 2 =
+
F (xi )
(8.5.28)
12n
2n
i=1
From the asymptotic distribution of nW 2 the critical regions can be computed: frequently
used test sizes are given in the Tab.8.5.9. The asymptotic distribution is reached remarkably
rapidly (in this table the asymptotic limit is reached for n 3).
The Kolmogorov-Smirnov test follows the same idea of comparing the model c.d.f. with
the data e.c.d.f. but
p it defines a di↵erent metric for the distance between the two. The test
statistic is d := D N where D is the maximal vertical di↵erence between Fn (x) and F (x)
(see Fig.8.5.10):
D := max |Fn (x) F (x)|
x
The hypothesis H0 corresponding to the function f (x) is rejected if d is larger than a given critical value. The probability P (d  t0 ) can be calculated in ROOT by TMath::KolmogorovProb(t0).
3
In 1-D the ordering is trivial, ascending/descending, in n-D it is arbitrary, you have to choose a convention
and map it to a 1D sequence.
Unbinned tests
111
Figure 8.5.9: Rejection regions for the Smirnov-Cram´er-von Mises test the for some typical
test sizes.
Some values are reported in table 8.5.1.
Figure 8.5.10: Example of c.d.f. and e.c.d.f.. The arrow indicates the largest distance used
by the Kolmogorov-Smirnov test.
Table 8.5.1: Critical values t0 for various significances 1 ↵.
↵
99% 95% 50% 32%
5%
1%
0.1%
P (d  t0 )
1% 5 % 50% 68% 95% 99% 99.9%
t0
0.44 0.50 0.83 0.96 1.36 1.62
1.95
The Kolmogorov-Smirnov test can also be used to test if two data sets have been drawn
from the same parent distribution. Take the two histograms corresponding to the data to be
compared and normalize them (such that the cumulative plateaus at 1). Then compare the
e.c.d.f. for the two histograms and and compute the maximum distance as before (in ROOT
use h1.KolmogorovTest(h2)).
112
CHAPTER 8. HYPOTHESES TESTING
Notes on the Kolmogorov-Smirnov test:
• the test is more sensitive to departures of the data from the median of H0 than to
departures from the width (more sensitive to the core than to the tails of the distributions)
• the test becomes meaningless if the H0 p.d.f. is a fit to the data. This is due to the
fact that there is no equivalent of the number of degrees of freedom as in the 2 -test,
hence it cannot be corrected for.
8.6
Two-sample problem
In this section we will look at the problem of telling if two samples are compatible with
each other, i.e. if both are drawn from the same parent distribution. Clearly the complication is that even if they are compatible they will exhibit di↵erences coming from statistical
fluctuations. In the following we will examine some typical examples of two-sample problems.
8.6.1
Two gaussians, known
Suppose you have two random variables X and Y distributed as gaussians of known width.
Typical situations are when you have two measurements taken with the same device with a
known resolution; or two samples are taken under di↵erent conditions where the variances of
the parent distribution
are known
(you have the two means hXi, hXi and the uncertainty on
p
p
the means x / Nx and x / Nx ).
This problem is equivalent to check if X Y is compatible with 0. The variance of X Y is
V (X Y ) = x2 + y2 and so the problem boils down to how many the di↵erence is from 0:
q
2
2
(X Y )/
x + y.
More generally what you are doing is defining a test statistics
|hxi µ0 |
p
/ N
(in the previous case
µ0 = 0) and a double sided rejection region. This means that you choose the significance of
your test (↵) and set as rejection region the (symmetric) values u1 ↵/2 on the corresponding
gaussian as:
Z u1 ↵/2
Z 1
↵
G(x; µ0 , )dx =
G(x; µ0 , )dx =
(8.6.29)
2
1
u1 ↵/2
If the measured di↵erence ends up in the rejection region (either of the two tails) then the
two samples are to be considered di↵erent.
You can also decide to test whether X > Y (or Y > X). In this case the test statistic is
(hxi µ0 )
p
and the rejection region becomes single sided (u1 ↵ , 1) (or ( 1, u1 ↵ )).
/ N
8.6.2
Two gaussians, unknown
The problem is like the previous one, you’re comparing two gaussian distribution with means
hXi and hY i, but this time you don’t know what are the parents’ standard deviations. All
you can do is to estimate them from the samples at hand:
P
P
(xi hxi)2
(yi hyi)2
sx =
;
sy =
.
(8.6.30)
Nx 1
Ny 1
F-test
113
Because we’re using the estimated standard deviation we have to use the Student’s t to test
the significance and not the gaussian p.d.f. as we did in the previous case (see Sec. 2.2.7).
So we build a variable which is the ratio between a gaussian and a 2 distribution. The
expression
(hxi hyi)
q
(8.6.31)
( x2 /Nx ) + ( y2 /Ny )
under the null hypothesis that the two distributions have the same mean, is a gaussian centred
at zero with standard deviation one.
The sum
(Nx 1)s2x (Ny 1)s2y
+
(8.6.32)
2
2
x
y
is a 2 with Nx + Ny 2 d.o.f.
If we take the ratio of the two (assuming that the unknown parent standard deviation
such that they cancel out in the ratio) we get the definition for a t-distribution:
t=
where
S
(hxi hyi)
p
(1/Nx ) + (1/Ny )
x
=
y,
(8.6.33)
1)s2x + (Ny 1)s2y
(8.6.34)
Nx + Nx 2
(S is called the “pooled estimate” of the standard deviation,pas it is the combined estimate
from the two samples, appropriately weighted.
The term S/ (1/Nx ) + (1/Ny ) is analogous
p
to the standard error on the mean / N that is used when is known).
The variable t is distributed as a Student’s t with Nx + Nx 2 d.o.f. With this variable we
can now use the same testing procedure (double or single sided rejection regions) used in the
case shown above 8.6.1, substituting the c.d.f of the gaussian with the c.d.f. of the student’s t.
S2 =
8.6.3
(Nx
F-test
The F -test is used to test whether the variances of two samples with size n1 and n2 , respectively, are compatible. Because the true variances are not known, the sample variances V1
and V2 are used to build the ratio F = VV12 . Recalling the definition of the sample variance,
we can write:
1 Pn1
x
¯ 1 )2
V1
i=1 (x1i
n1 1
F =
= 1 Pn2
.
(8.6.35)
V2
x
¯ 2 )2
i=1 (x2i
n 1
2
(by convention the bigger sample variance is at the numerator, such that F 1). Intuitively
the ratio will be close to 1 if the two variances are similar, while it will go to a large value
if they are not. When you divide the variance by 2 you obtain a random variable which
is distributed as a 2 with N 1 d.o.f. Given that the random variable F is the ratio of
two such variables, the cancels and we are left with the ratio of two 2 distributions with
f1 = N1 1 d.o.f. for the numerator and f2 = N2 1 d.o.f. for the denominator.
The variable F follows the F -distribution with f1 and f2 degrees of freedom: F (N1 , N2 ):
q
((f1 + f2 )/2)
F f1 /2 1
P (F ) =
f1f1 f2f2
(8.6.36)
(f1 /2) (f2 /2)
(f1 + f2 F )(f1 +f2 )/2
114
CHAPTER 8. HYPOTHESES TESTING
For large numbers, the variable
Z=
1
log F
2
(8.6.37)
converges to a Gaussian distribution with mean 12 (1/f2 1/f1 ) and variance 12 (1/f2 + 1/f1 ).
In ROOT you can use the function TMath::fdistribution pdf.
F-test
115
Example Background model for the H !
search: The collected diphoton events are divided in several categories (based on resolution and S/B to optimize the analysis sensitivity).
Once a model for the background is chosen (e.g. a polynomial) the number of d.o.f. for that
model (e.g. the order of the polynomial) can be chosen using an F-test. The main idea is
gradually increase the number of d.o.f. until you don’t see any decrease in the variance (see
snapshot of the text here below).
2
Table 8.6.2: Summary of the hypothesis tests for the two-sample problem.
H0
H1
Test Statistic
Rejection Region
Comment
2
1
µ1
µ1
µ1
µ2 
µ2
µ2 =
µ1
µ2 
µ1
µ1
µ2
µ2 =
2
1
2
1
2
1

=
2
2
2
2
2
2
Comparison of two normal distributions with µ1 and µ2 and known
and 22
x¯1 x¯2
µ1 µ2 >
(u
x
¯
1 ↵ ; 1)
i : arithmetic mean of sample i
d
x¯1 x¯2
1
2
µ1 µ2 <
( 1; u1 ↵ )
d := n + n
1
2
d
|x¯1 x¯2
|
µ1 µ2 6=
(u
;
1)
1 ↵/2
d
2
2
Comparison of µ1 and µ2 with unknown (but supposed equal)
q 1 and 2 2
q
2
(n1 1)S1 +(n2 1)S2
x¯1 x¯2
n1 +n2
µ1 µ2 >
(tf ;1 ↵ ; 1)
Sd =
Sd
n1 +n2 2
n1 n2
µ1
µ1
2
1
2
1
2
1
x¯1 x¯2
µ2 <
( 1; tf ;1 ↵ )
f = n1 + n2 2
Sd
|x¯1 x¯2
|
µ2 6=
(t
;
1)
Calculate
by non-central t-dist.
f ;1 ↵/2
Sd
F -Test: Hypotheses about 12 and 2 of two normal distributions
> 22
S12 /S22
A1 = (FN1 ;N2 ;1 ↵ ; 1)
Ni = n i 1
< 22
S12 /S22
A2 = (0; FN1 ;N2 ;↵ )
6= 22
S12 /S22
A1 and A2
116
8.6.4
CHAPTER 8. HYPOTHESES TESTING
Matched and correlated samples
In the previous sections we’ve seen how to compare two samples under di↵erent hypothesis.
The tests are more discriminating the smaller are their variances. Correlations between the
two samples can be used to our advantage to reduce the variance. Take as test statistic:
X
(xi yi )
(8.6.38)
i
where each of the data point of the first sample is paired to a corresponding one in the second
sample. The variance of this distribution is:
V (x
y) =
2
x
+
2
y
2⇢
x y
(8.6.39)
Now, if the two samples are correlated (⇢ > 0) then the variance is reduced and will make
the test more discriminating.
Example A consumer magazine is testing a widget claimed to increase fuel economy. Here
are the data on seven cars are reported in Fig.8.6.11. Is there evidence for any improvement?
If you ignore the matching, the means are 38.6 ± 3.0 and 35.6 ± 2.3 for the samples with and
Figure 8.6.11: Data from seven cars.
without the widget. The improvement of 3 m.p.g. is within the statistical uncertainties. Now
look at the di↵erences. Their average is p
3.0. The estimated standard deviation s is 3.6, so
the error on the estimated average is 3.6/ 7 = 1.3, and t is 3.6/1.3 = 2.0. This is significant
at the 5% level using Student’s t (one-tailed test, 6 degrees of freedom, tcritic = 1.943).
8.6.5
The most general test
As we already said the more precisely the test can be formulated, the more discriminant it
will be. The most general two-sample test, makes no assumptions at all on the two distributions, it just asks whether the two are the same.
You can apply an unbinned test like the the Kolmogorov-Smirnov (as explained in Sec.8.5.4)
by ordering the two samples and computing the maximal distance between the two e.c.d.f.
Or you can approach the problem ordering together both samples and then look apply a the
run-test. If the two samples are drawn from the same parent distribution there will be several
very short runs; if on the other hand the two samples are from di↵erent parent distributions
you will have long runs from both samples. This test can be tried only if he number of points
in sample A is similar to the one in sample B.
8.7. REFERENCES
117
Example Two samples A and B from the same parent distribution will give something
like: AABBABABAABBAABABAABBBA.
Two samples from two narrow distributions with di↵erent means will give something like:
AAAAAAAAAABBBBBBBBBBBBB.
2
8.7
References
• G. Cowan, “Statistical Data Analysis”,Ch. 4
• R. Barlow, “ A guide to the use of statistical methods in the physical sciences”. Ch. 8
• W. Metzger, “Statistical Methods in Data Analysis”: Ch.10