DTC Quantitative Methods Statistical Inference II: Statistical Testing

DTC Quantitative Research Methods
Statistical Inference II:
Statistical Testing
Thursday 7th November 2014
Hypothesis testing
•
•
•
•
Imagine that we know that the mean income of university
graduates is £16,500. We then do a survey of 64 sociology
graduates and find that they earn a mean income of £15,400
with a standard deviation of £4,000. Can we say that this is
convincing evidence that sociology graduate students earn
less than other graduate students?
The null hypothesis here is that sociology graduates earn the
same as other graduates. This is a hypothesis of no difference.
The alternative hypothesis is that there is a difference.
The null hypothesis (or Ho) is usually of no difference. And the
alternative hypothesis (or Ha) is usually of difference.
When we carry out statistical tests, we attempt, as here, to reject
the null hypothesis at a 95% level of confidence (or sometimes at
a 99% or 99.9% level).
Statistical significance
• A conclusion (e.g. that a difference or
relationship exists) is statistically significant if
the probability that the conclusion would be
drawn if it is, in fact, erroneous falls below the
significance level chosen (in social science
research this is often 5% = 0.05 = 1 in 20).
• The significance level is sometimes referred to
as alpha (α).
Hypothesis testing
• So, thinking about the example again:
Imagine that we know that the mean income of university graduates is
£16,500. We then do a survey of 64 sociology graduates and find that
they earn a mean income of £15,400 with a standard deviation of
£4,000. Can we say that this is convincing evidence that sociology
graduate students earn less than other graduate students?
• If we construct a 95% confidence interval for the population mean
income of sociology graduates it will look like this:
– 15,400 plus or minus 1.96 x (4,000 / 64)
– 15,400 plus or minus 1.96 x (4,000 / 8)
– 15,400 plus or minus 980  £14,420 to £16,380
• The top point of this range is still below the mean income for
graduates generally – there is no overlap. This means that there is
less than a 5% chance that a difference as big as £1,100 would
have occurred if there is no difference between sociology graduates’
mean income and the mean income for all graduates.
p-values
• A p-value quantifies (more precisely) the
statistical significance of a result.
• More precisely, it quantifies how likely a
difference or relationship of equal or greater
magnitude to that observed would be to have
occurred if there is no difference/relationship
in the population (i.e. if the null hypothesis is
correct)
Back to the example…
• In the example, the standard error (i.e. the standard
deviation of the sample mean) is equal to (4,000 / 64)
= 500.
• Thus the sample mean is 1,100/500 = 2.2 standard
errors away from the suggested population mean.
• Statistical theory tells us that 95% of sample means are
within 1.96 standard errors of the population mean.
• And also tells us that 97.2% of sample means are
within 2.2 standard errors of the population mean.
• Hence the p-value for the difference of 2.2 standard
errors (which is a test statistic) is (100-97.2)/100 =
0.028
• Since p < 0.05, it is statistically significant at the
conventional 5% significance level.
Hypothesis testing
Theory
You test out particular hypotheses with reference to your
sample statistics. However these hypotheses are about
underlying population characteristics (parameters)
Procedure
• Set up ‘null’ (and ‘alternative’) hypothesis
• Note sample size and design
• Establish sampling distribution under the assumption
that the null hypothesis is true
• Identify decision rule (i.e. what constitutes
acceptance/rejection of the null hypothesis)
• Compute sample statistic(s), and apply the decision rule
(N.B. This is where Type I and Type II errors can occur).
Error Types
Truth about population
Decision (based
on hypothesis test)
Reject H0
Do not reject H0
H0 true
Ha true
Type I error
Correct decision
Correct decision
Type II error
Note: Reducing the chance of one type of error occurring increases the chance
that the other type will!
(Statistical) Power
• Power is defined as the probability that a test
will correctly reject the null hypothesis, i.e.
correctly conclude that there is a difference,
relationship, etc.
• The probability of a Type II error is sometimes
labelled beta (β), hence power equals 1-β.
• The power of a test depends on the size of the
effect (which is, of course, unknown!)
What is the point of power?
• Power also depends on the sample size and the
significance level chosen.
• So if we want to use the usual 5% significance
level (to obtain ‘95% confidence’ in our results)
and we want to be able to identify an effect of a
given size, we can calculate how likely, for a given
sample size, we are to find an effect of that size,
assuming such an effect exists.
• If the power of a test is low, there is little point in
applying it, which suggests a need for a larger
sample.
Never innocent…
• Rather deciding between ‘guilty’ and ‘innocent’,
statistical tests decide between ‘guilty’ and ‘not
proven’.
• In other words, a statistically insigificant or nonsignificant result (sometimes indicated by NS rather
than, say p > 0.05) does not indicate that a difference
or relationship does not exist, but simply that there is
insufficient evidence to conclude that one does exist!
• This leaves open the possibility of a small difference or
weak relationship, which the the statistical test was
insufficiently powerful to identify…
Applying the logic of a statistical test…
There are a large number of different statistical tests that use inferential methods
to ask questions about different forms of differences/relationships:
• Is the sample mean sufficiently different from the suggested population
mean that it is implausible that the suggested population mean is correct?
Testing the plausibility of a suggested population mean (via a z-test). [This is
what we’ve just done].
• Are the means from two samples sufficiently different for it to be
implausible that the populations from which they come are actually the
same? Test via a two-sample t-test, or if comparing more than two (sub-)
samples (i.e. more than two groups) testing for differences via Analysis of
Variance (usually referred to as ANOVA).
• Are the observed frequencies in a cross-tabulation sufficiently different
from what one would have expected to have seen if there were no
relationship in the population for the idea that there is no relationship in the
population to be implausible? Test this via a chi-square test.
In each instance we are asking whether the difference between the actual
(observed) data and what one would have expected to have seen, given
some hypothesis Ho, is sufficiently large that the hypothesis is implausible.
Thus we are always trying to disprove a (null) hypothesis.
(Two sample) t-tests
• Test the null hypothesis, which is:
H0 : 1 = 2
or
H 0:  1-  2 = 0
i.e. the equality of means
• The alternative hypothesis is:
Ha: 1  2 or Ha: 1- 2  0
What does a t-test measure?
Note: T = treatment group and C = control group. (The above depicts a comparison in
experimental research; in most discussions the groups tend just to be labelled as
groups 1 and 2, indicating different groups.)
Example
• We want to compare the average amounts of
television watched by Australian and by British
children.
• We have a sample of Australian and a sample of
British children. We could say that what we have and
want to do are something like this:
Population of
Australian children
inference
Sample of
Australian children
Want to compare
Population of
British children
inference
Sample of
British children
Example (continued)
• Here the dependent variable is number of
hours of TV watched each night
• And the independent variable is nationality
(or, perhaps, national context).
• When we are comparing means SPSS calls
the independent variable the grouping
variable and the dependent variable the
test variable.
Example (continued)
• If the null hypothesis, hypothesising no difference
between the two groups, was correct (and children thus
watch the same average amount of television in Australia
as in Britain), we would assume that if we took repeated
samples from the two groups the difference in means
between them would generally be small or zero.
• However it is highly likely that the difference between
any two particular samples will not be zero.
• Therefore we acquire a knowledge of the sampling
distribution of the difference between the two sample
means.
• We use this distribution to determine the probability of
getting an observed difference (of a given size) between
two sample means from populations with no difference.
If we take a large number of random samples and calculate the difference between each
pair of sample means, we will end up with a sampling distribution that has the following
properties:
It will be a t-distribution, with
The mean of the difference between sample means will be zero if the null hypothesis is
correct.
Mean (M1 – M2) = 0
The ‘average’ spread of scores around this mean of zero (the standard error) will be
defined by the formula:
S DM
  N1  1  s12    N 2  1  s22   1 1  
 
  
N1  N 2  2

 N1 N 2  
This estimate ‘pools’ the variance in the groups – just take it at face value!
Back to the example…
When we are choosing the test of significance it is important to note that:
1.
We are making an inference from TWO samples (of Australian and of
British children). And these samples are independent (the number of
hours of TV watched by British children doesn’t affect the number of
hours watched by Australian children, and vice versa) Therefore we need
an two-sample test (what SPSS calls an ‘independent samples’ t-test)
2.
The two samples are being compared in terms of an interval-ratio
variable (hours of TV watched). Therefore the relevant descriptive
statistic is the mean.
 These facts lead us to select the two sample t-test for the equality of
means as the relevant test of significance.
Table 1. Descriptive statistics for the samples
Descriptive statistic Australian sample
British sample
Mean
166 minutes
187 minutes
Standard deviation
29 minutes
30 minutes
Sample size
20
20
t-test of independent means: formulae
M1  M 2
t
S DM
S DM
  N1  1  s12    N 2  1  s22   1
1 
 
 

N1  N 2  2

 N1 N 2  
Note: 1 + 1 = N1 + N2
N1 N2
N1 N2
df  N1  N2  2
Where:
M = mean
SDM = Standard error of the difference between means
N = number of subjects in a group
s = Sample standard deviation of a group
df = degrees of freedom
What are ‘degrees of freedom’?
• Degrees of freedom can be thought of as the
‘sources of variation’ in a particular situation.
• If we are comparing groups of 20, then within
each group there are 19 (independent)
sources of difference between the values for
that group.
• Thus for the two groups combined there are
19+19 = 38 degrees of freedom (d.f.)
Example: Calculating the t-value
M1  M 2
t
S DM
S DM
Descriptive
statistic
Australian
sample
British
sample
Mean
166 minutes
187 minutes
Std. dev.
29 minutes
30 minutes
Sample size
20
20
  N1  1  s12    N 2  1  s22   1
1 
 
 

N1  N 2  2

 N1 N 2  
S DM =

(20-1)292 + (20-1)302
20 + 20 – 2
tsample = 166 – 187
9.3
= – 2.3

20+20
20 x 20

=
9.3
Example: Obtaining
a p-value for a t-value
• To obtain the p-value for this t-value (score) we could consult a table of
critical values for the t-distribution.
• Such a table may not have a row of probabilities for 38 degrees of
freedom (d.f.) In that case we (to be cautious) would refer to the row
for the nearest reported number of degrees of freedom below the
desired number. Here that might be 30.
• For 30 degrees of freedom and a two-tailed test, the tabulated t-scores
for p=0.05 and p=0.02 are 2.042 and 2.457.
• The (absolute magnitude) of the t-statistic, falls between these scores,
hence the p-value linked to this t-statistic is therefore between 0.02
and 0.05.
• Therefore the p-value is statistically significant at the 5% (0.05) level
but not at the 2% or 1% (0.02 or 0.01) level.
• Of course, SPSS is set up to calculate exact p-values for test statistics
such as the t-statistic (in this case the exact value is p=0.030).
Example: Reporting the results
“The mean number of minutes of TV watched by the
sample of 20 British children is 187 minutes, which
is 21 minutes higher than the mean of 166 minutes
for the sample of 20 Australian children; this
difference is statistically significant at the 0.05
level (t(38)= -2.3, p = 0.03, two-tailed test).
Based on these results we can reject the hypothesis
that British and Australian children watch the same
average amount of television every night.”
Some final thoughts…
• ANOVA (Analysis of Variance) works on broadly similar principles, but is
a technique allowing one to look simultaneously at differences between
the means of more than two groups.
• Both t-tests and ANOVA make an assumption of homogeneity of
variance (i.e. that the spread of values in each of the groups being
considered is consistent).
• We will look at ANOVA in more detail later in the module.
• What are crucial to remember from this session are the principles of
hypothesis testing:
– That we start with a null hypothesis (of no difference in the population).
– That, using our sample we can test whether this is plausible.
– The p-values that we get (and that we report) show the likelihood of the
observed results given no difference.
– Therefore (to simplify), the lower the p-value the more likely it is that there is
a real difference between the groups.
• A reminder: The three things that affect the test statistic are the sample
size (of each group), the size of the differences in the means (between
groups) and the variability of scores (within each group).