The normal distribution assumption and other assumptions. I

The normal distribution assumption and other assumptions.
I. t-tests assume that the data from the population are distributed normally.
So how do we know if a population has a normal distribution?
As usual, we use the sample and use this as and estimate (sort of).
If you take a “sample” from a normally distributed population, it makes sense that your
sample will also be normally distributed.
You’re more likely to pick items near the average, because more items will be
NEAR the average!
As your sample size goes up, you start picking up more extreme values (just as in
a normal distribution)
So how do you figure out if a population is actually normally distributed?
There are two types of methods:
1) statistical tests:
A lot of people don’t like “tests” of normality.
In all cases you're trying to prove the H0, which is impossible!
We won’t be covering them in this class, but if you ever do need to use
them (e.g., someone tells you to, even when they shouldn't) here are some
that work:
Shapiro-Wilks test - better for smaller samples.
Kolmogorov test - better for larger samples.
(The division is not clear cut, and there's lots of overlap).
NEVER, never use a goodness of fit test to test for normality
(You can literally do anything you want with this one!)
2) graphical methods:
Histograms - these will tell you very quickly when something's not normal, but
it's more difficult to use these to figure out if something is normal.
Box plots - are similar to histograms - they'll let you know when something's not
normal, but aren't so good to figure out if something's actually normal.
QQ plots (normal probability plots in your text) - probably the best method for
figuring out if something's normal.
II. Making Q-Q plots (= quantile - quantile plots)
In general, we’ll let the computer do this. But let's explain how it works:
1) Determine your sample size.
2) The calculate what you would expect from this sample size (more below).
3) Plot the expected values (x-axis) vs. the actual values of your data (y-axis).
4) If the points are more or less on a straight line, then your sample is probably normal.
Your text has a pretty good description on p. 138 [136] {134].
The details:
Suppose you took a sample of size 10.
Where would you expect the smallest value to be?
At the 10th percentile (= 1/10).
The second smallest value should be at the 20th percentile (= 2/10) and so on.
So you could simply get the z-score for the 10th percentile and plot that vs. the smallest
value in your data.
[You could convert your z-score into a y-value (the 4th edition does this), but it
really isn't necessary - the graph will look the same:
This does, however, plot the actual vs. expected values using the same
scale, which might be a little easier to understand.]
However, we can't do this. If we just plot the percentiles like this, eventually we'd
get to the 100th percentile.
There is no z-score for 100% of the area (it's at infinity!)
In other words, we can't do:
i
to calculate our percentiles because this will eventually give us 1 or 100%.
n
Notice that we use i, not the actual values of our data.
Instead we have to tweak our percentiles a bit and we use:
2nd or 3rd edition
4th edition
1
3
1
n+
3
i−
or
i−
1
2
n
Unfortunately this is a bit different depending on which edition you have. If you have the
2nd or 3rd edition, the first formula is used, if you have the 4th edition, the second formula
is used.
In both cases, note that you don't get 1 or 100%.
These equations give us the percentiles (or quantiles).
Important!
You still have to convert these to a z-score!
Remember that you need to use i, not the actual value of your data.
Before we get really confused, let's just do an example (see p. 137 [134] {136}):
We have a sample of heights for 11 women, and we want to determine if the sample has a normal
distribution.
First note that n = 11.
Now we calculate the z-scores we would expect for the smallest height, the second
smallest height, and so on.
For the smallest height we have (using the 2nd or 3rd edition approach):
1
3
= 0.0588
quantile = percentile =
1
11+
3
1−
and then look up the z-score for this percentile:
Z α = Z 0.0588 = −1.56
If we wanted to use the 4th edition approach we would have:
quantile = percentile =
1
2
= 0.0455
11
1−
and then look up the z-score for this percentile:
Z α = Z 0.0455 = −1.69
Obviously, you stick with one approach or the other (the 4th edition approach is
arguably a bit better, and is the one that R uses).
The results will actually be quite similar, wichever method you use.
Let's try the second smallest value:
2nd or 3rd edition:
1
3
quantile =
= 0.1471
1
11+
3
2−
so we look up
Z α = Z 0.1471 = −1.05
4th edition:
1
2
quantile =
= 0.1364
11
2−
so we look up
Z α = Z 0.1364 = −1.10
Remember, you’re finding the quantile (or percentile), then doing a reverse lookup
for your z-value.).
The other 9 numbers would be calculated the same way - aren’t you glad the computer
can do all this?
Once you have all 11 z-scores you then plot everything (see the graph on p. 137 [135] {136}.
You plot your smallest observed value against the 1st z-score
You plot the second smallest observed value against the 2nd z-score
And so on, until you plot all 11 values.
(Caution: the description in your text (2nd and 3rd edition is not quite correct - it uses “1-α” as a
subscript for z, which is exactly backwards from the way your tables are set up).
III. Interpreting Q-Q plots.
So what should a q-q plot look like?
If things are perfectly normal, then all the dots should line up on a perfectly straight line.
Think of this a little like plotting x = y.
If your data points are exactly normal, all your z-scores will match your observed data
perfectly, and everything will be on a straight line).
In other words, the straighter, the better.
Our plot of heights for women is very straight and is a good example of a q-q plot that shows
normally distributed data.
Here are some guidelines to interpreting q-q plots:
1) Don’t worry about every little bump. There’ll be lots of bumps.
Nothing is every perfectly straight (unless the data are made up).
2) Worry if you see a strong curve of some kind (e.g. fig. 4.26 on p. 140 [4.28, p. 137], {4.4.7, p
138})
If you see a backwards “S” (if the ends of the curve point towards the vertical) this
indicates long tails which is BAD.
Long tails can be very difficult to work with.
It takes a long time for the central limit theorem to work.
If you see a regular “S” that indicates short tails.
That's not so bad, and even a moderate sample size means things are probably
sufficiently normal.
With practice, you can tell a lot about the distribution of your data from q-q plots. Here are some
examples:
Caution: some software (e.g., Minitab) reverse the axes on q-q plots
If the axes are reversed, then the above comments need to be “reversed” as well (a regular
“S” becomes long tailed).
Always make sure you look at the labels of the axes before you interpret a probability
plot.
Exercise 4.15, p. 143 [4.18, p. 139 - 140] {4.4.2, p. 140} should help you with some of the
interpretation
Some comments on the normal assumption.
Depending on how the data are “not normal”, the t-test might still do pretty good, though there
are often more powerful tests (we'll learn one soon).
Note that the normal assumption needs to satisfied for each sample.
You need to do a q-q plot for each sample
For example, do a q-q plot for both males and females if you're doing two sample
test comparing males vs. females.
IV. Finally, what are all the assumptions of a two sample t-test?
1) The data must be collected “randomly” (a “random sample”)
1.5) In the case of a two sample test, each sample needs to be independent.
Being in sample 1 can not influence being in sample 2.
Data in sample 1 should be completely independent of data in sample 2.
2) The data should be normal, BUT, if your sample size is large enough, you can get away with nonnormal data (remember the CLT).
What is large enough? For a two-sample test, if each sample is 20 to 25 or over that is often
good enough. But it depends on how non-normal the data is.
If the data are strongly non-normal, you might need a larger sample size, say 50 or even
more.
If the data are almost, but not quite normal, you can get away with a smaller sample size
(say about 15 or so).
Worry more if you have long tails in a q-q plot.
3) For a two-sample t-test you should figure out whether or not you should assume equal variances.
Unless you're pretty sure, you should not assume equal variances.