Download Report

Stacie Bowman
Math 1040
Skittles Term Project Report
4/24/2015
Introduction
Have you ever noticed when eating a bag of skittles that there are often times an abundance of
one color and a only a few or none of others? It begs the question; are skittle colors random or is there
actually a higher probability to have some colors over others? This term project for Math 1040 aims to
answer this question while showing how to make use of statistics with a real life data set. The project's
goal is to illustrate and practice the beginning to end process of statistics: collecting, organizing, and
analyzing data, then drawing conclusions and presenting results.
Organizing and Displaying Categorical Data: Colors
My Skittles
Red
22
35.5%
Orange
13
21.0%
Yellow
3
4.8%
Green
8
12.9%
Purple
16
25.8%
My Skittles Distribution - Pie Chart
25.8%
35.5%
Red
Orange
Yellow
Green
12.9%
4.8%
Purple
21.0%
My Skittles Distribution - Pareto Chart
25
Number of Skittles
20
Red
15
Purple
Orange
22
10
16
5
Green
13
Yellow
8
3
0
Red
Purple
Orange
Green
Yellow
Class Total Skittles
Red
283
22.4%
Orange
271
21.4%
Yellow
254
20.1%
Green
206
16.3%
Purple
250
19.8%
Class Total Skittles Distribution - Pie Chart
19.8%
22.4%
Red
Orange
Yellow
16.3%
Green
21.4%
20.1%
Purple
Class Total Skittles Distribution - Pareto Chart
300
Number of Skittles
250
Red
200
Orange
150
283
271
254
Yellow
250
206
100
Purple
Green
50
0
Red
Orange
Yellow
Purple
Green
When opening my bag of skittles I assumed there would not be a very even distribution of colors
because I have experienced this in the past (especially noticeable when your favorite color/flavor is the
least frequent). Although this is often the case with a small bag (small sampling), I expected the larger
sampling of the class to be more evenly distributed (assuming skittles manufactures roughly the same
number of each color). Both graphs for my own bag and the total skittles in the class turned out to be
about what I expected; my own bag has an overall uneven color distribution with an overabundance of
red and only a few yellow, whereas the total class distribution is much more even. I would expect an
even larger sample size to be even more evenly distributed, just as flipping a coin more times gets closer
to 50%/50%. The overall data is very different from the small sample of my own bag, but interestingly
red is the most frequent for the overall class while also being the dominating color of my own bag. One
can't help but wonder if the skittles factory actually does produce more red than other colors.
Organizing and Displaying Quantitative Data: the Number of Candies per Bag
The shape of the distribution is normal for the most part, with a majority (12 of 21) of the bags
containing between 60 and 62 skittles. The overall graph is skewed only slightly to the left with outliers
at 52 and 57. This is mostly what I expected to see, a normal and narrow distribution, but what surprised
me was the outliers being so significantly away from the average. It seems you can either get lucky and
get an extra 7 or so skittles or get unlucky and get 8 or so less than the average. My own bag had 62
skittles in it, which is actually the class mode (62) and is within the standard deviation (3.01) of the class
mean (60.2), making my bag very typical when compared to the whole class.
Skittle Count - Five Number Summary
Min
Q1
Q2
Q3
Max
52
59
60
62
67
Mean:
60.2
Stdev:
3.01
My Bag:
62
# of Bags:
21
Order #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Total
52
56
57
58
59
59
59
60
60
60
60
60
62
62
62
62
62
62
62
63
67
---------->
---------->
Min: 52
0.25*21= 5.25
rUp--> 6
Q1: 59
---------->
median
Q2: 60
---------->
0.75*21= 15.75
rUp--> 16
Q3: 62
---------->
Max: 67
Confidence Interval Estimates
A confidence interval is used to indicate how accurate an estimate of a population parameter is
expected to be. A confidence interval provides a measure that one is x% confident that the true
population parameter falls within the confidence interval.
Calculations show that we are 95% confident that the true proportion of purple candies falls
between 0.176 and 0.220. This is roughly a fifth of the candies which seems reasonable as there are five
colors. We are 99% confident that the true mean of candies per bag falls between 58.3 and 62.1. This
could be interpreted as expecting about 60 skittles in a bag plus-or-minus 2 skittles. I also calculated that
we are 98% confident that the true standard deviation of candies per bag is between 2.20 and 4.68. All
of these ranges seems reasonable for the given parameters.
Hypothesis Tests
A hypothesis test offers a way to test a claim made about a population. The claim could be made
about the proportion, mean, or other property of the population. The claim is tested against a selected
significance level and is either rejected or not rejected. If the claim includes equality (is specific) then the
hypothesis test determines if there is sufficient evidence to warrant rejection of the claim, if the claim
does not include equality (e.g. greater than) then the test determines if there is sufficient evidence to
support the claim.
Our hypothesis tests for the bags of skittles tested the claims that 20% of all the skittles are
green and that the mean number of candies per bag is 56. My above calculations determined that in
both cases the claims are rejected. Concerning the claim that 20% are green, it is tested against our class
proportion of 16.3% green skittles and is basically determined to be unlikely to be correct and therefore
is rejected (calculated P-value is less than significance level). The claim that the mean number of skittles
per bag is 56 is tested against our class mean of 60.2 and is also determined to be unlikely to be correct
and is also rejected (calculated test statistic falls within the critical value).
Hypothesis Testing Reflection
The condition for a hypothesis test for population proportions is that np >= 5 and nq >= 5. In our
case np = (1264)*(0.2) = 253 > 5 and nq = (1264)*(0.8) > 5 so our samples met these conditions and the
test should be correct except the sampling may not be considered simple random, disqualifying the
result. The condition for a hypothesis test for a population mean is that the population is normally
distributed or n > 30. In our case the number of bags is 21 which is not greater than 30, so we must
determine if the distribution is normal. The below graph shows the frequency distribution and it is not
very normal, therefore our samples do not meet the requirements for hypothesis testing and our test
result may be incorrect. Standard deviations require strict normal distribution which we do not have.
Errors could have been made using this data because our sample mean (of skittles per bag) may
not be at all representative of the population mean. The sampling method could be improved by
including many more samples (more bags) and by selecting those bags from randomized locations (as
opposed to all of them from here in town).
Frequency
Frequency Distribution- Skittles per
Bag
8
7
6
5
4
3
2
1
0
52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
# of skittles in bag
Term Project Conclusion: Reflective Writing
The skittles term project has really helped me understand how statistics can be used to make
sense of a real world data set. This project helped me learn the whole process of statistics from
collecting data to drawing conclusions and presenting results. The results of this work also taught me
that you have to be careful to not make assumptions about a population parameter when your sample
size is very small (such as one bag of skittles). My own bag showed only a few yellow, possibly leading
one to assume that the factory produces much fewer yellow candies compared to the others. When
seeing the data collected from the whole class however yellow was almost exactly a proportionate fifth
of the total candies, illustrating how a small sample can be extremely misrepresentative of the larger
population.
The way I think about math has changed due to this project and class. Statistics is quite different
from the math I am used to and it felt more real to me, involving real world information and data sets
rather than simply solving abstract equations. The skittles project in particular really helped convince me
that math can be used to make sense of all sorts of things, be it clinical trials or candy. I definitely think
this project has also taught me skills that will be useful for other classes and for life in general. Skills such
as organizing information, making sense of raw data, and presenting ideas and results.
It appears that the Skittles company produces about the same amount of each color (based on
proportions of class total), although a disproportionate amount can certainly end up in a single bag. The
evidence also suggests the company does a decent job of keeping close to the same number of candies
in each bag (small standard deviation relative to average) although outliers (lucky and unlucky bags)
with several more or less skittles than average can occur. Before this project I would have had no idea
how to make sense of data like this, leading to these conclusions. Statistics has turned out to be more
interesting and useful than I ever thought.