Chapter 1 Know how to create a graphical representation of data with two variables (data in a two way table), (10, 6a; 09, 1a) Know how to describe differences in categorical data? (10, 6b; 09, 1b) Know how to compare distributions (Shape, center, spread)? Make sure you use comparative words (larger, wider, etc), (10b, 1a; 08, 1a; 08b, 1a; 07, 1b; 06, 1a; 05, 1a;05b, 1a; 04b, 5a) Know how to make and interpret a stemplot or back to back stemplot, including labels and a legend, (10b, 1b; 07b, 1a; M7, 7) What is a gap, and what are clusters? Remember to state where they are in the distribution and what that means in context of a problem, (10b, 1c;07b, 1c) Know how skewness and symmetry affect the relationship between mean and median, (09, 6b; 09, 1c; 05, 1d; 05b, 1b) Know how to create (including labels) and interpret a boxplot, including how to find the range, interquartile range, 25th/75th quartile, and how to determine which way, if, it’s skewed (09b, 1a; 04, 1ab; 04b, 5a; 03, 1ab; M7, 29; M2, 14) Know what effect adding a constant has on the center and spread of a set of data. Know what multiplying a constant has on the center and spread of a set of data, (09b, 1b;09, 1b; M2, 7) Know how to create a dotplot, (08b, 1a) Know how to interpret standard deviation in the context of a problem, (07, 1a) Know how to describe a distribution (shape, center, spread) when looking at a stemplot, boxplot, dotplot, and other types of graphical representations of quantitative data, (07b, 1b) Know what the standard deviation or overall spread of a distribution will look like for a set of data that is more consistent than a different set of data, (06, 1b) Know how to interpret center in the context of a problem, (06, 1c) Know how to read and interpret a cumulative frequency or cumulative relative frequency plot and interpret points and slope in context, (06b, 1; M2, 27) Know what a median is, and what it represents in the context of a problem, (M7, 1) Know how to compare standard deviations when looking at two or graphical displays (If you have two histograms, know which one has a larger standard deviation), (M7, 15) 2010B #1 2009 #1 2008 #1 2007B #1 2006 #1 2006B #1 2005B #1 Slide 1 ___________________________________ ___________________________________ Chapter 1 ___________________________________ Exploring Data ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Section 1.1 Displaying Distributions with Graphs ___________________________________ ___________________________________ ___________________________________ Slide 3 Definitions: Individuals and variables • Individuals are the objects described by a set of data. Individuals can be people, animals, or things. • A variable is any characteristic of an individual – Gender, Height, Weight, Race, Religion, etc. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Categorical versus Quantitative Variables A categorical variable (qualitative variable) records which of several groups or categories an individual belongs to. Gender, Race, City, Zip Code, Area Code, Religion, Color, Age Group (21-25) • A quantitative variable takes numerical values for which it makes sense to do arithmetic – Age, Height, Weigh, Number of RBCs, Score ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 More on Quantitative Variables • Units are what are used to describe the numbers. • Quantitative data can be categorized – Discrete is where every possible value can be listed • AP scores:1, 2, 3, 4, 5 • Test Score: 0-100 – Continuous is where there is an infinite number of possibilities • Weight: 20, 19, 18.8, 18.81, 18.806 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Displaying Categorical Data • Pie Chart – Make sure it’s labeled – You could not represent quantitative data with a pie chart • Bar Chart – The bars do not touch – Make sure each bar is labeled – What 2 things is this bar chart missing? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 More Bar Charts • The only time the bars would touch if you were data on more than one variable – Here we are looking at scores for each grade – We are still missing a couple of labels ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 Stemplots, Splitstems, and Back-to-Back Stemplots • Stemplots are used when you have a smaller amount of data, because it displays all of the individual values of data. • You would split the stems if you had a small range of data – For example if I had collected everybody’s height in inches, I wouldn’t want to only have 6’s and 7’s ___________________________________ ___________________________________ ___________________________________ • Back to back stemplots are used when you collect similar data from different samples – For example, if I collected heights from 1st period and I wanted to compare them to 3rd period ___________________________________ ___________________________________ Slide 9 Creating A Stemplot • Here is data collected on the weight of packages delivered to Ford • ___________________________________ ___________________________________ 103, 118, 131, 134, 134, 168, 191, 222, 232, 242, 268, 280, 280, 290, 301, 361, 381, 401, 431, 431, 441, 481 1 2 3 4 0, 2, 3, 3, 3, 7, 9 2, 3, 4, 7, 8, 8, 9 0, 6, 8 0, 3, 3, 4, 8 4|3 = 430 pounds Slide 10 Splitstem • Here is the same data but represented with a splitstem • ___________________________________ ___________________________________ ___________________________________ ___________________________________ 103, 118, 131, 134, 134, 168, 191, 222, 232, 242, 268, 280, 280, 290, 301, 361, 381, 401, 431, 431, 441, 481 Weight of Packages delivered to Ford • • • • • • • • ___________________________________ 1 1 2 2 3 3 4 4 0, 2, 3, 4 7, 9 2, 3, 4 7, 8, 8, 9 0 6, 8 0, 3, 3, 4 8 ___________________________________ ___________________________________ ___________________________________ Slide 11 Back to Back Stemplot ___________________________________ • Here is data from two different days Tuesday Wednesday 2, 4, 6, 6 1 3, 4 3, 3, 3 2 4, 5, 6 1 3 5, 6, 6 ___________________________________ 1|4 = 140 lbs ___________________________________ ___________________________________ ___________________________________ Slide 12 Know how to create a dotplot • Dotplots are used when you have a small amount of data, and, like a stemplot, can display every piece of data collected – This shows every piece of data, but we might not know every possible value ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 13 Histograms • Histograms are generally used when you have a large amount of data and it wouldn’t be reasonable to display every one collected • Histograms can be confusing if the scales aren’t described ___________________________________ ___________________________________ ___________________________________ – Let’s take a look at the histogram on page 50 ___________________________________ ___________________________________ Slide 14 Cumulative Frequency • A cumulative frequency chart is a histogram that, instead of showing how much many items are in each interval, shows how many that have occurred up to that point – It’s usually a percentage, but it does not have to be http://stattrek.com/AP-Statistics-1/CumulativeFrequency-Plot.aspx?Tutorial=AP ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 15 Timeplots • A timeplot is a way to display quantitative data measure against time – You are collecting the same data, sometime from the same individuals, and trying to discover a patter over time ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 ___________________________________ Describing Data Shape, Center, and Spread ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 Describing data: shape, center, and spread DESCRIBE is one of the major tipoff words on the AP exam Whenever you see that word it automatically tells you to talk about these three items ___________________________________ ___________________________________ Shape Symmetric, skewed, uniform Shape can be unimodal, bimodal, multimodal. Shape can have clusters and gaps. ___________________________________ Center We start by just guessing where the middle is We will get more in depth in the next section Spread We start by talking about spread in terms of the range of the data. We will get more in depth in the next section Slide 18 ___________________________________ ___________________________________ ___________________________________ Shape ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 19 Skewed ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 Uniform ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 Unimodal ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 22 Bimodal ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 23 Gaps and Clusters 1|1, 1, 4, 7, 9 2|5, 6, 7 3| 4|1, 2 5|0 In this skewed right stemplot there is a gap in the 30’s and you could say that there are two clusters (one in the 10s and 20s and the other is in the 40’s and 50s) Slide 24 Making lists and histograms with the calculator • STAT/Edit… allows you to create a list • STAT PLOT (on the function keys) allows you to create a histogram with L1 • WINDOW (on the function keys) allows you to set up the criteria for your window • Talk about the idea of range of data • GRAPH (on the function keys) allows you to see the histogram ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ – You can then trace the data using the Trace button. ___________________________________ Slide 25 ___________________________________ ___________________________________ Section 1.2 Describing Distributions With Numbers ___________________________________ ___________________________________ ___________________________________ Slide 26 Basics of Central Tendency • We use these three measurements to find the center of the data. – Mean=average – Median=middle – Mode=Most • Which one is better? Which one is least useful? Why? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 27 Finding and Using Mean Add all the numbers up and divide by the amount We can use our calculator to find the mean as well Create our list STAT gives us our screen, then go over to CALC/1-Var Stats Mean is a number that is nonresistant because outliers can affect it For the most part, mean is only an appropriate number when you are looking at a symmetric distribution of data Slide 28 Finding and Using Median • Find the middle number when you put them in order. – If there are an even number of data, we have to take the average of the middle two ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ • Let’s look at a couple – 1, 4, 6, 7, 8 – 1, 4, 6, 8 • You can also use your calculator. • Median is resistant to outliers • Median represents the point in your data, where there is 50% above and below that point – It is usually not halfway between your min and your max ___________________________________ ___________________________________ ___________________________________ Slide 29 Finding and Using Quartiles The quartiles are the medians of the median They would be used as a measure of spread ___________________________________ ___________________________________ We can use our calculator as well Create our list STAT gives us our screen, then go over to CALC/1-Var Stats After choosing this, choosing the list you want to look at Scroll down to Q1 and Q3 Quartiles are useful in telling us where the middle 50% are, the top 25%, and the bottom 25% Slide 30 Outliers The Interquartile Range is the difference of the two quartiles (Q3 – Q1) We can use the IQR to find outliers Multiply the IQR by 1.5 and that tells how far to move up and down from the quartiles ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ If data are past these limits, then they are outliers. • 2, 4, 8, 10, 11 – The IQR = 6 – 1.5*6=9 – So any number biggert than 10 + 9 = 19 is an outlier Slide 31 Know how to create and interpret a boxplot… • • • The five number summary shows the… – Min, Q1, Med, Q3, Max A boxplot is a visual representation of a five number summary, and can be used to show/determine skewness Outliers are marked with stars This distribution is slightly skewed left. It has a median near 82 and an interquartile range near 20. There appears to be an outlier in the 20’s. The 25th quartile is at 70 and the 75th quartile is at 90. That means that the middle 50% of the data is between 70 and 90. The bottom 50% is below 82 and the upper 50% is above 82. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 32 Skewed Left ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 33 Skewed Right ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 34 Calculating Standard Deviation What do we use to measure spread with average if we use quartiles with median? It is still measured in the same number of units For example, if I were taking an average age, the standard deviation would be measured in years ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 35 Calculating Standard Deviation By Hand 1, 2, 4, 6, 12 Calculating standard deviation Start by calculating the average. Now look at the differences of all the numbers with the average. -4, -3, -1, 1, 7 What do you get when you add these up? • 0 What we do instead is look at the squares of these differences. 16, 9, 1, 1 Add them up and that gives you a number that has little meeting. Divide that number by one less than the total number of data. That number is your variance. The square root of the variance is the standard deviation, s = 4.36. This is the number that is used to measure spread when the average is used. It is really only used with symmetric data. Make sure you get the right number when you use 1-var stats. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 36 What does standard deviation mean? • What does it mean if something has a small standard deviation? – The numbers are close together – 1, 1, 1, 1, 1, 1 • In this case s = 0 because there is not spread ___________________________________ ___________________________________ ___________________________________ • Which one has the largest standard deviation – 80, 80, 80, 80 – 1, 2, 3, 4 – 2, 4, 6, 8 ___________________________________ ___________________________________ Slide 37 Know how to interpret standard deviation in the context of a problem In layman’s terms, standard deviation is the average or typical distance away from the mean Again, if I were looking at a collection of data involving age, I would say that the average age is 17.5 years, and the standard deviation is 2 years. ___________________________________ ___________________________________ ___________________________________ • Interpretation – The typical person in this study is 2 years older or younger than 17.5 years. – The average distance from 17.5 years is 2 years ___________________________________ ___________________________________ Slide 38 Know what effect adding/multiplying a set of data by a constant can have • What if I take a group of data with the five number summary: 1, 5, 10, 12, 20 and I… ___________________________________ ___________________________________ – Add three to every number? • The summary becomes 4, 8, 13, 15, 23 – Multiply every number by 3? ___________________________________ • The summary becomes 3, 15, 30, 36, 60 • What if I take a group of data with avg = 10 and s = 2 and I… – Add three to every number ___________________________________ • Avg = 13 and s = 2 (spread did not change) – Multiply every number by 3 Slide 39 ___________________________________ ___________________________________ ___________________________________ Things I might have missed ___________________________________ ___________________________________ ___________________________________ Slide 40 Know how to compare distributions • Compare means to talk about shape center and spread • Compare also means to use comparative The distribution of the raw group words appears to be skewed to the right, while the smoothed distribution appears to be approximately symmetric. They both appear to have the same median near 25, but the raw group appears to have a larger spread, as is seen by a larger range and interquartile range. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 41 Know how skewness and symmetry affect the relationship between mean and median Since this distribution is skewed right (toward the higher numbers) the mean would be larger than the median, because the mean is non resistant to the higher numbers, while the median is. If it were skewed left (or toward the lower numbers) the mean would be lower than the median. Since this distribution is symmetric, the mean and median should be (if not the same) very close. This is true of any symmetric distribution, even if it has a bizarre shape. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Know how to compare standard deviations when looking at two graphical displays Slide 42 Since the raw group has an overall larger spread than the smoothed group, the standard deviation of the raw would be larger than the smoothed group, because standard deviation is a measure of spread ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Know how to create a graphical representation of data with two variables Slide 43 • Here is a distribution of M+M’s that were left in a bowl at the of a party last weekend Plain Peanut Red 10 5 Yellow 12 8 Blue 15 10 ___________________________________ ___________________________________ ___________________________________ • You would get a distribution that looks like… 20 15 10 Plain Peanut 5 ___________________________________ 0 Red Yellow Blue ___________________________________ Slide 44 Know how to describe differences in categorical data 16 14 12 10 8 6 4 2 0 Plain ___________________________________ ___________________________________ Peanut Red Yellow Blue • There were fewer peanut M+M’s in each category of color and there were fewer Red M+M’s in each of the type of M+M. Assuming that the bowl started with the same number of peanut and plain and the same number of each color, this would make me think that people prefferred peanut over plain and red over the other two. If there was no preference, we would expect these graphs to all have the same height ___________________________________ ___________________________________ ___________________________________ Chapter 2 Know what a percentile is and how to find a value at a certain percentile (What is the 60th percentile?), (09, 2a; 06b, 3c; 04, 6c; M7, 3) Know how to find the probability of an event occurring or the percentage of time that an event will occur of a long period of time ( or ), (08b, 5b; 06, 3a; 06b, 3a; 05b, 6b; 04b, 3a, 03, 3ab) Know how to solve for a population average if you know a probability of an event happening and its standard deviation (What does the average have to be for the class to get a 90 1% of the time?), (08b, 5c) Know how to use the symmetry of a normal distribution to find a probability ( P(z>1)=P(z<-1), (M7, 8) Know how to use z-scores to compare two individual from two different groups, (M7, 22; M2, 3) Know how to find the middle 20% (or other number) or a normal distribution, (M2, 10) Normal Distribution If there is reason to believe that a distribution is normal, you must state that it is normal and state the average ( ) and the standard deviation ( ) o This can be done by simply writing the shorthand version: You must draw a normal curve picture with the problem’s numbers referenced in it, and it would also be good to reference the formula for a standardized score (z-score) o z x x or z / n Calculate your probability, and verify with a calculator State your probability and its meaning in the context of the question Example1 A box of candy is known to have an average weight of 50 oz. If it is known that the amount of packaged candy is normally distributed with a standard deviation of 5, is it likely to get a box that weighs 62 oz or more? z 62 50 5 There is a 0.82% chance that a box would weigh 62 oz or more. So this is very unlikely. Example 2 Using the information above, find the middle 20% of the data. Solution: Since we want the middle 20%, that means that there will be 40% in the top tail and 40% in the bottom tail. So the question has really become, what is the 40th percentile and what is the 60th percentile. The 40th percentile occurs when there is a z-score of about -0.25. So… x = 48.75 Since it is symmetric the 60th percentile will be at 51.25. So, the middle 20% of the data is between 48.75 and 51.25 oz Slide 1 ___________________________________ ___________________________________ Chapter 2 The Normal Distributions ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Z-Scores and Density Curves ___________________________________ ___________________________________ ___________________________________ Slide 3 A Question • Last year, Eunice had Mr. Allen for math and received a 87% in the class, while Irene had Mr. Merlo for the same math class and received a 80%. It has been mathematically proven that Mr. Merlo is a much harder teacher. In fact, his class average was 15% lower than Mr. Allen last year. Who is smarter? • Why can you argue that Irene is smarter? • What extra piece of information might prove that Eunice is actually smarter? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 The Standardized Value • The Standardized Value (z-score) is a measure of the number of standard deviations a piece of data is away from the mean in a normal distribution. ___________________________________ ___________________________________ ___________________________________ • If a test or other measure has been standardized, z-scores can be used to determine whether or not individuals are better. ___________________________________ ___________________________________ Slide 5 A More Detailed Question • Last year, Eunice had Mr. Allen for math and received a 87% in the class, while Irene had Mr. Merlo for the same math class and received a 80%. It has been mathematically proven that Mr. Merlo is a much harder teacher. In fact, Mr. Allen’s class average was 15 points higher than Mr. Merlo’s 70% average. If we know that Mr. Allen’s class had a standard deviation of 2% and Mr. Merlo’s class had a standard deviation of 10%, Who is smarter? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Density Curves • A density curve is what you get when you collect a lot of data and you get a fluid shaped graph. • It has an area of exactly 1 underneath it. – That’s because it represents 100% of your data. – The median cuts the area in half. – The mean is the balance point. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 Two Different Density Curves ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 What Is the Most Common Density Curve? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Normal Distribution • This is the standard bell-shaped curve. • The mean and median are always the same in a normal distribution. • Although different normal distributions are similar, they might have different shapes. – Some are “taller” or “wider” than others. – What determines how “tall” or “wide” a normal distribution is? ___________________________________ ___________________________________ ___________________________________ ___________________________________ • The standard deviation. ___________________________________ Slide 10 One Up, One Down • Although the shape may change, the proportion of the data between the two standard deviations remains the same. – 68% of the outcomes are between one standard deviation above and below the average. – Notice one standard deviation away is at the inflection point. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 The Empirical Rule • The Empirical Rule (68-95-99.7) Rule tells you the proportion of the data that is in the middle when you move 1-2-3 standard deviations away from the mean. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 ___________________________________ ___________________________________ Standard Normal Calculations And what the ap graders are looking for ___________________________________ ___________________________________ ___________________________________ Slide 13 Finding a Probability • If a population is known to have a normal distribution of ages with an average of 16 and a standard deviation of 1.2, what is the probability that a randomly chosen individual will be older than 18? • N(µ, σ)N(16, 1.2) P(x>18) = P(z>(18-16)/1.2) = P(z>1.67) = 1-.9525 Slide 14 Know how to find the probability of an event occuring • Using the same information from the previous slide, what proportion of the population is between the ages of 16 and 17? N(16, 1.2) P(16 < X < 17) = P((16-16)/1.2)< Z < (17-16)/1.2) = P(0 < Z < 0.83) = 0.7967 – 0.5 =0.2967 Slide 15 Know what a percentile is and how to a value at a certain percentile • Using the same information from the previous two slides, what age does an individual have to be in order to be above the 35th percentile? N(16, 1.2) P(Z < -0.39) = 0.35 -0.39 = (x – 16)/1.2 -.47 = x – 16 X = 15.53 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 Calculator ___________________________________ • Normalcdf(lowerbound, upperbound, avg, s.d.) ___________________________________ Example Find P(x>18)=normalcdf(18, 999999999, 16, 1.2) ___________________________________ • InvNorm(percent behind, avg, s.d.) ___________________________________ Example P(x < ___) = 0.35InvNorm(0.35, 16, 1.2) ___________________________________ Slide 17 Know how to find a population average if you know the probability of an event and s.d. In a certain baseball league 20% of the individuals have more than 60 RBIs. If the standard deviation of all the players’ RBIs is 15 and the distribution is known to be approximately normal, what is the average number of RBIs in this league? The league average is 47.4 RBIs ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 Know how to find the middle n% of a normal distribution • Looking at a distribution that is N(10, 2), what interval contains the 20% of the population with the largest number of individuals? Solution In any normal distribution, the n% with the will be in the middle, because that is where your largest percent of data is. So, this question is really just, “where is the middle 20%?” ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 19 Solution Continued • Since we’re looking for the middle 20% of a N(10, 2), we will look for the z-score that have 40% above and 40% below. ___________________________________ ___________________________________ ___________________________________ • A similar method using z=0.25 would give us an x value of 10.5. So, the smallest interval contain 20% of the data is between 9.5 and ___________________________________ ___________________________________ Chapter 3 Can you plot a scatter plot? What goes on the x-axis? Y axis? Don’t forget to label. (10, 1b;08, 4a) Know how to get the equation of a LSRL from a Minitab printout and how to make it context specific, (10b, 6a; 08, 6b; 06, 2a; 05b, 5a; 05b, 5b) Know how to interpret the slope and y-intercept of a LSRL in a the context of a problem, (10b, 6a; 08, 6b; 07, 6abe; M2, 31) Know what a residual, how to calculate it give data points and a LSRL, and know how to interpret it contextually, (10b, 6b; 07b, 4b; M2, 17) Know how to describe a scatterplot (direction, strength, linear/nonlinear), (08, 4b; 08b, 6b; 04b, 1a) Know how to graph a least squares regression line on a x and y plane, (07, 6d; 07b, 4a) Know what happens to the slope of a LSRL if new data points are added, (07b, 4c; 03b, 1) Know what happens to the correlation of a set of data if new data points are added, (07b, 4c; 03b, 1) Know that the s in the bottom left of a Minitab is the standard deviation of the residuals, or typical distance each observation is from the LSRL, and know how to interpret that in the context of a problem, (06, 2b) Know that you need a scattered residual plot to prove that something is linear (correlation does not prove linearity), (05, 3a; 04b, 1c; M7, 40) Know how to find an expected number for a LSRL if you are given an x-value (plug it in), (05, 3b) Know what r-sq is and how to interpret it in the context of the problem and how to find it if you know the correlation, (05, 3c; M2, 34) Know what extrapolation is and when it is and is not appropriate, (05, 3d) Know what happens to correlation if you change the units of measurement (Change weight from lbs to kgs), (M7, 10; M2, 6) Know how correlation relates to the slope of a LSRL, (M7, 19) 2010 #1 (If you do Chapter 5 before) 2010B # 6 (If you do Chapter 10 before) 2008 #4 (If you do Chapter 7 before) 2008B #6 (If you do Chapter 13 before) 2007B #4 2006 #2 2005 #3 2003B #1 Slide 1 ___________________________________ ___________________________________ Chapter 3 Examining Relationships ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Scatterplots and Correlation ___________________________________ ___________________________________ ___________________________________ Slide 3 Variables Review An explanatory variable is the variable that we believe is causing the change If we were testing a new blood pressure drug, the explanatory variable would be the level of the dosage of the drug A response variable is the variable that we believe is changing due to the explanatory variable In the blood pressure example, it is the blood pressure ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Scatterplots A scatterplot shows the relationship between two quantitative variables measure on the same individual If there is an explanatory variable it goes on the x axis If there is a response variable, it goes on the yaxis If there does not appear to be a clear explanatory or response variable, it does not matter which variable goes where ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 Example Here is a scatterplot comparing the age and height of plants We know that there are 22 plants We believe that age is the explanatory variable We should have a top label as well as a unit label for variables ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Describe a scatter plot • When describing a scatter plot, one must take about strength, association, and form • Strength – We use general terms like strong, weak, slightly strong, slightly weak, very strong, etc • Association ___________________________________ ___________________________________ ___________________________________ – It is either positive, negative, or neither • From – It is linear or non-linear ___________________________________ ___________________________________ Slide 7 Describe a scatterplot • This appears to have a strong, positive, linear relationship ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 Describe another scatterplot • This scatterplot has a very strong, negative, non-linear relationship ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Correlation ___________________________________ • Correlation is a number that we use to measure HOW linear a relationship is • It is a number between -1 and 1 ___________________________________ – Negative association=negative correlation – Positive association=positive correlation – No associate=0 correlation ___________________________________ • The closer to 1 or -1, the more linear a relationship is ___________________________________ ___________________________________ Slide 10 Correlation ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Notes about correlation • Never switch correlation and association ___________________________________ ___________________________________ when describing a relationships – You would never say there is a strong, positive, linear correlation • Correlation is a number – You would never say something has a strong correlation – That’s like saying it has a strong 0.76 ___________________________________ ___________________________________ ___________________________________ Slide 12 Notes about correlation • Just because something has a correlation very close to 1 or -1, does not mean it necessarily is linear – This graph is non linear, but would have a very high correlation ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 13 Know what happens to correlation if you change units Changing the units used by the variables will not change the correlation For example, in our plant problem, if we changed all the age measurement to days (instead of years) and all the height measurements to cm (instead of inches) we would get a very different looking scatterplot But, it would not change the correlation, because changing the units, does not change the overall relationship between age and height ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 Know how correlation relates to the slope of a LSRL ___________________________________ • One way to find the slope a LSRL is using the equation ___________________________________ • So, as correlation gets larger, so does the slope • You can also use this to see how data affects slope ___________________________________ ___________________________________ ___________________________________ Slide 15 ___________________________________ ___________________________________ Least-Squares Regression ___________________________________ ___________________________________ ___________________________________ Slide 16 Least Squares Regression Line (LSRL) • A LSRL is a line of best fit for a linear association • It is usually written in the form ŷ = a + bx – This is just another way to write y = mx + b – We use ŷ because it is a predicted value for y, not the actual y value that will occurr ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 LSRL •This scatterplot has a LSRL of ŷ = 0.83 + 0.96x •So a plan that is 4 months old should be close to 4.67 inches tall •Can we make a prediction that at 20 months old the plant will be 20.02 inches •No. This is called extrapolation and is very dangerous. •You cannot make a prediction about data outside of the domain of the data that you collected ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 Interpreting a LSRL Let’s look at ŷ = 0.83 + 0.96x comparing age and height of plants Interpret the slope In general, we would say for every increase in 1 of x, y increase an average of “b”. In this case, for every increase in one month of the age of a plant, the height increases an average of 0.96 inches Interpreting y intercept In general, we would say that when the item is 0 x, the average item should be “a” y In this case, a plant that is 0 moths old should be 0.83 inches This obviously does not make sense, which does occurr sometimes when you interpret y-intercept. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 19 Residuals • A residual is the distance between an observed point in a scatterplot and the predicted point from a LSRL – Residual = observed – expected= y – ŷ • For the three points at age 4, there are 3 residuals – 7.5 - 4.67 = 2.82 – 5 – 4.67 = 0.33 – 4 – 4.67 = -0.67 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 Residual Plot • A residual plot is used to determine if something is linear – Note: correlation does not determine if something is linear, it determines how linear – Here is an example of a residual plot of a linear association ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 Residual Plot • You know that it is a linear relationship, because the residuals are scattered • This residual plot is plotted against ŷ – You can do this, but I usually plot it against the x variable ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 22 More on residual plots ___________________________________ • Here is a residual plot for a non linear association. ___________________________________ – We know that it is non linear because the residual plot is NOT SCATTERED ___________________________________ ___________________________________ ___________________________________ Slide 23 Coefficient of determination (R-Squared) ___________________________________ The coefficient or determination (or the square of the correlation) is a number that represents that “percent of the variation in y that can be explained by x” Let’s say that scatter plot comparing age to amount of hair is -0.7 (because you lose hair as you age) In this case r squared would be 0.49 ___________________________________ So, 49% of the change in people amount of hair can be explained by their age. The remaining 51% of the change is due to other factors. ___________________________________ ___________________________________ ___________________________________ Slide 24 Reading a Minitab Printout Here is a minitab printout of amount tile versus cost of laying it in a house You can see the LSRL on the top left R-squared is 0.81 (we never use adj) So, we could conclude that the correlation is 0.9, because that is the square root of 0.81 and we know that the correlation is positive, because the slope is positive The s = 9.282 is the standard deviation of the residuals So, the typical point on the scatterplot will be 9.282 dollars above or below the predicted value ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 4 Know how to use the inverse function to solve for ln y, (09b, 6c) Know how to interpret r squared in terms of transformed data, (04b, 6b) Know how to look at a residual plot with a relationship between x and log y to determine if something is linear or non linear, (M2, 28) 2009B #6 2004B #1 Chapter 5 Know what a treatment is and how to list them based on the description of an experiment with more than one factor? (10, 1a; 06, 5a; 06b, 5a) What is an experimental unit? (10, 1a; 06b, 5a) What is a response variable? (10, 1a; 06b, 5a) Know what a stratified random sample is, how to describe it so a reasonable person could do it, and how to implement one where you need to represent different proportion in a population (10% Hispanic, 20% Native American, etc), (10, 4c;10b, 2b, 05, 5c; M7, 20; M2, 15) Know how to take a simple random sample from a large population, and how to describe it so a reasonable person could do it, (10b, 2a;08, 2c; 04b, 2a) Why is a stratified sample sometime better than a SRS? Better than a cluster? (10b, 2c) Why wouldn’t it be appropriate to assign people by flipping a coin at times? (09, 3a) Why is it important to assign individuals in an experiment as opposed to letting them pick? (09, 3b; 03, 4a) Know what a block design is and how to randomly assign experimental units within that design, (09b, 4a; 07b, 3b; 04, 2ab; M2, 16) Know what it means for an experiment to be double blind and how it can be implemented, (09b, 6a) Know what nonresponse bias is and how it can affect results of an observational study, (08, 2a; M7, 9) Know how to create a completely randomized design experiment, including how to randomly assign your experimental units so that a reasonable statistician could do it, (09, 3a; 08b, 4a; 07, 2b; 06, 5b; 06b, 5b; 05b, 3a; 03b, 4a; M7, 35; M2, 25) Know what a control group is, why we control experiments, and how to describe its benefits in the context of a problem, (07, 2a; 03, 4b; 03b, 4b) Know why it is beneficial to do a block design at times, and why it is important to create homogeneous groups, (07, 2c; 07b, 3a; 04, 2c; M7, 14; M7, 31) Know the difference between an experiment and an observational study, (07, 5a; 03b, 3a; M2, 1) Know that you can only make conclusions about the population that your sample is taken from (If I choose five people from 4th period I can make conclusions about 4th period. If I take 5 people from CV, I can make conclusions about CV), (06, 5cd; 05, 1b; 04, 3c; 04, 5b; 03, 4c; 03b, 4d, M7, 16) Know how replication is used to improve a study/experiment and how it is implemented correctly, (06b, 5c; 05, 1c) Know what confounding is and how it can affect the results of an experiment, (06b, 5d) Know what bias is and be able to explain in context how a bias can directly affect the results (the proportion would be higher if the sample was truly random, (05, 5a; 04b, 2a) Know how to create a matched pairs design, including how you randomly assign treatments, (05b, 3b; 04b, 4b) Know what wording bias is, how to fix it, and how it affects an outcome in the context of the problem, (04b, 2b) Know what a census is, (M7, 2) 2010B #2 2008 #2 2007 #2 2007B #3 2006 #5 2006B #5 2005 #1 2004 #2 2004B #2 2003 #4 Slide 1 ___________________________________ ___________________________________ Chapter 5 Producing Data: Sample and Experiments ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ DESIGNING SAMPLES ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 3 Does my mommy really love me? • An advice columnist, Ann Landers, once asked her readers, “If you had it to do over again, would you have children?” A few weeks later, her column was headlined, “70% OF PARENTS SAY KIDS NOT WORTH IT.” Indeed, 70 % of the 10,000 respondents said they would not have children. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Designing Samples • Population – This is who we are trying to study. ___________________________________ ___________________________________ • We usually can’t get everyone, though. • Sample – A part of the population that represents the whole. – What is a true sample? • Is our class a sample that represents the school? ___________________________________ ___________________________________ ___________________________________ Slide 5 Types of Samples Census When you can survey/test everyone in the population. Voluntary Response (Self-Selected) Sample ___________________________________ ___________________________________ When people choose whether or not to respond. American Idol Mail home survey Convenience Sample When you survey/test those easiest to reach. Taking a survey in the quad at lunch. ___________________________________ Quota Sample When you hand pick a group that seems to match your population Probability Sample Each member of the population has a known probability of being in the sample. ___________________________________ ___________________________________ Slide 6 Probability Sampling Simple Random Sample (SRS) A sample of size n so that every set of n individuals is equally likely to be chosen. ___________________________________ ___________________________________ This is the “best” type of sampling. Systematic Sample Picking every nth individual Every third person that comes through the door will win a prize. This is random, but it is not an SRS. The first two people through the door can’t both win. Stratified Random Sample ___________________________________ Subgroups (strata) are picked that are similar in some way and then individuals are chosen out of the group. They can be split up by proportion If 55% of the population is female, then I will make sure that my sample is 55% female. This is beneficial if you want to represent certain groups of a population, or you need to make sure a certain group is represented For example, in a large group you might have a 2% population of native americans, but you might not get a large group if you took a SRS. You would want to make sure you get them in your sample by doing a stratified sample. ___________________________________ ___________________________________ Slide 7 Other Sampling Methods • Cluster (Area) Sampling – The population is split into clusters and only certain clusters are studied to get a feel for the population. • If I want to get a feel for town governments, an SRS will cause us to have to do too much travelling. • So, we randomly choose five counties (these are your clusters) and then study every town government in those counties. • This saves us travelling time, but still gives us a random sample of the population. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 More Sampling Methods Multistage Sampling This is when you use a sampling or combination of sampling methods more than once to get a sample of the population. If I want to interview Ca resident, I might do a cluster sample to pick different counties, and then a SRS to pick individuals. This is a two-stage sample If I want to study US seniors we might do a stratified random sample to get districts of certain demographics, then do an SRS to get a smaller number of schools, then do an SRS of seniors in those schools. This is a three-stage sample ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Sampling Bias Bias A design is biased if it systematically favors a certain outcome. Undercoverage This is when certain groups are left out of the sample. ___________________________________ ___________________________________ A telephone survey can have undercoverage, because people without phones aren’t included. How is a systematic sample biased? Most samples, no matter how good, suffer from some undercoverage Nonresponse An individual can’t be contacted or refuses to participate This occurs if I randomly call 100 houses, but only 50 are reached or only 50 agree to participate Reponse Bias ___________________________________ Occurs if a respondent gives false answers, they can’t understand the question, they want to please the interviewer, or the ordering of the question favors and answer. Wording Bias The wording of a question affects the outcome. “Don’t you think the driving age should be raised to 18 since teenagers are so reckless?” ___________________________________ ___________________________________ Slide 10 Using a Table of Random Digits When you pick a SRS, you need to be Random. We can use a table of random digits (Table B) Assign a numerical label to every individual. Make sure that every individual has the same number of digits. Don’t do 0001-1000 because then you have to use four-digit numbers. Instead use 000-999. Use table B to select at random. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Using Your Calculator • Go to MATH/PRB – Choose randInt ___________________________________ ___________________________________ • randInt(1, 100, 23) – You will randomly pick 23 numbers between 1 and 100. ___________________________________ • ranInt(0, 99, 45) – You will randomly pick 45 numbers between 0 and 99 ___________________________________ ___________________________________ Slide 12 ___________________________________ ___________________________________ Designing Experiments ___________________________________ ___________________________________ ___________________________________ Slide 13 Observational Study vs. Experiment • An observational study observes and records behavior but does not impose a treatment. – I’m going to take a survey to see how many students drink energy drinks. • An experiment is a study in which the researcher imposes some sort of treatment. – I want to determine the effects of energy drinks on hours of sleep. So, I’m going to give some students energy drinks and the others aren’t allowed to drink energy drinks. • The difference is that an experiment is Slide 14 Experimental units and treatments • • An experimental unit on which a treatments is being imposed. – An experimental unit is called a subject if it is a person. A treatment is a specific experimental condition applied to the experimental units. – Two different individuals in an experiment might get two different treatments. • One might get an energy drink, another might not. • To find the number of treatments when there is more than one variable you use the multiplication principle – For example, I am testing based on energy drinks and number of classes » So, there are two treatments in energy drinks (yes or no) and 3 in number of classes (4, 5, 6), which means that there are six total treatments ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 15 Explanatory and Response Variables An explanatory variable is what is being implemented. This is the amount of caffeine given or dosage of blood pressure medicine. Each explanatory variable is referred to as a factor. A factor can have different levels. In our drink and classes experiment there are two factors Energy drink and number of classes Two levels in one factor (yes/no) and three in the other (4, 5, 6) creates six different treatments. A Response Variable is what is being measured This would be blood pressure or the number of hours of sleep. An experiment usually is trying to determine if or how Slide 16 Principles of Experimental Design • There are three principle of experimental design: 1.Control 2.Randomization 3.Replication ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 1. Control The biggest aspect of the actual experiment is whether or not you are controlling the lurking variables and confounding Is it the treatment that is affecting the response variable or is it something else? Lurking Variables are those that are not among the explanatory and response variables but can influence results ___________________________________ ___________________________________ ___________________________________ Many experiments are controlled with a placebo Half of the class will get the love potion while the other half gets sugar water. This way we know if it’s the love potion or just a new found confidence. Controlling experiments reduces the chances of confounding Confounding occurs when you cannot distinguish if the explanatory variable is causing an affect or if another Slide 18 Controlling Bias You can avoid some personal bias by blinding experiments. ___________________________________ ___________________________________ ___________________________________ ___________________________________ All experiments should at least be single blind The subjects should not be aware that their treatment is different than someone else’s. You don’t tell the subject her dosage is higher. In order to avoid bias from the person implementing the experiment, it can be made double blind. In this case the implementer and the subjects are not aware of the differences in the treatments. The doctor does know if he is giving medicine or a placebo? Slide 19 2. Randomization • How are you picking your units/subjects? • You want to equalize groups so that lurking variables will be equal among the different groups. • We want to make the groups as equal as possible except for difference in treatments. – If I were to study heart medicine I wouldn’t put all the people who have had heart attacks in one group. I would want them to be in both groups. • You can use the different methods of sampling in order to create randomization. Slide 20 3. Replication • The more units/subjects I have the better. • The bigger the number, the more likely you are to have a representation of the population. • This reduces bias or systematic favoritism. • I don’t have to run the experiment more than once. – I just need to have a lot of experimental units. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 ___________________________________ ___________________________________ Types of Experiments ___________________________________ ___________________________________ ___________________________________ Slide 22 Completely Randomized Design A completely randomized design takes a random sample from the population that we are trying to study. This is like a SRS. In a completely randomized design each treatment is unique and independent from the other Example I want to test the affects of energy drinks and number of classes on sleep. I have created six treatment groups based on the two factors. I put the names of the 300 high school students that have volunteered in a hat. The first fifty names pulled will be in the yes/4 group, the next 50 in the yes/5 group, and so on. We will measure every individuals sleeping patterns for a month and then compare. Slide 23 Block Design A block design separates the population into blocks and tests them individually. This is the same as a stratified random sample. We could create gender blocks of men and women. Each block receives the exact same treatments. Although it is nice, blocks do not have to be the same size. We can have 55 men and 45 women. Example Using the same information on energy drinks from the previous slide, I will split up the 300 volunteers into two groups based on gender. I will then take all the men and randomly put them into six groups (one for each treatment) using a SRS and run the experiment as before. I will then take the women and put them into six groups (one for each treatment) using a SRS and run the experiment. I will collect data for a month and then compare the results. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 24 Matched Pairs • • A matched pairs design is a type of block design that compares only two treatments. – I will have several pairs of fish tanks in different parts of the room. One gets one fish food, one gets the other. • In this case the different parts of the room are the blocks. You can also have one subject get both treatments. – Which is better, Dr. Pepper or Diet Dr. Pepper. • In this case, each individual is the block. Example I want to determine if a new type of bicycle tire will last longer than the other. I have found 100 bicyclists and asked them to take one new tire and one old tire. 50 of them will put the new tire on the front and old on the back, and the other 50 will do the opposite. We will measure each tire on a 10 point scale and find the difference between the new and old (n – o), and review our results. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 25 What the data looks like • • Completely Randomized and Block designs – You will have at least two lists of data, one for each treatment group – In our example, the group that had the energy drink and four classes should have 50 pieces of data measuring each individuals average hours of slep during that month • y/4—{7.1, 8.0, 6.8, …} • y/5—{7.0, 8.0, 6.6,…} • y/6—{6.8, 7.1, 7.2…} • … Matched Pairs – Since we are comparing two treatments in individual blocks, we will be looking at one list of data, usually representing a difference • In our example with the tires, we would have 100 numbers representing the difference (New – Old) from each biker’s tires – Difference—{1.0, 0.5, -0.2, 0.0, 2.0…} ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 26 ___________________________________ ___________________________________ Simulating Experiments ___________________________________ ___________________________________ ___________________________________ Slide 27 Simulations • You can run simulations the same way that do a SRS. • I want to run a simulation of picking ten people where 53% are men and 47% are women. – 00-52 represent men; 53-99 represent women – 01-53 represent men; 54-99, 00 represent women • I can use table B or randint on my calculator. • How many women were picked in this simulation? Slide 28 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ What We Missed ___________________________________ ___________________________________ ___________________________________ Slide 29 Know how to take a SRS from a large population • • Observational study – Put a name in a hat for every individual from a population and choose n individuals – Assign every individual in the population a number and use a RNG or a table of random digits to pick n people Experiment – Put all the experimental units names/assigned #s in a hat, the first n/2 you pull go into one group, the remaining go in the other group – For every individual, we flip a coin, if it’s heads they go into one group, if it’s tails it goes in the other group. • Once one group fills n/2, the remaining individuals go in the other group • You have to make sure that the individuals are chosen in a random order. You would not want to go through students in order of grade in a class, because the last students would all be put into a group, but they are all the student with the lowest grades – For every individual, roll a die. If it’s a 1 o2 they go into one group… • As with the coin, you have to make sure that individuals are chosen in a random order ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 30 Know who you can draw conclusions about • You can only draw conclusions about the group from which you drew your sample – If I took 100 random student from CV, I could only draw conclusions about CV students, not students – If I took 100 students from California, I could only draw conclusions about students from California, not the nation • It also does not matter how many you take as long as it’s random – If I randomly chose 5 students form CV, I could make a conclusion about students form CV as long as it’s random – It does not matter how large the sample is – We will talk about the setbacks of small sample second semester ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 6 Know how to find a probability, a union probability (A or B), and a conditional probability using a two way table, (10b, 5ab; 03b, 2ab; M7, 18) Know the two ways to check for independence (10b, 5c; 03b, 2c) Know how and when to find a conditional probability, (09b, 2a) Know how to find union probabilities of disjoint events, (09b, 2bc;08, 3cd; 04, 4a) Know how to find a joint probability (A and B) of two or more independent events, (08, 3b; 04, 3b; 04, 4a; 03b, 5a; M7, 38; M2, 36) Know how to interpret a probability in the context of a problem and deem if it is likely to occur or not, (04, 3c) Know what it means for events to be mutually exclusive and how that affects their joint probability and their union probability, (M7, 36; M2, 23) Know how to set up a simulation and perform it using a table of random digits, (M2, 4) 2009B #2 2003B #2 Slide 1 ___________________________________ ___________________________________ Chapter 6: PROBABILITY The Study of Randomness ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Simulation ___________________________________ ___________________________________ ___________________________________ Slide 3 Simulating Randomness • Simulation is the imitation of chance or behavior, based on a model that accurately reflects the phenomenon under consideration Example A statistician wants to simulate pulling ten people at random from the US population. Describe a simulation attempting to establish how many women there will be. Solution The statistician will assume that women and men are split up 50-50. So, he will flip a coin ten times and every head he gets will represent choosing a woman from the population. After ten flips he will record the number of women. He will run this simulation 100 times, recording the number of women every time, and then will average his 100 numbers to make an estimate at the number of women that he “should” get. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Another Simulation Example Gary is a pretty decent free throw shooter, converting 81% of his free throws last season. He figures this year that he will take about 50 free throws. Run a simulation to establish how many of them he should convert. Solution Using the numbers 1-100 we will assign 1-81 as a make, and 82-100 as a miss. Using the the random number generator on our calculator we will simulate 50 free throws: RandInt(1, 100, 50). We will record the number of makes. Then we will run this simulation 19 more times, and take the average of the makes to give him an idea of how many he “should” make this season. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 ___________________________________ Randomness ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 The Language of Probability Random ◦ A phenomenon where outcomes are unpredictable, but a pattern will emerge in the long run. ___________________________________ ___________________________________ What is the pattern when I flip a coin? Probability ◦ The proportion (percentage) of times that an event will occur after many repetitions. ___________________________________ What the proportion of heads that we will get? Independence ◦ Events are independent if one event has no effect on another event. Flipping a coin twice. ___________________________________ ___________________________________ Slide 7 ___________________________________ ___________________________________ Probability Models ___________________________________ ___________________________________ ___________________________________ Slide 8 Sample Spaces and Events Sample Space ◦ All possible outcomes. Flipping a coin: {H, T} Rolling a die: {1, 2, 3, 4, 5, 6} ___________________________________ ___________________________________ Event ◦ An outcome or set of outcomes of a random phenomenon. Multiplication Principle ◦ When you combine two phenomenon, the new sample space conatains the product of the size of the two phenomenon. When I flip a coin and roll a die, there are 2 X 6 = 12 events in the sample space. We used this last chapter when we were establishing how many experimental treatments there are ___________________________________ ___________________________________ ___________________________________ Slide 9 Probability Rules 1. The Probability of any event must be between 0 and 1. 2. The Probability of the sample space is 1. 1. P(S) = 1 3. The complement of an event is the probability that the event won’t occur. 4. The addition rule. 1. Disjoint (Mutually Exclusive) events are events that don’t have anything in common. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 Complement rule • The complement rule states that Example #1 If the P(G) = 0.4, the P(not G) = 0.6 ___________________________________ ___________________________________ ___________________________________ ___________________________________ Example #2 If the P(getting an A) = 0.1, then the P(not getting an A) = 0.9 Slide 11 The Addition Rule • If two events are disjoint (mutually exclusive), then the addition rule is ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 ___________________________________ ___________________________________ What is Independence? ___________________________________ ___________________________________ ___________________________________ Slide 13 Independent and “And” A Partner Question If 50% of the population are men and 20% of the population have a college degree, what percent of the population falls under both categories. It might help to pretend that there are only 100 people in the entire population. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 The Question • What did we have to assume in order to get an answer of 10%? – That men were just as likely to have a degree as women. • That is independence • How does our answer change if we know that 30% of women have a degree, while only 10% of men have a degree. – That gives us the 20% of the population that have a degree, but gives us a different answer to our problem. Slide 15 What Statistically is Independence? • Two events are independent if • In our example P(man and degree) = P(man) X P(Degree) P(man and degree) = 0.5 X 0.2 P(man and degree) = 0.1 or 10% ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 What If They’re Not Independent? • If two events are not independent then • If we know that 50% are men and 20% have a degree, but that only 5% have both, are these events independent? P(man and degree) ?=? P(man) X P(Degree) 0.05 ?=? 0.5 X 0.2 These are not equal. So, being a man and having a degree are not independent. By being a man, the likelihood of a randomly chosen person having a degree changes. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 What is independence? Events are independent if one event has no effect on another event. • Essentially that means that if one event occurs, it does not change the likelihood of another event occurring ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 ___________________________________ ___________________________________ General Probability Rules ___________________________________ ___________________________________ ___________________________________ Slide 19 The General Addition Rule • If two events are NOT disjoint, then ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 Our Question Example If the probability of choosing a man from a population is 50%, the probability of choosing a college grad from a population is 20%, and the probability of choosing someone who is both is 5%, what is the probability that we would choose an individual who has at least one of those qualities? Solution P(M or D) = P(M) + P(D) – P(M and D) P(M or D) = 0.5 + 0.2 - 0.05 P(M or D) = 0.65 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 What if events are Independent? Example In a population, 25% of the group are seniors and 40% are hispanic. If the two events have been shown to be independent, what is the probability of randomly choosing someone who is either a senior or hispanic? Solution P(S or H) = P(S) + P(H) – P(S and H) P(S or H) = 0.25 + 0.4 – (0.25)(0.4) P(S or H) = 0.55 or 55% Slide 22 A conditional probability • A conditional probability is a probability of an event based on the fact that a different event has all ready occurred • If I have a bag full or blue and red marbles and I have all ready pulled a red marble, then the probability of pulling a blue marble is written – P(Blue|Red) = Probability of Blue “given” that a red has been drawn ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 23 General Multiplication Rule • The general multiplication rule is Example If there are 6 red marbles in a bag, and 4 blue marbles in the same bag. What is the probability of pulling a red marble followed by a blue marble? Solution P(R and B) = P(R) X P(B|R) P(R and B) = (6/10) X (4/9) = .266… or 27% ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 24 Know how and when to find a conditional probability • Given the following data of M+Ms Plain Peanut Red 10 5 Yellow 12 8 Blue 15 10 • What percent of blue M+Ms are peanut? Solution Notice that is says “of blue” not “are blue and peanut.” That’s why it is a conditional probability, not a joint probability. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 25 ___________________________________ ___________________________________ More On Independence ___________________________________ ___________________________________ ___________________________________ Slide 26 More on the Multiplication Rule • Here are two things that we know… – The general multiplication rule says – If two events are independent, then ___________________________________ ___________________________________ ___________________________________ ___________________________________ • Our Conclusion: If two events are independent, then A has no effect on B. So, ___________________________________ Slide 27 Two Ways to Check Independence • Two events are independent if one of the two following things are true: ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 28 ___________________________________ ___________________________________ Tree Diagrams ___________________________________ ___________________________________ ___________________________________ Slide 29 Tree Diagrams • Tree diagrams can be used to show probabilities when you have two or more events – Here is an example of pulling marbles out of a bag with replacement ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 30 Tree Diagrams Part II • Here is a probability tree based on the likelihood of meeting with an individual on three house visits – What is the likelihood of missing the person every time? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 31 ___________________________________ ___________________________________ Itmes We Might Have Missed ___________________________________ ___________________________________ ___________________________________ Slide 32 Know how to find the union probabilities of disjoint events An HIV test a 99% of a giving a negative result if an individual does not have HIV. If an individual fails, you take a second test to make sure, and if that is positive, an individual is cleared. How likely is it that an individual who is not HIV positive will pass. Solution In this case, there are two disjoint events: passing the first time and passing the second time P(A) +P(B)*P(A) =0.99+(0.01)(0.99)=0.9999 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 33 Know what it means for events to be mutually exclusive and how that afffects their joint/union probabilities • • • • Disjoint (mutually exclusive) events have P(A and B) = 0 – Thus P(A and B) ≠ P(A)*P(B) – Disjoint events are not independent and independent events cannot be disjoint Events that are not disjoint have a P(A and B) > 0 – They could be independent or dependent – Events that are not disjoint could be either Independent events must NOT be disjoint Dependent events could be disjoint or not ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 7 Know how to find the expected outcome of a discrete random variable, (08, 3a; 05, 2a; 05b, 2a; 04, 4b; 03b, 5b; M2, 5) Know how to find the standard deviation when you average two variables, (08, 4c) If you have two normal distributions that are both normal and you add or subtract them, what is the shape of the new distribution? How do you find the average of the new distribution? How do you find the standard deviation of the new distribution? (08b, 5a; 05b, 2b; M7, 26) Know how to create and read a graph of the probability distribution for a discrete random variable, (07b, 1a) Know what the law of large numbers is and how to describe in the context of a problem, (05, 2b) Know how to find the median, quartiles, and interquartile range by looking at a probability distribution for a discrete random variable, (05, 2c; M7, 12; M7, 24) Know how to find the standard deviation of a discrete random variable, (05b, 2a) Know how to find the average and standard deviation when you combine distributions and change the units( ), (05b, 2c) 2008 #3 2008B #5 2004 #4 Probablity Distribution (non specific) o o Probablities must add to 1 Expected Outcome (Mean of the distribution) o o x xi pi Multiply each outcome by its probability and add together Standard deviation o Example: The number of hats sold at Landry’s per week is as follows X 0 1 2 3 P(X) 0.4 0.3 0.2 0.1 x (0)(0.4) (1)(0.3) 2(0.2) 3(01 . ) 10 . On average, Landry’s sells 1 hat per week with a standard deviation of 1 hat. He should buy fifty two hats per year Other notes: We know that the five number summary is 0 0 1 2 3 based on where the percentiles are. Slide 1 ___________________________________ ___________________________________ Chapter 7 Random Variables ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ Discrete and Continuous Random Variables ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 3 Random Variables • A random variable is a variable whose value is a numerical outcome of a random phenomenon. – Remember that a random phenomenon is where outcomes are unpredictable, but a pattern will emerge in the long run. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Discrete Random Variable • A discrete random variable has a countable number of possible values. ___________________________________ ___________________________________ – I know when I roll a die that there are exactly six possibilities. – I know when I pick an integer between 1 and 10 that there are exactly ten possibilities. ___________________________________ • These individual probabilities are all between 0 and 1, and they add up to 1. • Discrete probability histograms use bars to show all the individual probabilities. ___________________________________ ___________________________________ Slide 5 Probability Histograms ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Continuous Random Variable • A continous random variable has an uncountable number of individual outcomes, but probabilities of intervals can be found. – These distributions are described by density curves. – Probabilities of intervals are found by finding the area under the curve. • These are normal distributions, as well as other distributions with uncountable outcomes. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 Continuous Random Variable ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 ___________________________________ Means and Variances of Random Variables ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Means of Discrete Random Variables • The mean of a random variables can also be describes as the expected outcome. • To find the expected outcome of a discrete random variable… – Multiply each possible outcome by its individual probabilty. – Add up those numbers. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 Finding Expected Value • Here is the distribution for the AP scores from 1 2 3 4 5 2010.Score P(X) .23 .18 .24 .22 .13 ___________________________________ ___________________________________ ___________________________________ What was the average score (expected outcome)? = 1(.23) + 2(.18) + 3(.24) + 4(.22) + 5(.13) Slide 11 Standard Deviation of a Discrete Random Variable ___________________________________ ___________________________________ ___________________________________ • Whenever you can find the average (center) of a distribution you can also find out its standard deviation (spread) • Finding the standard deviation of a discrete random variable is similar to finding the standard deviation of a set of numbers. ___________________________________ • The formula is _________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 Finding Standard Deviation • Here is the distribution for the AP scores from 1 2 3 4 5 2010.Score P(X) .23 .18 .24 .22 .13 ___________________________________ ___________________________________ ___________________________________ What was the standard deviation of these test scores? ___________________________________ ___________________________________ Slide 13 Law of Large Numbers • The House always wins. • The Law of Large Numbers says that if you continue to make observations of a random event, the proportion of outcomes should approach the expected probabilities. • What would happen if you flipped a coin ten times? – How about 100 times? – How about 1000 times? – How about 1,000,000 times? • Is there are law of small numbers? Slide 14 What happens if I change data • Let’s say that I added a certain number to all pieces of data or multiplied a number for all pieces of data. We would follow this formula for average. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ – This formula might be used for converting between units of measurement. • In this same situation, the standard deviation also changes. ___________________________________ ___________________________________ Slide 15 Example • A sports store near the beach makes money by renting boats to patrons. Each patron must pay an initial $20 fee and then $10 per hour. If the average customer rents a boat for 3.4 hours, on average how much money does the sports store make per boat customer? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 Example Continued • Using the information in the previous problem, what would be the standard deviation of the dollar amount, if you knew that the standard deviation of the time is 0.5 hours? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 What happens if I try to combine two groups of data • If there are two groups of INDEPENDENT data, then the sum and differences of their data can be described by ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 Example • Mr. Merlo is trying to compare his first period class from this year to his first period class from last year. He knows that their class average last year was 58% and this year it is 63%. What is the average difference between the two classes? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 19 What about standard deviation • In this same situation the standard deviation also changes, but you have to use the variances to find out how much ___________________________________ ___________________________________ ___________________________________ • Notice, that it is the same formula whether you are adding or subtracting the two sets of data. Slide 20 Example of Standard Deviation • Mr. Merlo is trying to compare his first period class from this year to his first period class from last year. He knows that their class standard deviation last year was 6% and this year it is 8%. What is the standard deviation of the difference between the two classes? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 Standard Deviations of Sums • In order to find the standard deviation of the sum or difference of two variable, you must add their variances, not their standard deviations. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 22 Two Normal Distribution • If we know that two distributions are normal, then we can use our rules of normal distribution to find different probabilities ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 23 Let’s Go Bowling • Mr. Merlo and Mr. Furutani have entered a team bowling tournament. If Mr. Merlo has a bowling score with a normal distribution N(120, 30) and Mr. Furutani has a bowling score with normal distribution N(160, 40), what is the probability, assuming that their scores are independent that they will bowl a combined score of at least 320? ___________________________________ ___________________________________ ___________________________________ • So, now we know that distribution of their combined scores will be N(280, 50) ___________________________________ ___________________________________ Slide 24 Lets Go Bowling Continued • Mr. Merlo and Mr. Furutani have entered a team bowling tournament. If Mr. Merlo has a bowling score with a normal distribution N(120, 30) and Mr. Furutani has a bowling score with normal distribution N(160, 40), what is the probability, assuming that their scores are independent that they will bowl a combined score of at least 320? ___________________________________ ___________________________________ – Now we know that the combined distribution is N(280, 50) P(X > 320) =P(Z > (320-280)/50) =P(Z>0.80) =.2119 There is a 21% chance that they will combine for a score greater than 320. ___________________________________ ___________________________________ ___________________________________ Slide 25 ___________________________________ ___________________________________ What We Missed ___________________________________ ___________________________________ ___________________________________ Slide 26 • Know how to find the standard deviation when you average 2 variables Mr. Merlo and Mr. Furutani have entered a team bowling tournament. If Mr. Merlo has a bowling score with a normal distribution N(120, 30) and Mr. Furutani has a bowling score with normal distribution N(160, 40), what is the distribution of the average of their two scores? Assume that they are independent. • Since they are both normal, we know that the distribution of the average will also be normal ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 27 Know how to…continued • In terms of mean and s.d., we are looking at a distribution of an average ___________________________________ ___________________________________ – That means we want the distribution of (M + F)/2 • The average is what you would expect it to be • We have to think about the standard deviation a little more ___________________________________ ___________________________________ ___________________________________ Slide 28 Know how to find median and quartiles by looking at a probability distribution • Here is the distribution of a scores on a 5 Grade 0 1 2 3 4 5 point P(G) test 0.1 0.1 0.4 0.14 0.16 0.1 • What is median? – Median is the number that has 50% above and 50% below – So, that number would occur at 2, because 60% of the data is 2 or smaller and 20% is 1 or smaller, so the 50th percentile has to be in two – By the same idea, the 25th percentile (Q1) will also ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 8 Can you find the expected outcome and standard deviation of a binomial distribution? (10, 4a;10b, 3b)—525* Can you set up and solve a binomial problem that is cumulative using a calculator( 4b;10b, 3c;09, 2b) =binomcdf(n, p, 30)? (2010, Know how to solve a binomial problem that is the complement of a cumulative problem ( ), (06, 3b; 06b, 3b) If you believe that you are doing a binomial question, state so immediately in your answer, B(n, p), (10b, 3a;09, 2b; 07b, 1b; 06b, 6c; 03, 3c) Know how to solve a basic binomial problem (P(X=1)), (07b, 2b; 06b, 6c; 03, 3c; M2, 32) Know the conditions for a binomial problem and when a problem is not binomial, (04, 3a; M7, 7; M7, 11) Binomial Distribution Both of these formulas are in your packet Solving a binomial probem You must state that it is binomial, state what a success is, what the probability of success is and how many observations are being made. o This can be done with the shorthand: B(n,p) Make sure to state what p represents. You must show that it meets the four criteria o Success/Failure o There are a set number (n) of observations o Probability of success never changes o Each observation is independent Plug into the formula o P(X = k) = Describe you answer in the context of the question Example 1 The probability of making any money in a state lottery is 0.4. There is a drawing once a week. What is the probably that you would win at least six times in a seven week period? B(7, 0.4) where n is the number of weeks being observed and p = the probability of making any money = 0.4 1. 2. 3. 4. success=making money/failure=not making money There are seven weeks of observations p = probability of winning = 0.4 There is no reason to believe that each drawing is not independent There is a 1.88% chance of randomly winning 6 or more times Example 2 You could also do this same problem the following way: Example 3 What is the expected outcome and the standard deviation of this distribution? Expected = 7(0.4)=2.8 s.d.=1.296 2010 #4 2010B #3 2006 #3 2006B #3 2004 #3 2003 #3 Slide 1 ___________________________________ ___________________________________ Chapter 8 The Binomial and Geometric Distributions ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ The Binomial Distributions ___________________________________ ___________________________________ ___________________________________ Slide 3 Let’s Flip A Coin ___________________________________ • If I decided to flip a coin three times. What is the probability that I will get exactly 2 heads? ___________________________________ • This is what we call a binomial distribution. ___________________________________ ___________________________________ ___________________________________ Slide 4 What makes it binomial? 1. Each observation falls into one of just two categories: “success” or “failure.” 2. There is a fixed number of observations. 3. All observations are independent. 4. The probability of “success” never changes. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 The Binomial Distribution ___________________________________ We make a set number of observations (n) and each “success” has the same probability (p). Example ___________________________________ Let’s say that the probability that a girl says she will go out with me this week is 0.1, and I ask out 7 girls. ___________________________________ ▪ B(7, 0.1) What are we assuming is true? ▪ Each are independent. ___________________________________ ___________________________________ Slide 6 Finding Probabilities To find a binomial probability, we can use the formula ____________________ Lets look at the dating example: B(7, 0.1) First, define X = # of times Merlo gets a date. Find P(X = 2) ▪ This is a little funky because any one of the girls could say yes. P(X = 2) =21 * (0.1)^2 * (0.9)^5 21 ways it could happen, 2 yes, 5 no. So, you get 0.1240 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 Dating Continued Find P(X = 1) P(X = 1) = 7 * (0.1) * (0.9)^6 = 0.3720 Find P(X = 0) P(X = 0) = (0.9)^7 = .4783 We don’t have to use factorials because there is only one way that this could happen I can use my calculator ___________________________________ ___________________________________ ___________________________________ 2nd/Distr/0:binompdf ▪ Binomepdf(7, 0.1, 3) finds P(X = 3) 2nd/Dist/1:binomcdf ▪ Binomcdf(7, 0.1, 3) finds P(X < 3) ▪ This is cumulative ___________________________________ ___________________________________ Slide 8 Desperate Dating • Let’s go ahead and assume the information from the previous problems is true about my dating life: B(7, 0.1). What is the probability that at least one girl will say yes? P(X≥1)=P(X=1) +P(X=2) + P(X=3)+…+ P(X=7) = 1- P(X < 1) = 1 – P(X = 0) = 1 – (0.9)^7 = 0.5217 There is a 52% chance that at least one girl will say yes. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 What if he asks 100 girls • Now we are looking at B(100, 0.1) • Lets find the probability that at most 15 will say yes – P(X≤15) = Binomcdf(100, 0.1, 15) = 0.9601 • Let’s find the probability that at least 8 will say yes – P(X≥8) = 1-P(X≤7) = 1-binomcdf(100, 0.1, 7) = 0.7939 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 Mean of a binomial distribution • To find the mean, you multiply n*p ___________________________________ ___________________________________ – The formula is ____________ • On average, you would expect me to get 7*0.1=0.7 dates per week. ___________________________________ ___________________________________ ___________________________________ Slide 11 Standard Deviation of a binomial distribution ___________________________________ • The formula for finding the standard deviation of a binomial distribution is_______________ ___________________________________ • In our problem, Mr. Merlo averages 0.7 dates per week with a standard deviation of ___________________________________ • 0.7937 dates per week ___________________________________ ___________________________________ Slide 12 ___________________________________ ___________________________________ Geometric Distributions ___________________________________ ___________________________________ ___________________________________ Slide 13 How Long Until We… • Josh and Eli are playing a game where they flip a coin. If it’s heads Josh wins, if it’s tails Eli wins. On a bright and shiny Wednesday, they play the game, and Josh keeps losing. In fact, he doesn’t win until the 8th flip of the game. He exclaims, “What is the probability of that?” What exactly is the probability of that? • This is a geometric distribution ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 What makes it geometric? 1. Each observation falls into one of just two categories: “success” or “failure.” 2. The probability of “success” never changes. 3. All observations are independent. 4. The variable of interest (X) is the number of trials required to obtain the first success. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 15 The Geometric Distribution We are going to keep making observations until we succeed. ___________________________________ ___________________________________ The formula for this is ___________________ Example I am going to keep asking out girls until one says yes. P(X = 1) = 0.1 ___________________________________ ▪ That’s the likelihood that the first girls I ask will say yes. P(X = 5) = (0.9)(0.9)(0.9)(0.9)(0.1) ▪ That’s the probability that I have to ask five girls until one says yes. ___________________________________ ___________________________________ Slide 16 Adding an infinite amount of numbers • The probability that it takes more than n trials ___________________________________ ___________________________________ to see the first success is _________________ • Let’s find the probability that it will take me at least 6 tries to get a date. – P(X > 6) = P(X > 5) = (1 – 0.1)^5 = (0.9)^5= 0.59 ___________________________________ ___________________________________ ___________________________________ Slide 17 Expected Outcome Again • The mean of a geometric distribution, is the average number of trials it should take until you succeed. – It can be found by _________________ • On average it would take me 1/0.1 = 10 tries until I got a date. – This makes sense, because there is a one tenth of a chance that I will actually get a date. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 9 How is a sampling distribution of a sample mean different than a distribution of individual observations? (10, 2a; M7, 23) Know how to find the probability of averaging a certain amount in a Normal distribution ( 3b; 06, 3c; 04b, 3c) ? (10, 2b; 09, 2c; 07, Know how to find the standard deviation of a proportion, (08, 4c) What does it mean to be considered a biased or unbiased estimator? (08b, 2; M7, 33) Know how sampling distributions of sample means change as your sample size gets larger, and how that affects the likelihood of extreme events occurring, (07, 3a; M2, 9) Know that when you are referring to a standard deviation of a sampling distribution of the sample mean, you do not have to divide by rad n, it has all ready been done, (07, 3b) Know that when you have a large sample size, even the most nonnormal distributions will have a sampling distribution of the sample mean close to a normal distribution by the Central Limit Theorem, (07, 3c; 07b, 2c) Know how to find the average and standard deviation of a sampling distribution of a sample mean if you know the average and standard deviation of the original data, (07b, 2c; M2, 30) Know the difference between finding the probability that a group of 10 will average more than 150 lbs and a group of ten will all weight more than 150 lbs, (05b, 6c) 2010 #2 2009 #2 2008B #2 2007 #3 2007B #2 2006 #3 2004B #3 Sampling Distributions of Sample Means As your sample size become larger, these are things that you you notice: 1) It becomes more normal (But will never be normal, only approximately normal)—(This is the Central Limit Theorem) a. If something is all ready symmetric, then you only have to make a few observations for the sampling distribution of the sample means to become approximately normal (notice the picture on the right) b. If something is not at all symmetric, then you need about 30 observations for the sampling distribution of the sample means to become approximately normal 2) The average remains the same (If the overall then no matter how large the sample size) 3) It becomes less spread (If the overall then no matter what n is) This is true even if the sampling distribution of the sample means is not normal and if it is normal Example 1 (Know the difference between finding the probability that a group of 10 will average more than 150 and a group of ten will all weigh more than 150lb) A group of 10,000 people have a normal distribution of weights with an average of 145 and a standard deviation of 20. 1) What is the probability that a randomly chosen person will weigh more than 150lb? N(145, 20) 2) What is the probability of 10 random people averaging more than 150lb? N(145, ) 3) What is the probability that 10 random people will all weigh more than 150? Since the probability that one individual will weigh more than 150 is 0.4013, the probability that 10 individuals will all weigh more than 150 is… Slide 1 ___________________________________ ___________________________________ Chapter 9 Sampling Distributions ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Sampling Distributions ___________________________________ ___________________________________ ___________________________________ Slide 3 Parameter and Statistic • A parameter is a number that describes a population. It is usually unknown. ___________________________________ ___________________________________ – 53% of CV students like Mr. Merlo. • 53% is a parameter. We write that p = 0.53 • A statistic is a number taken from sample data. This is usually what we can get. ___________________________________ – In a survey of 100 students, 53% say that they like Mr. Merlo. • 53% is the statistic. We write that ____________ • Statistics are used to estimate parameters. ___________________________________ ___________________________________ Slide 4 Sampling Variability A statistic is merely a good guess at a parameter. If I took several surveys of students, looking at different groups of 100, I might get different statistics (53%, 47%, 58%). ___________________________________ ___________________________________ This is called sampling variability In the ideal world we would have a survey of every possible group of 100, and we would create a histogram of those probabilities. This is called a sampling distribution. I understand that this is unreasonable, because it would be easier just to ask every person, but I’m setting up something for later. ___________________________________ ___________________________________ ___________________________________ Slide 5 Sampling distribution ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Sampling Distribution ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 Bias and Variability • If a statistic or the mean of a sampling distribution is far from the actual parameter, it is said to be biased. • It is unbiased if the statistic or mean of a sampling distribution is equal to the true value of the parameter. • Variability is how spread out the statistics gathered are. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 Bias And Variability ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 ___________________________________ ___________________________________ Sample Proportions ___________________________________ ___________________________________ ___________________________________ Slide 10 The Distribution of a Statistic • Let’s look at the Mr. Merlo likers. • If I took as many samples of groups of 100 as I could, that sampling distribution (histogram) would be close to normal. – Not only would it be normal, but its mean, should be the parameter that I’m looking for. – And, the standard deviation of the distribution ___________________________________ ___________________________________ ___________________________________ ___________________________________ would be ________________ ___________________________________ Slide 11 Rule of Thumb 1. Use the formula for standard deviation of a statistic only when the population is at least 10 times bigger than the sample. 1. In the Merlo example, I would have to use a sample smaller than 300 because the population of the school is 3000. 2. We can only really assume that its normal if np and n(1-p) are bigger than 10. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 Example • If 100 students are chosen at random, how likely is it that 60 or more will like Mr. Merlo if we know that 53% actually do like him? • Since 100 is less than 10% of the school population, it would be appropriate to use the standard deviation formula. Since 100(.53)>10 and 100(.47)>10, we can also use the normal approximation. • • ___________________________________ ___________________________________ ___________________________________ So, now we know that this distribution is ___________________________________ ___________________________________ Slide 13 Example Continued • Now that we know ___________________________________ ___________________________________ • We want to find ___________________________________ ___________________________________ ___________________________________ Slide 14 ___________________________________ ___________________________________ Sample Means ___________________________________ ___________________________________ ___________________________________ Slide 15 Sample means • If I know the true values… – Parameter is usually represented with μ. – The standard deviation of the population is σ. • If I’m using sample data… – The mean of the sampling distribution of x-bar is μ. – If the standard deviation is σ, then the standard deviation of our sample is σ/√n. • What happens to the standard deviation as our sample sizes gets bigger? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 Know how to find the prob of averaging a certain amount A soda factory releases soda out of a machine and the amount of liquid that comes out has a normal distribution with an average of 12 oz and a standard deviation of 0.4 oz. How likely is it that a six pack of soda will average more than 12.2 oz per bottle? ___________________________________ ___________________________________ ___________________________________ Solution ___________________________________ =0.1112 ___________________________________ Slide 17 Central Limit Theorem • As your n gets bigger, all distributions, no matter how skewed, will begin to become normal. • These distributions will have the same mean as the original distribution and a standard deviation of σ/√n. • An example of the Central Limit Theorem can be found at http://www.stat.sc.edu/~west/javahtml/CLT.ht ml ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 CLT ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 19 CLT ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 Example ___________________________________ • Mr. Merlo is going to roll a die one hundred times and find the average. What is the probability that the average will be greater than 3.6 if you know that the average of rolling one die is 3.5 with a standard deviation of 1.71? ___________________________________ • By the CLT, we know that rolling dice will have an average of 3.5. • Even though rolling a die is definitely not normal, by the CLT, we can say it’s normal. • So, now we have a new distribution of the average of rolling die. It is ___________________________________ ___________________________________ ___________________________________ Slide 21 Example Continued • Mr. Merlo is going to roll a die one hundred times and find the average. What is the probability that the average will be greater than 3.6 if you know that the average of rolling one die is 3.5 with a standard deviation of 1.71? • We now know that CLT tell us N(3.5, 0.171) • =P(z>(3.6-3.5)/.171) • =P(z>0.58) • =.2810 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 10 Know how to interpret a confidence level (i.e. What does it mean to be 95% confident)? (10, 3a; 03b, 6a; M2, 37)—622 What exactly is a confidence interval? How do I use it to argue for or against a claim (e.g. Based on your confidence interval is there evidence that boys are smarter than girls?)? (10, 3b;10b, 6d; 03, 6b) If I need to have a certain margin of error, how do I establish how many observations to make? (10, 3c; 08b, 3a; 05, 5b; 03b, 6b; M2, 26)—635, 671 Know the conditions for a 1-proportion z interval, specifically how to check normality, (10b, 4a; 03, 6b; M7, 21) If we are taking a sample of 50 from a population, why don’t we consider the fact that the probability is changing because we are not replacing? (10b, 4b) Know how to calculate a one proportion z interval, (03, 6b; 03b, 6a) Know how to interpret a one proportion z interval, (03, 6b; 03b, 6a; M7, 34) Know how to check the conditions for a one proportion z interval, (03b, 6a) Know how changing the sample size affects the margin of error, (M7, 30) Know what happens to a confidence interval and the margin of error if you increase the confidence level, (M2, 13) Know how to calculate a confidence interval for a one sample t-interval, (04, 6a; M2, 8) Know how to interpret the confidence interval for a one sample t-interval, (04, 6a) Know when to do a t-test/interval instead of a z-test/interval, (M7, 25; M2, 33) Know what a t-distribution is and how it is similar and different to a normal distribution, (M2, 18) 2010 #3 2010B #4 2008B #3 2008B #4 2005 #5 2003 #6 2003B #6 One sample (paired) t-confidence interval—Quantitative Data I) Name the Test and state the formula a. One sample (paired) t-confidence interval a. II) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. III) Do the math a. Plug the given numbers into the formula, state the degrees of freedom that you are using, and use a calculator to verify c. IV) Draw a conclusion in context a. We are _____% confident that the avg. ____________________ is between _____ and _____ One sample z-confidence interval for proportions—Categorical Data I) Name the Test and state the formula a. One sample z-confidence interval for proportions b. II) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. ii. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. Do the math a. Plug the given numbers into the formula, and use a calculator to verify Draw a conclusion in context a. We are _____% confident that the proportion of ____________________ is between _____ and _____ c. III) IV) np 10 Slide 1 ___________________________________ ___________________________________ Estimating With Confidence ___________________________________ ___________________________________ ___________________________________ Slide 2 Definitions • Inference – The process of arriving at some conclusion that, though it is not logically derivable from the assumed premises, possesses some degree of probability relative to the premises. • Statistical Inference – Provides methods for drawing conclusions about a population from sample data. – We will have several “inference procedures” that we will learn in the next couple of months. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 3 Questions • What does the following statement mean to you? – I am 95% confident that average age of high school teachers is between 30 and 36. • How would the statement change if I altered the first part to, “I am 99% confident…” ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Confidence Interval • Confidence Intervals – This is a guess at a parameter using a statistic. ___________________________________ ___________________________________ • We think that the average age is between 30 and 36. • Margin of Error – This is the distance up and down from our sample mean that we are willing to go. • This is the + or – you see when you watch elections. • In our example above, we collected a statistic of 33 from a sample. ___________________________________ – This is just a guess, though. • So, in our study we had a margin of error of 3 years. – That’s where the 30 and 36 came from. ___________________________________ ___________________________________ Slide 5 Confidence Level • A confidence level is a percentage, which is the probability of our interval containing the true parameter. • We are 95% confident that the average age of high school teachers is between 30 and 36 years. – This means that if we were to take several samples (where we would get different statistics: 32, 29, 35,…) 95% would contain the the true parameter. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Know how to interpret a confidence level • We are 95% confident that the average age of high school seniors is between 17.1 and 17.8 ___________________________________ ___________________________________ • What does it mean? – If we took several samples, 95% of them would contain the true answer of the average age of high school seniors • What does it not mean? – It does not mean that 95% of seniors are between 17.1 and 17.8 – It does not mean that if you took several samples 95% would have an average between 17.1 and 17.8 ___________________________________ ___________________________________ ___________________________________ Slide 7 Confidence Interval Conditions • In order for us to be able to create a confidence interval we need three things – 1) The data come from an SRS of the population – 2)The sampling distribution is close to normal, or it’s large enough to use the CLT. – 3) The individual observations are independent ___________________________________ ___________________________________ ___________________________________ • The sample must be less than 10% of the population ___________________________________ ___________________________________ Slide 8 Confidence Interval for a Population Mean • Once we have checked the condition, we then want to find our interval by using our formula ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Example We sampled a 100 students and found that the average SAT score was 1800. We are told the the population standard deviation is 200. ___________________________________ ___________________________________ ◦ 1800 is a statistic. Is it the true average of all students? No. So we will take a good guess at the actual parameter. First establish how confident you want to be. ◦ 95% is pretty good. Using the empirical rule, that is two standard deviation above and below the average. Since it’s a sample of 100, the standard deviation of our sample is 200/(square root of 100) = 20. ___________________________________ ___________________________________ ___________________________________ Slide 10 Example Continued ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Example Continued • We are 95% confident that the average SAT score is between 1760 and 1840. – So, we think that the parameter, the actual average, is between 1760 and 1840. – This means that if we took a lot of samples, our guess would be wrong 5% of the time. • 95 is the confidence level, and (1760, 1840) is the confidence interval. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 Writing Confidence Intervals • This is how I want you to write answers to confidence intervals… – We are ____% confident that the average _________________ is between _____ and ____. ___________________________________ ___________________________________ ___________________________________ – You fill in the blanks to make it question specific. ___________________________________ ___________________________________ Slide 13 Being more specific • Instead of using the empirical rule, we are going to use more specific z scores. – 90% confidence --- z = 1.645 – 95% confidence --- z = 1.960 – 99% confidence --- z = 2.576 • So we would actually be 95% confident that the average SAT score is between 1760.8 and 1839.2 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 Margin Of Error • The margin of error is the last part of the formula ___________________________________ ___________________________________ ___________________________________ • We use this to establish how many people that I need to interview ___________________________________ ___________________________________ Slide 15 Example • I am studying SAT scores and I want to be more specific without raising my confidence interval. So, how many people do I need to interview in order to reduce my margin of error to 10. ___________________________________ ___________________________________ ___________________________________ ___________________________________ • We would have to interview 1537 people. Slide 16 ___________________________________ ___________________________________ Inference for the Mean of a Population ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 Standard Error • Standard Error is the standard deviation of a set of data. ___________________________________ ___________________________________ – It is s, instead of σ • The standard error, s, is our best guess at the standard deviation of the population. • What do we do with confidence intervals and hypothesis tests if we don’t know σ, and we have to use s? ___________________________________ ___________________________________ ___________________________________ Slide 18 T-scores and Degrees of Freedom • A t-score is what we use instead of a z-score when we have to use s instead of σ. – It is similar to z-score, but it is a non normal distribution that is shorter and wider – The larger your sample size, the closer to normal it becomes • In order to use a t statistic, you have to knowing the distribution’s degrees of freedom – This can be found by taking n-1. – The larger your degrees of freedom, the closer to normal your distribution is Slide 19 T-distribution • • The smaller your sample size (df), the more area there is in the tails The bigger the sample size (df), the distribution becomes more normal – That’s why if you have infinite for your degrees of freedom, you just use the z score ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 Confidence Interval for a Population Mean • Once we have checked the condition, we then want to find our interval by using our formula ___________________________________ ___________________________________ ___________________________________ • Notice the difference than the formula when you know the population standard deviation ___________________________________ ___________________________________ Slide 21 Example • We sampled 30 random students and found that the average SAT score was 1800 and the standard deviation of our sample was 200. Find a 95% confidence interval. – This is different than what we’ve done before, because we do not know the population standard deviation, only that from our sample. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 22 Example Continued • State your test – One sample t-interval • State your Conditions • 1) Randomness given in problems • 2) Since, our sample is large (n=30), the distribution of sample means will be approximately normal • 3) Since 30 is less than 10% or the population that takes the SAT, it is safe to say that our distribution is independent ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 23 Example Continued • Do the math Notice that we used df=29, but the n=30 inside the radical • ___________________________________ ___________________________________ ___________________________________ ___________________________________ Draw a Conclusion – We are 95% confident that the average SAT score is between 1725.32 and 1874.67. ___________________________________ Slide 24 ___________________________________ Estimating a Population Proportion ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 25 Conditions Again • The three conditions needed in order to do a confidence interval of a proportion are – 1) Random—SRS – 2) Normal ___________________________________ ___________________________________ ___________________________________ – 3) Independent • Less than 10% of the population • These are all the same as for C.I.’s for sample means, except the normal check Slide 26 Confidence Intervals for a Population Proportion • If all of the conditions are met, you can use a confidence interval to make a guess at the sample proportion. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ • Notice that the formula for standard deviation is in the equation and that you will always use a z-score, never a t-score when you do a Slide 27 Example • A statistician was trying to make a guess at the number of students that pass the AP exam. He had a 80 random students’ results and saw that 60 had a passing score. Create a 95% confidence interval for the proportion of students that will pass the exam. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ • We have sample proportion of 0.75. Slide 28 Example Continued • Since the problem fulfills the three conditions, we may use the formula ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ • We are 95% confident that the proportion of students that pass the AP exam is between ___________________________________ Slide 29 Margin of Error • As with sample means, you can also use the margin of error formula for sample proportions. ___________________________________ ___________________________________ ___________________________________ • You can use your best guess at the sample proportion for p star, or you can use 0.5. ___________________________________ – I recommend using 0.5 because it is the most conservative guess. ___________________________________ Slide 30 ___________________________________ Example • How many randomly chosen college students would you need to interview to find the proportion of college students who lived off campus their freshman year within 3% at 95% confidence? ___________________________________ Solution This is asking us to find out how to find this interval with a .03 margin of error We use 0.5, because that is the proportion that will have the widest interval, making it the most conservative guess We always round up margin of error problems. So, we need to interview 1068 college students ___________________________________ ___________________________________ ___________________________________ Slide 31 ___________________________________ ___________________________________ What We Missed ___________________________________ ___________________________________ ___________________________________ Slide 32 Know how to use a confidence interval to make a conclusion ___________________________________ • We are 95% confident that the average age of high school seniors is between 17.1 and 17.8 ___________________________________ Question Is there evidence that the average high school senior is 18 years old? ___________________________________ Solution No there is not evidence that the average age is 18, because 18 is not an option in the ___________________________________ ___________________________________ Slide 33 Know how to use a confidence interval to make a conclusion ___________________________________ • We are 95% confident that the average age of high school seniors is between 17.1 and 17.8 ___________________________________ Question Is there evidence that the average high school senior is 17.5 years old? ___________________________________ Solution There is not evidence against it, but we cannot conclude that 17.5 is the answer. We just ___________________________________ ___________________________________ Chapter 11/12 Know how to write a pair of hypotheses for a one sample/paired t-test, including a definition of parameters, (09, 6a; 09b, 5a; 08b, 6c; 07, 4; 06, 6a; 06b, 4; 05B, 4a; 05b, 6a; 03, 1c; M7, 5) Know what the conditions are and how to check them for a one sample/paired t-test or confidence interval, (09b, 5a;08b, 6c; 07, 4; 06b, 4; 05b, 4a; 05b, 6a; 04, 6a; 04b, 5b) Know how to calculate a test statistic for a one sample/paired t-test, find its degrees of freedom, and calculate its Pvalue, (09b, 5a;08b, 6c; 07, 4; 06b, 4; 05b, 4a; 05b, 6a; M2, 39) Know how to draw a conclusion in context based on a P-value in a one sample/paired t-test, (09b, 5a;08b, 6c; 07, 4; 06b, 4; 05b, 4a; 05b, 6a; M2, 24) Know what a P-value is how to interpret it when analyzing a large number of simulations or large number of observations, (10, 6e; 09, 6c; 09b, 5b; 06b, 6d) Know when to do a paired t-test as opposed to a 2 sample t-test, (08b, 6c) Know how to write a pair of hypotheses for a one propotion z test, including how to define parameters, (06b, 6a; 05, 4; 03, 2a; M2, 2) Know how to check the conditions for a one proportion z-test, (06b, 6b; 05, 4) Know how to find a z-statistic and a P-value for one proportion z-test, (06b, 6f; 05, 4) Know how to draw a conclusion based on a P-value for a one proportion z-test in context, (06b, 6f; 05, 4) Know that a matched paired design experiment will need a paired t-test, (05b, 3a) Know how a confidence interval relates to two sided test of the same data, and how it relates to a one sided test of the same data, (04, 6b; M7, 27; M2, 29) Know what a Type 1 and Type 2 error, and be able to distinguish which one has worse consequences in the context of a problem if give a null and alternative hypothesis, (09, 5c; 08b, 4b, 03, 2b) Know what power is and what can be done to increase it in an observational study or an experiment, (09b, 4b; M7, 32; M2, 35) 2010 #6 2009 #6 2009B #5 2006B #6 2005B #6 2004 #6 2003 #1 2003 #2 2009B #4 One sample (paired) t-test for a sample mean—Quantitative Data I) Name the Test and state the formula a. One sample (paired) t-test b. II) Write your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. IV) Do the math a. Plug the given numbers into the formula, state the t statistic, degrees of freedom that you are using, and your P-value. Use a calculator to verify V) Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ One sample z-test for a proportios—Categorical Data I) Name the Test and state the formula a. One sample z-test for a proportion b. II) State your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. ii. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. Do the math a. Plug the given numbers into the formula, state your z statistic, and your P-value. Use a calculator to verify. Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ c. IV) V) np 10 Slide 1 ___________________________________ ___________________________________ Chapter 11 Testing a Claim ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ Using Inference to Make Decisions ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 3 Type 1 error (told I’m wrong when I’m right) A Type 1 error occurs when the null hypothesis is actually true, but you get a small P-value and reject the null hypothesis. Example ◦ An actual fair coin is being tossed. If you tossed it a million times the probability of getting heads would be 0.5. Mr. Merlo took this coin and tossed it 100 times and got 90 heads. This would give an extremely low P-value in a hypothesis test, even though the coin is actually fair. Mr. Merlo thinks the coin is unfair, even though it isn’t. We just happened to get the craziest 100 flips ever. ◦ This is a type 1 error. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Finding the probability of a Type 1 error Assuming that we are right, what is the probability that we would be told that we are wrong? ◦ It depends at what point you consider a group special or different. ◦ It’s your level of significance. It’s the probability of getting that far away from the actual average. If your level of significance (alpha) is 0.05 or 5%, then the probability of a type 1 error is 0.05 or 5%. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 Type 2 Error(Told I’m right, when I’m wrong) A type 2 error occurs when the null hypothesis is actually wrong, but you get a big P-value and do not reject the null hypothesis. Example ◦ An unfair coin is being tossed. In fact, for this particular coin, the probability of getting heads is 0.75. Mr. Merlo tossed this unfair coin 100 times and got 50 heads. Remember that he has to assume that the coin is fair. ___________________________________ ___________________________________ ___________________________________ This give a very high P-value in a hypothesis test, even though the coin is actually not fair. Mr. Merlo thinks the coin is fair, even though it isn’t. We just happened to get a crazy 100 flips, because we should have gotten a number near 75. ◦ This is a type 2 error ___________________________________ ___________________________________ Slide 6 Power (told I’m wrong when I’m wrong) • This isn’t an error, because this is what you want to happen. • Finding the probability – This is just the complement of a type 2 error. – It’s the opposite of being told I’m right when I’m wrong (type 2 error) • This tells you how good your test is if your alternate hypothesis is true. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 How to Increase Power 1) Have a larger significance level (go from 1% to 5%) 2) Increase the sample size 3) Decrease the standard deviation ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 Know how a confidence interval relates to a two sided test of the same data ___________________________________ • 95% C.I. relates to a 5% significance level • 99% C.I. relates to a 1% significance level • 96% C.I. relates to a 4% significance level ___________________________________ • If you reject at the 5% level, then your confidence interval would not contain the parameter from the null hypothesis ___________________________________ – Think of doing a test and a confidence interval for the same set of data, you would get the same result – Let’s look at P. 710-711 ___________________________________ ___________________________________ Slide 1 ___________________________________ ___________________________________ Chapter 12 Significance Tests ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Using t-scores ___________________________________ ___________________________________ ___________________________________ Slide 3 Standard Error • Standard Error is the standard deviation of a set of data. ___________________________________ ___________________________________ – It is s, instead of σ • The standard error, s, is our best guess at the standard deviation of the population. • What do we do with confidence intervals and hypothesis tests if we don’t know σ, and we have to use s? ___________________________________ ___________________________________ ___________________________________ Slide 4 T-scores and Degrees of Freedom • A t-score is what we use instead of a z-score when we have to use s instead of σ. • In order to use a t statistic, you have to knowing the distribution’s degrees of freedom – This can be found by taking n-1. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 Using T-scores We can find t-scores using the same formula we do for z-scores. We don’t have a chart for t statistics like we do for z statistics, because we’d have to have a chart for every degree of freedom. So, we pick a range of P-values instead of an exact P-value. ___________________________________ ___________________________________ ___________________________________ Example We have a sample size of 12 and a t statistic of 1.78. This gives 11 degrees of freedom, and the area above t=1.78 gives a P-value between 0.05 and 0.10 using table C in our book. ___________________________________ ___________________________________ Slide 6 ___________________________________ ___________________________________ Basics of a hypothesis Test ___________________________________ ___________________________________ ___________________________________ Slide 7 What is a Hypothesis Test? • A Hypotheses test is an inference procedure where we take data collected from a sample and determine if it is extreme enough to say that something is not true • Example – The national AP test score average is 2.87 with a standard deviation of 0.9 – 100 randomly chosen students from California averaged a 3.12 – Is there statistical evidence that Californians do better than the nation on the AP test? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 Steps of a Hypothesis Test 1) 2) 3) 4) 5) Name the test and state the formula State your hypotheses Check the conditions Do the math Draw a conclusion in context ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Name the test and state the formula • Since we collected our data from one sample, California AP test takers, we are going to run a one sample t-test for a sample mean – It is a t test, because we do not know the population standard deviation ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 ___________________________________ State your hypotheses • You always state a null hypothesis ( which is what you assume is true ), – This is the assumption that your sample is not different than the usual population • You always state an alternative hypothesis ( ), which is what you think might be true. µ=the average AP Stats score for all California students ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Check Conditions 1) You sample must be random 2) Your distribution must be approximately normal a. b. If your sample is larger than 30 you must use the CLT to check approximate normality If you sample is smaller than 30 it must all ready be known to be normal, or you must graph the data to see if it appears normal ___________________________________ ___________________________________ ___________________________________ 3) You distribution must be independence a. This is usually checked by having less than 10% of the population ___________________________________ ___________________________________ Slide 12 Check Conditions Continued 1) Students were chosen by random 2) Since our sample is large (n =100) the distribution will be approximately normal by the central limit theorem 3) Since there were more than 1000 students that took the AP Statistics exam in California, the students are considered independent ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 13 Do the math • Find a standardized statistic ___________________________________ ___________________________________ ___________________________________ • State your degrees of freedom – D=80 since n = 100 • Find your P-value ___________________________________ – Between .0025 and .005 ___________________________________ Slide 14 Draw A Conclusion • You either reject or you cannot reject – You cannot accept, because there is always a chance that you got bad data – You just say, there is evidence that… or there is not evidence that… • Since our P-value is smaller than .05, we can reject at the 5% significance level. There is evidence that California students did better than the national average. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 15 Calculator • Our calculator can do this – 1 sample t test ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 What is a P-value • A P-value is a probability – The probability of getting the data that you got (or more extreme) assuming that the null hypothesis is true • Our P-value – If we assume that the Ca average is the same as the national average (2.87), then there is a .33% chance that a group of 100 students would randomly score 3.12 or more. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 ___________________________________ ___________________________________ One Proportion z-test ___________________________________ ___________________________________ ___________________________________ Slide 18 The only difference • The first difference is that you never use a t statistic for a proportion. You always use a zstatistic • The second difference is checking normality. You cannot use CLT with proportions • You must check by ___________________________________ ___________________________________ ___________________________________ ___________________________________ – np>10 – n(1-p)>10 ___________________________________ Slide 19 Example • Mr. Merlo’s lucky quarter was flipped 100 times and 60 times it came up heads. Is there evidence that the coin is unfair. • State your test ___________________________________ ___________________________________ ___________________________________ – One proportion z test ___________________________________ ___________________________________ Slide 20 Example Continued ___________________________________ • Write your hypotheses ___________________________________ • Check your conditions ___________________________________ – 1) We will assume that the coin flips are random – 2) 100(0.6)>10 and 100(0.4)>10 – 3) It’s safe to say that all coin flips are independent ___________________________________ ___________________________________ Slide 21 Example continued • The z score is ___________________________________ ___________________________________ ___________________________________ P = 0.0456 ___________________________________ We cannot reject the null hypothesis at the 1% significance level. There is not evidence that Mr. Merlo’s coin is unfair. ___________________________________ Chapter 13 How do I create a set of hypotheses for a two sample t-test? (10, 5; 08, 6a; 07b, 5) What are the conditions for a two sample t-test/interval and how do I check them? (10, 5;09, 4a;08, 6a; 08b, 1b; 07b, 5; 06, 4; 05, 6a; 04b, 4a; 04b, 5c)—782 Know how to check normality if you have less than 30 pieces of data and if you have more than 30 pieces of data, (09, 4a; 08, 6a; 07b, 5; 06, 4; 05, 6a) How do I calculate the t statistic for a two sample t-test? (10, 5; 08, 6a; 07b, 5)—788 Know how to write a conclusion in context for a two sample t test, (10, 5; 08, 6a; 07b, 5; M7, 13; M7, 37) Know how to calculate a confidence interval comparing two sample means, (09, 4a; 06, 4; 05, 6a; 04b, 4a)—788 Know how to write a conclusion in context for a two sample/proportion t/z interval (09, 4a; 09, 5b; 06, 4; 06b, 2a; 05, 6a; 04b, 4a) Know how to use a confidence interval to determine if there is evidence that there is a difference between two populations, (09, 4b; 07, 1c; 06, 4; 06b, 2b, 05b, 4b) Know how to interpret a P-value in the context of a two sample/proportion problem. How is this different that just drawing a conclusion?, (09, 5a; 07, 5c; 07b, 6a) Know how to write a pair of hypotheses for a two prop z test, (09b, 3b; 07, 5b; 07b, 6a; 04b, 6a; 03b, 3b) Know how to check the conditions for a 2 proportion z test/interval, (09b, 3a; 07, 5c; 07b, 6a; 06b, 2a; 04b, 6a; M7, 39; M2, 22; M2, 40) Know how to find a test statistic for a 2 prop z test, and how to look up its corresponding p-value, (09b, 3b; 07b, 6a; 04b, 6a) Know how to draw a conclusion in context for a 2 prop z test, (09b, 3b; 07, 5c;07b, 6a; 04b, 6a) Know how to calculate a confidence interval for a z confidence interval for a difference of two proportions (2 prop z interval), (09b, 6b; 06b, 2a; M7, 4) Know how to distinguish a two sample t-test from a paired t-test, (07, 4; 06b, 4; 05b, 4a; M2, 12) 2010 #5 2009 #4 2009 #5 2009B #3 2008B #1 2007 #1 2007 #4 2007 #5 2006B #2 2006B #4 2005 #6 2005B #3 2005B #4 2004B #4 2004B #5 2003B #3 2003B #4 Two Sample t-confidence interval for a difference in sample means—Quantitative Data I) Name the Test and state the formula a. Two-sample t-confidence interval for a difference in sample means b. II) III) IV) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal for both of the sample (you must check twice) i. a. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, state the degrees of freedom that you are using, and use a calculator to verify Draw a conclusion in context a. We are _____% confident that the avg. difference between ____ and _____ is between _____ and _____ Two sample z-confidence interval for a difference of proportions—Categorical Data I) Name the Test and state the formula a. Two sample z-interval for a difference of proportions b. II) III) IV) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. and ii. and c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, and use a calculator to verify Draw a conclusion in context a. We are _____% confident that the difference in the proportion of ________ and _______ is between _____ and _____ Two Sample t-test for a difference of sample means—Quantitative Data I) Name the Test and state the formula a. Two-sample t-test b. II) State your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal for both of the sample (you must check twice) i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, state your t-statistic, the degrees of freedom that you are using, and your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the _____% significance level. There is/isn’t evidence to say that_______________________ IV) V) Two sample z-test for a difference of proportions—Categorical Data I) Name the Test and state the formula a. Two sample z-test for a difference of proportions a. II) Write your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. IV) V) and ii. and c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, state your z statistic, and your P-value. Use a calculator to verify. Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ Slide 1 ___________________________________ ___________________________________ Chapter 13 Comparing Two Population Parameters ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ Comparing Two Means ___________________________________ ___________________________________ ___________________________________ Slide 3 What does it mean to compare two means? • In the previous section we were comparing our results to an expected outcome. – For example, we might know that a machine that fills bottles with soda should put an average of 300 mL in each bottle. • We tested to see if it was underfilling. – Null: – Alternative: Average = 300 Average < 300 • In this section we will test to see how two different populations compare to each other. – Do women study more than men? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Here’s the Data • We collected a random sample of 35 men and 30 women and recorded their study habits. • We found that men averaged 4 hours of studying per week with a sample standard deviation of 1.5. x – =4, s=1.5 • We found that women average 5 hours of studying per week with a sample standard deviation of 1. x – =5, s=1.0 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 State your test • We are doing a two sample t test, comparing sample means ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 6 Write your hypotheses • In this case, we have to averages that we are testing. – ___________________________________ M = The average amount of hours per week that men study. W – ___________________________________ = The average amount of hours per week that women study. ___________________________________ • So, the test is this… ___________________________________ ___________________________________ Slide 7 Check the Conditions • Random – It was given that both samples were chosen at random ___________________________________ ___________________________________ • Normality – There is a large sample of men (n=35). So, by CLT our distribution should be approximately normal – There is a large sample of women (n=30). So, by CLT our distribution is approximately normal ___________________________________ • Independence – There are more than 350 men who study – There are more than 300 women who study ___________________________________ ___________________________________ Slide 8 Do the math • Find your t statistic ___________________________________ ___________________________________ =-3.20 ___________________________________ • State your degrees of freedom (d=59.6104) • P = 0.0011 ___________________________________ ___________________________________ Slide 9 Draw a conclusion in context • Since we have a small P-value, we can reject the null hypothesis in favor of the alternative at the 1% significance level. There is evidence that women study more than men. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 What’s up with two standard deviations? We have a problem with standard deviation, though….There are two of them. Since we are going to look at difference of averages, we need to look at difference of standard deviations. Remember, though, we can’t do that, we have to add their variances. 2 2 s1 formula… s So, we get this 2 n1 n2 2 2 1.5the following… 1 This gives us 0.3512 25 30 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Find our t-statistic In this case t 45 2.847 .3511884584 ___________________________________ ___________________________________ How many degrees of freedom? Since there are two samples we will use the df of the smaller sample, because that will be a more conservative guess. So, we will say that df = 24. So, our P-value is between .0025 and .005. This is definitely enough evidence to reject the null hypothesis, and we can say pretty securely that women study more than men! ___________________________________ ___________________________________ ___________________________________ Slide 12 Our Calculator Can Do This • This is a 2-SampTTest • Notice that we get the same t-statistic, but that they use df=40. – They get that number from the formula on page 633. – We aren’t going to worry about that. We’ll let the calculator do it. • Our P-Value was a pretty good guess, though, even with the conservative degrees of freedom. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 13 Two sample t-interval ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 Two proportions • Two proportion z-interval • Two proportion z-test ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 14 Know when to use a chi-squared test of independence and how to find the degrees of freedom (r-1 X c-1), (10b, 5d) Know how to write a null and alternative hypothesis for a chi-square test of independence (09, 1c; 04, 5a) Know how to write a null and alternative hypothesis for a goodness of fit test, (08, 4a; 03b, 5c) Know how to find expected counts for a GOF test if each possibility has a different proportion, (08, 4a; 03b, 5c; M2, 19) Know how to find a chi squared statistic, the degrees of freedom, and the P-value for a GOF test, (08, 4a; 06, 6c; 03b, 5c; M7, 17; M2, 19) Know how to draw a conclusion in context for a GOF test, (08, 4a; 03b, 5c) What does an individual chi squared value represent? Does it tell you if the expected was higher or lower? (08, 4b) Know how to interpret a P-value for a GOF test in the context of the problem, (06, 6f) Know how to check the conditions for a chi-squared test of independence, (04, 5a, 03, 5) Know how to find a chi squared statistic, the degrees of freedom, and the P-value for a chi squared test of independence, (04, 5a; 03, 5) Know how to interpret a P-value for a chi squared test of independence, (04, 5a; 03, 5) Know how to write a pair of hypotheses for a chi squared test of independence (03, 5) Know how to find the expected outcomes for a chi squared test of independence, (M2, 11) 2010B #5 2009 #1 2008 #5 2004 #5 2003 #5 2003B #5 Chi Squared Goodness of Fit Test—Categorical Data I) Name the Test and state the formula a. Chis Squared Goodnes of Fit Test b. II) Write your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether the sample is large enough i. At most 20% of the expected outcomes are less than 5 ii. All the expected outcomes are more than 1 Do the math IV) a. b. V) Find your expected outcomes Plug the given numbers into the formula and find your chi squared value, state your degrees of freedom, and state your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ Chi Squared Test of Independence—Categorical Data (Two way table) I) Name the Test and state the formula a. Chi Squared test of independence b. II) Write your pair of hypotheses III) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether the sample is large enough i. At most 20% of the expected outcomes are less than 5 ii. All the expected outcomes are more than 1 Do the math a. Find your expected outcomes ( ) b. Plug the given numbers into the formula and find your chi squared value, state your degrees of freedom, (r-1)(c-1), and state your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ IV) V) Slide 1 ___________________________________ ___________________________________ Chapter 14 Distributions of Categorical Variables: Chi-Square Procedures ___________________________________ ___________________________________ ___________________________________ Slide 2 ___________________________________ ___________________________________ What is a Chi-Square Distribution ___________________________________ ___________________________________ ___________________________________ Slide 3 Chi Square Distributions • Chi Square Distributions are a family of distributions that are skewed right, always positive, and specified by degrees of freedom • I think of it as a special type of t or z statistic ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 4 Chi Square Distributions • We treat them similar to t distributions • If we have df = 4 and , we get a P-value between .005 and.010 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 5 ___________________________________ ___________________________________ Chi Square Goodness of Fit Test ___________________________________ ___________________________________ ___________________________________ Slide 6 When to do a GOF • You do a goodness of fit when you have a categorical variable with more than 2 options • You are trying to see if the distribution is not what you expected • This usually takes place when you see a one way table ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 7 Example • Historically, the distribution of CV soccer games is as follows: 50% win, 30% loss, and 20% tie • This last season, they had the following distribution Win Loss Tie 16 5 4 • Is there statistical evidence that this year is different? ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 8 How to do a chi square test 1) 2) 3) 4) 5) Name your test and state the formula Write your hypotheses Check the conditions Do the math Make a conclusion ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 9 Name your test • We are doing a chi square Goodness of Fit Test ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 10 Write your hypotheses The distribution of wins, losses, and ties is the same as it has been historically. (The distribution of wins, losses, and ties for this year’s team is 50%/30%/20%) The distribution of wins, losses, and ties is not the same as it has been historically ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 11 Check Conditions • For chi square tests, there are only two conditions: Random and all expected values are greater than 1 (and less than 20% are smaller than 5) • For this problem, we will assume the games are random • The expected outcomes are all 5 or larger Wins Losses Ties 12.5 7.5 5 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 12 Do the math ___________________________________ • Find your test statistic ___________________________________ • We have two degrees of freedom • So, our P-value between 0.05 and 0.10 ___________________________________ ___________________________________ ___________________________________ Slide 13 Draw a Conclusion • Since we have a P-value larger than 5% we cannot reject the null hypothesis at the 1% or 5% level. There is not evidence that this team has a different distribution than any historical team ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 14 ___________________________________ ___________________________________ Chi Square Test of Independence ___________________________________ ___________________________________ ___________________________________ Slide 15 When to do a Chi Square Test of Independence • You use this test when you have a two way table and you want to check if the two variables in the table are independent ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 16 ___________________________________ Example • Based on the following distribution is there evidence that gender is independent of grade in Mr. Merlo’s class A B C D F Male 20 15 30 10 10 Female 10 13 20 8 8 ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 17 ___________________________________ Name your test • We are doing a chi square test of independence ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 18 Write your hypotheses Gender and the grade one receives in Mr. Merlo’s class are independent ___________________________________ ___________________________________ ___________________________________ Gender and the grade one receives in Mr. Merlo’s class are not independent ___________________________________ ___________________________________ Slide 19 ___________________________________ Check Conditions • For chi square tests, there are only two conditions: Random and all expected values are greater than 1 (and less than 20% are smaller than 5) • For this problem, we will assume the games are random • The expected outcomes are all 5 or larger A B C D F Male 17.71 16.53 29.51 10.63 10.63 Female 12.29 11.47 20.49 7.38 7.38 ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 20 How to get the expected values • If you do not use your calculator, the best way to find the expected values is to use the formula • To find the expected number of men who should get C’s: ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 21 Do the math • Our calculator gives us • The degree of freedom can be found by the following formula: df = (r-1)(c-1) – Our problem has df = (2-1)(5-1) = 4 • So, the P-value is .8669 ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 22 Draw a Conclusion • We cannot reject the null hypothesis in favor of the alternative. There is no evidence that gender and grade in Mr. Merlo’s class are not independent. ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 23 Know how to interpret a P value in context • If gender and grade in Mr. Merlo’s class were independent, there would be an 86.69% chance that the distribution that actually occurred or one more extreme would occur randomly ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Chapter 15 Know what the t score and P-value on the Minitab printout are the results of, (08, 6c) Know how to write a null and alternative hypothesis for testing slope and when this is an appropriate test, (07, 6c; M7, 28) Know how to find a t-statistic, degrees of freedom (n-2), and P-value for a test on slope, (07, 6c) Know how to draw a conclusion from a p-value for a test on slope, (07, 6c) Know how to calculate a confidence interval for a slope when given a Minitab printout, (07b, 6b; 05b, 5c; M2, 21) Know what it means if 0 is a possibility in a confidence interval or if you fail to reject the test againt B=0, (07b, 6c) Know that the SE in a Minitab printout is the standard error of the slope, and know how to interpret that in the context of a problem, (06, 2c) 2008 #6 (If you do Chapter 13 before) 2007 #6 2007B #6 2005B #5 Slide 1 ___________________________________ ___________________________________ Chapter 15 Inference for Regression ___________________________________ ___________________________________ ___________________________________ Slide 2 Confidence Intervals for slope • We will do a 95% confidence interval for the slope of the regression line of cost to lay tile in a house on the square inches need to be covered (x=square inches, y= cost). We will say that there are 11 observations ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Slide 3 Confidence Intervals Continued C.I. = your statistic (t statistic) X (standard deviation) ___________________________________ ___________________________________ Since there are n=11, we will use df=9 because there are two variables We are 95% confidence that the slope of the LSRL of cost on square footage of tile is between 0.4746 and .7931 dollars per square inch Since 0 is not in the interval, it is safe to say that there is a relationship between the two variables ___________________________________ ___________________________________ ___________________________________ Slide 4 Hypothesis Test • Most of the time, we will run the following hypothesis test • This is what the Minitab printout gives us • Which gives us a P-value= 0.000 • We can reject the null hypothesis in favor of the alternative. There is evidence that there is a relationship between square footage and cost ___________________________________ ___________________________________ ___________________________________ ___________________________________ ___________________________________ Confidence Intervals Every confidence interval can be done using this formula: C.I. = your statistic (z or t score) (standard deviation) All problems will be done in the following format: I) II) III) IV) State what confidence interval you are doing Check the conditions to make sure it is appropriate Do the math State a conclusion in context One sample (paired) z-confidence interval—Quantitative Data I) Name the Test and state the formula a. One sample (Matched pairs) z-confidence interval b. II) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s approximately normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. III) Do the math a. Plug the given numbers into the formula, and use a calculator to verify IV) Draw a conclusion in context b. We are _____% confident that the avg. ____________________ is between _____ and _____ One sample (paired) t-confidence interval—Quantitative Data V) Name the Test and state the formula a. One sample (paired) t-confidence interval c. VI) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. VII) Do the math a. Plug the given numbers into the formula, state the degrees of freedom that you are using, and use a calculator to verify VIII)Draw a conclusion in context a. We are _____% confident that the avg. ____________________ is between _____ and _____ Two Sample t-confidence interval for a difference in sample means—Quantitative Data V) Name the Test and state the formula a. Two-sample t-confidence interval for a difference in sample means b. VI) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal for both of the sample (you must check twice) i. a. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random VII) VIII) Do the math a. Plug the given numbers into the formula, state the degrees of freedom that you are using, and use a calculator to verify Draw a conclusion in context b. We are _____% confident that the avg. difference between ____ and _____ is between _____ and _____ One sample z-confidence interval for proportions—Categorical Data V) Name the Test and state the formula a. One sample z-confidence interval for proportions b. VI) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. np 10 ii. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. Do the math a. Plug the given numbers into the formula, and use a calculator to verify Draw a conclusion in context a. We are _____% confident that the proportion of ____________________ is between _____ and _____ c. VII) VIII) Two sample z-confidence interval for a difference of proportions—Categorical Data V) Name the Test and state the formula a. Two sample z-interval for a difference of proportions b. VI) VII) VIII) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. and ii. and c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, and use a calculator to verify Draw a conclusion in context a. We are _____% confident that the difference in the proportion of ________ and _______ is between _____ and _____ Hypothesis Tests Every test statistic can be calculated using this formula: All problems will be done in the following format: I) II) III) IV) V) State what confidence test you are doing State your null and alternative hypotheses and define parameters Check the conditions to make sure it is appropriate Do the math State a conclusion in context One sample (paired) z-test—Quantitative Data II) Name the Test and state the formula a. One sample z-test for a sample mean b. III) Write your pair of hypotheses IV) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s approximately normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. V) Do the math a. Plug the given numbers into the formula, state your z score, and your P-value. Use a calculator to verify VI) Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ One sample (paired) t-test for a sample mean—Quantitative Data VI) Name the Test and state the formula a. One sample (paired) t-test b. VII) Write your pair of hypotheses VIII)Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. IX) Do the math a. Plug the given numbers into the formula, state the t statistic, degrees of freedom that you are using, and your P-value. Use a calculator to verify X) Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ Two Sample t-test—Quantitative Data VI) Name the Test and state the formula a. Two-sample t-test b. VII) State your pair of hypotheses VIII) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal for both of the sample (you must check twice) i. If n < 10, make a boxplot and verify that it’s normal ii. If n < 30, make a boxplot and verify that it’s close to normal iii. If n > 30, state that the CLT allows us to say that it’s normal c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, state your t-statistic, the degrees of freedom that you are using, and your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the _____% significance level. There is/isn’t evidence to say that_______________________ IX) X) One sample z-test for a proportios—Categorical Data VI) Name the Test and state the formula a. One sample z-test for a proportion b. VII) State your pair of hypotheses VIII) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. IX) X) np 10 ii. c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then you must verify that the sample is less than 10% of the population. Do the math a. Plug the given numbers into the formula, state your z statistic, and your P-value. Use a calculator to verify. Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ Two sample z-test for a difference of proportions—Categorical Data VI) Name the Test and state the formula a. Two sample z-test for a difference of proportions b. VII) Write your pair of hypotheses VIII) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether or not the distribution is normal i. IX) X) and ii. and c. Determine whether the n observations were taken independently. If sampling with replacement this can be deduced. If sampling without replacement, then… i. If there are two distinct populations this can be done by verifying that there is each less than 10% of their respective populations ii. If the two groups come from the same population, this can be done by verifying that the individuals were placed in their respective groups at random Do the math a. Plug the given numbers into the formula, state your z statistic, and your P-value. Use a calculator to verify. Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ Chi-Squared Test Every test statistic can be calculated using this formula: All problems will be done in the following format: I) II) III) IV) V) State what test you are doing State your null and alternative hypotheses Check the conditions to make sure it is appropriate Do the math State a conclusion in context Chi Squared Goodness of Fit Test—Categorical Data VI) Name the Test and state the formula a. Chis Squared Goodnes of Fit Test b. VII) Write your pair of hypotheses VIII) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether the sample is large enough i. At most 20% of the expected outcomes are less than 5 ii. All the expected outcomes are more than 1 Do the math a. Find your expected outcomes b. Plug the given numbers into the formula and find your chi squared value, state your degrees of freedom, and state your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________ IX) X) Chi Squared Test of Independence—Categorical Data (Two way table) VI) Name the Test and state the formula a. Chi Squared test of independence b. VII) Write your pair of hypotheses VIII) IX) X) Determine whether or not conditions are met a. Determine whether the n samples were taken at random b. Determine whether the sample is large enough i. At most 20% of the expected outcomes are less than 5 ii. All the expected outcomes are more than 1 Do the math a. Find your expected outcomes ( ) b. Plug the given numbers into the formula and find your chi squared value, state your degrees of freedom, (r-1)(c-1), and state your P-value. Use a calculator to verify Draw a conclusion in context a. We can/cannot reject the null hypothesis at the ____% significance level. There is/isn’t evidence to say that ____________________________
© Copyright 2025