SUMMER KNOWHOW STUDY AND LEARNING CENTRE An introduction to STATISTICS 1 2 Contents Data………………………………………………………………………………………………………………………………………....5 Summation Notation…………………………………………………………………………………………………………………7 Measures of Spread…………………………………………………………………………………………………………………..9 Introductory Probability………………………………………………………………………………………………………….13 Sample Spaces………………………………………………………………………………………………………………………….16 Conditional Probability…………………………………………………………………………………………………………….18 Binomial Distribution……………………………………………………………………………………………………………….20 Normal Distribution…………………………………………………………………………………………………………………22 Standard Normal Distribution………………………………………………………………………………………………….24 Probability and Normal Distribution………………………………………………………………………………………...28 Sampling Distributions………………………………………………………………………………………………………….…30 Confidence Intervals…………………………………………………………………………………………………………….….32 Hypothesis Testing……………………………………………………………………………………………………………….…34 3 4 DATA Definitions: Population: the total group of individuals or items. Sample: a group of individuals or items chosen from the population. Data: the information collected from the sample or population. Statistic: a number calculated from the sample data. Parameter: a number calculated from the population data. Types of data: Data may be either qualitative (categorical) or quantitative (numerical) Qualitative Data (classified or labelled). Data is put into non-numerical categories. Blood type, religion, cause of death, are all examples of qualitative data. Quantitative Data (counted or measured). There are two types of quantitative data. o Discrete Data: data is put into categories depending on its counted number; for example, the number of children in a family. o Continuous Data: data is put into categories depending on its measured size; for example, height. Graphical Representation Qualitative/Categorical data is often represented by means of a bar chart or a pie chart. Example 1 The table shows the percentage of imports from various countries. This data can be represented on a pie chart so that comparisons are easier: Country USA Japan Germany UK China New Zealand Italy Other Imports 25 20 10 7 6 4 3 25 Italy 3% Other USA 25% 25% NZ 4% China 6% UK 7% Japan Germa 20% ny 10% Quantitative/Numerical Data is often represented by means of a frequency bar chart called a histogram. 5 Example 2 A group of school students were surveyed to find the number of children in their families. This data can be represented using a histogram. 15 20 No. of Children in a Family 10 5 0 Total Frequency 13 21 11 4 3 1 1 54 Frequency No. of Children 1 2 3 4 5 6 7 0 2 4 Children 6 8 Exercises 1. Label each of the following as either a categorical or numerical variable. For the numerical variables label each as either discrete or continuous. (a) Hair colour (b) A person’s religion (c) A person’s height (d) Number of children in a family (e) The weights of babies born on a particular day (f) The number of crimes committed in Victoria each week (g) The distance travelled to work by the employees of a large company (h) The make of car driven by students at RMIT 2. Represent the data in example 1 in a bar graph. 25 Percentage Imports by Country 0 5 10 15 20 2. mean of var2 Answers 1.(a) Categorical (b) Categorical (c) Numerical – continuous (d) Numerical – discrete (e) Numerical – continuous (f) Numerical – discrete (g) Numerical – continuous (h) Categorical China 6 Ger Italy Japan NZ UK USA wOther SUMMATION NOTATION Summation notation or sigma notation is a shorthand method of writing the sum or addition of a string of similar terms. A typical element of the sequence which is being summed appears to the right of the summation sign. Last value of i 5 2i A sum of terms Each term looks like this i=1 This value will change with each term First value of i This will remain constant with each term To expand we replace i by its starting value (below the sigma symbol) and obtain each successive term by adding 1 to the previous value until the final value of i (above the sigma symbol) For the above sequence: 5 2i = 2×1 + 2×2 +2×3 + 2×4 + 2×5 = 30 i=1 Examples: 1. Expand and evaluate 3 (i 2 3) i 0 3 (i 2 3) = (02 – 3) + (12 – 3) + (22 – 3) + (32 – 3) i 0 = (-3) + (-2) + 1 + 6 = 2 2. Given the set of data x1 = 1, x2 = 2, x3 = 4, x4 = 5 evaluate n (a) x = x i i 1 n n (b) s2 = x x i 1 2 i n 1 7 n x i i 1 x= n x x 2 x3 x 4 = 1 n = 1 2 4 5 4 = 3 n s2 = x x 2 i i 1 n 1 x1 x x2 x x3 x x4 x 2 = 2 2 4 1 1 3 2 3 4 3 5 3 2 = 2 2 4 11 4 3 10 = 3 2 2 4 1 = NB: If n is not specified then it is assumed to be the number of scores or values. ∑ means the sum of all the scores. Exercise 3 1. Find (a) (5i 2) 3 (b) i 1 2. (5i) 2 i 1 Given x1 = -2, x2 = 0, x3 = 1, x4 = 3, x5 = 3 5 find (a) 10 x i 1 i 5 (b) 10 x i i 1 5 (d) xi i 1 5 (c) (x ) i 1 2 i n x 5 (e) i( xi ) (f) x= i 1 Answers: 1. (a) 24 2 . (a) 50 (b) 28 (b) 50 (c) 23 i 1 i n (d) 25 (e) 28 8 (f) 1 2 MEAN, MODE & MEDIAN The mean, mode and median are measures of the centre or middle of a set of data. They are sometimes called measures of central tendency and they provide a single value that is typical of the data. The mode is the value that occurs most often. The median is the middle value when the data is arranged in order. The mean (or average) is the sum of all the scores divided by the number of scores in the data set: = ∑ Examples 1. Consider the data set 3, 2, 0, 5, 2 The mode is 2 because it has the highest frequency. Rearranging the data in order gives 0, 2, 2, 3, 5: the median is 2. middle score The mean is = ∑ = = 2.4 2. Find the mean mode and median of the data displayed in the frequency table x -3 -1 4 5 24 frequency 1 3 1 2 1 n = f = 8 highest frequency n = number of scores is the sum of all the frequencies The mode is -1 [this score occurs most often] There are 8 scores and so two ‘middle’ scores, the 4th and 5th. The median is the average of these two scores: median = = 1.5 The mean is (-3) 1+(-1) 3+4 1+5 2+24 1 8 [NB: A disadvantage of the mean is that it is affected or distorted by extreme or outlying values.] =4 Graphs and the mode, median and mean For symmetrical bell shaped graphs such as this the mode, median and mean all have the same value, 100 100 9 The data set 1, 1, 2 ….86, 94, 96 is shown in the stem and leaf plot below 8 9 A scan of the data organised into the plot reveals that the mode is 35. There are 21 scores less than or equal to 38 and 20 scores greater than or equal to 50 . There are also 7 scores in the forties. So altogether there are 48 scores. The median will be midway between the 24th and 25th scores which are easy to locate when we know the 21st score is 38. The median is 47.5. 1. Given the following scores: 12, 12, 13, 14, 14, 15, 15, 15, 16. (a) Find the mean score (b) Find the median (c) What is the mode? 2. Determine the mean, mode and median for the data in the frequency table Score 40 50 60 70 80 90 Total Frequency 1 4 8 3 3 1 20 3. Find the mode and median for the data displayed in the stem and leaf plot for which the smallest score is 10 and the largest 69. stem leaf Answers 1. (a) 14 (b) 14 (c) 15 2. mean 63, mode 60, median 60 3. mode 41, median 31 10 MEASURES OF SPREAD Measuring spread or dispersion in data: Consider the two sets of values below: Set A: 4, 4, 5, 5, 5, 6, 6. Set B: 1, 3, 4, 5, 6, 7, 9. Both groups A and B have mean = median = 5 but the data sets are quite different. The values in Set A are less spread out than those in Set B.. Range To compare data sets it is also useful to look at the measure of spread. The most basic measure of spread is the range, the distance from the smallest to the largest value. Range = Largest Value - Smallest Value Set A Set B Range = Highest Value - Lowest Value Range = Highest Value - Lowest Value Range = 6 – 4 Range = 9 - 1 Range = 2 Range = 8 We can see that Set B has greater spread than Set A. But the problem with the range is that it uses only two of the values in the data set. One of these may be an odd or unusual value called an outlier. Consider the two sets of values below: Set Y: 1, 1, 2, 2, 2 , 2, 2, 100. Set Z: 1, 18, 23, 41, 59, 63, 87, 100. The range for both is 99 because Set Y has one unusual value. Interquartile Range The Interquartile Range (IQR) is the distance between the first quartile Q1 and the third quartile Q3. IQR = Q3 - Q1 Lowest value Q2 Q1 Q3 (median) The first and third quartiles are values that are ¼ and ¾ of the way through the ordered data. Q1 is the median of the lower half of the data and Q3 is the median of the upper half of the data. (NB: there are other ways to find Q1 and Q3 so check with your program). 1 1 2 2 Set Y 2 2 2 100 Q1 IQR = Q3 - Q1 = 2 – 1.5 = 0.5 Q3 1 18 23 41 Q1 Q3 IQR = Q3 - Q1 = – = 75 - 20.5 = 54.5 11 Set Z 59 63 87 100 Highest value So for the data sets Y and Z the mean together with the IQR are better for summarising the data sets. Standard Deviation A measure of dispersion or spread in a data set that takes into account all of the data is the standard deviation. It gives an indication of the typical or average distance of each score from the mean for the data. The standard deviation can be calculated using the formula s = √ ∑( ̅) but it is much more convenient to use your calculator or the computer. Some statistical tests make use of the variance which is the square of the standard deviation: Variance = s2 Set A 4 4 5 5 5 6 6 ̅ = 5 s = 0.82 Set B 1 3 4 5 6 7 9 ̅ = 5 s = 2.65 We can interpret the standard deviation: the scores in set A are typically 0.82 away from the mean, but the scores in set B are typically 2.65 away from the mean. Though set A and B have the same centre those in set B are clearly more dispersed or have greater spread than set A. 1. Given the following scores: 12, 12, 13, 14, 14, 15, 15, 15, 16, find the standard deviation. 2. A class of 22 students gained the following scores, out of 10, on a test : 5, 7, 8, 7, 6, 5, 6, 4, 7, 4, 8, 3, 7, 9, 4, 9, 7, 3, 6, 8, 7, 5. Find the (a) range (b) IQR (c) standard deviation. 3. Pistol Pete is the star full-forward for the local football team. Last season he played 20 games and kicked the following number of goals in each game: 5, 6, 6, 5, 7, 4, 3, 1, 3, 8, 7, 8, 6, 0, 5, 2, 7, 6, 5, 6. (a) Find the mean and the standard deviation for the number of goals that Pete kicked per game. (b) This season the mean number of goals Pete kicks per game is 5, with a standard deviation of 2.7. In which year was his performance more consistent? Answers 1. 1.414 2. (a) 6 (b) 2 (c) 1.807 3. (a) ̅ = 5, s = 2.22 (b) Last year: smaller standard deviation means less variation Parts of this resource were adapted from materials created by the Academic Skills Unit at Southern Cross University. 12 INTRODUCTORY PROBABILITY A probability is written as a number between zero and one: 0 Pr(A) 1 Pr(A) = 0 means that event A is impossible. Pr(A) = 1 means that event A is certain. When considering a set of all possible outcomes an event is a particular outcome of interest. For example, In tossing a coin the particular event of interest might be ‘obtaining a head’ In considering the weather for Saturday the event of interest might be ‘it doesn’t rain’ In planning a two child family he particular event of interest might be ‘a boy and a girl’. The probability of an event E can be found with the formula: Pr (E) = [assuming all outcomes are equally likely] Examples: If two coins are tossed find the probability of obtaining two heads. Let E be the event ‘two heads’ The possible outcomes are HH HT TH TT Pr (E) = = If a die is thrown find the probability of obtaining an odd number Let E be the event ‘an odd number’ The possible outcomes are 1 2 3 4 5 6 Pr (E) = The multiplication principle Two events, A and B, are independent if the fact that A occurs does not affect the probability of B occurring. Because successive tosses of a coin are independent events, an alternate way of calculating the probability in example one would be to use the multiplication principle. If A and B are independent events then Pr(A and B) = Pr(A B) = Pr(A) Pr(B) The probability of a head on the first toss (H1) and a head on the second toss (H2) = Pr (H1 H2) = × = 13 The addition principle Pr(A or B) = Pr(A B) = Pr(A) + Pr(B) - Pr(A B) If A and B are mutually exclusive (cannot happen together) then Pr(A or B) = Pr(A B) = Pr(A) + Pr(B) If we are tossing a single die twice and want to calculate the probability that a 6 occurs, then the 6 could occur on the first toss (S1) or on the second toss (S2): Pr(S1 or S2) = Pr(S1 S2) = Pr(S1) + Pr(S2) - Pr(S1 S2) [because the events are not mutually exclusive] = + - = Complementary events If E is an event in then (not E) or or E’ is called the complement of E. Examples of complementary events: ‘winning the grand final’ and ‘not winning the grand final’ ‘passing a test’ and ‘failing a test’ ‘being left handed’ and ‘being right handed’ Because P(E) + P(E’) = 1 it follows that P(E’) = 1 - P(E) In the previous example where a die was tossed twice the probability of not getting a 6 on either the first or second toss = 1 = Exercise 1. If 1000 tickets are sold in a raffle and one winning ticket is chosen at random, what is my probability of winning the raffle if I buy 5 tickets? 2. If I roll a die, what is the probability that the number uppermost is greater than 4? 3. A bag contains 6 white marbles and 4 black marbles. A marble is chosen, the colour recorded and then replaced three times. What is the probability that all three marbles are white? 4. The probability that person A is alive in 30 years time is 0.7. The probability that person B is alive in 30 years time is 0.4 . Find the probability that: (a) both are alive in 30 years. (b) neither are alive in 30 years (c) only one is alive in 30 years time (d) at least one is alive in 30 years time. Answers 1. 2. 3. 0.216 4. (a) 0.28 (b) 0.18 (c) 0.54 (d) 0.82 14 SAMPLE SPACES A list or diagram showing all possible outcomes in a probability experiment is called a sample space. Then Pr(E) = h = ( ) (S) For tossing a single die the sample space is 1, 2, 3, 4, 5, 6 and Pr(1) = Pr (2) = Pr(3) = Pr(4) = Pr(5) = Pr(6) = For this spinner, which has 4 equal sectors, the sample space is Red, Green, Yellow, Blue And Pr(R) = Pr (G) = Pr(Y) = Pr(B) = NB: The sum of the probabilities of the distinct outcomes within a sample space is 1. Tree diagrams A tree diagram can be used to find the sample space. For example, if two coins are tossed there are four possible outcomes: The sample space for tossing two coins is HH HT TH TT If E is the event ‘at least one head’ then Pr(E) = Pr(HH or HT or TH) = + The sample space for a three child family is shown below: If E is the event ‘first child a girl’ then Pr(E) = = 15 + = Other sample spaces and diagrams If a single card is drawn from the deck and (a) D is the event ‘the card is a diamond’ then Pr(D) = (b) E is the event ‘the card is a diamond (D) or an ace (A)’ then Pr(E) = Pr (D or A) = Pr (D A) = Pr(D) + Pr(A) - Pr(D A) = = + = = Tables and Venn diagrams can also be used to organise information that makes finding probabilities easier The diagram shows the number of people in a survey of 256 who regularly ate Kit Kats, Mars Bars or Rocky Road From the diagram we can see Pr(K) = Pr(M R) = = = = Pr(KitKat and MarsBar but not Rocky Road) = Pr(at least one of these things) = 1 - = = = [using complementary events] 16 The table shows the results of a study that looked at the association between smoking (S) and lung cancer (C). From the table we can see Pr(S) = = Pr(C’) = Pr (S C) = = = Exercise 1. Use a tree diagram to find the sample space for a two child family. Hence find (a) The probability that both children are girls (b) The probability that the oldest child is a girl (c) The probability that at least one child is a girl 2. The diagram shows the sample space for tossing a single die twice. Find the probability that (a) the first toss is a 4 (b) the sum of the two tosses is 5 (c) at least one toss is a 6 (d) neither toss is a 6 3. In a classroom of 20 Yr 12 VCE students 10 study Maths Methods, 7 study Specialist maths and 5 study both. Organise the information in a Venn diagram and find the probability that a student chosen at random (a) Studies neither of these maths subjects (b) Studies Maths Methods but not Specialist Maths 4. Find the probability that a card drawn at random from a pack is (a) A red card (b) Lower than a 5 (ace low) Answers 1. (a) (b) (c) 2. (a) (b) (c) 3. (a) (b) 4. (a) (b) (d) 17 CONDITIONAL PROBABILITY Dependent events Two events are dependent if the outcome or occurrence of the first affects the outcome or occurrence of the second so that the probability is changed. Example A card is chosen at random from a pack. If the first card chosen is the jack of diamonds and it is not replaced what is the probability that the second card is (a) a diamond? (b) a jack? P() ) = P(jack) = (c) the queen of clubs? one less diamond in the pack = = one less card in the pack P(Q♧) = The events J1 ‘jack of diamonds on the first draw’ and D2 ‘a diamond on the second draw’ are dependent when there is no replacement. The probability of choosing a diamond on the second draw given that the jack of diamonds was chosen on the draw pick is called a conditional probability. We say Pr (D2 /J1) = … “The probability of D2 given J1 is “ Multiplication Rule When two events, A and B, are dependent, the probability of both occurring is: Pr(A and B) = Pr(A B) = P(A) · P(B|A) Example Find the probability of obtaining two jacks if two cards are drawn is succession from a pack (a) with replacement (b) without replacement (a) If the cards are replaced then the events are independent: Pr(J1 J2) = Pr(J1 ) × Pr( J2) = = (b) If the cards are not replaced then the probability of the second draw depends on the first draw: Pr(J1 J2) = Pr(J1 ) × Pr( J2/ J1) = = Conditional probability The multiplication rule for dependent events can be rearranged to find a conditional probability Pr(B|A) = P (A B) P(A) or Pr(A|B) = P (A B) P(B) Examples 1. Find the Pr(A|B) if Pr(A) = 0.7, Pr(B) = 0.5 and Pr(A B) = 0.8. Pr(A B) = Pr(A) + Pr(B) - Pr(A B) 0.8 = 0.7 + 0.5 – Pr(A B) Pr(A B) = 0.4 [we must first find Pr(A B)] 18 and Pr(A|B) = P (A B) P(B) = . . = 0.8 2. In a class of 15 boys and 12 girls two students are to be randomly chosen to collect homework. What is the probability that both students chosen are boys? Pr(B1 B2) = Pr(B1 ) × Pr( B2/B1) = = = Another way to do conditional probability problems is to reduce the sample space: 3. Given the information in the following table find the probability that someone was sunburnt given that they were not wearing a hat. Sunburnt face Yes No Hat Yes 3 77 80 No 12 8 20 15 85 100 ̅) = Pr(S/H Highlight the part of the table that satisfies the condition “not wearing a hat”. This becomes the sample space for the question. = Exercise 1. The results of a survey of music preferences are displayed in the Venn diagram. Find the probability that a student likes rock music given that they like dance music. Image Source: Passy’s World of Mathematics 2. Three cards are chosen at random from a pack without replacement. What is the probability of choosing 3 aces? 3. In a maths class of 20 students 5 failed the final exam. If two students are chosen at random without replacement, what is the probability that the first passed but the second failed? 4. If Pr(X) = 0.5, Pr(Y) = 0.5 and Pr(X Y) = 0.2 find the probability of (a) Pr (X/Y) (b) Pr (X Y) (c) Pr(X)×Pr(Y/X) 5. In a three child family what is the probability that all three children will be girls given that the first child is a girl. [Hint: Draw a tree diagram to find the sample space] Answers 1. = 2. 3. 4. 5. = × = (a) 0.4 (b) 0.8 0.25 (c) 0.2 19 BINOMIAL DISTRIBUTION A variable may be described as having a binomial distribution when there are only two possible outcomes. The following are all examples of probability questions about binomial data: What is the probability of obtaining 5 heads in 6 tosses of a coin? What is the probability that in a randomly selected group of 30 people none of them will have a particular disease? What is the probability that in a sample of 100 manufactured components no more than 2 will be defective? Suppose that in a particular family the probability that a child will have red hair is ¼. If the parents have three children… (i) the probability that all three will have red hair is P( and and ) = P( R and R and R ) = ¼x¼x¼ 1 64 = (ii) the probability that none will have red hair is P( and and ) = P( R and R and R ) = ¾x¾x¾ = (iii) 27 64 the probability that at least child will have red hair is 1 – P (none with red hair) = 1 - P( R and R and R ) = 1 – (¾ x ¾ x ¾) = 1= 27 64 37 64 [ NB: ‘at least one child…’ and ‘no children…’ are complementary events] (iv) the probability that only the first child will have red hair is P( and and ) = P( R and R and R ) = ¼ x¾x¾ = (v) 9 64 the probability that exactly one child will have red hair is P( and and ) or P( and and ) or P( and and = P( R and R and R ) + P( R and R and R ) + P( R and R and R ) =¼ x¾x¾+¾x¼ x¾+¾x¾x¼ ) 9 64 27 64 = 3x = [NB: The last example demonstrates that it is important to consider all the ways in which the child with red hair might be selected.] 20 Binomial Probability Formula If ‘n’ is the number of trials eg ( number of tosses of a coin, number of children in a family number of items in a sample), and ‘p’ is the probability of the outcome of interest then the probability of ‘x’ outcomes is given by the formula P(X = x) = nCx × px ×(1 - p)n-x This is a calculator button that counts the number of ways the desired outcome can occur Example One in every hundred items a machine produces are defective. What is the probability that in a sample of five items produced by this machine that (a) Exactly three are defective? (b) None are defective ? (c) At least 1 is defective n = 5, p = 1 = 0.01, x = 3 100 (a) P(X = x) = nCx px (1 - p)n-x P(X =3) = 5C3 (0.01)3(1 – 0.01)5-3 ≈ 0.00001 [so if we obtained three defective items in a sample of 5 we might be suspicious of the claim that only one in a hundred is defective!] (b) P(X = 0) = 5C0 (0.01)0(1 – 0.01)5 = 0.95 (c) P(X ≥ 1) = 1 - P(X = 0) [‘none defective’ and ‘at least one defective’ are complementary events] = 1 - 0.95 = 0.05 Probability Distribution A list of all possible outcomes of an event and their associated probabilities is called a probability distribution. The probability distribution table for the event X in the example is x (no of defectives) P(X=x) 0 0.95099 1 0.04803 2 0.00097 3 0.00001 4 0.00000 5 0.00000 For a binomial distribution the mean and standard deviation are found using the formulae: = E(X) = np = np(1-p) For the previous example the expected value and standard deviation of the number of defectives in a batch of one thousand would be = E(X) = 1000x0.01 = 10 np(1-p) = √1000(0.01)(0. ) = 3.15 = So that in a batch of 1000 we would expect to get 10 defectives and the number of defectives will deviate from this amount by an ‘average’ of 3.15. We would expect most batches to have between 7 and 13 defective items. Exercise The probability that an archer will hit a bullseye is 0.7. If he is allowed ten attempts, find the probability that he (a) hits it every time (b) misses each time (c) scores at least two bullseyes Answers:: (a) 0.028 (b) 0.000006 (c) 0.99985 21 NORMAL DISTRIBUTION 10 20 30 40 50 60 70 80 90 25 20 15 10 100 5 0 0 0 5 10 10 20 15 30 20 40 25 50 Graphical data can display different forms: 10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100 But many things that can be measured such as heights of people blood pressure errors in measurements scores on a test 0 10 20 30 40 50 follow a bell shaped curve. Such data is said to be normally distributed 10 20 30 40 50 60 70 80 90 100 110 Graph from Wolfram Alpha Properties of a normal distribution Symmetry about the mean Mean = median = mode 50% of values greater than the mean and 50% less than the mean 68% of values fall within one standard deviation either side of the mean 95% of values fall within two standard deviations either side of the mean 99.7% of values fall within three standard deviation either side of the mean This is sometimes known as the empirical or 68-95-99 rule NB: Even though most of the data will fall within three standard deviations of the mean there is in theory no upper or lower bound to a normal distribution. We are just less and less likely to find values beyond these points. 22 Example If scores on an IQ test are normally distributed with mean = 100 and standard deviation = 10, what percentage of people would we expect to (a) score between 90 and 110? (b) score less than 80 (a) Because 90 = 100 - 10 and 110 = 100 + 10 are both one standard deviation from the mean 68% of people would be expected to score between 90 and 110 (b) 80 = 100 – 2 × 10 is two standard deviations below the mean. We know that 95% of scores fall between 80 and 120 so 5% must fall outside this range. Half of these, 2.5%, will be below 80. Therefore we would expect that 2.5% of people to have IQ scores less than 80. Exercise 1. Scores on a general achievement test are normally distributed with a mean of 80 and a standard deviation of 15. Adam scored 95. What proportion of students had a higher score than Adam? 2. The actual weights of cereal boxes that are supposed to contain 500g are normally distributed with mean of 510g and a standard deviation of 5g. What proportion of boxes are underfilled? 3. In a maths class the bottom 16% of students are given an F grade. If the class mean is 63 and the standard deviation is 18 what score must a student get to pass? 4. If newborn birth weights in a certain hospital are normally distributed with a mean of 3200g and a standard deviation of 400g (a) what percentage of babies weigh more than 3200g? (b) what percentage of babies weigh between 2400g and 4000g? (c) what percentage of babies weigh less than 3600g? (d) if the 16% of babies with the lowest birth weights are placed in the special care nursery will a baby that weighs 2500g need special care? 5. 95% of people in a clinical study had systolic blood pressure readings between 116 and 144. If the blood pressure measurements follow a normal distribution what is the mean and standard deviation of the blood pressures for this group. 6. A class of ten students get the following marks in a test: 13, 23, 41, 55, 66, 78, 49, 33, 35, 67. If anyone who scored less than one standard deviation below the mean fails how many students will fail? Answers 1. 16% 2. 2.5% 3. 45 4. (a) 50% (b) 95% (c) 84% (d) yes 23 5. μ = 130, σ = 7 6. 2 STANDARD NORMAL DISTRIBUTION The standard normal distribution (sometimes called a z-distribution) has a mean of zero ( μ = 0) and a standard deviation of 1 (σ = 1). If we are working with the standard normal distribution we are not restricted to the 68-95-99 rule because tables are available to enable us to find proportions or percentages or probabilities for any value in the distribution. Tables come in different layouts, but this table gives the proportion to the left of a chosen z-value of up to 2 decimal places. We can also interpret our proportions or percentages as probabilities: Pr(z < 0) = 0.5 Pr (z < 0.03) = 0.512 Pe (z<0.75) = 0.7734 NB: It is also possible to use a graphics calculator or a computer to find areas, proportions and probabilities in a normal distribution 24 Example: In a standard normal distribution what percentage of values will be (a) less than 1.28? (b) more than 1.28? (c) between 0 and 1.28? (d) greater than -1.28? (e) between -1.28 and 1.28? (a) First draw a diagram: We are looking for the percentage of the graph to the left of 1.28 We can see from the table that the percentage of values less than 1.28 is 89.97% (b) First draw a diagram: We are looking for the percentage of the graph to the right of 1.28 However we cannot read areas to the right of a z-value directly from the table. Instead we must observe that 100% of all values lie under the curve The area to the left of the shaded region is the same as part (a) So the percentage of values to the right of 1.28 will be 100% subtract the percentage to the left of 1.28. Therefore the percentage of values more than 1.28 is 100% - 89.97% = 10.03% (c) First draw a diagram: We are looking for the percentage of the graph between 0 and 1.28 25 We cannot look up areas between two values directly from the table. But we know from part (a) that 89.97% of values are less than 1.28 50% of values lie to the left of the mean because this is a property of our symmetrical bell curve. Therefore the percentage of values between 0 and 1.28 is 89.97% - 50% = 39.97% (d) First draw a diagram: We are looking for the percentage of the graph to the right of -1.28 We cannot look up negative values in the table. But we know That the bell curve is symmetrical The area to the right of -1.28 is the same as the area to the left of 1.28 Therefore the percentage of values greater than -1.28 is 89.97% (e) First draw a diagram: We are looking for the percentage of the graph between -1.28 and 1.28 To find this area we use the symmetry of the graph. Observe that the area between -1.28 and 0 is exactly the same as the area between 0 and 1.28. We also know the area between 0 and 1.28 because we found it in part (c). Therefore the percentage of the graph between -1.28 and 1.28 is 2 × 39.97% = 79.8% Exercise 1. In a standard normal distribution what percentage of values will be (a) less than 1.95? (b) less than -1.95? (c) between -1.95 and 0? (d) greater than 1.95? (e) between -1.95 and 1.95 2. In a standard normal distribution what proportion of values lie between -0.5 and 1.5? 3. In a standard normal distribution what proportion of values lie outside the interval ±1.7? 4. Given that a value in a standard normal distribution is greater than -1 what is the probability that it will be less than 2? [Hint: use conditional probability formula] Answers…may vary slightly depending on whether calculators or tables are used 1. (a) 97.44% (b) 2.56% (c) 47.44% (d) 2.56% (e) 94.88% 26 2. 0.6247 3. 0.0892 4. 0.9729 The proportion of the area under the curve to the left of a chosen value of z is given in the table below. z 27 PROBABILITY AND THE NORMAL DISTRIBUTION Even when data follows a normal distribution different data sets will have their own mean and standard deviation and a different bell shaped curve. But every score in a normally distributed data set regardless of the shape has an equivalent score in the standard distribution. The mean of a normal distribution corresponds to a standardised score of 0 and we can see that → 1, 2 → 2 and 3 → 3. BUT 0.8 z = -1 For other values we can use the formula z= 1.2 z=0 x = 1.7 z=? 𝑥 𝜇 to find z scores. 𝜎 = To find the standardised score for x = 1.7: z = . . . = 1.25 A score of 1.7 in the distribution with mean 1.2 and standard deviation 0.4 is equivalent to a standardised score of 1.25. Alternatively, the score 1.7 is 1.25 standard deviations above the mean for that distribution. Once we have converted the scores of our distribution into standard scores or z-scores we can use normal distribution tables to calculate precise percentages and probabilities. The normal distribution is a continuous distribution, so we can find the probability that x is greater than or less than a particular value, but not that x is equal to a particular value. Because the total area under the standardised curve is 1, pr(z < β) is equivalent to the area to the left of β. β 28 Examples 1. If the mean maximum temperature for Melbourne in January is 25. C with a standard deviation of 2.1 what is the probability that the mean maximum temperature for January 2015 will be above 28 C? First draw a diagram: Then standardise x = 28 x = 28 z= = . . = 1 Pr (x > 28) = Pr (z> 1) = 1 – 0.84 [from tables] = 0.16 2. The top 0.5% of students applying for Stato university are given full scholarships. If the mean score on the entrance exam is 372 and the standard deviation is 40 what mark is needed to obtain a scholarship? 0.005 First draw a diagram: xs In this question we know that Pr(x > xs) = 0.005 but must work backwards to find the cut off score that defines that area on the graph. First we find the z-score: Pr (z > zs) = 0.005 z = 2.58 [using the tables in reverse] Then substituting into the formula z = : 2.58 = 2.58 × 40 = 3 2 103.2 = 3 2 103.2 + 372 = = 475.2 Applicants who score more than 475.2 will obtain a scholarship Exercise 1. If a population has a mean I.Q. of 100 and a standard deviation of 15, (a) find the probability that an individual chosen at random will have an I.Q. between 110 and 130. (b) find the probability that an individual chosen at random will have an I.Q. greater than 87 2. A coffee machine is regulated to deliver 200mL. per cup. In fact, the amount of coffee varies, following a normal distribution with a mean of 200mL. and a standard deviation of 10mL. (a) What is the probability that a cup contain less than 195mL.? (b) What is the probability that a cup will contain more than 220mL.? (c) What is the probability of a cup containing between 195 and 215 mL.? 3. (a) The heights of a group of men follow a normal distribution with a mean of 180 cm. and a standard deviation of 6 cm. What is the probability that a man chosen from this group is less than 185 cm tall? (b) If the tallest 10% of this group are automatically eligible for a basketball team what is the qualifying height. Answers 1. (a) 0.2297 (b) 0.8069 2. (a) 0.3085 (b) 0.0228 29 (c) 0.6247 3. (a) 0.7977 (b) 187.69cm SAMPLING DISTRIBUTIONS A sampling distribution is the probability distribution for the means of all samples of size n from a given distribution. The sampling distribution will be normal distributed with parameters ̅ and ̅, if either the population from which the samples are drawn is normally distributed, or the samples are large (n ≧ 30) where 𝜇𝑥̅ = μ and 𝜎𝑥̅ = 𝜎 [for large samples] √𝑛 NB: ⦁ the sampling distribution has the same centre as the population ⦁ the measure of variability of a sampling distribution, ̅ , is called the standard error. The distribution of means is not as spread out as the values in the population from which the sample was drawn. ⦁ if we do not know the population standard deviation we approximate with the sample standard deviation: s ̅ ≅ σ ̅ and ≅ ) √ √ Consider the little ‘population’ of values P = {1 2 3 4 5} This population has μ = 3 and σ = 1.41 If a sample of size n = 3 was drawn from this population it could be any one of… (1 2 3) (1 2 4) (1 2 5) (1 3 4) (1 3 5) (1 4 5) (2 3 4) (2 3 5) (2 4 5) (3 4 5) The means of each of the samples, and a histogram of the distribution of means, are shown in the table and graph below: Sample 1 2 1 2 1 2 1 3 1 3 1 4 2 3 2 3 2 4 3 4 Mean ̅ =2 ̅ = 2.33 ̅ = 2.67 ̅ = 2.67 ̅ =3 ̅ = 3.33 ̅ =3 ̅ = 3.33 ̅ = 3.67 ̅ =4 3 4 5 4 5 5 4 5 5 5 ̿=3 ̅ = 0.61 , The sampling distribution of the means for samples of size 3 is: P( = ̅) 2 0.1 2.33 0.1 2.67 0.2 3 0.2 3.33 0.2 3.67 0.1 4 0.1 Even though this sample is small, and the population is not normally distributed (though it is symmetric) the sampling distribution is reasonably normally distributed: 30 .2 .15 0 .05 .1 probability 2 2.5 3 Mean 3.5 4 We can see that the mean of the sampling distribution (the mean of all the means) is the same as the population mean, ̿ = μ = 3. But the variability in the sampling distribution is less than that of the population: ̅ = 0.61 and σ = 1.41. Because larger samples, or those drawn from normally distributed populations, will follow a normal distribution we can use the properties of normal distributions to find probabilities relating to samples: ̅ = ̅ ̅ ̅ = √ Example The shire of Bondara has 1200 preschoolers. The mean weight of pre-schoolers is known to be 18kg with a standard deviation of 3kg. What is the probability that a random sample of 50 preschoolers will have a mean weight more than 19kg? n = 50, μ = 18 and σ = 3 The sampling distribution of the means for samples of size 50 will have error, ̅ ̅ = √ ̅– = = √ = √ Pr ( ̅ ̅ = μ = 18, and standard = 0.42. – = 2.38 √ ̅ > 2.38) = 1 – 0.9913 [from tables] = 0.0087 1 ) = Pr( Exercise 1. List all samples of size 2 for the population {1, 2, 3, 4, 5, 6}. What is the probability of obtaining a sample mean of less than 3? 2. Samples of size 40 are drawn from a population with μ = 50 and σ = 5. (a) What are the mean and standard error of the sampling distribution? (b) What is the probability that a particular sample has a mean less than 48.5? 3. If IQ in the general population of secondary students is known to follow a normal distribution with μ = 100 and σ = 10, (a) find the mean and standard error for a random samples of size 100. (b) To test whether a secondary school is representative of the general population a sample of 100 students from that school is chosen. What is the probability of the mean IQ being more than 105? (c) What would be your conclusion? Answers 1. 4/15 2. (a) 3. ̅ = 50 and ̅ = 0.79 (b) 0.0288 (a) ̅ = 100 and ̅ = 1 (b) 0.00003 (c) either the sample was not random (perhaps all the smartest students were in the sample) or this school has a higher IQ than the general population. 31 CONFIDENCE INTERVALS We use the statistics we obtain from samples to make inferences or estimates about the population from which the sample was drawn. For example A batch may be selected in a factory production process to assess how the process is operating. Surveys of consumers are used to determine the preferred brands in the population. Polls are conducted on samples of the voting population before elections to predict the result of the election. Together with our estimate of the population parameter it is often helpful to provide a confidence interval. After constructing a confidence interval we are able to make statements such as: “we are 5% confident that the true mean weight of boxes of cocobix cereal labelled 450g is in the interval [44 .5, 453.8]”. For large samples (n ≧ 30) we can use the mean of a sample, ̅ , to estimate the mean of the population, , using the formula: μ = 𝑥̅ 𝑧 𝜎 √𝑛 or μ = 𝑥̅ 𝑠 𝑧 √𝑛 [when σ is not known] The value of z is determined by the level of confidence and can be found using normal tables, a graphics calculator or an online statistics program such as Stat Trek: 1.96 For a 95 % confidence interval z = 1.96 For a 99% confidence interval z = 2.575 For a 90% confidence interval z = 1.645 Example 36 of a certain type of fish were caught in Port Phillip Bay. This sample had a mean length of 30 cm. and a standard deviation of 3 cm. (a) What is the 95% confidence interval for the true mean length of this type of fish? (b) What is the 98% confidence interval for the true mean length of this type of fish? (a) Confidence interval for μ = ̅ = 30 √ 1. 6 = 30 ± 0.98 √ We can state with 95% confidence that the mean of the entire population of fish will be between 29.02cm and 30.98cm 32 (b) CI for μ = ̅ = 30 √ 2.326 = 30 ± 1.163 √ We can state with 98% confidence that the mean of the entire population of fish will be between 28.84cm and 31.16cm Exercises 1. In an effort to improve appointment scheduling, a doctor agreed to estimate the average time spent with each patient. A random sample of 49 patients yielded a mean of 30 minutes and a standard deviation of 7 minutes. (a) Construct a 95% confidence interval for the true mean. (b) Construct an 80% confidence interval for the true mean. 2. To estimate the average weight of males in the town of Cityville a random sample of 100 men was drawn from the population of 10 000 men and weights recorded. The mean weight was found to be 83kg and the standard deviation 12 kg. (a) What is the 99% confidence interval for the mean weight of the male population. (b) In two of the suburbs of Cityville, Subtown and Tubtown, the mean weights for males were found to be 80kg and 88kg repectively. Comment on these results. 3. A market research company conducted a randomised survey of 50 regular smokers to find the amount spent on cigarettes per week. They found that the smokers spent on average $22 each week and the standard deviation was $4.50. Using a 95% level of confidence calculate the confidence interval for the true mean amount spent on cigarettes by regular smokers. 4. After randomly sampling 400 individuals and obtaining a sample mean of 56.5 a research company was able to claim they were 90% certain that the true mean of the population was between 56.089 and 56.911. What was the standard deviation of the sample? Answers 1. (a) [28.04,31.96] (b) [28.72 ,31.28] 2. (a) [79.91, 86.09] (b) The mean weight for Subtown men is within the expected range but men who live in Tubtown appear to be extremely heavy compared with the general population. This may reflect lifestyle differences or a failure to select a random and representative sample. 3. [20.75, 23.25] 4. 5 33 HYPOTHESIS TESTING Consider statements such as Teenagers aged 13-15 spend no more than 10 hours a week on Facebook The average weight of Australian men is the same as it was in 1990. Students from private schools have the same mean ATAR score as the Victorian average. The mean winter rainfall for the last 10 years is the same as the historical mean. Our confidence about the probabilities of values drawn from normally distributed populations and sampling distributions enables us to formally test hypotheses (or claims) such as these. When we perform an ‘experiment’ we know there will be chance variation. For example, if we toss a supposedly fair coin 100 times we would not be surprised to obtain 48 or 45 or perhaps even 40 heads. However we would be surprised to obtain only 5 heads. If we were testing a coin for ‘fairness’ we might even like to decide beforehand what we would consider a reasonable number of heads. In hypothesis testing ‘reasonable’ is defined as what we could expect 5% (or % or 0% etc) of the time. In a hypothesis test we are concerned to assess how unusual our result is, whether it is reasonable chance variation (obtaining 45 heads in 100 tosses of a coin) or whether the result is too extreme to be considered chance variation (obtaining 5 heads in 100 tosses of a coin). The experiment may consist of drawing a sample and comparing the sample mean with the population mean for ‘reasonableness’. A hypothesis test is a formal process with the following steps: 1. State the null and alternative hypotheses Ho: ̅ = μ [the sample mean is the same as the population mean allowing for chance variation] Ha: ̅ ≠ μ [the sample mean is not the same as the population mean after allowing for chance variation] 2. A significance level α is chosen [α = 0.05 we are defining reasonable as what we can expect 3. Tables or a calculator or a computer are used to find the 0.025 z-value that corresponds to the chosen significance level eg: These are the thresholds for ‘reasonableness’ and are called the critical values. -1.96 4. 95% of the time] 0.025 0.95 1.96 The test statistic is the standardised difference between the sample mean (calculated from the given data) and the known population mean: z = ̅ ̅ [for a large sample where σ is unknown] 5. A decision is made regarding the ‘reasonableness’ of the test statistic: Yes Reject Ho “Is the test statistic more extreme than the critical value?” No Do not reject Ho 6. State your conclusion: There is (if you reject)/is not (if you do not reject) evidence to suggest that…. NB: (i) The steps for hypothesis testing may differ from course to course so check with your Program. (ii) The decision relates only to rejecting or not rejecting Ho. Ha is not mentioned in the decision, and we do not accept Ho or Ha. 34 Example Because students had previously found a statistics course very difficult the average score over many years was 48% with a standard deviation of 12%. A bridging program was introduced and the 120 students that attended achieved a mean score of 50% in the final exam. Is there evidence that the scores of those who attended the bridging program have changed at a 99% level of significance? 1. Hypotheses: Ho: μ = 48 Ha: μ ≠ 48 2. α = 0.01 [level of significance 99% = 0.99] 3. Critical values: α = 0.01 z = -2.58 or z = 2.58 4. Test statistic: z = ̅ ̅ = = 1.83 √ 5. Decision: Is 1.83 more extreme than 2.58? No, therefore do not reject Ho. 6. Conclusion: There is not evidence to suggest that the scores of those who attended the bridging program have changed*. It is reasonable that the apparent improvement is due to chance variation. *[use the wording of the question] Exercise 1. Repeat the example to decide if there is evidence at the 90% level of significance that attending the bridging program is associated with the change in scores. 2. A random sample of 36 soft drinks from vending machines had an average content of 370ml with a standard deviation of 20ml. Test the null hypothesis that μ = 3 5 ml against the alternative hypothesis μ ≠ 3 5 ml at the % significance level. 3. A bank manager has historical data that shows over lunchtime Mon –Fri the mean number of customers that come into the bank is 32. Accordingly he believes he has no need to change the number of tellers. However a branch survey conducted every lunchtime over eight weeks found that the mean number of customers was 36 with a standard deviation of 8.2. Conduct a hypothesis test with a 95% level of significance to test whether the mean number of lunchtime customers has changed. What recommendation would you make to the bank manager? 4. The manufacturer of ‘longlast’ batteries claims the mean lifetime of his batteries is 450 hours. A consumer interest magazine samples 100 batteries and finds that they have a mean of 444 hours with a standard deviation of 28 hours. Do the sample data contradict the manufacturer’s claim? [use α = 0.02] Answers Your answers should be set out and contain all the steps shown above. A brief outline of the main features is given below: 1. Test statistic = 1.83 reject Ho: evidence of change in scores. 2. Test statistic = -1.5 do not reject Ho: difference consistent with chance variation 3. Test statistic = 3.09 reject Ho: evidence of increase in number of lunchtime customers and therefore need more tellers 4. Test statistic = 2.14 do not reject Ho: the difference is consistent with chance variation and there is no evidence to contradict the claim that the mean battery life is 450 hours. 35
© Copyright 2024