10 DATA ANALYSIS – CORE MATERIAL DATA A When information for a statistical investigation is collected and recorded, the information is referred to as data. WHAT IS A STATISTICAL INVESTIGATION? There are four processes involved in a statistical investigation: Collection of data (information) Data for a statistical investigation can be collected from records, from surveys (either faceto-face, telephone, or postal), by direct observation or by measuring or counting. Unless the correct data is collected, valid conclusions cannot be made. Organisation and display of data Data can be organised into tables and displayed on a graph. This allows us to identify features of the data more easily. Calculation of descriptive statistics Some statistics used to describe a set of data are the centre and the spread of the data. These give us a picture of the sample or population under investigation. Interpretation of statistics This process involves explaining the meaning of the table, graph or descriptive statistics in terms of the variable, or theory, being investigated. COLLECTION OF DATA The variable is the subject that we are investigating. The entire group of objects from which information is required is called the population. Gathering statistical information properly is vitally important. If gathered incorrectly then any resulting analysis of the data would almost certainly lead to incorrect conclusions about the population. The gathering of statistical data may take the form of: ² a census, where information is collected from the whole population, or ² a survey, where information is collected from a much smaller group of the population, called a sample. For example: ² The Australian Bureau of Statistics conducts a census of the whole population of Australia every five years. ² In opinion polls before an election, a survey is conducted to see which way a sample of the population will vote. ² The students in a school are to vote for a new school captain. If 20 students from the school are asked how they will vote, then the population is all the students who attend the school, and the 20 students is a sample. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 11 When taking a sample it is hoped that the information gathered is representative of the entire population. For accurate information when sampling, it is essential that: ² the number of individuals in the sample is large enough ² the individuals involved in the survey are randomly chosen from the population. This means that every member of the population has an equal chance of being chosen. If the individuals are not randomly chosen or the sample is too small, the data collected may be biased towards a particular outcome. For example: If the purpose of a survey is to investigate how the population of Melbourne will vote at the next election, then surveying the residents of only one suburb would not provide information that represents all of Melbourne. TYPES OF DATA Data are individual observations of a variable. A variable is a quantity that can have a value recorded for it or to which we can assign an attribute or quality. Two types of variable that we commonly deal with are categorical variables and numerical variables. CATEGORICAL VARIABLES A quality or category is recorded for this type of variable. The information collected is called categorical data. Examples of categorical variables and their possible categories include: Colour of eyes: blue, brown, hazel, green and violet Continent of birth: Europe, Asia, North America, South America, Africa, Australia and Antarctica Gender: male or female Type of car: General Motors, Toyota, Ford, Mazda, BMW, Subaru, etc. NUMERICAL VARIABLES A number is recorded for this type of variable. The information collected is called numerical data. There are two types of numerical variables: Discrete numerical variables A discrete variable can only take distinct values and these values are often obtained by counting. Examples of discrete numerical variables and their possible values include: The number of children in a family: 0, 1, 2, 3, ... The score on a test, out of 30 marks: 0, 1, 2 ..., 29, 30. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 12 DATA ANALYSIS – CORE MATERIAL Continuous numerical variables A continuous numerical variable can theoretically take any value on a part of the number line. Its value often has to be measured. Examples of continuous numerical variables and their possible values include: The height of Year 9 students: any value from about 120 cm to 200 cm The speed of cars on a stretch any value from 0 km/h to the fastest speed that a car of highway: can travel, but most likely in the range 30 km/h to 120 km/h The weight of newborn babies: any value from 0 kg to 10 kg but most likely in the range 0:5 kg to 5 kg The time taken to run 100 m: any value from 9 seconds to 30 seconds. EXERCISE 1A 1 40 students, from a school with 820 students, are randomly selected to complete a survey on their school uniform. In this situation: a what is the population size b what is the size of the sample? 2 A television station is conducting a viewer telephone-into-the-station poll on the question ‘Should Australia become a republic?’ a What is the population being surveyed in this situation? b How is the data biased if it is used to represent the views of all Australians? 3 A polling agency is employed to survey the voting intention of residents of a particular electorate in the next election. From the data collected they are to predict the election result in that electorate. Explain why each of the following situations would produce a biased sample. a A random selection of people in the local large shopping complex is surveyed between 1 pm and 3 pm on a weekday. b All the members of the local golf club are surveyed. c A random sample of people on the local train station between 7 am and 9 am are surveyed. d A doorknock is undertaken, surveying every voter in a particular street. 4 Classify the following data as categorical, discrete numerical or continuous numerical: a the quantity of soil in a particular size of potplant b the number of pages in a daily newspaper c the number of cousins a person has d the speed of cars on a particular stretch of highway e the state of Australia where a person was born f the maximum daily temperature in Melbourne g the manufacturer of a car h the preferred football code i the position taken by a player on a football field j the time it takes 12-year-olds to run one kilometre k the length of feet l the number of goals shot by a netballer m the amount spent weekly, by an individual, at the supermarket. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 13 5 A sample of public trees in a municipality was surveyed for the following data: a the diameter of the tree (in centimetres) measured 1 metre above the ground b the type of tree c the location of the tree (nature strip, park, reserve, roundabout) d the height of the tree, in metres e the time (in months) since the last inspection f the number of inspections since planting g the condition of the tree (very good, good, fair, unsatisfactory). Classify the data collected as categorical, discrete numerical or continuous numerical. ORGANISING AND DISPLAYING DATA B CATEGORICAL DATA Tally and frequency tables are used to organise categorical data and there are several types of graphs that can be used to effectively display the data. For example: A centrally-located school is investigating how their students get to school. This is of interest to them because of local traffic problems. A sample of 50 students was asked which of the following five categories they used most. The results were: BBCWTn TnTnTmCC WCCBC CWBBTn TmCBWTn WWTnTnC TmTnCCTm BBBBW CCBWC TnBCBB (Tn ´ train, Tm ´ tram, B ´ bus, W ´ walk, C ´ private car) The variable ‘mode of transport to school’ is a categorical variable. We can organise the data using a tally and frequency table. One stroke for each data value is recorded in the tally column. represents a tally of five. © © jjjj Mode of transport Train Tram Bus Walk Private car Total © © jjjj jjjj © © jjjj © © jjjj © © jjjj Tally jjjj © jjjj © jjjj jjj © © © © jjjj jjjj Frequency 9 4 14 8 15 50 From the frequency table we can see: ² The most favoured ‘mode of transport’ in the sample was ‘Private car’. ² 9 + 4 + 14 = 27 of the 50 students came by public transport (train, tram, or bus). ² Only 8 of the 50 students (16%) walked to school. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 14 DATA ANALYSIS – CORE MATERIAL GRAPHS TO DISPLAY CATEGORICAL DATA 1 A barchart (or column graph) is usually drawn with the categories along the horizontal axis and the frequency on the vertical axis. Each bar (or column) is drawn with height equal to the frequency of its category. The ‘bars’ are equally spaced (not joined together) and are of the same width. Below is a barchart for the example. Note: A barchart can also be drawn with horizontal bars. Mode of transport to school Mode of transport to school 16 frequency 14 12 10 8 6 4 2 0 train tram bus walk private car private car walk bus tram train 0 2 4 6 8 10 12 14 16 2 A segmented barchart is a single ‘bar’ divided into segments so that the length of each segment is proportional to the frequency. A percentaged segmented barchart can also be produced. The percentage for each category is calculated using frequency of category £ 100% . total For example, for the traffic data shown previously: The category with the highest frequency of 15 was Private car. So, 15 50 £ 100 1 = 30% of the students came by private car. 27 students came by public transport. So the percentage who came by public transport was 27 50 £ 100 1 = 54%. Following is a segmented barchart and a percentaged segmented barchart for the above example. The segments can be labelled, or shaded including a legend. 50 frequency 100% 40 private car 30 walk 60% 20 bus 40% 10 tram train 20% 0 80% 0% 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan % frequency black private car walk bus tram train VIC MCR_12 Chapter 1 UNIVARIATE DATA 15 EXERCISE 1B.1 a Which subject was the most favoured? b How many students chose Art as their favourite subject? c What percentage of the students nominated Mathematics as their favourite subject? d What percentage of the students chose either Music or Art as their favourite subject? subject 1 55 randomly selected year eight students were asked to nominate their favourite subject studied at school. The results of the survey are disEnglish played in the barchart alongside. Mathematics Science Language History Geography Music Art 0 2 4 6 8 10 frequency 2 A randomly selected sample of adults was asked to News Service Frequency nominate the evening television news service that ABC 40 they watched. The results alongside were obtained: Channel 7 45 a Construct a barchart for this data. Channel 9 64 b Use the table and graph to answer the followChannel 10 25 ing questions about the data. SBS 23 i How many adults were surveyed? None 3 ii Which news service is the most popular? iii What percentage of those surveyed watched the most popular news service? iv What percentage of those surveyed watched the news service on Channel 7? 3 Construct a percentaged segmented barchart for the following categorical data, shading the categories and including a legend. Expenditure item Weekly household expenditure ($) Food Clothing Rent Travel Utilities Entertainment 60 30 120 15 30 45 DISCRETE NUMERICAL DATA A discrete numerical variable can take only distinct values. The data is often obtained by counting. For example, a farmer has a crop of peas and wishes to investigate the number of peas in the pods. He takes a random sample of 50 pods and counts the number of peas in each pod, obtaining the following data: 6654987776567888752477678 8786642913359887767768455 The variable in this situation is the discrete numerical variable ‘the number of peas in a pod’. The data could only take the discrete numerical values 0, 1, 2, 3, 4, .... 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 16 DATA ANALYSIS – CORE MATERIAL TABLES AND GRAPHS No. peas in pod 1 2 3 4 5 6 7 8 9 Total To organise his data the farmer could use the tally and frequency table shown. A barchart could be used to display the results. frequency 14 12 10 8 6 4 2 0 0 1 2 3 4 5 6 7 8 9 number of peas in pod Tally j jj jj jjjj © j © jjjj © jjjj © jjjj © © © jjj © jjjj jjjj © © © © jjjj jjjj jjj Frequency 1 2 2 4 6 9 13 10 3 50 Alternatively, the farmer could use a dot plot which is a convenient method of tallying the data and at the same time displaying the frequencies. To draw a dot plot: 1 Draw a horizontal axis and mark it with the values that the variable can take. For this example, the variable took values from 1 to 9, so we mark the axis from 0 to 10. 2 Label the axis with a description, in this case: number of peas in pod. 3 Systematically go through the data, placing a dot or cross above the appropriate position on the axis. The dot plot for this example is: 0 1 2 3 4 5 6 7 9 8 10 number of peas in pod Notice that the dots are evenly spaced so the final plot looks similar to the barchart. From both the barchart and the dot plot it can be seen that: ² Seven was the most frequently occurring number of peas in a pod. 100 ² 35 50 £ 1 = 70% of the pods yielded six or more peas. 10% of the pods had fewer than 4 peas in them. ² DESCRIBING THE DISTRIBUTION OF A SET OF DATA The distribution of a set of data is the pattern or shape of its graph. stretched to the left For the example above, the graph has the general shape shown alongside: This distribution of the data is said to be negatively skewed because it is stretched to the left (the negative direction). 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 A positively skewed distribution of data would have a shape: UNIVARIATE DATA 17 A symmetrical distribution of data is neither positively nor negatively skewed, but is symmetrical about a central value. stretched to the right A set of data whose graph has two peaks is said to be bimodal. Note that the horizontal is a number line with numbers in ascending order from left to right. Outliers are data values that are either much larger or much smaller than the general body of data. Outliers appear separated from the 12 frequency body of data on a frequency graph. 10 For the example, if the farmer found one pod in his sample contained 13 peas then the data value 13 would be considered an outlier. It is much larger than the other data in the sample. On the column graph it appears separated. 8 6 4 2 0 outlier 0 1 2 3 4 5 6 7 8 9 10 11 12 13 number of peas in pod EXERCISE 1B.2 2 a Construct a barchart for the discrete numerical data alongside. b Comment on the distribution of the data (positively or negatively skewed or symmetric). 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black Size of households frequency 1 A randomly selected sample of households has been asked, “How many people live in your household?” A column graph has been constructed for the results. a How many households were surveyed? b How many households had only one or two occupants? c What percentage of the households had five or more occupants? d Describe the distribution of the data. 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 number of people in the household Number of toothpicks 33 34 35 36 37 38 39 Frequency 1 5 7 13 12 8 2 VIC MCR_12 18 DATA ANALYSIS – CORE MATERIAL 3 A bowler has recorded the number of wickets he has taken in each of the last 30 innings he has played: 113200422431010215137222431103 a Construct a dot plot for the raw data. b Comment on the distribution of the data, noting any outliers. 4 For an investigation into the number of phonecalls made by teenagers, a sample of 50 fifteen-year-olds were asked the question, “How many phonecalls did you make yesterday?” The following dot plot was constructed for the data. The number of phone calls made in a day by a sample of 50 fifteen year olds 0 a b c d e f g 1 2 3 5 4 6 7 8 9 10 11 number of phone calls What is the variable in this investigation? Explain why the data is discrete numerical data. What percentage of the fifteen-year-olds did not make any phonecalls? What percentage of the fifteen-year-olds made 5 or more phonecalls? Copy and complete: “The most frequent number of phonecalls made was .........”. Describe the distribution of the data. How would you describe the data value ‘11’? 5 The number of matches in a box is stated as 50, but the actual number of matches has been found to vary. To investigate this, the number of matches in a box is counted for a sample of 60 boxes: 51 50 50 51 52 49 50 48 51 50 47 50 52 48 50 49 51 50 50 52 52 51 50 50 52 50 53 48 50 51 50 50 49 48 51 49 52 50 49 50 50 52 50 51 49 52 52 50 49 50 49 51 50 50 51 50 53 48 49 49 a b c d e What is the variable in this investigation? Is the data continuous or discrete numerical data? Construct a dot plot for this data. Describe the distribution of the data. What percentage of the boxes contained exactly 50 matches? CONTINUOUS NUMERICAL DATA The height of 14-year-old children is being investigated. The variable ‘height of 14-year-old children’ is a continuous numerical variable because the values recorded for the variable could, theoretically, be any value on the number line. They are most likely to fall between 120 and 190 centimetres. The heights of thirty children are measured in centimetres. The measurements are rounded to one decimal place, and the values recorded below: 163:0 154:2 152:8 160:5 148:3 149:2 154:7 172:7 171:3 162:5 165:0 160:2 166:2 175:3 143:4 174:6 180:9 162:4 167:3 158:4 159:4 164:5 163:7 183:8 150:8 163:4 181:9 158:3 165:0 156:8 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 19 Note that these rounded values are actually discrete. However, when we tally them, we use continuous class intervals as follows: The smallest height is 143:4 cm and the largest is 183:8 cm so we will use class intervals 140 up to 150 (this does not include 150), 150 up to 160, 160 up to 170, 170 up to 180, 180 up to 190. Note that we choose class intervals of the same width. These class intervals are written as 140 - , 150 - , 160 - , etc. in the frequency table. The final class interval is written as 180 - < 190 which means 180 cm up to a height that is less than 190 cm. Height (cm) Tally Frequency A tally-frequency table for this example is: 140 jjj 3 © jjj 150 © jjjj 8 © © © jj 160 © jjjj jjjj 12 170 jjjj 4 180 - < 190 jjj 3 Total 30 A histogram is used to display continuous numerical data. This is similar to a barchart but because of the continuous nature of the variable, the ‘bars’ are joined together. The frequency is represented by the height of the ‘bars’. Heights of a sample of fourteen-year-old children 12 frequency 8 A histogram for this example is shown opposite: 4 0 140 150 160 170 180 190 height (cm) Note: The two oblique lines that cross the horizontal axis indicate that the numbers on this axis are not starting at zero. This can also be shown using . A relative frequency table and histogram can also be drawn: Height (cm) 140 150 160 170 180 - < 190 Total Frequency 3 8 12 4 3 30 3 30 Relative % £ 100 = 10% 26:7% 40% 13:3% 10% 100% From the tables and graphs we can see: 40 relative frequency % 30 20 10 0 140 150 160 170 180 190 height (cm) ² More children had a height in the class interval 160 up to 170 cm than any other class interval. This class interval is called the modal class. 12 30 £ 100 = 40% of the children had a height in this class. ² 3 £ 100 = 10%) had a height less than 150 cm. Three of the children ( 30 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 20 DATA ANALYSIS – CORE MATERIAL ² ² Three of the children (10%) were 180 cm or more tall. The distribution of heights was approximately symmetrical. EXERCISE 1B.3 1 Construct a histogram for the following continuous numerical data. Time to complete 100 m swim (secs) Number of swimmers 50 55 60 65 70 75 - < 80 3 6 16 11 2 2 2 The speed of vehicles travelling along a 200 number of section of highway has been recorded and vehicles displayed using the histogram alongside. 150 a How many vehicles were included in this survey? 100 b What percentage of the vehicles were travelling at speeds equal to or 50 greater than 100 km/h? c What percentage of the vehicles were 0 travelling at a speed from 100 up to 50 70 90 110 130 speed (km/h) 110 km/h? d What percentage of the vehicles were travelling at a speed less than 80 km/h? e If the owners of the vehicles travelling at 110 km/h or more were fined $165 each, what amount would be collected in fines? 3 The daily maximum temperature (o C) to the nearest degree, in Melbourne, for each day in January 2001, is recorded below: 34 38 31 38 23 24 25 26 29 35 41 23 32 36 22 21 24 26 35 36 25 32 27 30 34 30 27 25 26 23 25 a Using class intervals of 5 degrees construct a tally and frequency table for the data. b Construct a histogram to display the data. c Describe the distribution of Melbourne’s daily maximum temperatures in January 2001. 4 The height of each member of a basketball squad has been measured and the results are displayed using the frequency table alongside. a Calculate the relative frequencies and construct a relative frequency histogram for the data. b Comment on the distribution of the heights. c Find the percentage of members of the squad whose height is i greater than 180 cm ii less than 170 cm iii between 175 and 190 cm. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black Height (cm) 165 170 175 180 185 190 195 200 - < 205 Frequency 1 3 5 12 7 5 2 1 VIC MCR_12 Chapter 1 UNIVARIATE DATA 21 STEM-AND-LEAF PLOTS (STEMPLOTS) C Constructing a stem-and-leaf plot, commonly called a stemplot, is often a convenient method to organise and display a set of numerical data. A stemplot groups the data and shows the relative frequencies but has the added advantage of retaining the actual data values. CONSTRUCTING A STEMPLOT Data values such as 25 36 38 49 23 46 47 15 28 38 34 are all two digit numbers, so the first digit will be the ‘stem’ and the last digit the ‘leaf’ for each of the numbers. The stems will be 1, 2, 3, 4 to allow for numbers from 10 to 49. The stemplot for the data is shown alongside. Stem Leaf Notice that: 1 5 ² 1 j 5 represents 15 2 358 3 4688 ² 2 j 3 5 8 represents 23, 25 and 28 4 679 2 j 3 means 23 ² the data in the leaves is evenly spaced with no commas ² the leaves are placed in increasing order, so this stemplot is ordered ² the scale (sometimes called the key) tells us the place value of each leaf. If the scale was 2 j 3 means 2:3, then 4 j 6 7 9 would represent 4:6, 4:7 and 4:9. For data values such as 195 199 207 183 201 ...... the first two digits are the stem and the last digit is the leaf. Example 1 The score, out of 50, on a test was recorded for 36 students. a Organise the data using a stemplot. 25 36 38 49 23 46 47 15 28 38 34 9 30 24 27 27 42 16 28 31 24 46 25 31 b Comment on the distribution of the 37 35 32 39 43 40 50 47 29 36 35 33 data. a Recording the data from the list gives an unordered stemplot: Ordering the data from smallest to largest for each stem gives an ordered stemplot: 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black Stem 0 1 2 3 4 5 Leaf 9 56 538 688 967 0 Stem 0 1 2 3 4 5 Leaf 9 56 34455 77889 01123455667889 02366779 0 4778459 40117529653 26307 2 j 4 means 24 marks VIC MCR_12 22 DATA ANALYSIS – CORE MATERIAL b Leaf 9 56 34455 77889 01123455667889 02366779 0 The shape of the distribution can be seen when the stemplot is rotated: The data is slightly negatively skewed. Stem 0 1 2 3 4 5 We also observe these important features: ² The minimum (smallest) test score is 9. ² The maximum (largest) test score is 50. ² The modal class is 30 - 39. SPLIT STEMS Consider the following example: The residue that results when a cigarette is smoked collects in the filter. This residue has been weighed for twenty cigarettes, giving the following data, in milligrams. 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69 1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58 Scanning the data reveals that there will be only two ‘stems’, i.e., 15 and 16. In cases like this we will need to split the stems. If we use the stem 15 to represent data with Stem Leaf values 1:50 to 1:54 and 15¤ to represent data 15 2 with values 1:55 to 1:59 etc., we can construct 15¤ 5 5 5 6 6 6 6 7 8 8 9 9 a stemplot with four stems: 16 1 1 2 2 3 3 15 j 2 means 1:52 16¤ 9 If we split the stems five ways, where 150 represents data with values 1:50 and 1:51, 152 represents data with values 1:52 and 1:53 etc., the stemplot becomes: The stemplot with the stems split five ways clearly gives a better view of the distribution of the data. The value 1:69 appears as an outlier in this graph. The stemplot with the stems split two ways was not sensitive enough to show this. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black Stem 150 152 154 156 158 160 162 164 166 168 Leaf 2 5 6 8 1 2 5 6 8 1 2 5 667 99 33 9 VIC MCR_12 Chapter 1 UNIVARIATE DATA 23 EXERCISE 1C 1 A school has conducted a survey of 60 of their students to investigate the time it takes for students to travel to school. The following data gives the travel time to the nearest minute. 12 15 16 8 10 17 25 34 42 18 24 18 45 33 38 45 40 3 20 12 10 10 27 16 37 45 15 16 26 32 35 8 14 18 15 27 19 32 6 12 14 20 10 16 14 28 31 21 25 8 32 46 14 15 20 18 8 10 25 22 a b c d Is travel time a discrete or continuous variable? Construct a stemplot for the data using stems 0, 1, 2, .... Describe the distribution of the data. Copy and complete: “Most students spent between ...... and ...... minutes travelling to school.” 2 The weight of 900 g loaves of bread varies slightly from loaf to loaf. A manufacturer of bread is concerned that he may be producing too many underweight loaves of bread in his 900 gram range. He weighs a sample of sixty 900 g loaves and records their weight to the nearest gram. Construct a stemplot for the following data and comment on the distribution of the data. 901 907 898 893 904 904 895 894 913 892 904 903 924 888 908 900 921 905 913 906 893 907 924 910 894 901 927 928 895 915 885 901 878 901 898 896 885 909 903 886 896 917 903 897 910 889 913 899 901 891 916 908 903 894 931 904 907 894 882 889 3 A taxi driver has recorded the fares, to the nearest dollar, of 60 passengers that he has collected from Melbourne airport: 25 32 35 16 39 18 19 25 16 41 40 43 16 13 9 48 42 20 20 22 23 33 35 24 23 14 34 37 36 36 44 51 22 48 55 13 16 20 26 30 12 30 33 35 41 17 22 54 24 20 21 35 42 43 54 28 38 37 46 25 a Construct a stemplot with stems 0, 1, 2, 3, ...... Comment on the distribution of the data. b Construct a stemplot with two-way split stems. Comment on the feature of the distribution that is revealed by this split-stem stemplot. 4 The time spent (minutes) by 20 people in a queue at a bank, waiting to be attended by a teller, has been recorded: 3:4 2:1 3:8 2:2 4:5 1:4 0 0 1:6 4:8 1:5 1:9 0 3:6 5:2 2:7 3:0 0:8 3:8 5:2 Construct a stemplot for this data (include a legend). Comment on the distribution of the data. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 24 DATA ANALYSIS – CORE MATERIAL SAMPLE SUMMARY STATISTICS: MEASURES OF CENTRE D MEASURES OF CENTRE A picture of a data set can be obtained if we have an indication of the centre of the data and the spread of the data. Three statistics that provide a measure of the centre of a set of data are: ² the mean ² the median ² the mode. THE MEAN The mean x is the statistical name for ‘average’. The mean is calculated by adding all the data values x then dividing this sum by the number of data n. P x sum of the data values denoted x = mean = number of data values n Note: The Greek letter sigma, §, means ‘the sum of’. ² ² The mean involves all the data values. If you are told that the mean mark for a test is 65% then there will be some marks higher than 65% and some marks lower than 65%. ² The mean does not have to be one of the data values. For example: The mean number of children per family is 1:8 in Melbourne. It is obvious that a family cannot have 1:8 children but this statistic tells us that most families have either 1 or 2 children, with more families having 2 children. Example 2 Find the mean of the following data: 5573823465764 There are 13 data values in this set, so n = 13. Mean = 5+5+7+3+8+2+3+4+6+5+7+6+4 65 = =5 13 13 Example 3 Megan has had three Maths tests and her mean (average) mark is 78. a What is the total of Megan’s marks for the three tests? b She scores 82 marks for her next test. What is the mean mark for the four tests? c How many marks did she need to score for the fourth test so that her overall mean mark would increase to 80? a The total number of marks for the three tests is 78 £ 3 = 234. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 b c UNIVARIATE DATA 25 234 + 82 = 79. 4 To get an average mark of 80 in four tests, Megan needed to score a total of 4 £ 80 = 320 marks. Hence she needed to score 320 ¡ 234 = 86 marks on the fourth test to bring her overall mean mark to 80. The average of her marks for the four tests is THE MEDIAN The median is the middle value of an ordered set of data. An ordered set of data is the data listed from smallest to largest value (or largest to smallest). The median splits the data set into two halves: half of the data have values less than or equal to the median and half have values greater than or equal to the median. For example, if the median mark for a test is 65%, then half the marks scored are greater than or equal to 65% and half the marks scored are lower than or equal to 65%. To find the median: 1 2 Order the data by rearranging the values from smallest to largest. Locate the middle of the data values. ² If there is an odd number of data then the median will be one of the data values. n+1 th value in a data set of n values. The median is the 2 ² If there is an even number of data then the median is the average of the two middle values and may not be equal to any of the data values. Example 4 Find the median for the following data sets: a 5573823465764 b 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10 a The data set is ordered (arranged from smallest to largest). 2334455566778 13 + 1 = 7th value (circled). 2 The median is the The median is 5. b There are 16 data values so the median is the average of the 8th and 9th values (circled). 3 5 5 5 5 6 6 6 7 7 7 8 8 8 9 10 6+7 = 6:5 2 The median is (Note: This is not one of the data values.) THE MODE The mode is the most frequently occurring value in the data set. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 26 DATA ANALYSIS – CORE MATERIAL This statistic can usually be found easily from a frequency table, barchart or dot plot. If there are two modes in a data set then the data can be described as bimodal. If there are more than two modes then it is said that “the mode is not distinct” and the mode is not useful as a descriptive statistic. For continuous data, the class interval with the highest frequency is the modal class. Example 5 Find the mode for the following data: 5573823465764 The mode is the most frequently occurring value. There are three 5s and the most we have of any other number is two. So, the mode is 5. EXERCISE 1D.1 1 Find the i mean ii median iii mode for each of the following data sets: a 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9 b 10, 12, 12, 15, 15, 16, 16, 17, 18, 18, 18, 18, 19, 20, 21 c 22:4, 24:6, 21:8, 26:4, 24:9, 25:0, 23:5, 26:1, 25:3, 29:5, 23:5 d 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140, 125, 124, 119, 128, 141. 2 Consider the following two data sets: Data set A: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 10 Data set B: 3, 4, 4, 5, 6, 6, 7, 7, 7, 8, 8, 9, 15 a Find the mean for both Data set A and Data set B. b Find the median of both Data set A and Data set B. c Explain why the mean of Data set A is less than the mean of Data set B. d Explain why the median of Data set A is the same as the median of Data set B. 3 A cricketer has scored an average of 25:4 runs in his last 10 innings. He scores 58 and 16 runs in his next two innings. What is his new batting average? 4 On the first five days of his holiday David drove an average of 256 kilometres per day and on the next three days he drove an average of 172 kilometres per day. a What is the total distance that David drove in the first five days? b What is the total distance that David drove in the next three days? c What is the mean distance travelled per day over the eight days? 5 A basketball team scored 43, 55, 41 and 37 goals in their first four matches. a What is the mean number of goals scored for the first four matches? b What score will the team need to shoot in the next match so that they maintain the same mean score? c The team shoots only 25 goals in the fifth match. What is the mean number of goals scored for the five matches? d The team shoots 41 goals in their sixth and final match. Will this increase or decrease their previous mean score? What is the mean score for all six matches? 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 27 COMPARING MEASURES OF CENTRE Consider the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 used in Examples 4a and 5. For this data set the mean, median and mode all had the same value, 5, and this fact indicates that the distribution of data in this set is symmetrical. A dot plot of the data confirms this: 2 When the distribution of data is not symmetrical the measures of centre can have different values. 3 4 5 6 7 8 data values mean, median and mode CALCULATING MEASURES OF SPREAD FROM A FREQUENCY TABLE When the same data appear several times we often summarise the data in table form. Consider the data of the given table. We can find the measures of the centre directly from the table. The mode The mode is 7. There are 15 of data value 7 which is more than any other data value. Data value Frequency Data value £ frequency 3 4 5 6 7 8 9 Total 1 1 3 7 15 8 5 40 3£1= 3 4£1= 4 5 £ 3 = 15 6 £ 7 = 42 7 £ 15 = 105 8 £ 8 = 64 9 £ 5 = 45 278 The mean There are 40 data in this set, made up of one 3, one 4, three 5s, seven 6s and so on. The data in an ordered list would look like 3 4 5 5 5 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 :::::: To add these numbers we could say 3 £ 1 + 4 £ 1 + 5 £ 3 + 6 £ 7 + 7 £ 15 + :::::: so it is not necessary to write out all the data values. Adding a ‘Data value £ frequency’ column to the table helps to add all the scores. For example, there are 15 data of value 7 and these add to 7 £ 15 = 105. 278 Since the total of the 40 data values is 278, the mean = = 6:95. 40 The median Since there are 40 data in this set, if the data is written out in order from smallest to largest then the median will be the average of the two middle values, i.e., the 20th and 21st values. The median can be found by counting down the frequency table. 3 4 5 6 7 8 9 Total 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan Data value black Frequency 1 1 3 7 15 8 5 40 1 2 5 12 27 one number is 3 two numbers are 4 or less five numbers are 5 or less 12 numbers are 6 or less 27 numbers are 7 or less VIC MCR_12 28 DATA ANALYSIS – CORE MATERIAL In the table, the blue numbers show us accumulated values. We can see that the 20th and 7+7 21st data values (in order) must both be 7s; the median = = 7. 2 Example 6 Find the mean, median and mode for the data given in the following frequency table. Adding a Data value £ Frequency column, we get: Data value 2 4 5 6 7 8 9 Total 188 26 ' 7:23 the mean = There are 26 data in this set, so the median will be the average of the 13th and 14th values. The 13th and 14th values are both 8 so their average is 8. The median is 8. Data value 2 4 5 6 7 8 9 Total Data value 2 4 5 6 7 8 9 Total Freq 1 1 2 3 4 9 6 26 Freq 1 1 2 3 4 9 6 26 Frequency 1 1 2 3 4 9 6 26 Data value £ Freq 2£1=2 4£1=4 5 £ 2 = 10 6 £ 3 = 18 7 £ 4 = 28 8 £ 9 = 72 9 £ 6 = 54 188 1st value 2nd value 3rd and 4th values 5th, 6th and 7th values 8th, 9th, 10th and 11th values 12th, 13th, 14th, 15th to 20th values 21st to 26th values 8 is the data value with the highest frequency of 9, so the mode is 8. Which measure of centre is the most suitable to use? In Example 6, the mean (7:23) is less than the median (8) and mode (8). A dot plot shows the distribution of the data: 2 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan 3 outliers black 4 5 6 7 8 9 mean median and mode VIC MCR_12 Chapter 1 UNIVARIATE DATA 29 The data is negatively skewed and the data values 2 and 4 are much smaller than most of the data values. The mean depends on the actual values of the data so it has been ‘dragged’ towards these outliers. If the data value ‘2’ was replaced by a ‘7’ then the overall total would increase by 5 and hence the mean would increase. The median is not influenced by extreme values because it depends on the position of data rather than their value. If the data value ‘2’ was replaced by a ‘7’ then the median would not change; the middle values would remain the same. In cases where there are outliers in one direction so the distribution is skewed, the most suitable measure of centre to use is the median or the mode. In this case the mode has the same value as the median and would be a suitable measure of centre for the data. However, because the mode does not take all the data values into account, in some situations it is not representative of a data set. For example, the data set 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 9, 9 has a mode of 2 and this is not representative of the data set. A more suitable measure of centre for this data set would be the median 4 or the mean 4:5: MEASURES OF CENTRE FROM A STEMPLOT Example 7 Find the mean, median and mode from the ordered stemplot shown. Stem 1 2 3 4 5 Leaf 6788 233446789 1258 median, the 11th value 046 1 The mean is found by dividing the sum of all the data values by the number of data. We must make sure that the ‘stem’ is included with the ‘leaf’. Mean = 16 + 17 + 18 + 18 + 22 + 23 + 23 + :::::: + 51 = 29:14 21 The median is the middle value, the 11th value in this ordered data set. Counting the leaves from the beginning gives a median of 27. The mode is the most frequently occurring value; there are two 18s, two 23s and two 24s in this set of data. We can say that the mode is not distinct in this case and is not useful as a measure of centre. The mean of 29:14 is larger than the median of 27, indicating that the distribution is positively skewed. This can be seen from the stemplot. Note: 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 30 DATA ANALYSIS – CORE MATERIAL USING A CALCULATOR TO FIND THE MEAN AND MEDIAN Consider the data 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9: The data is entered into the calculator under the … menu. Choose 1:Edit Í. Use List 1 ( L), and after checking that the cursor is in the first position of List 1 we can type the first data value. This value will appear at the bottom of the screen as L(1)=2. Press Í and ‘2’ appears in the list. Continue in a similar way through the list of data, pressing Í after each data entry to move the cursor to the next position. To find the descriptive statistics for the data: … ~ CALC will get you into the menus for finding descriptive statistics. We are dealing with only one variable so we choose 1:1-Var Stats Í. 1-Var Stats appears on the home screen. We need to tell the calculator which list our data is entered in, so type y À Í. All the available descriptive statistics for this variable appear on the screen: The first statistic, x, is the mean. The mean of the data is 4:867 (to 3 decimal places). P The second statistic, x = 73, means that the sum of all the data values is 73. The next three statistics we will consider in Section 1E. ‘n=15’ indicates that there are 15 data values in the set. The arrow ÿ beside n=15 means that there are other entries for this screen. Scroll down using †. Med=5 means the median is 5. The other statistics on this part of the screen give the statistics of the five-number summary which is also covered in Section 1E. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 31 EXERCISE 1D.2 1 Find the mean, median and mode for each of the following data sets given as frequency tables: b a Data value Frequency Number of rooms Frequency 1 2 2 1 2 5 3 4 3 8 4 12 4 6 5 15 5 4 6 2 7 4 8 2 2 The test scores, out of 30 marks, for a class of twenty-two students are: 15, 16, 18, 23, 22, 28, 29, 25, 25, 24, 27, 18, 11, 20, 23, 26, 26, 30, 25, 18, 15, 17 a Find the i mean ii median iii mode for the data. b Explain why the mean is not the most suitable measure of centre for this set of data. c Explain why the mode is not the most suitable measure of centre for this set of data. 3 a Find the i mean ii median iii mode for the data displayed in the following stem-andleaf plot: b Which measure of centre would be the best representative for this set of data? Stem 5 6 7 8 9 Leaf 356 0124679 3368 47 1 4 The following data is the daily rainfall (to the nearest millimetre) for the month of October 2000 in Melbourne: 3, 1, 0, 0, 0, 0, 0, 2, 0, 0, 3, 0, 0, 0, 7, 1, 1, 0, 3, 8, 0, 0, 0, 32, 38, 3, 0, 3, 1, 0, 0 a Find the i mean ii median iii mode for this data. b Explain why the median is not the most suitable measure of centre for this data. c Explain why the mode is not the most suitable measure of centre for this data. 5 The frequency table alongside records the number of phonecalls made in a day by 50 fifteen-year-olds. a Find the: i mean ii for this data. median iii mode b Construct a barchart for the data and show the position of the measures of centre (mean, median and mode) on the horizontal axis. Number of phonecalls 0 1 2 3 4 5 6 7 8 9 10 11 Frequency 5 8 13 8 6 3 3 2 1 0 0 1 c Describe the distribution of the data. d Why is the mean larger than the median for this data? e Which measure of centre would be the most suitable for this data set? 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 32 DATA ANALYSIS – CORE MATERIAL 6 Which one of the following will always be true for the mean, median and mode of a set of discrete numerical data, assuming a distinct mode exists? A The mean always equals one of the data values in the set. B The median always equals one of the data values in the set. C The mode always equals one of the data values in the set. D The median is distorted by extreme values. E In a positively skewed set of data, the median will be greater than the mean. SAMPLE SUMMARY STATISTICS: MEASURES OF SPREAD E MEASURES OF SPREAD Three commonly used statistics that indicate the spread of a set of data are: ² the range ² the interquartile range ² the standard deviation. THE RANGE AND INTERQUARTILE RANGE The range is the difference between the maximum (largest) data value and the minimum (smallest) data value. Range = maximum data value ¡ minimum data value Example 8 Find the range for the data set: 5 5 7 3 8 2 3 4 6 5 7 6 4. Scanning the data we can see that the minimum is 2 and the maximum is 8. Hence the range is 8 ¡ 2 = 6. Now the median divides an ordered data set into two halves. These halves are divided in half again by the quartiles. The median is denoted Q2 . The middle value of the lower half is called the lower quartile, denoted Q1 . One quarter (25%) of the data have values less than or equal to the lower quartile. Three quarters (75%) of the data have values greater than or equal to the lower quartile. The middle value of the upper half is called the upper quartile, denoted Q3 . One quarter (25%) of the data have values greater than or equal to the upper quartile. Three quarters (75%) of the data have values less than or equal to the upper quartile. The interquartile range (IQR) is the spread of the middle half (50%) of the data. Interquartile range (IQR) = upper quartile ¡ lower quartile = Q3 ¡ Q1 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 33 Example 9 For the data set 5 5 7 3 8 2 3 4 6 5 7 6 4 find the: a median b lower quartile c upper quartile d interquartile range The ordered data set is 2 3 3 4 4 5 5 5 6 6 7 7 8 a There are 13 data values so the median is the 7th value (circled). There is an odd number of data and the median is one of the values so it divides the data into two halves of six values each. Note: For an odd number of data the median data value is not included in the lower or upper half for the calculation of the quartiles. b The middle value of the lower half is the average of the 3rd and 4th values. 6 values 6 values z }| { z }| { 2 3 3 4 4 5 5 5 6 6 7 7 8 3:5 median 3+4 = 3:5 2 Similarly, the middle value of the upper half is the average of the 10th and Lower quartile = c 11th values: 2 3 3 4 4 5 5 5 6 6 7 7 8 6:5 6+7 = 6:5 2 Interquartile range = upper quartile ¡ lower quartile = 6:5 ¡ 3:5 =3 Upper quartile = d So, the middle half of the data has a spread of 3. A summary for the set of data in Example 9 is: = 8¡2 = 6 Range 2 3 3 4 4 5 5 5 6 6 7 7 8 3:5 5 Lower quartile Median Interquartile range Upper quartile = 3 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan 6:5 The data has a spread of 6 (range = 6), centred around the value 5 (median = 5). The middle half of the data has a spread of 3 (interquartile range = 3). black VIC MCR_12 34 DATA ANALYSIS – CORE MATERIAL Example 10 Find the range and the interquartile range and describe the distribution of the data: 8, 4, 3, 9, 6, 5, 5, 10, 3, 6, 7, 9, 11, 14, 9, 8, 7, 12 The ordered data set (there are 18 data values) is: 3, 3, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 9, 10, 11, 12, 14 The range = 14 ¡ 3 = 11 The median will be the average of the 9th and 10th values: 7+8 = 7:5 Median = 2 The median divides the data set into two sets of 9 values: 9 values 9 values z }| { . z }| { 3, 3, 4, 5, 5, 6, 6, 7, 7 ... 8, 8, 9, 9, 9, 10, 11, 12, 14 .. Lower quartile Median 7:5 Upper quartile The lower quartile is the middle value of the lower half and the upper quartile is the middle value of the upper half. The interquartile range = 9 ¡ 5 = 4 The data is centred at 7:5 (median) and has a spread of 11 (range). The middle half of the data has a spread of 4 (interquartile range). USING THE CALCULATOR TO FIND THE RANGE AND INTERQUARTILE RANGE Key the data into a list. The data does not have to be ordered. Enter … ~ CALC and choose 1:1-Var Stats Í. Press y À Í to select the list L. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 35 The screens below show all the statistics for the data. Use † to scroll down and reveal the lower part of the screen. The range is maxX ¡ minX = 14 ¡ 3 = 11 The IQR = Q3 ¡ Q1 = 9¡5 =4 MEASURES OF SPREAD FROM A STEMPLOT The median, range, and interquartile range can be found easily from an ordered stemplot. Example 11 The number of cars travelling along a particular road were counted for 21 days and the data was recorded in this ordered stemplot. Find the median, range and interquartile range for this data. The data is ordered so we can read from the smallest value to the largest value. Combining the ‘stem’ with the ‘leaf’, we get: 16, 17, 18, 18, 22, 24, 27, ......, 40, 44, 46, 51. The minimum is 16 and the maximum is 51, so the range = 51 ¡ 16 = 35. Stem 1 2 3 4 5 Stem 1 2 3 4 5 Leaf 6788 24789 02334568 046 1 Leaf 6788 24 789 023345 68 046 1 The median is the middle value (the 11th data value in a list of 21) and counting from the beginning, the median = 32 (circled). The median divides the data into two groups of 10 data values. The average of the middle values of these groups gives the lower and upper quartiles. 22 + 24 36 + 38 Lower quartile = = 23 Upper quartile = = 37 2 2 Interquartile range = Upper quartile ¡ Lower quartile = 37 ¡ 23 = 14 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 36 DATA ANALYSIS – CORE MATERIAL EXERCISE 1E.1 1 For each i ii iii iv a b c d of the following data sets, find: the median (make sure the data is ordered) the upper and lower quartiles the range the interquartile range. 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 8, 9, 9 10, 12, 15, 12, 24, 18, 19, 18, 18, 15, 16, 20, 21, 17, 18, 16, 22, 14 21:8, 22:4, 23:5, 23:5, 24:6, 24:9, 25, 25:3, 26:1, 26:4, 29:5 127, 123, 115, 105, 145, 133, 142, 115, 135, 148, 129, 127, 103, 130, 146, 140, 125, 124, 119, 128, 141. 2 For the data given in the following ordered stem-andleaf plot, find the: a median b upper quartile c lower quartile d range e interquartile range Stem 0 1 2 3 4 Leaf 347 034 003 137 2 9 678 56999 8 3 The time spent (in minutes) by 20 people in a queue at a bank has been recorded: 3:4, 2:1, 3:8, 2:2, 4:5, 1:4, 0, 0, 1:6, 4:8, 1:5, 1:9, 0, 3:6, 5:2, 2:7, 3:0, 0:8, 3:8, 5:2 a Find the median waiting time and the upper and lower quartiles. b Find the range and interquartile range of the waiting times. c Copy and complete the following statements: i “50% of the waiting times were greater than ...... minutes.” ii “75% of the waiting times were less than or equal to ...... minutes.” iii “The minimum waiting time was ...... minutes and the maximum waiting time was ...... minutes. The waiting times were spread over ...... minutes.” 4 The following data gives the number of novels counted in 30 households. a Find the median number of novels per household and the upper and lower quartiles of the data. b Copy and complete the following statements: i “Half of the households have more than ...... novels.” ii “75% of the households have at least ...... novels.” Stem 2 3 4 5 6 7 Leaf 025 013 224 001 25 2 5 5 7 2 899 6689 789 6 c Find the i range ii interquartile range for the number of novels per household. d Describe the distribution of the data using the statistics found. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 5 The height (to the nearest centimetre) of 20 ten year olds is recorded in the following stemplot. a Find the i median height ii upper and lower quartiles of the data. UNIVARIATE DATA 37 Leaf 9 134489 22446899 12588 Stem 10 11 12 13 b Copy and complete the following statements: i “Half of the children are less than or ...... cm tall.” ii “75% of the children are less than ...... cm tall.” iii “The middle 50% of the children have heights spread over ...... cm.” THE VARIANCE AND STANDARD DEVIATION Now the range and IQR both only use two values in their calculation. It is sometimes better to use a measure of spread that includes all of the data values in its calculation. One such statistic is the variance, which measures the average of the squared deviations of each data value from the mean. The deviation of a data value x from the mean x is given by x ¡ x. For a sample, i.e., when we have surveyed a portion of the population: P (x ¡ x)2 2 where n is the sample size ² the variance is s = n¡1 s ² the standard deviation s is the square root of the variance, s = P (x ¡ x)2 . n¡1 The variance and standard deviation for a whole population have slightly different formulae. However, we do not use these in this course. Note: Example 12 Use the formula to find the variance and the standard deviation of the sample data: 3, 4, 4, 8, 7, 6, 10 3 + 4 + 4 + 8 + 7 + 6 + 10 42 = =6 7 7 The mean, x, of the data is Using a table for the calculations: P variance s 2 = (x ¡ x)2 n¡1 38 6 = 6:3333:::: = standard deviation s = = p variance q x 3 4 4 8 7 6 10 x¡x ¡3 ¡2 ¡2 2 1 0 4 Total 2 (x ¡ x) 9 4 4 4 1 0 16 38 P (x ¡ x)2 38 6 = 2:5166 (4 d.p.) 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 38 DATA ANALYSIS – CORE MATERIAL Using a table to calculate the standard deviation is an interesting exercise, but you will normally use your calculator to find this statistic. USING THE CALCULATOR TO FIND THE STANDARD DEVIATION Press … and choose 1:EDIT Key data into list L. Press … ~ to choose CALC, then choose 1:1-Var Stats. Press y À Í to choose list L. Sample standard deviation. The variance is not given on the screen, but it can be found by squaring the standard deviation. Note: STANDARD DEVIATION FOR GROUPED DATA Example 13 The frequency table alongside shows data collected from a random sample of 50 households in a particular suburb, investigating the number of people in the household. Use the calculator to find the standard deviation of the number of people in a household for this sample. Press … and choose 1: Edit Key the variable values into L and the frequency values into L‚. Frequency 1 2 3 4 5 6 5 8 13 14 7 3 Press … ~ to choose CALC, then choose 1:1-Var Stats. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan Number of people in the household black VIC MCR_12 Chapter 1 Enter L, L‚ by pressing y À ¢ y Á Í. UNIVARIATE DATA 39 The sample standard deviation is 1:3536 :::: Note: If you do not include L‚ then you will still get a screen of statistics, but they will be for Lonly. GIVING MEANING TO THE STANDARD DEVIATION Many data sets have frequency distributions that are ‘bell-shaped’ and symmetrical about the mean. For example, the histogram alongside frequency exhibits this typical ‘bell-shape’. The 25 data represents the heights of a group 20 of adult women and has a mean of 165 15 and a standard deviation of 8. The data is centred about the mean and 10 spreads from 140 to 190. However, 5 most of the data have values between 0 155 and 170 and not many have values 140 145 150 155 160 165 170 175 180 185 190 more than 180 or less than 150. height (cm) The Normal distribution is an important bell-shaped distribution. For the Normal distribution it can be shown that: ² ² ² 68% of the data will have values within one standard deviation of the mean. 95% of the data will have values within two standard deviations of the mean. 99:7% of the data will have values within three standard deviations of the mean. Graphically this can be summarised: 68% of data 95% of data x¡-¡s x¡+¡s mean x x¡-¡2s mean x x¡+¡2s 99.7% of data x¡-¡3s mean x x¡+¡3s If we model the bell-shaped data above using the Normal distribution: ² 68% of the heights will have values between 165 ¡ 8 = 157 and 165 + 8 = 173, i.e., between 157 and 173 cm. 68% of the data values will be in the interval [x ¡ s, x + s]. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 40 DATA ANALYSIS – CORE MATERIAL ² 95% of the heights will have values between 165 ¡ 2 £ 8 = 149 and 165 + 2 £ 8 = 181, i.e., between 149 and 181 cm. 95% of the data values will be in the interval [x ¡ 2s, x + 2s]. ² 99:7% of the heights will have values between 165 ¡ 3 £ 8 = 141 and 165 + 3 £ 8 = 189, i.e., between 141 and 189 cm. 99:7% of the data values will be in the interval [x ¡ 3s, x + 3s]. Example 14 A set of data has a Normal distribution with a mean x = 30 and a standard deviation of s = 7. What percentage of the data is: a greater than 30 b between 23 and 37 c more than 37 d between 16 and 44 e more than 44 f between 37 and 44? a The distribution of data is symmetrical about the mean, so 50% of the data have a value greater than 30. 30 b Now x ¡ s = 30 ¡ 7 = 23 and x + s = 30 + 7 = 37 68% of the data are between 23 and 37. 23 30 37 c Since 68% of scores are between 23 and 37, 32% are outside this interval. The distribution of scores is symmetrical, so 16% are greater than 37. d 16% 68% 16% 23 30 37 Now x + 2s = 30 + 14 = 44 and x ¡ 2s = 30 ¡ 14 = 16 95% of the data fall between these two values. 16 e Since 95% of the data are between 16 and 44, 5% are outside this interval. The distribution is symmetric so 2:5% of the data are greater than 44. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black 2.5% 30 44 95% 16 30 2.5% 44 VIC MCR_12 Chapter 1 f From c, we know that 16% of the data are greater than 37, and from e, we know 2:5% of the data are greater than 44. 16% ¡ 2:5% = 13:5% of the data lie between 37 and 44. UNIVARIATE DATA 41 16%¡-¡2.5% =¡13.5% 2.5% 68% 2.5% 23 30 37 44 Example 15 The contents of a sample of two hundred ‘800 gram packets’ of muesli were weighed and the weights were found to have a bell-shaped distribution with a mean of 800 grams and a standard deviation of 8 grams. How many of the packets in the sample would be expected to have a weight of more than 792 grams? We model the bell-shaped distribution using the Normal distribution. Now 792 = 800 ¡ 8 So, 792 g is one standard deviation less than the mean. Since 68% of the weights are within one standard deviation of the mean, 32% are outside this range. 68%¡+¡16% =¡84% Since the distribution is symmetric, 32 2 = 16% of the weights are lower 68% than 792 g. 16% 16% 84% of the weights are above 792 g. 792 800 weight in grams 84 £ 200 = 168. 84% of 200 = 100 So 168 of the 200 packets in the sample would be expected to have a weight greater than 792 grams. EXERCISE 1E.2 1 a Use the formula to find the standard deviation of the following set of data: 334456678899 b Check your answer to a using your calculator. 2 Use your calculator to find the standard deviation and variance of the following data: 25:6, 32:8, 24:7, 36:0, 32:1, 30:9, 34:4, 27:5 3 Find the standard deviation of the data given in the frequency table below. Number of cars owned by the business 0 1 2 3 4 5 3 4 6 9 12 10 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan Frequency black Number of cars owned by the business 6 7 8 9 10 11 Frequency 10 8 5 2 1 0 VIC MCR_12 42 DATA ANALYSIS – CORE MATERIAL 4 The following data are the heights, to the nearest centimetre, of the thirty footballers that belong to an AFL club. 192 185 189 183 189 191 190 192 198 187 191 194 198 181 189 191 190 187 189 194 198 191 187 196 181 193 187 196 192 178 a Find the i mean, x ii standard deviation, s of the height of the footballers in this club. b i Calculate the interval [x ¡ s, x + s]. ii What percentage of the heights would be expected to fall in this interval? iii What percentage of the actual heights fall in this interval? c What percentage of the actual heights fall in the interval [x ¡ 2s, x + 2s]? What percentage would you expect to fall in this interval? 5 The distribution of weights of 600 g loaves of bread is bell-shaped with a mean weight of 605 g and a standard deviation of 8 g. What percentage of the loaves can be expected to have a weight between 597 g and 613 g? (Use the Normal distribution as a model.) 6 [1997 FM CAT 2 Q4] The distribution of the weight of ice-cream served in a single scoop of Danish Delight is known to be bell-shaped with a mean of 104 grams and a standard deviation of 2 grams. The percentage of single scoops of Danish Delight containing less than 100 grams will be closest to: A B 0% C 2:5% 5% D 16% E 95% 7 The diameters of washers produced by a machine have a bell-shaped distribution with a mean diameter of 10 mm and a standard deviation of 0:3 mm. Using the Normal distribution as a model, find the percentage of the washers that would have a diameter: a between 9:7 mm and 10:3 mm b greater than 10 mm c greater than 10:6 mm d between 9:4 and 9:7 mm e greater than 9:7 mm? 8 The distribution of exam scores for 780 students who sat an exam is Normal with a mean of 55 and a standard deviation of 15. a Find the number of students who would be expected to obtain a score: i greater than 70 ii less than 55 iii less than 25 iv between 70 and 85 b If the pass mark for the exam was 40, then how many students are expected to pass the exam? 9 The distribution of times taken to swim 50 metres by a group of 16 year-olds is bellshaped with a mean of 38 seconds and a standard deviation of 3 seconds. The slowest 16% of the students would be expected to have a swim-time: A B C D E greater than 32 seconds but less than 35 seconds less than 32 seconds greater than 35 seconds but less than 38 seconds greater than 35 seconds greater than 41 seconds. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 43 STANDARD SCORES (z-SCORES) The relative significance of a particular data value can be considered in terms of the number of standard deviations that it differs from the mean. This is called the standard score or z-score of the data value, and the process of finding the standard score is called standardisation. Non-standardised data are often referred to as raw scores. CALCULATING STANDARD SCORES Standard score (z-score) = raw score ¡ mean standard deviation Example 16 The mean percentage on a mathematics exam is 60 and the standard deviation is 13. a Find the standard scores for students who, on the exam, scored: i 82% ii 45% iii 73% b Find the raw score of a student whose standardised score was 0:61. a Using the formula for standard score: i ii standard score iii standard score 45 ¡ 60 13 = ¡1:15 (2 dec. pl.) 82 ¡ 60 13 = 1:69 (2 dec. pl.) 73 ¡ 60 13 =1 = = standard score = raw score ¡ mean standard deviation raw score ¡ 60 0:61 = 13 0:61 £ 13 = raw score ¡ 60 7:93 + 60 = raw score raw score = 67:93 b z-score = So, the student’s raw score would have been 68%. Consider the following example: 95% of data The bell-shaped distribution alongside has mean 35 and standard deviation 10. 68% 15 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black 25 35 40 45 x VIC MCR_12 44 DATA ANALYSIS – CORE MATERIAL The distribution of the standardised data is shown alongside. 95% of data Notice that: ² the shape of the distribution is unchanged ² the values on the x-axis have been scaled 68% so that: I the 68% of the data within one standard -2 -1 0 1 deviation of the mean have z-scores between ¡1 and 1 I the 95% of the data within two standard deviations of the mean have z-scores between ¡2 and 2 2 x ² a standard score of 0 represents a raw score of the same value as the mean ² a positive standard score represents a raw score that is greater than the mean ² a negative standard score represents a raw score that is less than the mean. These facts are always true when we standardise a bell-shaped distribution. Example 17 Find the percentage of scores that come from a Normal distribution that will have a z-score: a greater than 0 b between ¡2 and 2 c between 1 and 2 d less than ¡2 e more than 3. a A z-score of 0 corresponds to a raw score of the mean and 50% of the data will have a value greater than the mean. -3 -2 -1 b 0 1 2 3 50% z 2 3 z 2 3 z 95% of the data will have a z-score between ¡2 and 2. -3 -2 -1 0 1 95% 95 ¡ 68 = 13:5% of raw scores will have 2 a z-score between 1 and 2. c -3 -2 -1 0 95-68 2 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black 1 = 13.5% VIC MCR_12 Chapter 1 d If 95% of the raw scores have a z-score between ¡2 and 2 then 2:5% ( 12 of 5%) will have a z-score less than ¡2. 2.5% -3 -2 -1 e 45 UNIVARIATE DATA 2.5% 0 1 95% 2 3 z If 99:7% of the raw scores have a z-score between ¡3 and 3 then 0:3 100 ¡ 99:7 = = 0:15% will have a z-score score more than 3. 2 2 COMPARING RAW SCORES FROM DIFFERENT DATA SETS Since standard scores: ² keep the relative value of raw scores within a data set ² scale the x-axis of distributions in terms of their standard deviations, standard scores are useful for comparing scores from different data sets. Example 18 Archie scored 62% on his Mathematics exam. This exam had a mean of 57 and a standard deviation of 5. In his English exam Archie scored 75% and this exam had a mean of 70 and a standard deviation of 6. In which subject was his relative performance better? 5 62 ¡ 57 = =1 5 5 5 75 ¡ 70 = = 0:83 standard score = 6 6 In Maths: standard score = In English: Since Archie’s standard score for Maths was greater than his standard score for English, his Maths result was further to the right in the distribution of the scores of the class. Archie’s relative performance was better in Maths. EXERCISE 1E.3 1 Find the standard scores for the following raw scores that come from a set of data that has a mean of 6:4 and a standard deviation of 2. a b 10 c 5:2 12 d 6:5 2 A raw score from a data set has a z-score of ¡0:85. If the data set has a mean of 50 and a standard deviation of 5:6, find the value of the raw score. 3 A raw score of 72 has a z-score of 1:25. If the standard deviation from the data set is 8, find the mean of the data. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 46 DATA ANALYSIS – CORE MATERIAL 4 A raw score of 20 has a z-score of ¡1:6. If the mean of the data set is 28, find the standard deviation. 5 Peter has had four Mathematics tests for the year and his results and the class averages and standard deviations are given in the table below. Peter’s mark 58 72 68 78 Test 1 2 3 4 Class average 60 65 60 72 Standard deviation 7 12 10 9 a Calculate Peter’s standard score for each test. b In which test did Peter perform best? 6 The semester English exam results for four students are given in the table alongside. If the mean was 60 for both exams and the standard deviation was 15 for the Semester 1 exam and 8 for the Semester 2 exam: Student David Rodney Gavan Daniel Semester 1 70 54 92 75 Semester 2 65 58 75 70 a Which of the students improved their performance from Semester 1 to Semester 2? b Which student improved the most? c Which student’s performance was the most consistent for the year? 7 For a set of data that has a bell-shaped distribution, find the percentage of raw scores that have a z-score: a less than 0 b between ¡1 and 1 c greater than 2 d between ¡1 and 0 e between ¡1 and ¡2 f between 0 and 3 THE BOXPLOT (BOX-AND-WHISKER PLOT) F A boxplot is a visual display of some of the descriptive statistics of a set of data, namely its minimum and maximum values, the median and the upper and lower quartiles. These five statistics form what is called the five-number summary of the data set. CONSTRUCTING A BOXPLOT A boxplot (box-and-whisker plot) is constructed above a number line (labelled and scaled) which is drawn so that it covers all the data values in the data set. The boxplot is drawn with a rectangular ‘box’ representing the middle half of the data. The ‘box’ goes from the lower quartile to the upper quartile. The ‘whiskers’ extend from the ‘box’ to the maximum value and to the minimum value. A vertical line marks the position of the median in the ‘box’. For example, for the data set 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9: 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 7 values UNIVARIATE DATA 47 7 values z }| { z }| { The ordered data set is 1, 2, 3, 3, 3, 4, 5, 5, 5, 5, 6, 7, 7, 8, 9 (15 data). Q1 }| These 5 statistics form the five-number summary. { minimum is 1. maximum is 9. median is the 8th value, 5. lower quartile is the 4th value, 3. upper quartile is the 12th value, 7. Q3 z The The The The The median Q2 whisker 1 whisker 2 minimum 3 4 lower quartile 5 6 median 7 upper quartile 8 9 value maximum Using the graphics calculator to find descriptive statistics and construct a boxplot Press … and choose 1:Edit. Enter the data from the example above into L: 2, 3, 5, 4, 3, 6, 5, 7, 3, 8, 1, 7, 5, 5, 9 Statistical graphs are drawn using STAT PLOT, which is located above the o key. Press y o to use it. Press Í to use Plot 1. Turn the plot On by pressing Í then use the arrow keys to choose the boxplot icon Ö and press Í. Press q ® to draw the boxplot. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 48 DATA ANALYSIS – CORE MATERIAL r can be used to locate the statistics of the five-number summary. The arrow keys move backwards and forwards between them. In this screen, the cursor is on the median. INTERPRETING A BOXPLOT A set of data with a symmetric distribution will have a symmetric boxplot. For example: y 8 6 4 2 0 10 11 12 13 14 15 16 17 18 19 20 x 10 11 12 13 14 15 16 17 18 19 20 x The whiskers of the boxplot are the same length and the median line is in the centre of the box. A set of data which is positively skewed will have a positively skewed boxplot. For example: 10 8 6 4 2 0 y 1 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 x x The right whisker will be longer than the left whisker and the median line is to the left of the box. A set of data which is negatively skewed will have a boxplot that appears stretched to the left. For example: 1 1 2 3 4 5 6 7 8 2 3 4 5 6 7 8 9 x 9 x The left whisker is longer than the right and the median line is to the right of the box. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 49 Example 19 A boxplot has been drawn to show the distribution of marks (out of 100) in a test for a particular class: 0 a b c d e f g h i a b c d e f g h i 10 20 30 40 50 60 70 80 90 100 score on test What was the highest mark scored for this test? What was the median test score for the class? What is the range of marks scored for this test? What percentage of students scored 60 or more for the test? What was the lowest mark scored? What is the interquartile range for this test? The top 25% of students scored a mark between ...... and ...... If you scored 70 on this test, would you be in the top 50% of students in the class? Comment on the symmetry of the distribution of marks. The highest score corresponds to the end of the upper whisker, so the highest mark scored was 98. The median corresponds to the vertical line inside the box, which is at 73. The range = maximum score ¡ minimum score = 98 ¡ 30 = 68 The score of 60 corresponds to the lower quartile. 25% of the students have a score less than or equal to the lower quartile so 75% scored 60 or more. The lowest score corresponds to the end of the lower whisker, so the lowest score was 30. The interquartile range = upper quartile ¡ lower quartile = 82 ¡ 60 = 22 The top 25% of scores correspond to the upper whisker. So, the top 25% of students scored a mark between 82 and 98. The top 50% of students had a mark greater than or equal to the median of 73. You would not be in the top 50% of students if you scored 70 for the test. stretched to the left 0 10 20 30 40 50 60 70 80 90 100 score on test The distribution of test scores is stretched to the left, and is therefore negatively skewed. The lower whisker is longer than the upper whisker and the median is not in the centre of the box but further towards the upper end. The distribution is therefore not symmetrical. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 50 DATA ANALYSIS – CORE MATERIAL TESTING FOR OUTLIERS Outliers are extraordinary data that are either much larger or much smaller than the main body of the data. There are several tests that identify outliers. One commonly used test involves the following calculation of ‘boundaries’: The upper boundary = upper quartile + 1:5 £ IQR. Any data larger than this number is an outlier. The lower boundary = lower quartile ¡ 1:5 £ IQR. Any data smaller than this value is an outlier. When outliers exist, the ‘whiskers’ of a boxplot extend to the last value that is not an outlier. Each outlier is marked with an asterisk; it is possible to have more than one outlier at either end. Example 20 Draw a boxplot for the following data, identifying any outliers. 1, 3, 7, 8, 8, 5, 9, 9, 12, 14, 7, 1, 4, 8, 16, 8, 7, 9, 10, 13, 7, 6, 8, 11, 17, 7 The ordered data is: 13 values 13 values z }| { z }| { 1, 1, 3, 4, 5, 6, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 16, 17 lower quartile = 7 median = 8 upper quartile = 10 The five-number summary is: minimum value is 1 lower quartile is 7 median is 8 upper quartile is 10 maximum value is 17 Using the calculator: IQR = 10 ¡ 7 = 3 The upper boundary = upper quartile + 1:5 £ IQR = 10 + 1:5 £ 3 = 14:5 The lower boundary = lower quartile ¡ 1:5 £ IQR = 7 ¡ 1:5 £ 3 = 2:5 Values outside the interval [2:5, 14:5] are outliers. Hence the two outliers at the upper end are the data values 16 and 17, and the two at the lower end are both the data value 1. We now have all the information to draw the boxplot: Two outliers of the same value are shown like this. 0 1 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan 2 The whisker is drawn to the last value that is not an outlier. black variable VIC MCR_12 Chapter 1 UNIVARIATE DATA 51 Using the calculator to draw the boxplot in Example 20 above, we begin by entering the data in L. Use STAT PLOT by pressing y o. Press Í to use Plot 1. Turn the plot On then use the arrow keys to choose the ‘boxplot with outliers’ icon Õ Then press Í. Press q ® to draw the boxplot. Note that only one of the outliers at 1 appears on the screen. Press r and use the arrow keys to move the cursor through the summary statistics. Note that both values at 1 are included. You may wonder why we would need both the boxplot and the stemplot or histogram. Each complements the other and shows slightly different things. Boxplots provide an excellent display of the summary statistics, while stemplots and histograms illustrate the shape of the distribution more accurately. Note: Consider the following example: 3 4 5 6 7 8 9 10 11 2 2 3 0 0 4 0 2 2 5 3 8 5 0 8 1 3 9 height (cm) 13 8 3 8 5579999999 11 9 13344455678 3456 leaf unit: 0:1 cm 9 7 5 3 These graphs display the same distribution. The boxplot displays the summary statistics, while the stemplot reveals the bimodal nature of the distribution. Hence both graphics are of value. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 52 DATA ANALYSIS – CORE MATERIAL EXERCISE 1F 1 The following boxplot summarises the heights of the players in an AFL team. 165 Use a b c d e 170 175 180 185 190 195 200 205 210 the boxplot to find: the median height of the team the range of heights of the team (ignoring the outlier) the height that 75% of the team are taller than the height of the player that is an outlier the interquartile range of the heights. 215 height (cm) 2 Find the five-number summary (minimum, lower quartile, median, upper quartile, maximum) for each of the following data sets, and construct a boxplot for the data. a Essendon’s game scores for the year 2000 (not including the finals): 156, 130, 124, 137, 123, 144, 140, 127, 106, 132, 145, 169, 119, 89, 108, 89, 167, 159, 165, 109, 81, 97 b Number of toothpicks 33 34 35 36 37 38 39 c Frequency 1 5 7 13 12 8 2 The daily maximum temperature (o C) in Melbourne for the month of March 2001: Leaf Stem 1 1¤ 2 2¤ 3 3¤ 7 0 5 0 5 8 0 5 0 8 0 6 1 8 2 7 2 9 2 8 2 9 2233344 8 3 2 j 4 represents 24o C 3 A set of data has a lower quartile of 31:5, median of 37, and upper quartile of 43:5. a Calculate the interquartile range for this data set. b Calculate the boundaries that identify outliers. c Which of the data 22, 13:2, 60, 65 would be outliers? 4 The boxplot below shows the distribution of weights of a sample of Jack Russell terriers: 4 5 6 7 8 9 Which one of the following would not be true for this data? A The interquartile range is more than 1:5 kg. B The heaviest 25% of the dogs all weighed more than 8 kg. C The median weight was 7 kg. D At least 75% of the weights were more than 6 kg. E The lightest 25% weighed less than or equal to 6:2 kg. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black 10 weight (kg) VIC MCR_12 Chapter 1 53 UNIVARIATE DATA 5 The boxplot below shows the distribution of taxi fares for 50 trips taken from Melbourne Airport. 15 a Find: i 20 25 30 ii the median fare 35 40 iii the range of fares 45 fare ($) the IQR of fares. b Write a sentence describing the distribution of the data, mentioning each of the statistics from a. c Complete the following: i Approximately ...... % of fares were greater than $32. ii The minimum fare was $ ...... iii 75% of the fares were greater than $ ...... 6 Match the histograms A, B, C , D and E to the boxplots I, II , III , IV and V . A C 1 2 3 4 5 6 7 2 8 x D frequency 8 6 4 2 0 E B frequency 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 x 4 3 2 1 0 4 6 8 10 12 x 4 5 6 7 8 x frequency 1 2 3 Leaf 12246 223445555689 07999 0234666778 25668 29 5 leaf unit : 0:1 Stem 7 6 5 4 3 2 1 I II 1 2 3 4 5 6 7 8 1 x III 2 3 4 5 6 7 8 x IV 1 2 3 4 5 6 7 8 x 1 2 3 4 5 6 7 8 9 10 11 12 13 x V 1 2 3 4 5 6 7 8 9 10 11 12 13 x 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 54 DATA ANALYSIS – CORE MATERIAL RANDOM SAMPLES G When we conduct a statistical survey, it is important that our data reflects the whole population. If data is to be collected from a sample then the sample must accurately represent the population. Otherwise, reliable conclusions about the population cannot be made. Samples must be chosen so that the results will not show bias towards a particular outcome. The sample size is also an important feature to be considered if conclusions about the population are to be made from the sample. For example: Measuring a group of three fifteen-year-olds would not give a very reliable estimate of the height of fifteen-year-olds all over the world. We therefore need to choose a random sample that is large enough to represent the population. Note that conclusions based on a sample will never be as accurate as conclusions made from the whole population, but if we choose our sample carefully, they will be a good representation. CHOOSING A RANDOM SAMPLE In a simple random sample, every member of the population has an equal chance of being chosen, and each member is chosen independently of any other member. Random samples can be chosen using coins, dice, numbered tokens, random number tables, or random number generators on computers or calculators. For example: Suppose you wish to choose Tattslotto numbers. The population of numbers is the integers 1 to 45 inclusive and you are going to choose a ‘sample’ of six different numbers. How could you choose these numbers randomly? Three possible methods: 1 Number forty five pieces of paper, place them in a container and select six pieces of paper without looking. 2 Use a random number table (Table 1). 39634 14595 30734 64628 42831 80583 00209 05409 95836 15358 62349 35050 71571 89126 95113 70361 90404 20830 22530 70469 74088 40469 83722 91254 43511 41047 99457 01911 91785 87149 65564 27478 79712 24090 42082 26792 72570 60767 80210 89509 16379 44526 25775 25752 15140 78466 42194 55248 34361 72176 19713 67331 65178 03091 34733 03395 49043 79253 52228 18103 39153 93365 07763 39411 68076 17635 24330 12317 33869 55169 69459 54526 82928 73146 18292 09697 14939 84120 94332 79954 17986 22356 31131 06089 69486 82447 09865 77772 83868 72002 24537 93208 30196 15630 80468 31405 45906 50103 61672 20582 The digits in the table are generated by computer in groups of five for easy reading. You can start anywhere in the table and move across or down. To choose numbers between 1 and 45, you need to look at two digits at a time. If the digits are 04 then the chosen number is ‘4’. If the digits give a number greater than 45 then you ignore it. If you get a repeat of a number then you will also ignore it. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 55 Starting in the top left hand corner and going across, (crossing out the inappropriate numbers until you have six numbers) the numbers are: 39, 63, 46, 23, 49, 74, 08, 86, 55, 64, 16, 37, 91, 97, 13 Your chosen numbers would be 39, 23, 8, 16, 37 and 13. 3 Use the random number generator on the calculator. This can be found in the menu as follows: Press then | to select PRB. Choose 5:randInt(. This will bring randInt( to the screen. We need to type in the range of random integers that we are considering, i.e., 1 to 45. Press À ¢ ¶ · ¤. Pressing Í repeatedly will give random digits between 1 and 45. In this case the first six numbers were all different numbers, so these are the randomly chosen Tattslotto numbers. If numbers were repeated, we would generate more until we had six different ones. You could also type in the sample size of 6 as shown alongside. However, if this gave repeats in the sample, you would need to repeat the procedure. Example 21 The table below gives the monthly sales figures, in thousands of dollars, for a shop over a six year period. January February March April May June July August September October November December 2000 2001 2002 2003 2004 2005 43:1 48:7 45:7 44:0 48:6 46:3 38:2 35:3 36:4 38:3 37:7 40:2 38:6 36:0 36:2 34:8 35:3 33:3 40:2 40:9 42:4 42:5 43:8 35:7 43:2 44:2 47:0 48:7 50:3 52:4 27:8 32:3 33:5 34:1 32:2 35:8 26:4 27:2 23:5 27:2 27:7 28:1 23:8 24:9 24:8 27:6 26:1 28:2 27:4 30:8 32:7 33:6 34:9 35:1 40:4 39:3 38:7 41:3 42:4 44:9 68:3 67:4 67:3 69:8 70:4 72:6 81:2 83:9 84:6 85:5 88:3 87:2 a Choose a year at random. b Choose a month at random. c Choose three consecutive years. d Choose a period of three consecutive years (36 months) starting with any month. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 56 DATA ANALYSIS – CORE MATERIAL a There are six years from which to choose. We could use a die to randomly choose one of these years; the year 2000 would be represented by 1, 2001 by 2, ......, 2005 by 6. Alternatively, we could use the random generator on a calculator: The randomly chosen year is 2004. b There are twelve months from which we need to choose one month. We use the calculator, with 1 representing January, 2 representing February, etc. The randomly chosen month is November. c To choose three consecutive years, we need to establish the number of sets of three consecutive years that are possible: 1 2000 - 2002 2 2001 - 2003 3 2002 - 2004 4 2003 - 2005 There are four possibilities, from which we have to choose one. Using the calculator, the randomly chosen period is 3 2002 to 2004. d To choose a period of three consecutive years starting with any month, we need to establish the number of sets that are possible: 1 Jan 2000 - Dec 2002 2 Feb 2000 - Jan 2003 3 Mar 2000 - Feb 2003 .. . 37 Jan 2003 - Dec 2005 There are thirty seven possibilities, from which we have to choose one. Using the calculator, the randomly chosen period is 11 November 2000 to October 2003. TO CHOOSE A SIMPLE RANDOM SAMPLE: 1 State the sample size needed. 2 State the number of possibilities from which you can choose, and number them if necessary. 3 State the random number generator that you are using. 4 Explain what you will do if repeated random numbers are not applicable. 5 State the random number(s) chosen and the data that is now in your sample. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 1 UNIVARIATE DATA 57 EXERCISE 1G 1 Use the random number table from page 54, starting at the top left corner and working down, to: a select a random sample of six different numbers between 1 and 45 inclusive b select a random sample of 5 different numbers between 100 and 499 inclusive. 2 Use your calculator to: a select a random sample of six different numbers between 5 and 25 inclusive b select a random sample of 10 different numbers between 1 and 25 inclusive. 3 The following calendar for 2006 shows the weeks of the year. Each of the days is numbered. Using a random number generator, choose a sample from the calendar of: a five different dates b a complete week starting with a Monday c a month d three different months e three consecutive months f a four week period starting on a Saturday g a four week period starting on any day. Explain your method of selection in each case. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 58 DATA ANALYSIS – CORE MATERIAL January 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) February 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Wk 1 Wk 2 Wk 3 Wk 4 (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) (47) (48) (49) (50) (51) (52) (53) (54) (55) (56) (57) (58) (59) March Wk 6 Wk 7 Wk 8 Wk 9 Wk 5 July (182) (183) (184) (185) (186) (187) (188) (189) (190) (191) (192) (193) (194) (195) (196) (197) (198) (199) (200) (201) (202) (203) (204) (205) (206) (207) (208) (209) (210) (211) (212) We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa August 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Wk 27 Wk 28 Wk 29 Wk 30 Wk 31 (213) (214) (215) (216) (217) (218) (219) (220) (221) (222) (223) (224) (225) (226) (227) (228) (229) (230) (231) (232) (233) (234) (235) (236) (237) (238) (239) (240) (241) (242) (243) Wk 10 Wk 11 Wk 12 Wk 13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu September Wk 32 Wk 33 Wk 34 Wk 35 (244) (245) (246) (247) (248) (249) (250) (251) (252) (253) (254) (255) (256) (257) (258) (259) (260) (261) (262) (263) (264) (265) (266) (267) (268) (269) (270) (271) (272) (273) (91) (92) (93) (94) (95) (96) (97) (98) (99) (100) (101) (102) (103) (104) (105) (106) (107) (108) (109) (110) (111) (112) (113) (114) (115) (116) (117) (118) (119) (120) May Wk 14 Wk 15 Wk 16 Wk 17 Wk 18 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th October Wk 36 Wk 37 Wk 38 Wk 39 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th April (60) (61) (62) (63) (64) (65) (66) (67) (68) (69) (70) (71) (72) (73) (74) (75) (76) (77) (78) (79) (80) (81) (82) (83) (84) (85) (86) (87) (88) (89) (90) black (274) (275) (276) (277) (278) (279) (280) (281) (282) (283) (284) (285) (286) (287) (288) (289) (290) (291) (292) (293) (294) (295) (296) (297) (298) (299) (300) (301) (302) (303) (304) (121) (122) (123) (124) (125) (126) (127) (128) (129) (130) (131) (132) (133) (134) (135) (136) (137) (138) (139) (140) (141) (142) (143) (144) (145) (146) (147) (148) (149) (150) (151) June Wk 19 Wk 20 Wk 21 Wk 22 1 Th (152) 2 Fr (153) 3 Sa (154) 4 Su (155) 5 Mo (156) 6 Tu (157) 7 We (158) 8 Th (159) 9 Fr (160) 10 Sa (161) 11 Su (162) 12 Mo (163) 13 Tu (164) 14 We (165) 15 Th (166) 16 Fr (167) 17 Sa (168) 18 Su (169) 19 Mo (170) 20 Tu (171) 21 We (172) 22 Th (173) 23 Fr (174) 24 Sa (175) 25 Su (176) 26 Mo (177) 27 Tu (178) 28 We (179) 29 Th (180) 30 Fr (181) November Wk 40 Wk 41 Wk 42 Wk 43 Wk 44 (305) (306) (307) (308) (309) (310) (311) (312) (313) (314) (315) (316) (317) (318) (319) (320) (321) (322) (323) (324) (325) (326) (327) (328) (329) (330) (331) (332) (333) (334) Wk 45 Wk 46 Wk 47 Wk 48 Wk 23 Wk 24 Wk 25 Wk 26 December 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su Mo Tu We Th Fr Sa Su (335) (336) (337) (338) (339) (340) (341) (342) (343) (344) (345) (346) (347) (348) (349) (350) (351) (352) (353) (354) (355) (356) (357) (358) (359) (360) (361) (362) (363) (364) (365) Wk 49 Wk 50 Wk 51 Wk 52 Wk 53 VIC MCR_12 60 DATA ANALYSIS – CORE MATERIAL BIVARIATE DATA Many statistical investigations involve analysing the relationship between two variables. We call the data in these investigations bivariate data. The way that bivariate data is analysed depends on whether the data is categorical or numerical. In this chapter we will study the display and analysis of bivariate data where: ² one variable is a categorical variable and the other is a numerical variable ² both variables are categorical ² both variables are numerical. For any pair of variables, one of the pair is described as the dependent or response variable, while the other is the independent or explanatory variable. The dependent variable responds to changes in the independent variable. The independent variable explains the changes in the dependent variable. For example, the number of children in a family influences the type of car they have, but not the other way around. The type of car is therefore the dependent variable and the number of children is the independent variable. COMPARING ONE CATEGORICAL AND ONE NUMERICAL VARIABLE A BACK-TO-BACK STEMPLOTS If the categorical variable has only two categories then a back-to-back stemplot is useful. It is a visual display that enables easy analysis and comparison of the data. Consider this example: An office worker has the choice of travelling to work by tram or train. He has recorded the travel times from recent journeys on both of these types of transport. He wishes to know which type of transport is quicker and which is the more reliable. Recent tram journey times (minutes): 21, 25, 18, 13, 33, 27, 28, 14, 18, 43, 19, 22, 30, 22, 24 Recent train journey times (minutes): 23, 18, 16, 16, 30, 20, 21, 18, 18, 17, 20, 21, 28, 17, 16 A back-to-back stemplot could be used to display the relationship between the categorical variable type of transport which has two categories (or levels), and the numerical variable travel time. The type of transport is the independent variable and the travel time is the dependent variable, because the travel time depends on the type of transport. A back-to-back stemplot is constructed with only one stem. The leaves are grouped on either side of this central stem. The ordered back-to-back stemplot for the data is shown alongside: Train leaf 88877666 831100 0 Stem 1 2 3 4 Tram leaf 34889 1224578 03 3 The most frequently occurring travel times by train were between 10 and 20 minutes whereas the most frequently occurring travel times by tram were between 20 and 30 minutes. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 61 The median train travel time is 18 minutes and the median tram travel time is 22 minutes. This supports the observation that train journeys are generally shorter. The range of the train travel times is 30 ¡ 16 = 14 minutes while the range of the tram travel times is 43 ¡ 13 = 30 minutes. The interquartile range of travel times for the train is 21 ¡ 17 = 4 minutes, while the IQR for tram travel times is 28 ¡ 18 = 10 minutes. Comparison of these measures of spread indicates that the train travel times are less ‘spread out’ than the tram travel times. The train travel times are therefore more predictable or reliable. In conclusion, it is generally quicker and the travel times are more reliable if the worker travels by train to work. EXERCISE 2A.1 1 The heights (to the nearest centimetre) of Year 10 boys and girls in a school are being investigated. The sample data are as follows: Boys: 164 168 175 169 172 171 171 180 168 168 166 168 170 165 171 173 187 179 181 175 174 165 167 163 160 169 167 172 174 177 188 177 185 167 160 Girls: 165 170 158 166 168 163 170 171 177 169 168 165 156 159 165 164 154 170 171 172 166 152 169 170 163 162 165 163 168 155 175 176 170 166 a What are the two variables in this investigation? Classify the variables as categorical or numerical, dependent or independent. b Construct a back-to-back stemplot for the data. c Find the statistics in the five-number summaries for each of the data sets. d Compare and comment on the distributions of the data, mentioning the shape, centre and spread and quoting statistics to support your statements. 2 A new cancer drug is being developed and is being tested on rats. Two groups of twenty rats with cancer were formed; one group was given the drug while the other was not. The survival time of each rat in the experiment was recorded up to a maximum of 192 days. Survival times of rats that were given the drug: 64 78 106 106 106 127 127 134 148 186 78 106 106 192¤ 192¤ 192¤ 192¤ 192¤ 192¤ 64 Survival times of rats that were not given the drug: 37 38 42 43 43 43 43 43 51 51 55 57 59 62 66 69 ¤ 48 86 49 37 denotes that the rat was still alive at the end of the experiment a What are the variables in this investigation? Classify the variables as categorical or numerical, dependent or independent. b Construct a back-to-back stemplot for the data and find the statistics that make up the 5-number summaries. c Compare and comment on the distributions of the data, mentioning the shape, centre and spread and quoting statistics to support your statements. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 62 DATA ANALYSIS – CORE MATERIAL 3 Peter and John are competing taxi-drivers who wish to know who earns more money. They have recorded the amount of money (in dollars) collected per hour for five hours over five days: Peter: 17:3 11:3 15:7 18:9 9:6 13 19:1 18:3 22:8 16:7 11:7 15:8 12:8 24 15 13 12:3 21:1 18:6 18:9 13:9 11:7 15:5 15:2 18:6 John: 23:7 10:1 8:8 13:3 12:2 11:1 12:2 13:5 12:3 14:2 18:6 18:9 15:7 13:3 20:1 14 12:7 13:8 10:1 13:5 14:6 13:3 13:4 13:6 14:2 a Construct a back-to-back stemplot for the data and find the statistics that make up the 5-number summaries. b Compare and comment on the distributions of the data, mentioning the shape, centre and spread and quoting statistics to support your statements. 4 The residue that results when a cigarette is smoked collects in the filter. The residue from twenty cigarettes from the two different brands was measured, giving the following data, in milligrams: Brand X: 1:62 1:55 1:59 1:56 1:56 1:55 1:63 1:59 1:56 1:69 1:61 1:57 1:56 1:55 1:62 1:61 1:52 1:58 1:63 1:58 Brand Y: 1:61 1:62 1:69 1:62 1:60 1:59 1:66 1:55 1:61 1:62 1:64 1:61 1:58 1:57 1:57 1:57 1:58 1:60 1:63 1:59 a Copy and complete the back-to-back stemplot for this data: Stem 150 152 154 156 158 160 162 164 166 168 Brand Y Brand X 2 5 6 8 1 2 5 6 8 1 2 5 667 99 33 156 includes values 1:56 and 1:57 9 b Comment on and compare the shape of the distributions. PARALLEL BOXPLOTS Parallel boxplots are used to display and compare data where one of the variables is numerical and the other is a categorical variable with two or more categories. For example: If additional car travel time data is available to the office worker in the example on page 60, we can use parallel boxplots to compare the data. They help us decide which type of transport is the quickest to get him to work and which is the most reliable. Car travel times (minutes): 30, 21, 19, 17, 24, 28, 23, 25, 25, 16, 18, 19, 29, 22 The categorical variable type of transport now has three categories and is the independent variable. Ordering the car travel times we get: 16, 17, 18, 19, 19, 21, 22, 23, 24, 25, 25, 28, 29, 30 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 The 5-number summary is: min = 16, max = 30, median = BIVARIATE DATA 63 22 + 23 = 22:5, lower quartile = 19, upper quartile = 25 2 The three boxplots are drawn on the one axis: car train tram 10 15 20 25 30 35 40 45 categorical variable with three categories travel time (minutes) numerical variable The car travel times have almost the same spread (range = 14 mins, IQR = 6 mins) as the train travel times (range = 14 mins, IQR = 4 mins), suggesting that the car travel time is as reliable as the train travel time. However, the train travel times include two outliers which may be due to extraordinary events. If these are ignored then the range of travel times for the train would be 7 minutes, which is considerably less than the ranges for the car and tram. The median car travel time is 22:5 minutes, compared to 18 minutes for the train and 22 minutes for the tram, so it is still generally quicker to travel by train. In conclusion: From the data given, it is generally quicker and more reliable to travel by train than it is by either tram or car. Using the graphing calculator to graph parallel boxplots Press y o and choose 1:Edit. Press Í. The data for each of the transport types is entered in separate lists. Press y o to select STAT PLOT. The three boxplots can be drawn on the screen at the same time by turning each of them On. Make sure that the ‘boxplot with outliers’ icon Õ and the correct list is selected for each plot. q ® will bring the graphs to the screen: r, then the arrows, can be used to find ‘5-number summary’ values on the screen. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 64 DATA ANALYSIS – CORE MATERIAL General rules for interpreting and comparing the distribution of bivariate data: 1 Comment on the shape of the distributions (symmetric, positively skewed, negatively skewed, outliers). 2 Comment on and compare the centres of the data (median and mean). 3 Comment on and compare the spread of the data (range, interquartile range). EXERCISE 2A.2 1 The percentage scores on a SAC for three classes of Further Mathematics students have been recorded and the distribution of results for the three classes are summarised on the graph below: class A class B class C 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 score on SAC (%) a In which class was: i the highest mark scored ii the lowest mark scored? b Comment on the shape of the distribution of marks in each of the classes. c Comment on and compare the centre of the scores for the classes. d Comment on and compare the spread of the scores for the classes. 2 [VCAA FM 2001 Q6] female n¡=¡26 male n¡=¡23 0 5 10 15 20 25 30 35 40 45 age (years) A conservation park in Thailand is home to 49 elephants, of which 26 are females and 23 are males. The parallel boxplots above show the distribution of their ages by sex. Based on the information contained in the parallel boxplots, which one of the following statements is incorrect? A The youngest elephant is male. B There are fewer female elephants under the age of 15 years than male elephants under the age of 15 years. C There are no female elephants over the age of 40 years. D The median age of the female elephants is approximately the same as the median age of the male elephants. E Approximately 25% of the male elephants are 30 years of age or older. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 65 3 The daily maximum temperatures in Melbourne for June 21st and December 21st (the equinoxes) are being compared. The data for the 20 years from 1981 to 2000 is given below: June 21st: December 21st: 13:6, 17:4, 24:2, 21:3, 10:6, 13:5, 19:4, 23:0, 19:1, 16:7, 21:4, 28:1, 14:2, 14:0, 22:7, 20:3, 12:2, 11:1, 21:4, 17:2, 11:9, 17:0, 20:0, 35:0, 18:3, 15:4, 22:3, 33:7, 14:9, 16:3, 21:1, 21:9, 14:6, 15:6, 18:9, 21:4, 15:1, 16:3 23:5, 38:6 a What are the variables in this investigation? Classify the variables as categorical or numerical, dependent or independent. b Find the statistics that make up the 5-number summaries and construct parallel boxplots for the data. c Compare and comment on the distributions of the data, mentioning the shape, centre and spread and quoting statistics to support your statements. 4 Using the data from question 4, Exercise 2A.1, find five-number summaries and construct parallel boxplots to summarise the distributions of residue for the two types of cigarettes. What conclusions can be made from comparing the boxplots? Support your statements with statistics. 5 Plant fertilisers come in many different brands, but there are essentially two types: organic and inorganic. A student was interested to discover whether radish plants responded better to organic or inorganic fertiliser. He prepared three identical plots of ground, named plots A, B and C, in his mother’s garden, and planted 40 radish seeds in each plot. After planting, each plot was treated in an identical manner, except for the way they were fertilised. Cost prevented him using a variety of fertilisers, so he chose one organic and one inorganic fertiliser. Plot A received no fertiliser, plot B received the organic fertiliser as prescribed on the packet, and plot C received the inorganic fertiliser as prescribed on the packet. The student was interested in the weight of the root that forms under the ground. The data supplied below is the weight of the root (measured to the nearest gram) of the individual plants: Data from plot A: 27 39 32 29 38 30 9 50 34 10 34 22 8 41 36 39 36 40 42 12 32 14 32 35 32 35 30 42 38 25 Data from plot B: 51 47 45 54 58 58 56 56 34 41 63 50 66 47 54 47 48 46 48 48 53 52 47 34 29 20 46 28 33 Data from plot C: 55 69 68 76 70 68 65 76 63 61 43 54 67 70 61 69 62 72 68 60 58 64 58 77 76 79 66 59 65 65 56 75 47 79 60 50 70 39 a Produce parallel boxplots for the data. b Compare and comment on the distributions of the weights of the root for each plot, mentioning the shape, centre and spread and quoting statistics to support your statements. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 66 DATA ANALYSIS – CORE MATERIAL TWO CATEGORICAL VARIABLES B Two-way frequency tables are used to demonstrate the relationship between two categorical variables. Percentaged segmented barcharts give a visual display of the data. TWO-WAY FREQUENCY TABLES In two-way frequency tables, the independent variable fills the columns. Example 1 A town council is considering bringing in a rule banning the drinking of alcohol in public places. A random survey of 60 residents gave the following results: Of the 35 women surveyed, 20 were in favour of the rule. However only 11 of the men were in favour of it. a Construct a two-way frequency table to summarise these findings. b Construct a two-way percentaged frequency table and answer the following: i What percentage of those surveyed were female? ii What percentage of those surveyed were in favour of the proposal? iii What percentage of the females surveyed were in favour of the proposal? c Do the results of the survey support the theory that females would be more in favour of this rule than males? The two categorical variables involved in this question are: Gender: Male or Female Opinion about rule: In favour or Against Opinion about rule depends on gender so the variable gender is the independent variable. a Gender Male Female Total In favour 11 20 31 Opinion Against 14 15 29 Total 25 35 60 b The two-way percentaged frequency table is: Male In favour Opinion 11 25 14 25 Against Total i 35 60 31 60 ii iii Gender £ 100 = 44% £ 100 = 56% 100% 20 35 15 35 Female £ 100 = 57% £ 100 = 43% 100% £ 100 = 58:33% of those surveyed are female. £ 100 = 51:67% of those surveyed were in favour of the rule. 57% of the females were in favour of the rule. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 c BIVARIATE DATA 67 57% of the females surveyed were in favour of the proposed rule compared with 44% of the males. This shows a difference of 13%. The results support the theory. PERCENTAGED SEGMENTED BARCHARTS percentage The percentaged frequency table in Example 1 can be graphed using a percentaged segmented barchart: 100 80 in favour 60 against 40 20 male female gender EXERCISE 2B 1 A survey of Victorians was recently conducted to ascertain their interest in AFL football. The data was presented in the following two-way percentaged frequency table: Level of interest Very interested Somewhat Not very Not at all Total Gender Male Female 28 18 25 19 19 20 28 43 100 100 Total 22 21 20 37 100 a Use the table to find: i the percentage of those surveyed who are very interested in football ii the percentage of women who are either very or somewhat interested in football. b Construct a percentaged segmented barchart that compares the interest in Australian Rules for men and women. c Does the data support the theory that gender influences the level of interest in AFL football? Quote percentages to support your statement. 2 A survey of sixteen-year-old students revealed that 32 of the 48 boys and 23 of the 37 girls played a team sport outside school. Gender a Copy and complete the twoBoys Girls Total way frequency table shown: Yes Play team sport b Find the percentage of all No outside school? the students who play a Total team sport outside school. c Find the percentage of girls who play a team sport outside school. d Construct a two-way percentaged frequency table. e Do the figures support the theory that more boys than girls play a team sport outside school? Quote some percentages to support your statement. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 68 DATA ANALYSIS – CORE MATERIAL 3 A market research company is contracted to investigate the age of people who listen to the three radio stations, A, B or C, in a city. The results of their survey are given in the table alongside: Station A B C Total < 30 35 40 175 Age group 30 - 60 > 60 30 200 83 68 37 132 Total a Complete the Totals row and column in the table alongside. b Why do we need a two-way percentaged frequency table to help analyse the data? c Construct the two-way percentaged frequency table. d Compare and comment on which age groups listen to which radio station. 4 The two-way percentaged frequency Father Mother table alongside was produced to show Employed the labour force status of parents from 48:6 16:8 full-time one-parent families. Labour Employed a What are the variables in this sur13:3 27:2 force part-time vey? Classify them as categorical status Unemployed 8:3 8:9 or numerical, independent or dependent. Not in the 29:8 47:0 labour force b Construct a percentaged segmented barchart to illustrate the Total 100:0 100:0 data. (Source: ABS June 2002 Labour Force Survey) c What conclusions can be made from this table and graph? Support your statements with percentages from the table. 5 A polling agency wants to test the theory that in a particular municipality, “more of the female residents vote for female candidates”. A random sample of eighty residents in the municipality were asked their voting preference, either Smith the female candidate, or Jones the male candidate. Of the 35 female residents in the sample, 20 said they would vote for Smith, whereas 25 of the male residents said they would vote for Jones. a Fill in the missing values on Gender the two-way frequency table Male Female alongside. Smith 20 Voting b Construct a two-way percent25 intention Jones aged frequency table for the Total 35 data. c Use the figures in the table to comment on the validity of the theory. Total 80 TWO NUMERICAL VARIABLES C Scatterplots are used to demonstrate and visualise the relationship between two numerical variables. The data is plotted as points on a graph where the independent variable is the horizontal axis and the dependent variable is the vertical axis. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 69 CONSTRUCTING A SCATTERPLOT The pattern formed by the points on a scatterplot indicates the strength of the relationship between the two variables. For example: The relationship between weight and height of members of an AFL football team is being investigated. We expect there to be a fairly strong association between these variables as it is generally perceived that the taller a person is, the more they will weigh. The height and weight of each of the players in the team is recorded and these values form a coordinate pair for each of the players: Player 1 2 3 4 5 6 Height 203 189 193 187 186 197 Weight 106 93 95 86 85 92 Player 7 8 9 10 11 12 Height 180 186 188 181 179 191 Weight 78 84 93 84 86 92 Player 13 14 15 16 17 18 Height 178 178 186 190 189 193 Weight 80 77 90 86 95 89 Before a scatterplot is constructed you need to establish which of the variables is the independent variable and which is the dependent variable. In this case we assume that weight depends on height and so weight is the dependent variable and height is the independent variable. Weight versus Height weight (kg) The points are therefore plotted as coordinate pairs (height, weight) for the individuals in the investigation. 105 100 95 90 85 80 75 175 180 185 190 195 200 205 height (cm) Using the calculator to construct a scatterplot Press … and choose 1:Edit. Press Í. Enter the data into lists. The independent variable should be L and the dependent variable should be L‚. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 70 DATA ANALYSIS – CORE MATERIAL Press y o to select STAT PLOT. Press Í. Turn the plot On and select the scatterplot icon ". The XList is for the independent variable L and the YList is for the dependent variable L‚. Press q ® to view the scatterplot. You can press r and use the arrow keys to identify the points. INTERPRETATION OF A SCATTERPLOT There are four aspects that we need to consider: 1 Direction y Positive association The points generally go up as x increases, similar to a straight line with positive gradient. “As the independent variable (x) increases, the dependent variable (y) also increases.” x y Negative association The points generally go down as ‘x’ increases, similar to a straight line with negative gradient. “As the independent variable (x) increases, the dependent variable (y) decreases.” x 2 Form In the scatterplots above, the points are generally in a straight line. The relationship between the variables is said to be linear. These scatterplots show relationships which are not linear. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 71 3 Strength If the points form a well-ordered pattern then the strength of the association is said to be strong. For example: Strong positive Strong negative Strong non-linear If the points form a pattern which is less well defined, then the strength is said to be moderate. For example: Moderate positive Moderate negative If the points are scattered but a general pattern is still discernable then the association is said to be weak. For example: Weak positive Weak negative If the points appear to be randomly scattered then there is no association between the variables. An example of this is shown opposite. 4 Outliers Outliers stand out from the general body of data. The example opposite shows a “moderate positive association with one outlier”. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black outlier VIC MCR_12 72 DATA ANALYSIS – CORE MATERIAL We can interpret the Weight versus Height scatterplot from earlier as follows: “There is a moderate positive association between the variables height and weight. This means that as height increases, weight increases. The relationship appears linear and there are no obvious outliers.” weight (kg) Outliers should be checked to ensure they are genuine outstanding data and not errors in the data or errors in plotting. A decision can be made to ignore them as they will influence correlation measures and models fitted to the data, but this should only be done after careful consideration. Weight versus Height 100 90 80 height (cm) 180 190 200 EXERCISE 2C 1 For each of the following, state whether you would expect to find positive, negative, or no association between the following variables. Indicate the strength (none, weak, moderate or strong) of the association. a Shoe size and height. b Speed and time taken for a journey. c The number of occupants in a household and the water consumption of the household. d Maximum daily temperature and the number of newspapers sold. e Age and hearing ability. 2 Copy and complete the following: a If the variables x and y are positively associated then as x increases, y .......... b If there is negative association between the variables m and n then as m increases, n .......... c If there is no association between two variables then the points on the scatterplot appear to be .......... .......... 3 For each of the scatterplots below, state: i whether there is positive, negative or no association between the variables ii the strength of the association between the variables (zero, weak, moderate or strong) iii whether the relationship between the variables appears to be linear or not iv the presence of outliers. a b y 30 25 20 15 10 5 40 30 20 10 x 10 30 y 20.4 y 20.2 20 19.8 19.6 x x 5 40 10 15 20 25 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan 20 c black 10 20 30 40 VIC MCR_12 Chapter 2 d e y 120 100 80 60 40 20 f y 50 50 30 30 20 20 10 2 x y 1 2 2 1 10 x 1 2 3 4 5 6 7 4 Consider the data: y 40 40 x 73 BIVARIATE DATA 3 4 4 4 3 5 5 6 8 6 6 7 5 x 10 20 30 40 50 8 5 9 7 10 8 a Construct a scatterplot for the data. b State whether the association between the variables is: i positive, negative or no association ii weak, moderate or strong iii linear or not. 5 The following data was collected by a milkbar owner over fifteen consecutive days: Max. daily temp. (o C) 29 40 35 30 34 34 27 27 19 37 22 19 25 36 23 No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139 creams sold a Which of the two variables is the independent variable? b Construct a scatterplot of the data. c Interpret the scatterplot in terms of the variables, mentioning direction, strength, linearity and outliers. 6 A class of 25 students was asked to record their times (in minutes) spent preparing for a test. The table below gives the score that they achieved on the test and the recorded preparation time. Score Minutes spent preparing 25 31 30 38 55 20 39 47 35 45 32 33 34 75 30 35 65 110 60 40 80 56 70 50 110 18 Score Minutes spent preparing 38 17 38 17 17 26 41 50 30 45 36 23 80 22 30 15 10 85 100 60 55 80 50 75 a Which of the two variables is the independent variable? b Construct a scatterplot of the data. c Interpret the scatterplot in terms of the variables, mentioning direction, strength, linearity and outliers. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 74 DATA ANALYSIS – CORE MATERIAL CORRELATION D Correlation is a statistical word that means relationship or association. We can talk about the correlation/relationship/association between two variables and mean the same thing. PEARSON’S CORRELATION COEFFICIENT r The correlation between two numerical variables can be measured by a correlation coefficient. There are several correlation coefficients that can be used, but the most widely used coefficient is Pearson’s correlation coefficient, named after the statistician Carl Pearson who developed it. Its full name is Pearson’s product-moment correlation coefficient, and it is denoted r. For a set of n bivariate numerical data with variables x and y, Pearson’s correlation µ ¶µ ¶ coefficient is: 1 P x¡x y¡y r= n¡1 sx sy where x and y are the means of the x and y data respectively and sx and sy are their standard deviations. This formula is tedious to use, so in all situations you will be using your calculator to find r. INTERPRETATION OF PEARSON’S CORRELATION COEFFICIENT Pearson’s correlation coefficient gives a measure of the relationship between two variables on a scale from ¡1 to 1. Word descriptors based on r-values seem doubtful at the best of times and the majority of texts on this subject do not include them. Many texts and Internet sites vary on the advice they give. Here is one possible interpretation. Description r r Description perfect positive correlation ¡1 perfect negative correlation 0:75 to 1 strong positive correlation ¡1 to ¡0:75 strong negative correlation 0:50 to 0:75 moderate positive correlation ¡0:75 to ¡0:50 moderate negative correlation 0:25 to 0:50 weak positive correlation ¡0:50 to ¡0:25 weak negative correlation 1 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 Description r 0 to 0:25 75 Description r almost no correlation BIVARIATE DATA ¡0:25 to 0 almost no correlation Notes about Pearson’s correlation coefficient: ² It is designed for linear data only. ² It should be used with caution if there are outliers. For example, the data in the two scatterplots below both have a correlation coefficient of r = 0:8. The presence of the outlier in the second graph has greatly reduced the r value, however, without this point, r would equal 1. y y 15 outlier 15 10 10 5 5 x x 2 4 6 8 10 12 14 2 6 4 8 10 12 Using the calculator to find Pearson’s correlation coefficient The first step is to activate the diagnostic tools on the calculator. Once turned on these will remain on, but if the memory is cleared or battery changed then the calculator will revert back to the default functions that do not include r. To activate the diagnostic tools: Locate the menu CATALOG using y Ê. Use the arrow keys to scroll down to DiagnosticOn and press Í. DiagnosticOn will appear on the screen. Press Í and you will have turned the diagnostic tools on. We consider finding Pearson’s correlation coefficient for the data opposite: 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black x y 1 2 2 1 3 4 4 3 5 5 6 6 7 5 8 5 9 7 10 8 VIC MCR_12 76 DATA ANALYSIS – CORE MATERIAL Enter the data into lists, the x-data into L and the y-data into L‚. We check the scatterplot at this stage as it will reveal any errors made in entering the data, and any outliers. It will also indicate whether the data is linear. Press … ~ to select CALC and choose 4:LinReg(ax+b). (This means we are fitting a linear model or linear regression of the form y = ax + b to the data.) Regression will be discussed in greater detail in Chapter 3. LinReg(ax+b) appears on the screen. You need to tell the calculator where your data is: Enter L, L‚ by pressing y À ¢ y Á Í. The linear regression screen appears and the last figure r = :9130 :::: is Pearson’s correlation coefficient for this data set. The r value indicates a strong positive correlation, which agrees with the scatterplot. CAUSATION When analysing data, we must be aware of causation. A high degree of correlation between two variables does not necessarily imply that a change in one variable causes the other to change. For example: 1 The heights and reading speeds of children were measured and a strong positive correlation was found. Does this mean that increasing height makes you read faster or that increasing your reading speed will cause you to grow? These suggestions are obviously not sensible. The strong correlation results because both variables are closely associated with age. As age increases, both the variables height and reading speed increase. It is age which causes height and reading speed to increase. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 77 2 The number of television sets sold in Ballarat and the number of stray dogs collected in Bendigo were recorded over several years and a strong positive association was found between the variables. Obviously the number of television sets sold in Ballarat was not influencing the number of stray dogs collected in Bendigo. Both variables have simply been increasing over the period of time that their numbers were recorded. If a change in one variable causes a change in the other variable then we say that a causal relationship exists between them. For example: The age and height of a group of children is measured and there is a strong positive correlation between these variables. This will be a causal relationship because an increase in age will cause an increase in height. EXERCISE 2D 1 a Use your calculator to find Pearson’s correlation coefficient for the data given in question 5, Exercise 2C. Max. daily temp. (o C) 29 40 35 30 34 34 27 27 19 37 22 19 25 36 23 No. of ice119 164 131 152 206 169 122 143 63 208 155 96 125 248 139 creams sold b Interpret the value of r in terms of strength and direction. c Does the value of the correlation coefficient confirm your observations from the scatterplot? Was it appropriate to find r for this data? Explain. 2 a Use your calculator to find Pearson’s correlation coefficient for the data given in question 6, Exercise 2C: Minutes spent preparing 75 30 35 65 110 60 40 80 56 70 50 110 18 Score 25 31 30 38 55 20 39 47 35 45 32 33 34 Minutes spent preparing 80 22 30 15 10 85 100 60 55 80 50 75 Score 38 17 38 17 17 26 41 50 30 45 36 23 b Interpret the value of r in terms of strength and direction. c Does the value of the correlation coefficient confirm your observations from the scatterplot? Was it appropriate to find r for this data? Explain. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 78 3 [VCAA FM 2000 Q5] The scatterplot alongside shows the birth rate and the average food intake for 14 different countries. The value of the product moment correlation coefficient, r, for this data is closest to: A ¡0:6 B ¡0:2 C 0:2 D 0:6 E 0:9 birth rate (per 100¡000) DATA ANALYSIS – CORE MATERIAL 50 40 30 20 1.7 1.9 2.1 2.3 2.5 2.7 average food intake (1000 calories per person) 4 Which one of the following is true for Pearson’s correlation coefficient r? A The addition of an outlier to a set of data would always result in a lesser value of r. B An r value of 1 represents a stronger relationship between the variables than an r value of ¡1. C A high value of r means that one variable is causing the other variable to change. D An r value of ¡0:8 means that as the independent variable increases, the dependent variable will tend to decrease. E It can take values between 0 and 1 inclusive. 5 The following pairs of variables were measured and a strong positive correlation between them was found. Discuss whether a causal relationship exists between the variables. If not, suggest a third variable to which they may both be related. a The lengths of one’s left and right feet. b The damage caused by a fire and the number of firemen who attend it. c Company expenditure on advertising, and sales. d The height of parents and the height of their adult children. e The number of hotels and the number of churches in rural towns. THE COEFFICIENT OF DETERMINATION E In a bivariate set of numerical data, the coefficient of determination gives us a means of measuring the influence that one variable has over the other variable. Coefficient of determination = r 2 = (Pearson’s correlation coefficient)2 CALCULATION OF THE COEFFICIENT OF DETERMINATION r2 is found on the linear regression screen of your calculator as shown opposite. Alternatively, if the value of r is known, then this can simply be squared. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 Chapter 2 BIVARIATE DATA 79 INTERPRETATION OF THE COEFFICIENT OF DETERMINATION r2 indicates the strength of association between the dependent or response variable and the independent or explanatory variable. If there is a causal relationship then r2 indicates the degree to which change in the explanatory variable explains change in the response variable. For example: An investigation into many different brands of muesli found that there is strong positive correlation between the variables fat content and kilojoule content. Pearson’s correlation coefficient, r, was found to be 0:8625. The coefficient of determination for this study is (0:8625)2 = 0:7439. An interpretation of this r2 value is “the proportion of variation in kilojoule content that can be explained by the variation in fat content of muesli is 0:7439.” It is usual to quote the coefficient of variation as a percentage. A proportion of 0:7439 is equivalent to 0:7439 £ 100 = 74:39%. The interpretation becomes: dependent variable 74:39% of the variation in kilojoule content of muesli can be explained by the variation in fat content of muesli. independent variable If 74:4% of the variation in kilojoule content of muesli can be explained by the fat content of muesli then we can assume that the other 25:6% (100% ¡ 74:4%) of the variation in kilojoule content of muesli can be explained by other factors (which may or may not be known). Note: ² Since ¡1 6 r 6 1, 0 6 r2 6 1. ² If r = ¡0:625 then r2 = (¡0:625)2 = 0:3906, a positive value. ² It is only appropriate to use r2 values, like r values, in situations where there is a linear relationship between the two variables. ² r2 values of 10% or more are worth mentioning. ² If you are finding an r value from an r2 value then you must consider that the r p value can be positive or negative. The solutions to r2 = a are r = a and p r = ¡ a. Your calculator will only give you a positive value. Example 2 A study has found that 45% of the variation in selling price can be explained by the variation in age of a used car. If this statement was based on the coefficient of variation then what would be the value of Pearson’s correlation coefficient for this study? p We are told that r2 = 0:45 so r is the square root of 0:45. ( 0:45 w 0:6708) At this point we need to consider the variables involved: selling price and age of a car. We would assume that as the age of a car increases then the selling price of a car would decrease, i.e., there is negative correlation between the variables. Hence we can conclude for this study that Pearson’s correlation coefficient, r, will be ¡0:6708. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12 80 DATA ANALYSIS – CORE MATERIAL EXERCISE 2E 1 In an investigation the coefficient of determination for the variables preparation time and exam score is found to be 0:5624. Complete the following interpretation of the coefficient of determination: ...... % of the variation in .......... can be explained by the .......... in preparation time. 2 For each of the following find the value of the coefficient of determination correct to four decimal places, and interpret it in terms of the variables. a An investigation has found the association between the variables time spent gambling and money lost has an r value of 0:4732. b For a group of children a product-moment correlation coefficient of ¡0:365 is found between the variables heart rate and age. c In a study of a sample of countries, Pearson’s correlation coefficient for the variables female literacy and gross domestic product is found to be 0:7723. 3 A study of the relationship between stress levels and productivity has produced a productmoment correlation coefficient of 0:5629. Which one of the following would be an interpretation that could be made from this study? A 56:3% of the variation in productivity can be explained by the variation in stress levels. B 75% of the variation in productivity can be explained by the variation in stress levels. C 31:7% of the variation in productivity is caused by the variation in stress levels. D 56:3% of the variation in productivity is caused by the variation in stress levels. E 31:7% of the variation in productivity can be explained by the variation in stress levels. 4 A rural school has investigated the relationship between the time spent travelling to school (minutes) and a student’s year ten average (%) for a sample of students. The results are given in the table below: Travel time 10 33 18 43 34 30 24 47 44 41 17 45 39 31 23 11 14 25 16 17 (mins) Year 10 51 78 97 56 90 70 64 67 37 46 95 67 31 57 43 99 98 82 40 67 average (%) a Construct a scatterplot of the data and interpret the scatterplot. b Find Pearson’s correlation coefficient for the data and interpret. c Calculate the coefficient of determination and interpret this in terms of the variables. 100 95 75 50 25 5 0 100 95 75 50 25 5 0 cyan black VIC MCR_12
© Copyright 2024