pa ge s pl e A Sa m ustralia has received migrants from many countries over many years. Immediately after the end of World War 2 there was a surge of migration from European countries, as people tried to find a better life for themselves and their families. War, or civil unrest, has often been the spur for migration to Australia. Whatever the reason, and wherever they are from, migrants have added to Australian culture in innumerable ways. Census statistics indicate that the spread of migrants throughout Victoria has been extensive. In the Greater Geelong area, 19.6% of the population was born overseas, while in the Latrobe Valley, 15.6% of the population was born overseas. Through the data generated in the census, the waves of migration can be tracked. The figures by themselves could be misinterpreted—analysis of the data, alongside a reading of history, allows us to interpret the data more successfully. 95 Prepare for this chapter by attempting the following questions. If you have difficulty with a question, click on the Replay Worksheet icon on your Exam Café CD or ask your teacher for the Replay Worksheet. e Worksheet R3.1 1 Find the range of each of the following data sets. (a) 5, 7, 7, 8, 9, 9, 10, 11, 13, 15 (b) 4, 8, 12, 13, 16, 23, 15, 19, 17, 16 − − − (c) 6, 2, 1, 5, 3, 2, 5, 4, 7 e Worksheet R3.2 2 For each of the following data sets find: (i) the mean (ii) the median (iii) the mode (a) 1, 2, 3, 3, 4, 5, 6 (b) −3, −3, −2, −1, 0, 5, 8, 10, 11 (c) 4, 5, 2, 2, 4, 3, 3, 2, 5, 4 (d) 10, 11, 22, 19, 12, 18, 15, 11 e Worksheet R3.3 3 The graph shows the mean rainfall data for Tennant Creek in the Northern Territory. (a) How many rain days are usual for January? (b) Which is the wettest month? (c) Which month(s) has the most rain days? (d) Which month(s) has the least rain days? (e) About how much rain is recorded in Tennant Creek over the course of the year? No. of rain days 30 Average monthly rainfall No. of rain days 25 100 20 75 15 50 10 25 0 J F M A M J 5 J A S O N D Sa m pl e pa ge s Rainfall (mm) 125 sum of data values Mean = --------------------------------------------------------. number of data values Median is the physical middle value in the data set (the data must be put in order first). Mode is the most frequently occurring data value. Range is the difference between the highest data value and the lowest data value. 96 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 0 3.1 Types of data Data is information of some kind. (Data is the plural of the word datum, although now data is commonly used grammatically as a singular word.) Data can take many forms. Consider the following questions which might appear as part of a survey (although probably not the same one!). My favourite fruit is: Apple Banana Orange Mango Mandarin Other The death penalty should be reintroduced as the punishment for serious crimes. strongly agree agree not sure disagree strongly disagree pa ge s The data collected in both cases is known as categorical data. But, can you see a difference between the data in each case? In the first case, the categories are simply fruits, each is no different to any other; this is known as nominal data. In the second case, there is an order implied in the categories offered; this is known as ordinal data. Quite often, the data collected consists of numbers and, not surprisingly, is known as numerical data. Consider these questions which might appear in surveys. • What was the year of your birth? • What is your height? In the first case, we record answers, data values, such as 1960, 1984 and 1989, and there is no interpretation needed. Data like this is known as discrete data. With discrete data you can count the different values possible. pl e Discrete data does not need to be whole numbers. For example, shoe sizes are discrete, but the data values are not necessarily whole numbers: … 6, 6 1-2- , 7, 7 1-2- … Sa m In the second case above, we need to interpret the question. We might write 180 cm or 179.8 cm or 179.82 cm (although this degree of accuracy is unlikely). The value recorded varies depending on the degree of accuracy requested or the degree of accuracy possible with our measuring instrument. Data like this is known as continuous data; the values can be any (real) numbers within a particular range. We make this distinction between discrete and continuous data because there are different things we can do with each data type. However, we very often treat continuous data as if it were discrete. This is because we must record the data in some way. With the height example above, if the question said to record the height to the nearest cm, then we would have the discrete possibilities 178, 179, 180 etc. worked example 1 Decide whether these data samples are categorical or numerical. (a) The number of cars in the staff car park recorded daily for 3 weeks. (b) The favourite type of fast food for the members of your class. Steps (a) Will the data collected involve numbers? If yes, the data is numerical, otherwise it is categorical. (b) Will the data collected involve numbers? If yes, the data is numerical, otherwise it is categorical. Solutions (a) The data is clearly numerical. (b) As numbers are not involved, this data is categorical. 3 ● de s c ri pt i v e STATISTICS 97 exercise 3.1 Types of data Worked Example 1 e Hint e Hint Sa m pl e pa ge s Short answer 1 Decide whether these data examples are categorical (C) or numerical (N). (a) The amount of change given to customers in a shop. (b) The voting intention recorded by a sample of people in a poll conducted before a State election. (c) The eye colours of the students in your General Mathematics class. (d) The ages, in years, of the teachers at your school. (e) The individual mass of each sheep in a paddock. (f) The brand of shoe worn by each Member of Parliament. 2 Decide whether these categorical data examples are nominal (M) or ordinal (O). (a) The favourite TV show of the members of your family. (b) The degree of support for the retention of school uniforms recorded by individual students at a particular school. (c) The starting letter of the company code for each company listed on the Australian Stock Exchange (ASX). (d) The listing of the top 10 CDs for the week. (e) The colour of cars in the used-car lot. (f) The brand of toothbrush used by a sample of dentists. 3 Decide whether these numerical data examples are continuous (T) or discrete (D). (a) The mass of wool taken from each animal in a flock of sheep. (b) The number of brands of grocery items available at the supermarket. (c) The number of spectators at each of the Melbourne Redbacks home games. (d) The time taken by each competitor in the Olympic 100 m final for women. (e) The height of each individual tree in a forestry coop. (f) The distance between each of the capital cities in Australia. 98 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D Sa m pl e pa ge s 4 Imagine you were setting up a number of survey instruments. Prepare a question for each of the following and indicate the data type which will be collected. Use the code M: nominal; O: ordinal; D: discrete; T: continuous. You are trying to find out the most popular make of car. You are investigating the height of kindergarten pupils. You are investigating the most popular type of fast food. You are trying to determine the support for the question: Should local government be scrapped? (e) You are investigating the most popular brand of cat food. (f) You are trying to find out the ages of the players at the local football club. (a) (b) (c) (d) e Hint Multiple choice 5 (a) The strength of agreement with a number of statements is recorded using a scale of 1–5 where 1 stands for strongly disagree and 5 stands for strongly agree. The most precise way of describing the data collected would be: A continuous B categorical C nominal D ordinal E discrete (b) The height of the high tide at the Barwon Heads bridge is recorded for 30 consecutive days. The most precise way of describing the data collected would be: A continuous B categorical C nominal D ordinal E discrete (c) The number of brothers and sisters of each member of your class is recorded. The most precise way of describing the data collected would be: A continuous B categorical C nominal D ordinal E discrete 3 ● de s c ri pt i v e STATISTICS 99 3.2 pa ge s Extended answer 6 (a) The measurement of distance is supposed to result in continuous data. However, all Olympic records that involve distance appear to produce discrete data. How can this be? (b) Surveys often use the numbers 1–5 to represent various levels of support for statements that are made. Often, 1 stands for strongly disagree and 5 stands for strongly agree. Your responses would appear to be numerical, but are they really? Recording data Frequency tables pl e Once we have gone to the trouble of collecting data we need to record it in some way. For example, if we had found the number of pets for each of the 25 members of our class we might have found something like the following. 2, 3, 2, 1, 2, 0, 1, 3, 1, 2, 1, 0, 2, 1, 1, 0, 0, 2, 1, 4, 5, 1, 0, 2, 1 Recorded like this the numbers don’t mean much to us. A frequency table is an efficient way of recording data. Sa m worked example 2 Put the following data representing the number of pets owned by each of the 25 members of a class into a frequency table. 2, 3, 2, 1, 2, 0, 1, 3, 1, 2, 1, 0, 2, 1, 1, 0, 0, 2, 1, 4, 5, 1, 0, 2, 1 Steps 1. Draw up a table showing each of the data values in one column, a tally mark column and a frequency column. 2. Go through the data list putting a tally mark for each entry (go through the list only once). Use |||| for 5. 3. Count up the tally marks and put this total in the frequency column. Add up the frequency column as a check. Solution No. of pets 0 1 2 3 4 5 Tally marks |||| |||| |||| |||| || || | | Worked Example 2 dealt with a discrete data set with only a small number of different data values. If the range of values is too big, we need to group the data into class intervals. We need to think about how big the intervals should be. Although there are no hard and fast rules, we prefer to have somewhere between six and 15 groups. So, we find the range and use it to decide on the interval size. 100 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D Frequency 5 9 7 2 1 1 25 worked example 3 The following data represents the number of goals kicked over the course of the season by the players at a local football club. 17 21 35 64 2 0 20 44 52 81 47 21 14 10 0 2 1 1 3 1 5 2 1 18 0 1 0 2 1 19 Display the data in a frequency table. 3. Draw up a table showing each of the intervals in one column, a tally mark column, and a frequency column. 4. Go through the data list putting a tally mark for each entry. 5. Count up the tally marks and put this total in the frequency column. Add up the frequency column as a check. Solution 81 − 0 = 81 Interval size: 10. (This gives nine intervals.) No. of goals Tally marks Frequency 0–9 10–19 20–29 30–39 40–49 50–59 60–69 70–79 80–89 |||| |||| |||| | |||| ||| | || | | 16 5 3 1 2 1 1 0 1 pa ge s Steps 1. Find the range. 2. Choose a convenient interval size. (In this case, notice that the first interval is 0–9, which covers 10 discrete values.) | 30 pl e When the data is continuous we set up the intervals in a slightly different fashion. We do this even when we round off the values to discrete quantities. worked example 4 Sa m The following data represents the haemoglobin level for each member of the class. The value has been measured correct to one decimal place. Display the data in a frequency table. Steps 1. Find the range. 2. Choose a convenient interval size. 3. Draw up a table showing each of the intervals in one column, a tally mark column and a frequency column. 4. Go through the data list putting a tally mark for each entry. 5. Count up the tally marks and put this total in the frequency column. Add up the frequency column as a check. 10.1 16.4 11.3 14.6 10.8 13.3 14.5 14.5 14.8 12.3 14.6 9.6 11.9 13.6 11.8 11.8 13.7 13.7 12.6 13.0 15.8 15.7 13.6 11.9 13.9 Solution 16.4 − 9.6 = 6.8 Interval size: 1 (This gives eight intervals.) Haemoglobin level 9.0–10 10.0–11 11.0–12 12.0–13 13.0–14 14.0–15 15.0–16 16.0–17 Tally marks Frequency | || |||| || |||| || |||| || | 1 2 5 2 7 5 2 1 25 3 ● de s c ri pt i v e STATISTICS 101 There are several things to notice about Worked Example 4. It shows one way of writing the class intervals. Another popular method is shown in the frequency table at right. The righthand end-point of the interval is not given; it is assumed to be (less than) the next left-hand end-point. For the last interval the pattern just needs to be continued. Haemoglobin level Tally marks Frequency | || |||| || |||| || |||| || | 9.0– 10.0– 11.0– 12.0– 13.0– 14.0– 15.0– 16.0– 1 2 5 2 7 5 2 1 Stemplots pl e pa ge s For any of these frequency tables we Haemoglobin Relative Percentage can add additional columns for relative level Frequency frequency frequency frequency or percentage frequency, 9.0–10 1 0.04 4% as shown in the table at right. Both are 10.0–11 2 0.08 8% useful if we wish to compare values in 11.0–12 5 0.20 20% different frequency tables where the 12.0–13 2 0.08 8% total for each table is different. 13.0–14 7 0.28 28% To find the relative frequency we 14.0–15 5 0.20 20% divide the frequency by the total 15.0–16 2 0.08 8% number of data values in the table, for 16.0–17 1 0.04 4% example: 25 1.00 100% 1----= 0.04 25 and we multiply this value by 100 to convert to the percentage frequency: 0.04 × 100 = 4% In both cases, we may need to round off to a sensible number of decimal places. The relative frequency column should add to 1 and the percentage frequency column should add to 100, although there may be slight variations from these totals due to the rounding process. Sa m Although a frequency table is a useful way to record data, when we group data values we lose information. For example, look again at the final table in Worked Example 3; from the frequency table we can no longer identify the highest number of goals scored. Another way to record data that preserves this information is the stemplot. Stemplots are usually divided into intervals of 10, although if there are many values an interval of five may be used. Stemplots are sometimes called stem-and-leaf diagrams. This is because the leaf represents the units component of the number, while the stem is the other part. worked example 5 The following data represents the scores obtained by the students in a maths class on a topic test. Draw a stemplot for the data. Steps 1. Draw up the outline of the table using the tens digit as the stem. 2. Go through the data list writing each value in the appropriate place. 102 12 18 34 13 15 44 52 9 23 33 59 11 Solution STEM 0 1 2 3 4 5 LEAF 9 2 3 3 5 6 5 3 4 4 2 8 9 3 1 5 8 3 1 9 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 45 23 41 15 56 19 23 18 3. Order the data. STEM LEAF 0 1 2 3 4 5 9 1 3 3 1 2 2 3 4 4 6 3 5 5 8 8 9 3 5 9 pa ge s Note: We can still clearly identify the highest score or, for that matter, the lowest score. If we needed to break the stems because of the large number of leaf values, we would use a code such as 5L to represent the values in the lower half of the 50s; for example, 50, 51, 52, 53 and 54, and 5U for the upper half. A stemplot cannot be used for true continuous data but, remember, we often treat continuous data as discrete. In these cases, we could use a stemplot. When the data values are large we sometimes need to round them off to be able to use a stemplot. exercise 3.2 Recording data Sa m pl e Short answer 1 Construct frequency tables for the following discrete data sets. (a) 2, 3, 5, 4, 3, 6, 1, 0, 2, 4, 3, 2, 1, 2, 4, 2 (b) 25, 26, 25, 24, 23, 22, 21, 20, 19, 25, 26, 22, 23, 24, 22, 21 (c) 0, 1, 0, 1, 2, 3, 1, 0, 3, 4, 0, 1, 4, 3, 2, 1, 0, 1, 5, 0, 0, 1, 2, 3, 2 (d) 32, 35, 33, 34, 32, 31, 30, 31, 32, 34, 35, 35, 32, 34, 32, 31, 30 2 Construct frequency tables for the following discrete data sets, using appropriate class intervals. (a) 1, 2, 8, 10, 22, 33, 42, 46, 41, 33, 32, 38, 51, 47, 46, 32, 17, 21, 24, 23, 27, 29, 36 (b) 101, 122, 134, 159, 145, 152, 147, 117, 109, 102, 117, 118, 134, 147, 148,156, 159, 162, 155 (c) 15, 28, 42, 42, 43, 47, 51, 43, 46, 47, 48, 46, 16, 19, 22, 24, 26, 29, 31, 34, 37, 41, 38, 52, 54, 55 (d) 161, 162, 171, 175, 182, 183, 191, 183, 167, 169, 173, 178, 177, 192, 189, 190, 180, 170, 164, 168, 173, 175 3 Construct frequency tables for the following continuous data sets, using appropriate class intervals. (a) 16.1, 17.3, 18.4, 17.6, 18.2, 19.5, 19.7, 18.6, 19.2, 27.5, 26.7, 24.5, 26.5, 28.7, 29.6, 24.3, 26.1, 18.5, 19.7, 22.6, 23.5, 29.9, 28.1, 18.7, 21.2, 23.5, 26.7, 27.7, 26.5 (b) 57.2, 72.5, 83.9, 52.3, 49.9, 63.9, 71.7, 54.5, 64.3, 98.3, 42.9, 73.9, 57.2, 91.9, 83.1, 47.7, 53.7, 89.1, 37.3, 72.3, 56.8, 51.1, 78.5, 37.2, 92.1, 78.9, 38.6, 78.4, 89.3 (c) 155.5, 154.3, 198.5, 167.9, 176.5, 178.0, 156.8, 176.5, 154.9, 157.0, 187.8, 192.3, 178.0, 187.6, 175.0, 156.0, 184.6, 167.9, 165.8, 154.9, 158.8, 192.6, 178.7 (d) 19.6, 25.4, 19.5, 22.4, 24.6, 21.8, 20.7, 20.4, 23.5, 25.4, 22.7, 23.4, 21.5, 22.2, 23.6, 24.6, 20.2, 21.1, 23.1, 24.5, 22.1, 23.5, 24.1, 23.0, 24.0, 23.5, 23.1, 24.6 Worked Example 2 e Hint Worked Example 3 Worked Example 4 3 ● de s c ri pt i v e STATISTICS 103 4 Draw stemplots for each of the following data sets. (a) 1, 2, 8, 10, 22, 33, 42, 46, 41, 33, 32, 38, 51, 47, 46, 32, 17, 21, 24, 23, 27, 29, 36 (b) 101, 122, 134, 159, 145, 152, 147, 117, 109, 102, 117, 118, 134, 147, 148, 156, 159, 162, 155 (c) 15, 28, 42, 42, 43, 47, 51, 43, 46, 47, 48, 46, 16, 19, 22, 24, 26, 29, 31, 34, 37, 41, 38, 52, 54, 55 (d) 161, 162, 171, 175, 182, 183, 191, 183, 167, 169, 173, 178, 177, 192, 189, 190, 180, 170, 164, 168, 173, 175 5 For the following frequency tables add a relative frequency column and a percentage frequency column. (e) 0 1 2 3 4 5 3 6 7 4 3 1 Score Frequency 10–19 20–29 30–39 40–49 50–59 60–69 70–79 15 23 17 15 13 20 5 Score Frequency 0–10 10–20 20–30 30–40 40–50 50–60 9 18 13 15 14 16 (b) (d) (f) Score Frequency 4 5 6 7 8 9 10 15 23 14 23 17 19 20 pa ge s Frequency Score Frequency 101–110 111–120 121–130 131–140 141–150 151–160 161–170 9 4 7 1 4 3 7 Score Frequency 141–146 146–151 151–156 156–161 161–166 166–171 22 31 21 26 22 24 pl e (c) Score Sa m (a) Multiple choice 6 If we have a frequency table drawn, then: A the relative frequency column should add to 100 B the percentage frequency column should add to 100 and the frequency column should add to 1 C the frequency column should add to a number less than 20 D the relative frequency column should add to 1 and the percentage frequency column should add to 100 E both the relative frequency column and the percentage frequency column should add to 100 7 A data set that has a range of 120 is to be recorded in a grouped data frequency table. The size of the groups would be best set at: A 2 B 5 C 10 D 20 E 25 Extended answer 8 The following data represents the number of kilometres travelled each day by a solar powered vehicle. 112 95 122 78 99 145 133 160 145 144 78 66 68 79 93 125 133 89 67 78 114 134 80 65 105 115 127 134 117 84 104 Worked Example 5 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D e Hint pa ge s e Hint 3.3 Sa m pl e (a) Given the way the data has been recorded would you consider it to be discrete or continuous? (b) Construct a frequency table for the data, using a class interval of 10 and a starting value of 60. (c) Looking only at the frequency table what can you say about the maximum distance travelled on any day? (d) Construct a stemplot for the data. (e) What can a stemplot tell us that a frequency table cannot? (f) What does this question reinforce? Simple data displays Bar graphs A bar graph (or chart) is used to display categorical (nominal) information. Bar graphs can have horizontal bars or vertical bars . Consider the bar chart at right. Note there is no scale on the axis showing the categories (horizontal in this case). Gaps are left between the bars to emphasise that we are not dealing with continuous data. We can simply read values from the graph; for example, 15 of the students sampled rode a bike to school on the day of the survey. Bar charts can easily be constructed from data presented in a frequency table. Number 15 10 5 0 car walk bus bike train Mode of transport 3 ● de s c ri pt i v e STATISTICS 105 worked example 6 Draw a bar chart to represent the following data collected as a result of a survey. sedan, sedan, ute, truck, stationwagon, sedan, sedan, sedan, truck, motorbike, sedan, stationwagon, sedan, ute, truck, sedan, sedan, stationwagon, sedan, ute, truck, ute, truck, truck, stationwagon, sedan, sedan, sedan, truck, ute, sedan, stationwagon, sedan, ute, truck, ute, sedan, stationwagon Steps 1. Construct a frequency table for the information. Solution Type of vehicle Frequency sedan ute truck stationwagon motorbike 16 7 8 6 1 38 Frequency pa ge s 2. Draw the bar chart, remembering to leave a gap between each of the categories. 18 16 14 12 10 8 6 4 pl e 2 sedan ute truck station- motorwagon bike Type of vehicle Sa m 0 Divided bar charts Another type of data display is the divided bar chart, where a single rectangle is divided into pieces according to the contribution of the category to the whole. An appropriate scale needs to be chosen to make the graph easier to draw. Sometimes, we find the percentage contribution for each category and base the scale on those values. worked example 7 Draw a divided bar chart to represent the number of grams of various components in a hot dog. Component Protein Fat Saturated fat Carbohydrates Sugar No. of grams 17 18.2 6.3 34.5 2.6 Steps 1. Find the total number of grams involved. 106 Solution 78.6 g Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 2. Draw a rectangle of a suitable length and divide it into pieces according to the category values. 17 0 18.2 6.3 20 34.5 40 60 2.6 80 100 grams protein fat saturated fat carbohydrates sugar Frequency diagrams When dealing with numerical data we should not really use a bar chart. However, the temptation is strong when we have discrete data. We sometimes use strips rather than bars in these cases to make the diagram look different from a bar chart. We might call this a frequency diagram. Note that this time the data value axis needs a label, but the graph does not have a title. The axis labels tell us everything we need to know to be able to interpret the diagram. Again, there are gaps to emphasise the discrete nature of the data. Grouped discrete data can also be represented in a frequency diagram. The horizontal axis scale values would be the intervals. 20 15 10 5 pa ge s worked example 8 Frequency 0 0 1 2 3 4 No. of passengers Sa m pl e Draw a frequency diagram to represent the following discrete data set. Steps 1. Draw a set of axes marking Score on the horizontal axis and Frequency on the vertical axis. 2. Complete the graph. Score Frequency 7 8 9 10 11 4 8 6 2 1 21 Solution Frequency 8 7 6 5 4 3 2 1 0 7 8 9 10 11 Score Histograms When dealing with continuous numerical data we draw a diagram known as a histogram. A histogram is like a bar chart, but there are some important differences. One of these is that the bars are joined together. There are no gaps in the data values so we leave no gaps in the diagram. However, we do usually start the first column one-half column width from the vertical axis. 3 ● de s c ri pt i v e STATISTICS 107 We can use Worked Example 4 about the haemoglobin level in the blood to illustrate a histogram. The frequency table is repeated to remind us of the data values. Frequency Haemoglobin level Frequency 7 9.0–10 10.0–11 11.0–12 12.0–13 13.0–14 14.0–15 15.0–16 16.0–17 1 2 5 2 7 5 2 1 6 5 4 3 2 1 0 25 9 10 11 12 13 14 15 16 17 Haemoglobin level pl e pa ge s Note the way the interval end-points are used as the dividers on the horizontal axis. The is used to indicate that part of the horizontal axis has been left out. When we set up our own frequency tables we Age range No. of deaths emphasised the need to make the intervals the same size. 0– 3 However, when we use frequency tables that have already 5– 9 been set up we do not always have this luxury. If the 13– 3 intervals are not the same size then we need to be very 16– 14 careful. Consider the frequency table at right which gives 18– 48 information about road deaths of males in Victoria in 22– 39 2002. 26– 27 30– 40– 50– 60– 75+ 50 33 25 29 19 299 Source: CrashStats, owned by VicRoads Sa m Now, look at the diagram on the right, No. of deaths which is meant to be the histogram 50 illustrating this data. It seems to indicate 40 that the most dangerous age range is 30–40 years, as this is when there are the 30 most fatalities. But this is not correct; because 20 the interval is larger we would expect more deaths. We need a relative measure to make 10 such a decision. To do this, we introduce a 0 new term, frequency density, which is 0 10 20 30 40 50 60 70 80 90 Age defined as follows. Notice that we have taken 90 to be the upper limit for the age of the fatalities. percentage frequency Frequency density = -----------------------------------------------------------. class width The following worked example shows us how to draw a histogram correctly when we have inconsistent intervals. 108 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D worked example 9 Draw a histogram for the data shown in this frequency table. Age range No. of deaths 0– 5– 13– 16– 18– 22– 26– 30– 40– 50– 60– 75+ 3 9 3 14 48 39 27 50 33 25 29 19 299 Source: CrashStats, owned by Vic Roads pa ge s Solution Sa m 2. Draw the histogram, placing frequency density on the vertical axis. (In the histogram the frequency density values have been rounded off to one decimal place.) No. of deaths Age range pl e Steps 1. Add a percentage frequency column if not already there. Add a frequency density column using: percentage frequency Frequency density = ----------------------------------------------------------. class width For example: 3 percentage frequency = ---------- × 100 = 1.0% 299 1.0 frequency density = -------- = 0.2 5 Note the use of Σ (the Greek letter sigma) to represent ‘the sum of’. 0– 5– 13– 16– 18– 22– 26– 30– 40– 50– 60– 75+ 3 9 3 14 48 39 27 50 33 25 29 19 Percentage frequency Frequency density 1.0 3.0 1.0 4.7 16.1 13.0 9.0 16.7 11.0 8.4 9.7 6.4 0.2 0.375 0.333 2.35 4.025 3.25 2.25 1.67 1.1 0.84 0.647 0.427 Σf = 299 Frequency density 5 4 3 2 1 0 0 10 20 30 40 50 60 70 80 90 Age From this, we can see that the most dangerous age range is actually 18–22. This would fit with anecdotal evidence which suggests that young males who have just obtained a driving licence are more dangerous on the road than other groups. 3 ● de s c ri pt i v e STATISTICS 109 Cumulative frequency diagrams When we have a continuous data set recorded in class intervals we can draw what is called a cumulative frequency diagram. This shows the number of data values less than a particular value. To help in producing a cumulative frequency diagram we add a new column to our frequency table and label it cumulative frequency. We also put in a new first row to emphasise that we are finding the number of data values less than a given value. However, we should note that the cumulative frequency column tells us how many values there are less than the righthand end-point of the interval. worked example 10 Draw a cumulative frequency diagram for the following data set which gives the mass (kg) of pupils at a small rural school. Class interval (mass, kg) Frequency (number of students) 10–20 20–30 30–40 40–50 5 8 16 5 Steps 1. Add the new first row and new column to the frequency table. pa ge s Σf = 34 Solution Mass (kg) pl e 10 10–20 20–30 30–40 40–50 Sa m 2. Fill in the cumulative frequency column. Notice that the new row has an entry of 0. Add on each frequency as you go down the column. (Remember, this tells us, for instance, that there are 13 values less than 30.) Also notice that the final entry in the cumulative frequency column is equal to the sum of the frequency column. 3. Draw the graph. Put cumulative frequency on the vertical axis and record these values against the right-hand end-point of the class interval. Join the points with straight-line segments. Cumulative frequency Frequency 5 8 16 5 Σf = 34 Mass (kg) Cumulative frequency Frequency 10 10–20 20–30 30–40 40–50 0 5 13 29 34 5 8 16 5 Σf = 34 Cumulative frequency 40 32 24 16 8 0 10 110 20 30 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 40 50 Mass (kg) Sometimes, cumulative frequency diagrams are drawn with a smooth curve, called an ogive, passing through the points rather than using straight-line segments. For our purposes it does not make any real difference which method is used. You can use the cumulative frequency curve to obtain approximate answers to questions such as ‘What percentage of students are less than 35 kg?’ and ‘Under what mass are 70% of the students?’ worked example 11 For the cumulative frequency graph shown, find approximate answers to the following questions. (a) What percentage of students are less than 35 kg? Give your answer correct to the nearest per cent. (b) Under what mass are 70% of the students? Give your answer correct to the nearest kg. Cumulative frequency 40 32 24 pa ge s 16 8 0 10 Sa m 2. Read off the value and convert this to a percentage. Write this correct to the nearest per cent. (b) 1. Calculate a value to represent the given percentage of students. 2. Draw a line from the calculated value on the vertical axis across to the graph and then a line, from this point, down to the horizontal axis. 30 40 50 Mass (kg) Solutions (a) Cumulative frequency 40 pl e Steps (a) 1. Draw a line from the given mass on the horizontal scale to the graph and then, from this point, draw a line across to the vertical scale. 20 32 24 16 8 0 10 20 30 40 50 Mass (kg) 20 students 20 ------ × 100 = 58.82% 34 59% of students are less than 35 kg. (b) 70% of 34 = 23.8 Cumulative frequency 40 32 24 16 8 0 10 3. Read off the value. 20 30 40 50 Mass (kg) 70% of students are less than approximately 37 kg. 3 ● de s c ri pt i v e STATISTICS 111 Calculating percentiles using CAS We can use our CAS to find only approximate values for percentiles. These arise from questions such as “Under what mass are 70% of the students?” Of course, you need to have also been given the distribution of the students’ weight. We have the following information. Class interval (mass, kg) Frequency (number of students) 10–<20 20–<30 30–<40 40–<50 5 8 16 5 We need to enter the cumulative frequency values as the second column in our list. As a check, the total frequency is 34 so this should be the value in the last row. Using the ClassPad 1. To make this process work best we need to do it slightly out of order. The first step is to actually draw the line equivalent to the percentile value in question and then to do the line graph. To estimate the 70th percentile we need to know 70% of the cumulative frequency. In our case this is 0.7 × 34 = 23.8. Enter this in pa ge s Using the TI-Nspire CAS 1. Enter the data in the Lists & Spreadsheet application. Call the first column x and use the right-hand boundary values of the class intervals, and call the second column y. Include the point (10, 0) to give the graph a starting point. on the axes are not really important at this stage. Sa m pl e W mode and draw as usual. The values 2. Insert a new page, / I , and choose Data & Statistics. Label the axes as seen previously. 2. Enter the data in I mode. Use the right-hand boundary values of the class intervals for list1, and the cumulative frequency values for list2. Include the point (10, 0) to give the graph a starting point. 112 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 3. Join the dots by pressing >XY Line Plot. b > Plot Type 4. To estimate the 70th percentile we need to know 70% of the cumulative frequency. In our case this is 0.7 × 34 = 23.8. We now want to draw in the line y = 23.8. To do this press b > Analyse > Plot Function and fill in the dialog box. We can now approximate the x-value of the point of intersection. pa ge s 3. Tap SetGraph > Setting and make StatGraph1 the scatter plot as shown previously and StatGraph2 as shown at right. pl e 4. Tap h y and the cumulative frequency curve appears. Tap N $ and the horizontal line will appear. Sa m So, the 70th percentile is approximately 37. We can now approximate the x-value of the point of intersection. Just tap the screen at the point of intersection and the coordinate pair will appear at the bottom of the screen. Unfortunately, this cannot be shown in a screen shot, but it shows the 70th percentile is approximately 37. 3 ● de s c ri pt i v e STATISTICS 113 Summary Recording data • A frequency table is an efficient way of recording data. It records, in column format, the possible data values and the associated frequencies. • Class intervals are used when the range of values recorded is too great for a meaningful ungrouped arrangement. • Relative frequency and percentage frequency columns can be added to frequency tables. Relative frequency is found by dividing the frequency by the total frequency, and we multiply this value by 100 to obtain the percentage relative frequency. Data analysis methods and tools • When we divide a data set into a number of equal pieces we call these pieces quantiles. Some quantiles have special names: – if we divide the data set into four equal pieces we have quartiles – if we choose 10 equal pieces we have deciles – if we choose 100 equal pieces we have percentiles. • When we divide a data set into quartiles we obtain the five-figure summary—the lowest value, the lower quartile (Q1 or QL), the median (or Q2), the upper quartile (Q3 or QU) and the highest value. • The interquartile range is the difference between the upper quartile and the lower quartile (IQR = Q3 − Q1). It is a robust measure of spread unaffected by extreme values. • An outlier is located more than 1.5 × IQR away from the nearer of Q1 and Q3. • A boxplot is a way of representing the spread of a data set. It uses each of the values from the five-figure summary—lowest value, lower quartile, median, upper quartile and highest value. • Standard deviation is a measure of spread which uses all of the data values; however, it can be affected by extreme values. • Your CAS can be used to produce all of the statistics we are interested in as well as to draw boxplots. • Skewness is a way of describing how far a data set is from being symmetrical. Sa m pl e Simple data displays • Stem-and-leaf diagrams are another way of recording data. In a stem-and-leaf diagram, the leaf is the units value of the individual datum value and the stem is the other part of the number. The advantage with a stemand-leaf diagram is that it does not lose any of the detail in the data. • A bar chart is used to represent categorical (nominal) data. • A frequency diagram is used to represent numerical (discrete) data whether it is discrete or grouped. • A histogram is used to represent numerical (continuous) data. • Frequency density is used when we are trying to draw a histogram and the class intervals are not equal. • Continuous data can also be represented in a cumulative frequency curve which indicates the number of data values less than a particular value. – the median is the physical middle of the data set. for odd n, the median is the ⎛ n --- + 0.5⎞ th value; ⎝2 ⎠ n for even n, the median is the mean of ⎛ --- ⎞ th and ⎝2 ⎠ ⎛n --- + 1⎞ th values. ⎝2 ⎠ – the mode is the most frequently occurring data value. • If a data set has two values with the same highest frequency we say the set is bimodal. • Grouped discrete data is difficult to deal with as far as the median is concerned. We can, at best, estimate the value. pa ge s Data types • There are two types of categorical data, nominal (such as type of fruit) and ordinal (such as strongly agree, agree …) and two types of numerical data, discrete (such as number of brothers and sisters) and continuous (such as height of players in a team). Values of central tendency and spread • There are three measures of central tendency used: – the mean is calculated by finding the sum of the data values --------------------------------------------------------- . number of data values Σx Σx this can be written as ------ = ------ . n Σf 3 ● de s c ri pt i v e STATISTICS 153 Negative skew Positive skew Positive skew mode ⬍ median ⬍ mean pa ge s Negative skew mean ⬍ median ⬍ mode 3(mean – median) • The degree of skewness = -----------------------------------------------. standard deviation • Composite bar charts, back-to-back stem-and-leaf plots and comparative boxplots are all used to assist in the analysis of data sets. Symmetrical pl e Symmetrical Sa m Use the following to check your progress. If you need more help with any questions, turn back to the section given in the side column, look carefully at the explanation of the skill and the worked examples, and try a few similar questions from the Exercise provided. Short answer 1 Select the correct data type for each of the following examples. You are to choose from: nominal, ordinal, discrete and continuous. (a) The height jumped by each of the competitors in the school high jump competition. (b) The shirt size of each of the students in your class. (c) The favourite colour of each of the students in your class. (d) The highest level of education achieved by each of the respondents to a survey. 154 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D 3.1 pa ge s 2 The following data set represents the number of siblings for a sample of students in a school. 0 3 2 1 0 2 3 2 1 2 5 3 1 0 2 3 2 1 6 3 2 5 3 0 2 3 1 0 1 1 1 4 3 2 5 4 2 0 1 1 2 3 2 1 4 5 2 1 Construct a frequency table for the data set and include both a relative frequency column, stated to three decimal places where necessary, and a percentage frequency column, stated to one decimal place where necessary. 3 Draw a stemplot for the following data set. 25 38 65 32 77 21 79 81 66 50 47 53 25 30 42 60 70 29 51 63 68 82 40 33 22 45 37 65 74 70 35 61 81 77 65 66 4 Draw a divided bar chart to represent the number of grams of various components in a fried dim sim. Component Protein Fat Saturated fat Carbohydrates Sugar No. of grams 5 4.3 2.2 13.0 3.1 Sa m pl e 5 Draw a frequency diagram for the data set at right. 6 Find two estimates of the median for this data set. 7 The data set shown at right represents the heights of the players in the under 16 football squad at the local football club. (a) Draw a cumulative frequency curve and then use it to find: (b) an estimate for the median (c) an estimate for the IQR. Number of pets Frequency 0 1 2 3 4 5 6 3 3 5 6 1 2 1 x f 0–4 5–9 10–14 15–19 20–24 2 5 7 3 1 Class interval Frequency 130– 140– 150– 160– 170– 180– 3 3 4 12 4 2 8 Find the standard deviation of the following data sample set using your calculator. 1, 1, 2, 2, 2, 2, 3, 4, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 9 3 ● de s c ri pt i v e 3.2 3.2 3.3 3.3 3.4 3.4 3.5 3.5 STATISTICS 155 9 Find the five-figure summary for the data set below. STEM 13 14 15 16 17 18 19 LEAF 1 2 0 2 0 2 0 1 3 0 3 0 2 0 3.5 2 3 0 4 1 2 1 3 3 1 4 2 5 5 5 2 5 3 7 6 7 3 8 4 8 7 9 9 3 3 3 9 9 3.6 2000 2001 Australia Canada Japan Korea New Zealand USA Brazil China Russia 3.0 5.3 2.8 9.3 3.7 3.8 3.9 8.0 9.1 2.7 1.9 0.4 3.1 2.2 0.3 1.5 7.3 4.9 12 The bar chart shown gives the number of fatalities onVictorian roads for 2003 for a number of categories of road user, divided by sex. Discuss the content of the chart. 2002 3.3 3.3 0.2 6.3 4.2 2.4 1.5 8.0 4.3 Number of fatalities 100 Male Traffic fatalities 2003 Female 90 80 70 60 50 40 30 20 10 0 Driver 156 3.6 Sa m Country pl e pa ge s 10 For the 2003 season, each AFL club had 38 players on its senior list. The data represents the weight (kg) of the players for two clubs. Draw a back-to-back stemplot and use this to assist in comparing the two teams. Geelong: 83 84 81 97 83 84 89 90 93 92 86 96 96 94 96 84 87 85 106 94 89 84 87 79 101 91 86 97 96 86 88 86 90 94 78 104 96 94 Western Bulldogs: 77 98 91 77 89 81 90 76 94 80 98 84 79 72 77 75 75 98 94 82 94 96 84 89 85 72 94 74 81 88 83 85 82 83 78 101 83 85 11 The table indicates the GDP for various countries in the OECD. The figures represent the % GDP for each country. Draw a comparative frequency diagram to help you analyse the data. Passenger Pedestrian Motor cyclist Bicyclist Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D Pillion Passenger 3.6 Multiple choice 13 The collar size of shirts worn by the pupils in a class is recorded. The most precise way of describing the data collected would be as: A continuous B categorical C nominal D ordinal E discrete 14 A data set with a range of 300 is to be recorded in a grouped data frequency table. The size of the groups would be best set at: A 5 B 10 C 15 D 50 E 100 15 The following table gives the number of grams of various components in a 150 g packet of salted peanuts. Protein Fat Saturated fat Carbohydrates Sugar No. of grams 37.2 70.95 8.1 13.5 7.65 pa ge s Component The percentage of the mass of peanuts that is protein is closest to: A 70.95% B 37.4% C 37.2% D 27.1% Sa m pl e The data set at right is to be used for questions 16 to 18. 16 The mean of the data set is closest to: A 2.5 B 5 C 2.8 D 4 E 2 17 The median of the data set is closest to: A 2.5 B 5 C 2.8 D 4 E2 18 The mode of the data set is closest to: A 2.5 B 5 C 2.8 19 You are given the data set on the right. The mean of the data set is closest to: A 34.5 B 25 C 25.5 D 29 E 24.5 The data set at right is to be used for questions 20 to 21. 20 The mean of the data set is closest to: A 11.4 B 12.5 C 11.9 D 10.9 E 12.2 21 The modal class of the data set is best represented by: A 5–10 B 5 C 10–15 D 7 E 12.5 The data set at right is to be used for questions 22 and 23. 22 The median of the data set is best represented by: A 6 B 30 C 43 D 63 E 66 23 The IQR of the data set is best represented by: A 38.5 B 43.5 C 38 D 44 E 39 3.1 3.2 3.3 E 29.05% x f 1 2 3 4 5 3 5 2 4 2 3.4 3.4 3.4 D 4 E 2 x f 0–9 10–19 20–29 30–39 40–49 6 2 7 8 3 x f 0–5 5–10 10–15 15–20 20–25 2 5 7 3 1 STEM 3 4 5 6 7 8 9 3.4 3.4 3.4 LEAF 0 3 8 1 0 1 2 1 3 9 3 6 2 2 4 6 7 3 3 4 5 3.4 4 7 2 2 3.5 6 7 2 5 6 6 7 8 9 5 6 7 3 ● de s c ri pt i v e STATISTICS 157 24 The IQR for the data set: 2, 4, 5, 2, 4, 3, 1, 4, 6, 5, 4, 7, 8, 6, 8, 4, 5, 6, 2, 4, 3, 1, 3, 5, 7, 9, 1, 3, 2, 4, 6, 5, 3, 2, 1, 5, 2, 5, 6 is best represented by: A 2 B 4 to 5 C 3 to 6 D 4 E 6 25 For the data set on the right the IQR is best represented by: x f A 8 B 4 C 2.5 8 4 D 13 E 15 13 15 19 21 25 Extended answer 0– 16– 18– 21– 26– 40– 60–70 30 15 45 58 115 64 93 Sa m No. of deaths 6 7 8 9 10 11 3.5 8 6 3 1 2 LEAF 3.6 1 2 5 5 6 9 00 1 1 3 3 3 4 6 8 9 4 5 5 5 6 6 7 3 3 6 9 9 0 1 3 6 0 0 2 28 Two brands of rechargeable batteries were tested to compare their lifetimes, in hours, before they required recharging. Samples of 10 batteries of each brand were tested in the same equipment, producing the following times before requiring recharging. Big Zap: 7, 22, 9, 24, 16, 22, 25, 26, 23, 26 Sparky: 19, 20, 21, 15, 17, 15, 39, 23, 14, 15 pl e 27 The table shows the total road deaths by age group. Age range STEM pa ge s 26 Look at the data set given. The degree of skewness is closest to: C 11.258 A 66.675 B −0.049 D 0.247 E −0.146 3.5 (a) Draw a histogram to show this data. (Be careful!) (b) Draw a cumulative frequency curve and use it to estimate the age: (i) under which 60% of deaths occurred (ii) under which 18% of deaths occurred (iii) over which 70% of deaths occurred. (c) Draw a boxplot for the data, clearly identifying the five-figure summary values. (d) Describe the data set including some mention of its degree of skewness. (a) Calculate the mean, median, standard deviation and interquartile range of lifetime before recharging for each brand of battery. 158 Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D (b) Represent the data as a back-to-back stemplot. (c) Construct comparative boxplots for the lifetimes of the two brands of batteries. (d) If you wanted to be confident that the battery you purchased would last at least 14 hours, which battery would you buy? (e) If you wanted to maximise your chances of the battery you purchased lasting over 20 hours, which battery would you buy? (f) Which battery tends to last longer? Justify your answer with reference to the summary statistics and diagrams. exam focus 4 VCAA 2004 Further Mathematics Units 3 & 4, Exam 1, Section A, Questions 1 & 2 0 1 2 3 4 5 9 pa ge s The following information relates to Questions 1 and 2. The marks obtained by students who sat for a test are displayed as an ordered stemplot as shown. 0 1 2 5 6 0 1 1 1 3 5 5 7 8 9 9 1 2 3 4 4 6 7 7 0 D. 32 E. 50 D. 36 E. 41 Sa m pl e 1. The number of students who sat the test is A. 25 B. 26 C. 27 2. The interquartile range of these test marks is closest to A. 9 B. 13 C. 30 exam focus 5 VCAA 2004 Further Mathematics Units 3 & 4, Exam 1, Section A, Questions 4 & 5 The following information relates to Questions 1 and 2. The number of DVD players in each of 20 households is recorded in the frequency table below. Number of DVD players 0 1 2 3 4 5 Frequency 6 9 3 1 0 1 Total 20 1. For this sample of households, the percentage of households with at least one DVD player is A. 30% B. 45% C. 50% D. 70% E. 90% 2. For this sample of households, the mean number of DVD players in these 20 households is A. 0.75 B. 1.00 C. 1.15 D. 1.64 E. 2.00 3 ● de s c ri pt i v e STATISTICS 159
© Copyright 2024