A

pa
ge
s
pl
e
A
Sa
m
ustralia has received migrants from many countries over
many years. Immediately after the end of World War 2
there was a surge of migration from European countries, as
people tried to find a better life for themselves and their
families. War, or civil unrest, has often been the spur for
migration to Australia. Whatever the reason, and wherever
they are from, migrants have added to Australian culture in
innumerable ways.
Census statistics indicate that the spread of migrants
throughout Victoria has been extensive. In the Greater
Geelong area, 19.6% of the population was born overseas,
while in the Latrobe Valley, 15.6% of the population was
born overseas. Through the data generated in the census,
the waves of migration can be tracked. The figures by
themselves could be misinterpreted—analysis of the data,
alongside a reading of history, allows us to interpret the
data more successfully.
95
Prepare for this chapter by attempting the following questions. If you have difficulty with
a question, click on the Replay Worksheet icon on your Exam Café CD or ask your teacher
for the Replay Worksheet.
e
Worksheet R3.1
1 Find the range of each of the following data sets.
(a) 5, 7, 7, 8, 9, 9, 10, 11, 13, 15
(b) 4, 8, 12, 13, 16, 23, 15, 19, 17, 16
−
−
−
(c) 6, 2, 1, 5, 3, 2, 5, 4, 7
e
Worksheet R3.2
2 For each of the following data sets find:
(i) the mean
(ii) the median
(iii) the mode
(a) 1, 2, 3, 3, 4, 5, 6
(b) −3, −3, −2, −1, 0, 5, 8, 10, 11
(c) 4, 5, 2, 2, 4, 3, 3, 2, 5, 4
(d) 10, 11, 22, 19, 12, 18, 15, 11
e
Worksheet R3.3
3 The graph shows the mean rainfall
data for Tennant Creek in the
Northern Territory.
(a) How many rain days are usual
for January?
(b) Which is the wettest month?
(c) Which month(s) has the most
rain days?
(d) Which month(s) has the least
rain days?
(e) About how much rain is
recorded in Tennant Creek over
the course of the year?
No. of rain days
30
Average monthly rainfall
No. of rain days
25
100
20
75
15
50
10
25
0
J
F M A M
J
5
J
A S
O N D
Sa
m
pl
e
pa
ge
s
Rainfall (mm)
125
sum of data values
Mean = --------------------------------------------------------.
number of data values
Median is the physical middle value in the data set (the data must be put in order first).
Mode is the most frequently occurring data value.
Range is the difference between the highest data value and the lowest data value.
96
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
0
3.1
Types of data
Data is information of some kind. (Data is the plural of the word datum, although now data is
commonly used grammatically as a singular word.)
Data can take many forms. Consider the following questions which might appear as part of
a survey (although probably not the same one!).
My favourite fruit is: Apple
Banana
Orange
Mango
Mandarin
Other
The death penalty should be reintroduced as the punishment for serious crimes.
strongly
agree
agree
not sure
disagree
strongly
disagree
pa
ge
s
The data collected in both cases is known as categorical data. But, can you see a difference
between the data in each case? In the first case, the categories are simply fruits, each is no
different to any other; this is known as nominal data. In the second case, there is an order
implied in the categories offered; this is known as ordinal data.
Quite often, the data collected consists of numbers and, not surprisingly, is known as
numerical data. Consider these questions which might appear in surveys.
• What was the year of your birth?
• What is your height?
In the first case, we record answers, data values, such as 1960, 1984 and 1989, and there is
no interpretation needed. Data like this is known as discrete data. With discrete data you can
count the different values possible.
pl
e
Discrete data does not need to be whole numbers. For example, shoe sizes are discrete, but the data
values are not necessarily whole numbers: … 6, 6 1-2- , 7, 7 1-2- …
Sa
m
In the second case above, we need to interpret the question. We might write 180 cm or
179.8 cm or 179.82 cm (although this degree of accuracy is unlikely). The value recorded varies
depending on the degree of accuracy requested or the degree of accuracy possible with our
measuring instrument. Data like this is known as continuous data; the values can be any (real)
numbers within a particular range.
We make this distinction between discrete and continuous data because there are different
things we can do with each data type. However, we very often treat continuous data as if it were
discrete. This is because we must record the data in some way. With the height example above,
if the question said to record the height to the nearest cm, then we would have the discrete
possibilities 178, 179, 180 etc.
worked example 1
Decide whether these data samples are categorical or numerical.
(a) The number of cars in the staff car park recorded daily for 3 weeks.
(b) The favourite type of fast food for the members of your class.
Steps
(a) Will the data collected involve numbers? If yes,
the data is numerical, otherwise it is categorical.
(b) Will the data collected involve numbers? If yes,
the data is numerical, otherwise it is categorical.
Solutions
(a) The data is clearly numerical.
(b) As numbers are not involved, this data is
categorical.
3 ● de s c ri pt i v e
STATISTICS
97
exercise 3.1
Types of data
Worked Example 1
e
Hint
e
Hint
Sa
m
pl
e
pa
ge
s
Short answer
1 Decide whether these data examples are categorical (C) or numerical (N).
(a) The amount of change given to customers in a shop.
(b) The voting intention recorded by a sample of people in a poll conducted before a
State election.
(c) The eye colours of the students in your General Mathematics class.
(d) The ages, in years, of the teachers at your school.
(e) The individual mass of each sheep in a paddock.
(f) The brand of shoe worn by each Member of Parliament.
2 Decide whether these categorical data examples are nominal (M) or ordinal (O).
(a) The favourite TV show of the members of your family.
(b) The degree of support for the retention of school uniforms recorded by individual
students at a particular school.
(c) The starting letter of the company code for each company listed on the Australian
Stock Exchange (ASX).
(d) The listing of the top 10 CDs for the week.
(e) The colour of cars in the used-car lot.
(f) The brand of toothbrush used by a sample of dentists.
3 Decide whether these numerical data examples are continuous (T) or discrete (D).
(a) The mass of wool taken from each animal in a flock of sheep.
(b) The number of brands of grocery items available at the supermarket.
(c) The number of spectators at each of the Melbourne Redbacks home games.
(d) The time taken by each competitor in the Olympic 100 m final for women.
(e) The height of each individual tree in a forestry coop.
(f) The distance between each of the capital cities in Australia.
98
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
Sa
m
pl
e
pa
ge
s
4 Imagine you were setting up a number of survey instruments. Prepare a question for each
of the following and indicate the data type which will be collected. Use the code M:
nominal; O: ordinal; D: discrete; T: continuous.
You are trying to find out the most popular make of car.
You are investigating the height of kindergarten pupils.
You are investigating the most popular type of fast food.
You are trying to determine the support for the question: Should local government
be scrapped?
(e) You are investigating the most popular brand of cat food.
(f) You are trying to find out the ages of the players at the local football club.
(a)
(b)
(c)
(d)
e
Hint
Multiple choice
5 (a) The strength of agreement with a number of statements is recorded using a scale of
1–5 where 1 stands for strongly disagree and 5 stands for strongly agree. The most
precise way of describing the data collected would be:
A continuous B categorical
C nominal
D ordinal
E discrete
(b) The height of the high tide at the Barwon Heads bridge is recorded for 30
consecutive days. The most precise way of describing the data collected would be:
A continuous B categorical
C nominal
D ordinal
E discrete
(c) The number of brothers and sisters of each member of your class is recorded. The
most precise way of describing the data collected would be:
A continuous B categorical
C nominal
D ordinal
E discrete
3 ● de s c ri pt i v e
STATISTICS
99
3.2
pa
ge
s
Extended answer
6 (a) The measurement of distance is
supposed to result in continuous
data. However, all Olympic
records that involve distance
appear to produce discrete data.
How can this be?
(b) Surveys often use the numbers
1–5 to represent various levels
of support for statements that
are made. Often, 1 stands for
strongly disagree and 5 stands for
strongly agree. Your responses
would appear to be numerical,
but are they really?
Recording data
Frequency tables
pl
e
Once we have gone to the trouble of collecting data we need to record it in some way. For
example, if we had found the number of pets for each of the 25 members of our class we might
have found something like the following.
2, 3, 2, 1, 2, 0, 1, 3, 1, 2, 1, 0, 2, 1, 1, 0, 0, 2, 1, 4, 5, 1, 0, 2, 1
Recorded like this the numbers don’t mean much to us. A frequency table is an efficient
way of recording data.
Sa
m
worked example 2
Put the following data representing the number of pets owned by each of the 25 members of a class
into a frequency table.
2, 3, 2, 1, 2, 0, 1, 3, 1, 2, 1, 0, 2, 1, 1, 0, 0, 2, 1, 4, 5, 1, 0, 2, 1
Steps
1. Draw up a table showing each of the data
values in one column, a tally mark column and
a frequency column.
2. Go through the data list putting a tally mark for
each entry (go through the list only once).
Use |||| for 5.
3. Count up the tally marks and put this total in
the frequency column. Add up the frequency
column as a check.
Solution
No. of pets
0
1
2
3
4
5
Tally marks
||||
|||| ||||
|||| ||
||
|
|
Worked Example 2 dealt with a discrete data set with only a small number of different data
values. If the range of values is too big, we need to group the data into class intervals. We need
to think about how big the intervals should be. Although there are no hard and fast rules, we
prefer to have somewhere between six and 15 groups. So, we find the range and use it to decide
on the interval size.
100
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
Frequency
5
9
7
2
1
1
25
worked example 3
The following data represents the number of goals kicked over the course of the season by the players
at a local football club.
17 21 35 64
2
0 20 44 52 81 47 21 14 10
0
2
1
1
3
1
5
2
1 18
0
1
0
2
1 19
Display the data in a frequency table.
3. Draw up a table showing each of the intervals
in one column, a tally mark column, and a
frequency column.
4. Go through the data list putting a tally mark for
each entry.
5. Count up the tally marks and put this total in
the frequency column. Add up the frequency
column as a check.
Solution
81 − 0 = 81
Interval size: 10. (This gives nine intervals.)
No. of goals
Tally marks
Frequency
0–9
10–19
20–29
30–39
40–49
50–59
60–69
70–79
80–89
|||| |||| |||| |
||||
|||
|
||
|
|
16
5
3
1
2
1
1
0
1
pa
ge
s
Steps
1. Find the range.
2. Choose a convenient interval size. (In this
case, notice that the first interval is 0–9, which
covers 10 discrete values.)
|
30
pl
e
When the data is continuous we set up the intervals in a slightly different fashion. We do
this even when we round off the values to discrete quantities.
worked example 4
Sa
m
The following data represents the haemoglobin level for each
member of the class. The value has been measured correct to
one decimal place.
Display the data in a frequency table.
Steps
1. Find the range.
2. Choose a convenient interval size.
3. Draw up a table showing each of the intervals
in one column, a tally mark column and a
frequency column.
4. Go through the data list putting a tally mark for
each entry.
5. Count up the tally marks and put this total in
the frequency column. Add up the frequency
column as a check.
10.1
16.4
11.3
14.6
10.8
13.3
14.5
14.5
14.8
12.3
14.6
9.6
11.9
13.6
11.8
11.8
13.7
13.7
12.6
13.0
15.8
15.7
13.6
11.9
13.9
Solution
16.4 − 9.6 = 6.8
Interval size: 1 (This gives eight intervals.)
Haemoglobin level
9.0–10
10.0–11
11.0–12
12.0–13
13.0–14
14.0–15
15.0–16
16.0–17
Tally marks
Frequency
|
||
||||
||
|||| ||
||||
||
|
1
2
5
2
7
5
2
1
25
3 ● de s c ri pt i v e
STATISTICS
101
There are several things to notice
about Worked Example 4. It shows one
way of writing the class intervals.
Another popular method is shown in
the frequency table at right. The righthand end-point of the interval is not
given; it is assumed to be (less than) the
next left-hand end-point. For the last
interval the pattern just needs to be
continued.
Haemoglobin
level
Tally marks
Frequency
|
||
||||
||
|||| ||
||||
||
|
9.0–
10.0–
11.0–
12.0–
13.0–
14.0–
15.0–
16.0–
1
2
5
2
7
5
2
1
Stemplots
pl
e
pa
ge
s
For any of these frequency tables we
Haemoglobin
Relative Percentage
can add additional columns for relative
level
Frequency frequency frequency
frequency or percentage frequency,
9.0–10
1
0.04
4%
as shown in the table at right. Both are
10.0–11
2
0.08
8%
useful if we wish to compare values in
11.0–12
5
0.20
20%
different frequency tables where the
12.0–13
2
0.08
8%
total for each table is different.
13.0–14
7
0.28
28%
To find the relative frequency we
14.0–15
5
0.20
20%
divide the frequency by the total
15.0–16
2
0.08
8%
number of data values in the table, for
16.0–17
1
0.04
4%
example:
25
1.00
100%
1----= 0.04
25
and we multiply this value by 100 to convert to the percentage frequency:
0.04 × 100 = 4%
In both cases, we may need to round off to a sensible number of decimal places. The relative
frequency column should add to 1 and the percentage frequency column should add to 100,
although there may be slight variations from these totals due to the rounding process.
Sa
m
Although a frequency table is a useful way to record data, when we group data values we lose
information. For example, look again at the final table in Worked Example 3; from the frequency
table we can no longer identify the highest number of goals scored. Another way to record data
that preserves this information is the stemplot. Stemplots are usually divided into intervals of
10, although if there are many values an interval of five may be used. Stemplots are sometimes
called stem-and-leaf diagrams. This is because the leaf represents the units component of the
number, while the stem is the other part.
worked example 5
The following data represents the scores obtained
by the students in a maths class on a topic test.
Draw a stemplot for the data.
Steps
1. Draw up the outline of the table using the tens
digit as the stem.
2. Go through the data list writing each value in
the appropriate place.
102
12
18
34
13
15
44
52
9
23
33
59
11
Solution
STEM
0
1
2
3
4
5
LEAF
9
2
3
3
5
6
5
3
4
4
2
8 9 3 1 5 8
3
1
9
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
45
23
41
15
56
19
23
18
3. Order the data.
STEM
LEAF
0
1
2
3
4
5
9
1
3
3
1
2
2
3
4
4
6
3 5 5 8 8 9
3
5
9
pa
ge
s
Note: We can still clearly identify the highest score or, for that matter, the lowest score.
If we needed to break the stems because of the large number of leaf values, we would use a
code such as 5L to represent the values in the lower half of the 50s; for example, 50, 51, 52, 53
and 54, and 5U for the upper half.
A stemplot cannot be used for true continuous data but, remember, we often treat continuous data as
discrete. In these cases, we could use a stemplot. When the data values are large we sometimes need
to round them off to be able to use a stemplot.
exercise 3.2
Recording data
Sa
m
pl
e
Short answer
1 Construct frequency tables for the following discrete data sets.
(a) 2, 3, 5, 4, 3, 6, 1, 0, 2, 4, 3, 2, 1, 2, 4, 2
(b) 25, 26, 25, 24, 23, 22, 21, 20, 19, 25, 26, 22, 23, 24, 22, 21
(c) 0, 1, 0, 1, 2, 3, 1, 0, 3, 4, 0, 1, 4, 3, 2, 1, 0, 1, 5, 0, 0, 1, 2, 3, 2
(d) 32, 35, 33, 34, 32, 31, 30, 31, 32, 34, 35, 35, 32, 34, 32, 31, 30
2 Construct frequency tables for the following discrete data sets, using appropriate
class intervals.
(a) 1, 2, 8, 10, 22, 33, 42, 46, 41, 33, 32, 38, 51, 47, 46, 32, 17, 21, 24, 23, 27, 29, 36
(b) 101, 122, 134, 159, 145, 152, 147, 117, 109, 102, 117, 118, 134, 147, 148,156, 159, 162, 155
(c) 15, 28, 42, 42, 43, 47, 51, 43, 46, 47, 48, 46, 16, 19, 22, 24, 26, 29, 31, 34, 37, 41, 38, 52,
54, 55
(d) 161, 162, 171, 175, 182, 183, 191, 183, 167, 169, 173, 178, 177, 192, 189, 190, 180, 170,
164, 168, 173, 175
3 Construct frequency tables for the following continuous data sets, using appropriate
class intervals.
(a) 16.1, 17.3, 18.4, 17.6, 18.2, 19.5, 19.7, 18.6, 19.2, 27.5, 26.7, 24.5, 26.5, 28.7, 29.6, 24.3,
26.1, 18.5, 19.7, 22.6, 23.5, 29.9, 28.1, 18.7, 21.2, 23.5, 26.7, 27.7, 26.5
(b) 57.2, 72.5, 83.9, 52.3, 49.9, 63.9, 71.7, 54.5, 64.3, 98.3, 42.9, 73.9, 57.2, 91.9, 83.1, 47.7,
53.7, 89.1, 37.3, 72.3, 56.8, 51.1, 78.5, 37.2, 92.1, 78.9, 38.6, 78.4, 89.3
(c) 155.5, 154.3, 198.5, 167.9, 176.5, 178.0, 156.8, 176.5, 154.9, 157.0, 187.8, 192.3, 178.0,
187.6, 175.0, 156.0, 184.6, 167.9, 165.8, 154.9, 158.8, 192.6, 178.7
(d) 19.6, 25.4, 19.5, 22.4, 24.6, 21.8, 20.7, 20.4, 23.5, 25.4, 22.7, 23.4, 21.5, 22.2, 23.6, 24.6,
20.2, 21.1, 23.1, 24.5, 22.1, 23.5, 24.1, 23.0, 24.0, 23.5, 23.1, 24.6
Worked Example 2
e
Hint
Worked Example 3
Worked Example 4
3 ● de s c ri pt i v e
STATISTICS
103
4 Draw stemplots for each of the following data sets.
(a) 1, 2, 8, 10, 22, 33, 42, 46, 41, 33, 32, 38, 51, 47, 46, 32, 17, 21, 24, 23, 27, 29, 36
(b) 101, 122, 134, 159, 145, 152, 147, 117, 109, 102, 117, 118, 134, 147, 148, 156, 159,
162, 155
(c) 15, 28, 42, 42, 43, 47, 51, 43, 46, 47, 48, 46, 16, 19, 22, 24, 26, 29, 31, 34, 37, 41, 38, 52,
54, 55
(d) 161, 162, 171, 175, 182, 183, 191, 183, 167, 169, 173, 178, 177, 192, 189, 190, 180, 170,
164, 168, 173, 175
5 For the following frequency tables add a relative frequency column and a percentage
frequency column.
(e)
0
1
2
3
4
5
3
6
7
4
3
1
Score
Frequency
10–19
20–29
30–39
40–49
50–59
60–69
70–79
15
23
17
15
13
20
5
Score
Frequency
0–10
10–20
20–30
30–40
40–50
50–60
9
18
13
15
14
16
(b)
(d)
(f)
Score
Frequency
4
5
6
7
8
9
10
15
23
14
23
17
19
20
pa
ge
s
Frequency
Score
Frequency
101–110
111–120
121–130
131–140
141–150
151–160
161–170
9
4
7
1
4
3
7
Score
Frequency
141–146
146–151
151–156
156–161
161–166
166–171
22
31
21
26
22
24
pl
e
(c)
Score
Sa
m
(a)
Multiple choice
6 If we have a frequency table drawn, then:
A the relative frequency column should add to 100
B the percentage frequency column should add to 100 and the frequency column should
add to 1
C the frequency column should add to a number less than 20
D the relative frequency column should add to 1 and the percentage frequency column
should add to 100
E both the relative frequency column and the percentage frequency column should add
to 100
7 A data set that has a range of 120 is to be recorded in a grouped data frequency table.
The size of the groups would be best set at:
A 2
B 5
C 10
D 20
E 25
Extended answer
8 The following data represents the number of kilometres travelled each day by a solar
powered vehicle.
112
95 122
78
99 145 133 160 145 144
78
66
68
79
93 125 133
89
67
78
114 134
80
65 105 115 127 134 117
84
104
Worked Example 5
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
e
Hint
pa
ge
s
e
Hint
3.3
Sa
m
pl
e
(a) Given the way the data has been recorded would you consider it to be discrete or
continuous?
(b) Construct a frequency table for the data, using a class interval of 10 and a starting
value of 60.
(c) Looking only at the frequency table what can you say about the maximum distance
travelled on any day?
(d) Construct a stemplot for the data.
(e) What can a stemplot tell us that a frequency table cannot?
(f) What does this question reinforce?
Simple data displays
Bar graphs
A bar graph (or chart) is used to display categorical (nominal)
information. Bar graphs can have horizontal bars
or vertical
bars . Consider the bar chart at right. Note there is no scale on
the axis showing the categories (horizontal in this case). Gaps
are left between the bars to emphasise that we are not dealing
with continuous data. We can simply read values from the graph;
for example, 15 of the students sampled rode a bike to school on
the day of the survey. Bar charts can easily be constructed from
data presented in a frequency table.
Number
15
10
5
0
car walk bus bike train
Mode of transport
3 ● de s c ri pt i v e
STATISTICS
105
worked example 6
Draw a bar chart to represent the following data collected as a result of a survey.
sedan, sedan, ute, truck, stationwagon, sedan, sedan, sedan, truck, motorbike, sedan,
stationwagon, sedan, ute, truck, sedan, sedan, stationwagon, sedan, ute, truck, ute, truck, truck,
stationwagon, sedan, sedan, sedan, truck, ute, sedan, stationwagon, sedan, ute, truck, ute,
sedan, stationwagon
Steps
1. Construct a frequency table for the
information.
Solution
Type of vehicle
Frequency
sedan
ute
truck
stationwagon
motorbike
16
7
8
6
1
38
Frequency
pa
ge
s
2. Draw the bar chart, remembering to leave a
gap between each of the categories.
18
16
14
12
10
8
6
4
pl
e
2
sedan
ute
truck
station- motorwagon
bike
Type of vehicle
Sa
m
0
Divided bar charts
Another type of data display is the divided bar chart, where a single rectangle is divided into
pieces according to the contribution of the category to the whole. An appropriate scale needs
to be chosen to make the graph easier to draw. Sometimes, we find the percentage contribution
for each category and base the scale on those values.
worked example 7
Draw a divided bar chart to represent the number of grams of various components in a hot dog.
Component
Protein
Fat
Saturated fat
Carbohydrates
Sugar
No. of grams
17
18.2
6.3
34.5
2.6
Steps
1. Find the total number of grams involved.
106
Solution
78.6 g
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
2. Draw a rectangle of a suitable length and
divide it into pieces according to the category
values.
17
0
18.2 6.3
20
34.5
40
60
2.6
80
100
grams
protein
fat
saturated fat
carbohydrates
sugar
Frequency diagrams
When dealing with numerical data we should not really use a
bar chart. However, the temptation is strong when we have
discrete data. We sometimes use strips rather than bars in
these cases to make the diagram look different from a bar
chart. We might call this a frequency diagram. Note that this
time the data value axis needs a label, but the graph does not
have a title. The axis labels tell us everything we need to know
to be able to interpret the diagram. Again, there are gaps to
emphasise the discrete nature of the data. Grouped discrete
data can also be represented in a frequency diagram. The
horizontal axis scale values would be the intervals.
20
15
10
5
pa
ge
s
worked example 8
Frequency
0
0
1
2
3
4
No. of passengers
Sa
m
pl
e
Draw a frequency diagram to represent the following
discrete data set.
Steps
1. Draw a set of axes marking Score on the
horizontal axis and Frequency on the vertical
axis.
2. Complete the graph.
Score
Frequency
7
8
9
10
11
4
8
6
2
1
21
Solution
Frequency
8
7
6
5
4
3
2
1
0
7
8
9 10 11
Score
Histograms
When dealing with continuous numerical data we draw a diagram known as a histogram.
A histogram is like a bar chart, but there are some important differences. One of these is that
the bars are joined together. There are no gaps in the data values so we leave no gaps in the
diagram. However, we do usually start the first column one-half column width from the
vertical axis.
3 ● de s c ri pt i v e
STATISTICS
107
We can use Worked Example 4 about the haemoglobin level in the blood to illustrate a
histogram. The frequency table is repeated to remind us of the data values.
Frequency
Haemoglobin level
Frequency
7
9.0–10
10.0–11
11.0–12
12.0–13
13.0–14
14.0–15
15.0–16
16.0–17
1
2
5
2
7
5
2
1
6
5
4
3
2
1
0
25
9
10
11
12
13
14 15 16 17
Haemoglobin level
pl
e
pa
ge
s
Note the way the interval end-points are used as the dividers on the horizontal axis. The
is used to indicate that part of the horizontal axis has been left out.
When we set up our own frequency tables we
Age range
No. of deaths
emphasised the need to make the intervals the same size.
0–
3
However, when we use frequency tables that have already
5–
9
been set up we do not always have this luxury. If the
13–
3
intervals are not the same size then we need to be very
16–
14
careful. Consider the frequency table at right which gives
18–
48
information about road deaths of males in Victoria in
22–
39
2002.
26–
27
30–
40–
50–
60–
75+
50
33
25
29
19
299
Source: CrashStats, owned by VicRoads
Sa
m
Now, look at the diagram on the right,
No. of deaths
which is meant to be the histogram
50
illustrating this data. It seems to indicate
40
that the most dangerous age range is
30–40 years, as this is when there are the
30
most fatalities. But this is not correct; because
20
the interval is larger we would expect more
deaths. We need a relative measure to make
10
such a decision. To do this, we introduce a
0
new term, frequency density, which is
0 10 20 30 40 50 60 70 80 90 Age
defined as follows. Notice that we have taken
90 to be the upper limit for the age of the fatalities.
percentage frequency
Frequency density = -----------------------------------------------------------.
class width
The following worked example shows us how to draw a histogram correctly when we have
inconsistent intervals.
108
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
worked example 9
Draw a histogram for the data shown in this frequency table.
Age range
No. of deaths
0–
5–
13–
16–
18–
22–
26–
30–
40–
50–
60–
75+
3
9
3
14
48
39
27
50
33
25
29
19
299
Source: CrashStats, owned by Vic Roads
pa
ge
s
Solution
Sa
m
2. Draw the histogram, placing frequency density
on the vertical axis. (In the histogram the
frequency density values have been rounded
off to one decimal place.)
No. of
deaths
Age range
pl
e
Steps
1. Add a percentage frequency column if not
already there.
Add a frequency density column using:
percentage frequency
Frequency density = ----------------------------------------------------------.
class width
For example:
3
percentage frequency = ---------- × 100 = 1.0%
299
1.0
frequency density = -------- = 0.2
5
Note the use of Σ (the Greek letter sigma) to
represent ‘the sum of’.
0–
5–
13–
16–
18–
22–
26–
30–
40–
50–
60–
75+
3
9
3
14
48
39
27
50
33
25
29
19
Percentage
frequency
Frequency
density
1.0
3.0
1.0
4.7
16.1
13.0
9.0
16.7
11.0
8.4
9.7
6.4
0.2
0.375
0.333
2.35
4.025
3.25
2.25
1.67
1.1
0.84
0.647
0.427
Σf = 299
Frequency density
5
4
3
2
1
0
0 10 20 30 40 50 60 70 80 90
Age
From this, we can see that the most dangerous age range is actually 18–22. This would fit
with anecdotal evidence which suggests that young males who have just obtained a driving
licence are more dangerous on the road than other groups.
3 ● de s c ri pt i v e
STATISTICS
109
Cumulative frequency diagrams
When we have a continuous data set recorded in class intervals we can draw what is called a
cumulative frequency diagram. This shows the number of data values less than a particular
value. To help in producing a cumulative frequency diagram we add a new column to our
frequency table and label it cumulative frequency. We also put in a new first row to emphasise
that we are finding the number of data values less than a given value. However, we should note
that the cumulative frequency column tells us how many values there are less than the righthand end-point of the interval.
worked example 10
Draw a cumulative frequency diagram for the
following data set which gives the mass (kg)
of pupils at a small rural school.
Class interval
(mass, kg)
Frequency
(number of students)
10–20
20–30
30–40
40–50
5
8
16
5
Steps
1. Add the new first row and new column to the
frequency table.
pa
ge
s
Σf = 34
Solution
Mass (kg)
pl
e
10
10–20
20–30
30–40
40–50
Sa
m
2. Fill in the cumulative frequency column. Notice
that the new row has an entry of 0. Add on
each frequency as you go down the column.
(Remember, this tells us, for instance, that
there are 13 values less than 30.) Also notice
that the final entry in the cumulative frequency
column is equal to the sum of the frequency
column.
3. Draw the graph. Put cumulative frequency on
the vertical axis and record these values
against the right-hand end-point of the class
interval. Join the points with straight-line
segments.
Cumulative
frequency
Frequency
5
8
16
5
Σf = 34
Mass (kg)
Cumulative
frequency
Frequency
10
10–20
20–30
30–40
40–50
0
5
13
29
34
5
8
16
5
Σf = 34
Cumulative frequency
40
32
24
16
8
0
10
110
20
30
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
40
50 Mass (kg)
Sometimes, cumulative frequency diagrams are drawn with a smooth curve, called an ogive, passing
through the points rather than using straight-line segments. For our purposes it does not make any
real difference which method is used.
You can use the cumulative frequency curve to obtain approximate answers to questions such
as ‘What percentage of students are less than 35 kg?’ and ‘Under what mass are 70% of the
students?’
worked example 11
For the cumulative frequency graph shown, find
approximate answers to the following questions.
(a) What percentage of students are less than 35 kg?
Give your answer correct to the nearest per cent.
(b) Under what mass are 70% of the students? Give
your answer correct to the nearest kg.
Cumulative frequency
40
32
24
pa
ge
s
16
8
0
10
Sa
m
2. Read off the value and convert this to a
percentage. Write this correct to the
nearest per cent.
(b) 1. Calculate a value to represent the given
percentage of students.
2. Draw a line from the calculated value on
the vertical axis across to the graph and
then a line, from this point, down to the
horizontal axis.
30
40
50 Mass (kg)
Solutions
(a) Cumulative frequency
40
pl
e
Steps
(a) 1. Draw a line from the given mass on the
horizontal scale to the graph and then,
from this point, draw a line across to the
vertical scale.
20
32
24
16
8
0
10
20
30
40
50 Mass (kg)
20 students
20
------ × 100 = 58.82%
34
59% of students are less than 35 kg.
(b) 70% of 34 = 23.8
Cumulative frequency
40
32
24
16
8
0
10
3. Read off the value.
20
30
40
50 Mass (kg)
70% of students are less than approximately
37 kg.
3 ● de s c ri pt i v e
STATISTICS
111
Calculating percentiles using CAS
We can use our CAS to find only approximate values for percentiles. These arise from questions
such as “Under what mass are 70% of the students?” Of course, you need to have also been
given the distribution of the students’ weight.
We have the following information.
Class interval
(mass, kg)
Frequency
(number of
students)
10–<20
20–<30
30–<40
40–<50
5
8
16
5
We need to enter the cumulative frequency values as the second column in our list. As a
check, the total frequency is 34 so this should be the value in the last row.
Using the ClassPad
1. To make this process work best we need to
do it slightly out of order. The first step is to
actually draw the line equivalent to the
percentile value in question and then to do
the line graph.
To estimate the 70th percentile we need to
know 70% of the cumulative frequency. In
our case this is 0.7 × 34 = 23.8. Enter this in
pa
ge
s
Using the TI-Nspire CAS
1. Enter the data in the Lists & Spreadsheet
application. Call the first column x and use the
right-hand boundary values of the class
intervals, and call the second column y.
Include the point (10, 0) to give the graph a
starting point.
on the axes are not really important at this
stage.
Sa
m
pl
e
W mode and draw as usual. The values
2. Insert a new page, / I , and choose Data
& Statistics. Label the axes as seen previously.
2. Enter the data in
I
mode. Use the
right-hand boundary values of the class
intervals for list1, and the cumulative
frequency values for list2. Include the point
(10, 0) to give the graph a starting point.
112
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
3. Join the dots by pressing
>XY Line Plot.
b > Plot Type
4. To estimate the 70th percentile we need to
know 70% of the cumulative frequency. In our
case this is 0.7 × 34 = 23.8. We now want to
draw in the line y = 23.8. To do this press
b > Analyse > Plot Function and fill in the
dialog box. We can now approximate the
x-value of the point of intersection.
pa
ge
s
3. Tap SetGraph > Setting and make
StatGraph1 the scatter plot as shown
previously and StatGraph2 as shown at right.
pl
e
4. Tap h y and the cumulative frequency
curve appears. Tap N $ and the
horizontal line will appear.
Sa
m
So, the 70th percentile is approximately 37.
We can now approximate the x-value of the
point of intersection. Just tap the screen at the
point of intersection and the coordinate pair
will appear at the bottom of the screen.
Unfortunately, this cannot be shown in a
screen shot, but it shows the 70th percentile
is approximately 37.
3 ● de s c ri pt i v e
STATISTICS
113
Summary
Recording data
• A frequency table is an efficient way of recording data.
It records, in column format, the possible data values
and the associated frequencies.
• Class intervals are used when the range of values
recorded is too great for a meaningful ungrouped
arrangement.
• Relative frequency and percentage frequency columns
can be added to frequency tables. Relative frequency is
found by dividing the frequency by the total frequency,
and we multiply this value by 100 to obtain the
percentage relative frequency.
Data analysis methods and tools
• When we divide a data set into a number of equal
pieces we call these pieces quantiles. Some quantiles
have special names:
– if we divide the data set into four equal pieces we
have quartiles
– if we choose 10 equal pieces we have deciles
– if we choose 100 equal pieces we have percentiles.
• When we divide a data set into quartiles we obtain the
five-figure summary—the lowest value, the lower
quartile (Q1 or QL), the median (or Q2), the upper
quartile (Q3 or QU) and the highest value.
• The interquartile range is the difference between the
upper quartile and the lower quartile (IQR = Q3 − Q1).
It is a robust measure of spread unaffected by extreme
values.
• An outlier is located more than 1.5 × IQR away from
the nearer of Q1 and Q3.
• A boxplot is a way of representing the spread of a data
set. It uses each of the values from the five-figure
summary—lowest value, lower quartile, median,
upper quartile and highest value.
• Standard deviation is a measure of spread which uses
all of the data values; however, it can be affected by
extreme values.
• Your CAS can be used to produce all of the statistics we
are interested in as well as to draw boxplots.
• Skewness is a way of describing how far a data set is
from being symmetrical.
Sa
m
pl
e
Simple data displays
• Stem-and-leaf diagrams are another way of recording
data. In a stem-and-leaf diagram, the leaf is the units
value of the individual datum value and the stem is the
other part of the number. The advantage with a stemand-leaf diagram is that it does not lose any of the
detail in the data.
• A bar chart is used to represent categorical (nominal)
data.
• A frequency diagram is used to represent numerical
(discrete) data whether it is discrete or grouped.
• A histogram is used to represent numerical
(continuous) data.
• Frequency density is used when we are trying to draw
a histogram and the class intervals are not equal.
• Continuous data can also be represented in a
cumulative frequency curve which indicates the
number of data values less than a particular value.
– the median is the physical middle of the data set.
for odd n, the median is the ⎛ n
--- + 0.5⎞ th value;
⎝2
⎠
n
for even n, the median is the mean of ⎛ --- ⎞ th and
⎝2 ⎠
⎛n
--- + 1⎞ th values.
⎝2
⎠
– the mode is the most frequently occurring data
value.
• If a data set has two values with the same highest
frequency we say the set is bimodal.
• Grouped discrete data is difficult to deal with as far as
the median is concerned. We can, at best, estimate the
value.
pa
ge
s
Data types
• There are two types of categorical data, nominal (such
as type of fruit) and ordinal (such as strongly agree,
agree …) and two types of numerical data, discrete
(such as number of brothers and sisters) and
continuous (such as height of players in a team).
Values of central tendency and
spread
• There are three measures of central tendency used:
– the mean is calculated by finding the
sum of the data values
--------------------------------------------------------- .
number of data values
Σx Σx
this can be written as ------ = ------ .
n
Σf
3 ● de s c ri pt i v e
STATISTICS
153
Negative skew
Positive skew
Positive skew
mode ⬍ median ⬍ mean
pa
ge
s
Negative skew
mean ⬍ median ⬍ mode
3(mean – median)
• The degree of skewness = -----------------------------------------------.
standard deviation
• Composite bar charts, back-to-back stem-and-leaf
plots and comparative boxplots are all used to assist in
the analysis of data sets.
Symmetrical
pl
e
Symmetrical
Sa
m
Use the following to check your progress. If you need more help with any questions, turn back to the
section given in the side column, look carefully at the explanation of the skill and the worked examples,
and try a few similar questions from the Exercise provided.
Short answer
1 Select the correct data type for each of the following examples. You are to choose from:
nominal, ordinal, discrete and continuous.
(a) The height jumped by each of the
competitors in the school high
jump competition.
(b) The shirt size of each of the
students in your class.
(c) The favourite colour of each of the
students in your class.
(d) The highest level of education
achieved by each of the
respondents to a survey.
154
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
3.1
pa
ge
s
2 The following data set represents the number of siblings for a sample of students in a school.
0
3
2
1
0
2
3
2
1
2
5
3
1
0
2
3
2
1
6
3
2
5
3
0
2
3
1
0
1
1
1
4
3
2
5
4
2
0
1
1
2
3
2
1
4
5
2
1
Construct a frequency table for the data set and include both a relative frequency column, stated to
three decimal places where necessary, and a percentage frequency column, stated to one decimal
place where necessary.
3 Draw a stemplot for the following data set.
25
38
65
32
77
21
79
81
66
50
47
53
25
30
42
60
70
29
51
63
68
82
40
33
22
45
37
65
74
70
35
61
81
77
65
66
4 Draw a divided bar chart to represent the number of grams of various components in a fried dim sim.
Component
Protein
Fat
Saturated fat
Carbohydrates
Sugar
No. of grams
5
4.3
2.2
13.0
3.1
Sa
m
pl
e
5 Draw a frequency diagram for the data set at right.
6 Find two estimates of the median for this data set.
7 The data set shown at right represents the heights of the
players in the under 16 football squad at the local football club.
(a) Draw a cumulative frequency curve and then use it to find:
(b) an estimate for the median
(c) an estimate for the IQR.
Number of pets
Frequency
0
1
2
3
4
5
6
3
3
5
6
1
2
1
x
f
0–4
5–9
10–14
15–19
20–24
2
5
7
3
1
Class interval
Frequency
130–
140–
150–
160–
170–
180–
3
3
4
12
4
2
8 Find the standard deviation of the following data sample set using your calculator.
1, 1, 2, 2, 2, 2, 3, 4, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 9
3 ● de s c ri pt i v e
3.2
3.2
3.3
3.3
3.4
3.4
3.5
3.5
STATISTICS
155
9 Find the five-figure summary for the data set below.
STEM
13
14
15
16
17
18
19
LEAF
1
2
0
2
0
2
0
1
3
0
3
0
2
0
3.5
2
3
0
4
1
2
1
3
3
1
4
2
5
5
5
2
5
3
7
6
7
3
8
4
8
7
9 9
3 3 3
9
9
3.6
2000
2001
Australia
Canada
Japan
Korea
New Zealand
USA
Brazil
China
Russia
3.0
5.3
2.8
9.3
3.7
3.8
3.9
8.0
9.1
2.7
1.9
0.4
3.1
2.2
0.3
1.5
7.3
4.9
12 The bar chart shown gives the
number of fatalities onVictorian
roads for 2003 for a number of
categories of road user, divided
by sex. Discuss the content of
the chart.
2002
3.3
3.3
0.2
6.3
4.2
2.4
1.5
8.0
4.3
Number of
fatalities
100
Male
Traffic fatalities 2003
Female
90
80
70
60
50
40
30
20
10
0
Driver
156
3.6
Sa
m
Country
pl
e
pa
ge
s
10 For the 2003 season, each AFL club had 38 players on its senior list. The data represents the
weight (kg) of the players for two clubs. Draw a back-to-back stemplot and use this to assist in
comparing the two teams.
Geelong:
83
84
81
97
83
84
89
90
93
92
86
96
96
94
96
84
87
85 106
94
89
84
87
79 101
91
86
97
96
86
88
86
90
94
78 104
96
94
Western Bulldogs:
77
98
91
77
89
81
90
76
94
80
98
84
79
72
77
75
75
98
94
82
94
96
84
89
85
72
94
74
81
88
83
85
82
83
78 101
83
85
11 The table indicates the GDP for various countries in the OECD. The figures represent the % GDP for
each country. Draw a comparative frequency diagram to help you analyse the data.
Passenger Pedestrian
Motor
cyclist
Bicyclist
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
Pillion
Passenger
3.6
Multiple choice
13 The collar size of shirts worn by the pupils in a class is recorded. The most precise way of describing
the data collected would be as:
A continuous
B categorical
C nominal
D ordinal
E discrete
14 A data set with a range of 300 is to be recorded in a grouped data frequency table. The size of the
groups would be best set at:
A 5
B 10
C 15
D 50
E 100
15 The following table gives the number of grams of various components in a 150 g packet of salted
peanuts.
Protein
Fat
Saturated fat
Carbohydrates
Sugar
No. of grams
37.2
70.95
8.1
13.5
7.65
pa
ge
s
Component
The percentage of the mass of peanuts that is protein is closest to:
A 70.95%
B 37.4%
C 37.2%
D 27.1%
Sa
m
pl
e
The data set at right is to be used for questions 16 to 18.
16 The mean of the data set is closest to:
A 2.5
B 5
C 2.8
D 4
E 2
17 The median of the data set is closest to:
A 2.5
B 5
C 2.8
D 4
E2
18 The mode of the data set is closest to:
A 2.5
B 5
C 2.8
19 You are given the data set on the right.
The mean of the data set is closest to:
A 34.5
B 25
C 25.5
D 29
E 24.5
The data set at right is to be used for questions 20 to 21.
20 The mean of the data set is closest to:
A 11.4
B 12.5
C 11.9
D 10.9
E 12.2
21 The modal class of the data set is best represented by:
A 5–10
B 5
C 10–15
D 7
E 12.5
The data set at right is to be used for questions 22 and 23.
22 The median of the data set is best represented by:
A 6
B 30
C 43
D 63
E 66
23 The IQR of the data set is best represented by:
A 38.5
B 43.5
C 38
D 44
E 39
3.1
3.2
3.3
E 29.05%
x
f
1
2
3
4
5
3
5
2
4
2
3.4
3.4
3.4
D 4
E 2
x
f
0–9
10–19
20–29
30–39
40–49
6
2
7
8
3
x
f
0–5
5–10
10–15
15–20
20–25
2
5
7
3
1
STEM
3
4
5
6
7
8
9
3.4
3.4
3.4
LEAF
0
3
8
1
0
1
2
1
3
9
3
6
2
2
4 6 7
3 3 4 5
3.4
4
7
2
2
3.5
6
7
2
5
6 6 7 8
9
5 6
7
3 ● de s c ri pt i v e
STATISTICS
157
24 The IQR for the data set: 2, 4, 5, 2, 4, 3, 1, 4, 6, 5, 4, 7, 8, 6, 8, 4, 5, 6, 2, 4, 3, 1, 3, 5, 7, 9, 1, 3, 2, 4, 6, 5,
3, 2, 1, 5, 2, 5, 6 is best represented by:
A 2
B 4 to 5
C 3 to 6
D 4
E 6
25 For the data set on the right the IQR is best represented by:
x
f
A 8
B 4
C 2.5
8
4
D 13
E 15
13
15
19
21
25
Extended answer
0–
16–
18–
21–
26–
40–
60–70
30
15
45
58
115
64
93
Sa
m
No. of deaths
6
7
8
9
10
11
3.5
8
6
3
1
2
LEAF
3.6
1 2 5 5 6 9
00 1 1 3 3 3 4 6 8 9
4 5 5 5 6 6 7
3 3 6 9 9
0 1 3 6
0 0 2
28 Two brands of rechargeable batteries were tested to
compare their lifetimes, in hours, before they
required recharging. Samples of 10 batteries of each
brand were tested in the same equipment, producing
the following times before requiring recharging.
Big Zap: 7, 22, 9, 24, 16, 22, 25, 26, 23, 26
Sparky: 19, 20, 21, 15, 17, 15, 39, 23, 14, 15
pl
e
27 The table shows the total road deaths by age group.
Age range
STEM
pa
ge
s
26 Look at the data set given. The degree of skewness is closest to:
C 11.258
A 66.675
B −0.049
D 0.247
E −0.146
3.5
(a) Draw a histogram to show this data.
(Be careful!)
(b) Draw a cumulative frequency curve and use it
to estimate the age:
(i) under which 60% of deaths occurred
(ii) under which 18% of deaths occurred
(iii) over which 70% of deaths occurred.
(c) Draw a boxplot for the data, clearly identifying
the five-figure summary values.
(d) Describe the data set including some mention
of its degree of skewness.
(a) Calculate the mean, median, standard
deviation and interquartile range of lifetime
before recharging for each brand of battery.
158
Heinemann V C E Z O N E : A D V A N C E D G E N E R A L M A T H E M A T I C S E N H A N C E D
(b) Represent the data as a back-to-back stemplot.
(c) Construct comparative boxplots for the
lifetimes of the two brands of batteries.
(d) If you wanted to be confident that the battery
you purchased would last at least 14 hours,
which battery would you buy?
(e) If you wanted to maximise your chances of the
battery you purchased lasting over 20 hours,
which battery would you buy?
(f) Which battery tends to last longer? Justify your
answer with reference to the summary statistics
and diagrams.
exam focus 4
VCAA 2004 Further Mathematics Units 3 & 4, Exam 1, Section A,
Questions 1 & 2
0
1
2
3
4
5
9
pa
ge
s
The following information relates to Questions 1 and 2.
The marks obtained by students who sat for a test are displayed as an ordered stemplot as shown.
0 1 2 5 6
0 1 1 1 3 5 5 7 8 9 9
1 2 3 4 4 6 7 7
0
D. 32
E. 50
D. 36
E. 41
Sa
m
pl
e
1. The number of students who sat the test is
A. 25
B. 26
C. 27
2. The interquartile range of these test marks is closest to
A. 9
B. 13
C. 30
exam focus 5
VCAA 2004 Further Mathematics Units 3 & 4, Exam 1, Section A,
Questions 4 & 5
The following information relates to Questions 1 and 2.
The number of DVD players in each of 20 households is recorded in the frequency table below.
Number of DVD players
0
1
2
3
4
5
Frequency
6
9
3
1
0
1
Total 20
1. For this sample of households, the percentage of households with at least one DVD player is
A. 30%
B. 45%
C. 50%
D. 70%
E. 90%
2. For this sample of households, the mean number of DVD players in these 20 households is
A. 0.75
B. 1.00
C. 1.15
D. 1.64
E. 2.00
3 ● de s c ri pt i v e
STATISTICS
159