Examiner`s Manual - Chapter 5: Reliability of the TOMAGS

5
Reliability of the TOMAGS
were computed for the two forming groups (normal
and gifted) at each age interval from 6-0 to 9-11 for
the TOMAGS Primary and from 9-0 to 12-11 for the
TOMAGS Intermediate using the entire normative
sample as subjects. In addition, the sample alphas
for the entire normative samples (all ages) were computed. Finally, the alphas were averaged across age
intervals using the z-transformation technique. The
average alphas are listed in the right-hand column of
each table. The average alphas for the quotient are
highly acceptable. All of the coefficients exceed .80;
in 6 of 20 instances (not counting average alphas),
they exceed .90. The magnitude of these coefficients
indicates that both levels of the TOMAGS are highly
reliable and the results can be used with confidence.
Standard errors o f measurement f o r t h e t w o
groups are reported in Tables 5.3 and 5.4. As noted
previously, t h e SEM i s a useful statistic t h a t i s
closely related to reliability and can be used to estimate a confidence interval that surrounds a particular score. It estimates the amount of error that may
be reflected in an individual's score due to the less
than perfect reliability of the instrument. The SEM is
based on the formula SEM = SD N/1-7. (SD •-• standard
deviation; r = reliability), and establishes a zone
within which an individual's true score probably lies.
The r in the formula is represented b y the coefficients alpha reported in Tables 5.1 and 5.2.
To Anastasi and Urbina (1997), the concept of test reliability refers to the "extent to which individual differences in test scores are attributable to 'true' differences in the characteristics under consideration and
the extent to which they are attributable to chance
errors" (p. 84). For instruments such as the TOMAGS,
reliability coefficients must approximate or exceed .80
in magnitude for the instrument to be considered minimally reliable; coefficients must be .90 or above to be
considered most desirable (Aiken, 1994; Helmstadter,
1964; Nunnally, 1978; Salvia & Ysseldyke, 1995). This
section examines three types of errors that can affect
the reliability of the TOMAGS: (a) content sampling,
(b) time sampling, and (c) scorer differences.
Content Sampling
Content sampling reflects the degree of homogeneity
among items within an instrument. To determine this
homogeneity, internal consistency reliability of the
items must be studied. This type of reliability demonstrates the extent to which the items correlate with
one another, and it is computed by using Cronbach's
(1951) coefficient alpha method. Coefficient alphas
for the TOMAGS Primary and TOMAGS Intermediate
are reported in Tables 5.1 and 5.2. Coefficient alphas
25
TABLE 5.1
Coefficient Alphas for the TOMAGS Primary at Four Ages and All Ages for Both Standardization Groups
(Decimals Omitted)
Age
Group
6
7
8
9
All Ages
Average
Normal
Gifted
92
90
81
84
89
88
81
81
92
90
86
87
TABLE 5.2
Coefficient Alphas for the TOMAGS Intermediate at Four Ages and All Ages for Both Standardization Groups
(Decimals Omitted)
Age
Group
9
10
11
12
All Ages
Average
Normal
Gifted
81
85
86
83
90
82
90
82
88
86
87
84
TABLE 5.3
Standard Errors of Measurement for the TOMAGS Primary at Four Ages and
All Ages for Both Standardization Groups
Age
Group
6
7
8
9
All Ages
Average
Normal
Gifted
4.24
4.74
6.53
6.00
4.97
5.19
6.53
6.53
4.24
4.74
5.61
5.40
TABLE 5.4
Standard Errors of Measurement for the TOMAGS Intermediate at Four Ages and
All Ages for Both Standardization Groups
Age
Group
9
10
11
12
All Ages
Average
Normal
Gifted
6.53
5.80
5.61
6.18
4.74
6.36
4.74
6.36
5.19
5.61
5.40
6.00
The clinical value of these figures is exemplified
by the case of 11-year-old Susan. She has a TOMAGS
Intermediate Quotient of 98, and the standard score
SEM (using the normal group) for her age is found to
be 4.74. Thus, the examiner knows with 68% probability that her true score lies between 93.26 and
102.74 (1 SEM); with 95% probability that her true
score lies between 88.52 and 107.48 (2 SEM); and
with 99% probability that her true score lies between
83.78 and 112.22 (3 SEM). Obviously, the smaller the
SEM, the more confidence one can have in an instrument's results.
One cannot always assume that, because a n
assessment instrument is reliable for a general popu26
and Urbina (1997) stated that this form of reliability
"shows the extent to which scores on a test can be
generalized over different occasions; the higher the
reliability, the less susceptible the scores are to the
random daily changes in the conditions of the test
takers or of the testing environment" (p. 92).
Two studies w e r e undertaken t o determine
whether the results of the TOMAGS were stable over
time. In both studies, 30 students identified as gifted
in mathematics i n a public elementary school i n
Texas were rated twice, w i t h a 2-week interval
between ratings. The students in Study 1 took the
TOMAGS Primary and were 6 through 9 years old.
The students i n Study 2 took the TOMAGS Intermediate and were 9 through 11 years old. Standard
scores were computed (M=100, SD=15), a n d t h e
results from the two ratings for both studies were
correlated. The resulting test—retest coefficients are
reported in Table 5.7.
lation, it will be equally reliable for every subgroup
within that population. Persons who develop instruments thus should demonstrate that they are indeed
reliable for subgroups, especially those subgroups
that are likely to be assessed or that because of ethnic, cultural, racial, or linguistic differences may not
be appropriate candidates for assessment.
The alphas for four selected subgroups within
the two normative samples for both levels o f the
TOMAGS are reported i n Tables 5.5 and 5.6. The
groups s t u d i e d a r e m a l e s , females, A f r i c a n
Americans (blacks), and Hispanic Americans. The
large composite alphas found in the right column of
the tables demonstrate that the TOMAGS is equally
reliable f o r all t h e subgroups investigated. These
alphas are essentially the same as the average alphas
reported in Tables 5.1 and 5.2 for the entire normative sample. These findings support the idea that the
TOMAGS contains little or no bias relative to the subgroups studied.
Scorer Differences
Time Sampling
The type of reliability discussed in this section refers
to the amount of test error due to examiner variability because of scoring differences. Unreliable scoring
is usually the result of clerical errors o r improper
application of standard scoring criteria on the part of
the examiner. Scorer error can be reduced consider-
Time sampling examines the extent to which a child's
test performance is constant over time, and it is generally measured using the test—retest technique. This
approach involves administering an instrument and
then readministering it a week or 2 later. Anastasi
TABLE 5.5
Coefficient Alphas for the TOMAGS Primary for Four Subgroups for Both Standardization Groups
(Decimals Omitted)
Subgroup
Group
Males
Females
African
Americans
Normal
Gifted
92
92
92
92
94
88
Hispanic
Americans
89
92
Average
92
91
TABLE 5.6
Coefficient Alphas for the TOMAGS Intermediate for Four Subgroups for Both Standardization Groups
(Decimals Omitted)
Subgroup
Group
Males
Females
African
Americans
Normal
Gifted
90
86
89
85
89
90
27
Hispanic
Americans
88
82
Average
89
86
ably by the availability of clear administration procedures and detailed guidelines governing scoring, and
by opportunities to practice scoring.
Nevertheless, developers should demonstrate
statistically the amount of error in their instruments
due to different scorers. To do this, Anastasi and
Urbina (1997) recommended that two trained individuals score a set of assessments independently.
The correlation between scores is a relational index
of agreement.
In the case of the TOMAGS, two staff persons in
PRO-ED's r e s e a r c h d e p a r t m e n t independently
scored 3 8 completed protocols f o r t h e TOMAGS
Primary a n d 4 6 completed protocols f o r t h e
TOMAGS Intermediate. The protocols were randomly
selected from the normative sample. The sample for
the TOMAGS Primary ranged in age from 8-2 through
9-10 and the sample for the TOMAGS Intermediate
ranged in age from 9-6 through 12-1. The results of
the scoring were correlated. The resulting coefficients for both TOMAGS Quotients were .99. The size
of these coefficients provides convincing evidence in
support of the scale's scorer reliability.
Summary
The TOMAGS's overall reliability is summarized in
Table 5.8, which shows the scale's status relative to
three sources of test error: content, time, and scorer.
The coefficients displayed are drawn from those
reported in previous sections of this chapter. They
were averaged using the z-transformation technique
described by Guilford and Fruchter (1978). As can be
seen from viewing the figures listed in the average
column at the right of the table, the TOMAGS evidences a high degree of reliability. This reliability is
consistently high across all four types of test error.
The magnitude o f these coefficients strongly suggests that (a) the TOMAGS possesses little test error
and (b) users can have confidence in its results.
TABLE 5.7
Test—Retest Reliability for the TOMAGS
First Testing
Level
Primary
Intermediate
N
30
29
M
112.28
109.59
Second Testing
SD
16.69
12.60
M
115.10
111.29
SD
15.74
10.94
r
.84
.94
TABLE 5.8
Summary of TOMAGS: Reliability Related to Three Sources of Test Error
(Decimals Omitted)
Source of Test Error
Content Sampling
TOMAGS Primary
TOMAGS Intermediate
Normal
Gifted
92
88
90
86
28
Time
Sampling
84
94
Scorer
Average
99
99
93
93