test performance under cat and sat conditions

J. EDUCATIONAL COMPUTING RESEARCH, Vol. 24(1) 57–75, 2001
ON TEST AND COMPUTER ANXIETY:
TEST PERFORMANCE UNDER CAT
AND SAT CONDITIONS
MARK D. SHERMIS
HOWARD R. MZUMARA
SCOTT T. BUBLITZ
Indiana University Purdue University Indianapolis
ABSTRACT
This article examines the differences between computer adaptive (CAT) and
self-adapted testing (SAT) along with possible differences in feedback
conditions and gender. Areas of comparison include measurement precision/
efficiency and student test characteristics. Participants included 623 undergraduates from a large Midwestern university who took math placement tests in a
4 (condition) × 2 (feedback) × 2 (gender) design. The four conditions included:
a) CAT; b) SAT—Global; c) SAT—Individual; and d) SAT—Placebo groups.
Multivariate Analysis of Variance was used to analyze the data. The perceived
control hypothesis was used as a framework to explain the differences between
CAT and SAT. Results indicated that measurement efficiency is differentially
affected by the type of test condition with the SAT—Global condition performing worse than the others. Moreover, there were significant gender effects
with regard to ability, test length, and test anxiety. There was no relative advantage for the inclusion of item feedback. Implications for computerized adaptive
testing and areas of future research are discussed.
INTRODUCTION
Computerized testing is a popular alternative to the traditional paper-and-pencil
format. These instruments are easy and inexpensive to administer, and generally
take less time to score and tabulate. Moreover, computerized-adaptive testing
(CAT) has been shown to produce more precise ability estimates while requiring
57
© 2001, Baywood Publishing Co., Inc.
58 / SHERMIS, MZUMARA AND BUBLITZ
fewer items than traditional fixed-item tests [1]. However, the increased efficiency of computerized testing may come at the cost of higher test anxiety for
some examinees [2].
As an attempt to give the examinee increased control over the testing situation
and thereby reduce test anxiety, self-adapted testing (SAT) is a technique that has
met with some success [3, 4]. Instead of the computer systematically selecting an
item in an attempt to maximize information at the examinee’s current ability
estimate, SAT lets individuals select items calibrated to a “desired” (relative) difficulty level. Otherwise, the calculation of ability level and estimate precision is
carried out as it normally is with CAT. For example, before an individual is presented with an item, he/she is asked how difficult of an item they would prefer.
The levels of difficulty can vary, typically ranging from three (easy, medium, hard)
up to eight. Using the difficulty range chosen by the individual, the algorithm selects
items specifically tailored to the present ability estimate. SAT has been found to
lessen the increase in anxiety that occurs with computerized testing [3-5].
Research has not fully investigated the cause of anxiety reduction in SAT.
Three hypotheses were offered by Wise et al. [6]. One possible explanation, the
self-monitoring hypothesis, states that SAT allows the examinee access to information beyond which would be available to a traditional computerized testing
algorithm (e.g., current affective (emotional) and motivational states) [7]. For
example, some examinees begin an examination by feeling apprehensive, anxious,
or insecure about their ability. In other cases, examinees may feel quite optimistic
and confident in their mastery of the test content. This additional information
about affective and motivational states allows the student to tailor the test to their
specific psychological states and thus reduce their anxiety. Another hypothesis
is that examinees that are anxious about the assessment experience test irrelevant
thoughts that interfere with test performance. With SAT, the act of continuously
choosing item difficulty levels may block test-irrelevant thoughts and keep the
examinee more focused. Allowing students to choose the difficulty of the items
may be enough to keep them focused and thereby reduce anxiety during the test
[8, 9]. A final possible explanation about SAT effects has focused on perceived
control. Perceptions of control over stress sources have been hypothesized to
improve performance, reduce anxiety, and increase motivation [5]. Since SAT
allows for choice over item difficulty, examinees can perceive this as increased
control of the testing process. Wise suggests that “a person’s perception of control
has been found to be more important in stress reduction than actually having control in an aversive situation” [p. 5, 18].
The purpose of this research is to examine possible differences between CAT
and SAT with respect to test anxiety, computer anxiety, and a variety of efficiency
measures. Moreover, this research will examine the participants’ attitudes about
CAT and SAT. Attitudes of interest include examinees’ opinions concerning
strengths, weaknesses, and modifications of each administrative condition. Finally,
the present study intends to explore the processes underlying self-adapted testing
TEST AND COMPUTER ANXIETY / 59
by manipulating several key characteristics. One way this will be accomplished
is by varying levels of control over item difficulty: some individuals will vary the
difficulty level of each item that is presented while others can set only the one difficulty level for all items in the test (referred to as “global difficulty level”).
Another exertion of test control is by manipulating the presence or absence of
item response feedback.
A great deal of research has recently been conducted comparing the computer
adaptive testing (CAT) and self-adapted testing (SAT). In general, it appears that
CAT surpasses the SAT in terms of measurement precision and efficiency [10].
Wise et al. found that participants taking a 20-item CAT took less time to complete
the test and received a lower standard error of ability estimate than participants
given the 20-item SAT [9]. These results were replicated in another study by Roos,
Wise, and Plake [11]. They also found that the average difficulty of items administered did not differ between CAT and SAT conditions. Another study by Ponsoda,
Wise, Olea, and Revuelta attempted to extend the findings to a non-US population
using Spanish high-school students in a study [12]. Although they had found no
difference in ability estimates between CAT and SAT, they did find that the CAT
had a lower standard error of ability estimate and administered more difficult
items than the SAT.
In another study by Vispoel and Coffman, a direct comparison was made among
fixed-item (FIT), CAT, and SAT tests [13]. The authors concluded that CATs were
significantly more precise and efficient in their ability estimates than either SATs
or paper-and-pencil tests. Specifically, in order to achieve a specific precision
level the SAT needed almost twice as many items and the FIT more than three times
the number of items as the CAT. Rocklin et al. also found that students taking the
SAT not only needed more time to complete each item than those taking the CAT,
but also needed more items to reach a specified precision level [3]. Finally, two
studies by Vispoel found that CAT was more precise and required less time to
administer than SAT [14, 15]. These results are not surprising, considering that
computer adaptive tests use algorithms to administer items that provide the most
information about the examinee’s ability. Due to the examinee’s option to choose
the item difficulty level in SATs, the efficiency of the test could decrease when
items too easy or too difficulty for the examinee are selected. In addition, the time
that the examinee spends choosing difficulty levels of the items inflates length of the
testing session [9, 11]. However, problems other than precision and efficiency
may materialize in testing situations that reduce the effectiveness of CAT.
Individual difference variables such as computer anxiety, verbal self-concept,
or test anxiety may confound the ability estimates derived from CATs, which can
limit their advantage over SATs. For computer anxiety, previous research has not
established whether this variable significantly affects computer test performance.
One position in this debate suggests that computer anxiety can significantly affect
test performance [16-18], while other research examining computer anxiety fails
to find this relationship [8, 19]. The Vispoel, Rocklin, and Wang study also found
60 / SHERMIS, MZUMARA AND BUBLITZ
similar results for another individual variable, verbal self-concept, which was based
on the participants’ belief in their ability to perform well on the administered vocabulary test [8]. Their research suggested that because participants who had low verbal
self-concepts taking the SAT performed better than similar participants in the
CAT and FIT conditions.
Research regarding the effects of test anxiety on CAT and SAT are consistent
with the literature on computer anxiety. For CAT, Mulkey and O’Neil found
that success and failure on the CAT state self-efficacy and the worry aspect of test
anxiety [20]. Vispoel found that students with lower levels of test anxiety performed better than those with higher levels of test anxiety [14]. However, an
earlier study by Vispoel and Coffman found that estimated ability and test anxiety
were significantly correlated in the CAT and FIT conditions, but were not in the
SAT condition [13]. Another study performed by Rocklin, O’Donnell, and Holst
indicated similar findings on the anxiety-performance relationship [3]. The relationship found between ability estimates and anxiety was strongly negative in the
CAT condition r = –.53), but significantly weaker in the SAT condition r = –.16).
This difference in relationship strength means that test anxiety contaminated the
ability estimates much less in the SAT than in the CAT. Other studies also found
that participants in the SAT condition reported significantly less post-test state
anxiety than those taking the CAT [7, 9]. These findings collectively suggest that
the SAT has higher construct validity because test anxiety may confound the CAT
ability estimates.
Research has not yet investigated the cause of the anxiety reduction in the selfadaptive test. One explanation, taken from psychological literature, focuses on
perception of control. Averill found that subjects who are given control over aversive
stimuli had lower stress levels when compared to subjects who had no control
[21]. Research performed by Kanfer and Seidner indicated that people can tolerate unpleasant situations if they believe they have some control over the source,
even if no actual control is present [22]. Allowing students to believe that they are in
control of the testing situation may be enough to reduce their anxiety during the test.
Another important consideration when comparing CAT and SAT is the examinee’s
attitudes about the various features of the tests. Research indicates that people
generally prefer computerized testing, but dislike not being able to review or skip
items as in traditional FIT formats [cf. 8]. Research directly comparing participant
attitudes of CAT and SAT has been sparse. Research conducted by Vispoel, Rocklin,
and Wang provided evidence that the inclusion of answer feedback and item difficulty selection was seen as a strength of the SAT [8]. They also found that most
participants in the SAT condition thought of the ability to view and skip items as less
important than the CAT. Since the lack of the ability to review and skip items is
reviewed as a disadvantage to the CAT, using the SAT may reduce these negative
effects [10].
Feedback in SAT allows participants to make informed choices about items
based on the correctness of their last response [23]. The majority of previous
TEST AND COMPUTER ANXIETY / 61
research indicates that the inclusion or absence of answer feedback should not
affect the construct validity of mean ability estimates [24, 25]. In addition, Vispoel
found that administration time for CAT and SAT decreased when item feedback
was given, and those results increased when the examinee had high levels of test
anxiety [14]. One study by Roos et al. did find higher proficiency estimates for a
subset of examinees receiving feedback in CAT and SAT conditions, but admitted
the result was difficult to interpret and suggested additional research in the area
[11]. The present study attempts to investigate further the effects of feedback on CAT
and SAT in terms of efficiency, anxiety, and student attitudes toward testing.
This study used four test conditions to investigate underlying differences between
CAT and SAT: a traditional CAT, a SAT-individual condition in which individuals select the difficulty of each item; a SAT-global condition in which the individual
initially selects the difficulty range of the entire test; and a SAT-placebo in which
the individual is asked to choose the difficulty of each item, but the computer
selects items in a manner similar to the CAT condition. The SAT-placebo condition looks identical to the SAT-individual condition for the examinee, but the
items are selected using algorithms for the CAT condition. The SAT-placebo
attempts to separate the actual control from the perception of control over the item
difficulty. These four test conditions were crossed with either the presence or
absence of feedback about the correctness of the individual response.
Based on a review of the literature, the present study addressed the following
three research hypotheses: 1) test conditions that employ CAT procedures will
yield more precise and efficient ability estimates than SAT conditions; 2) participants in the SAT conditions will report significantly less post-test state anxiety
than those taking CATs; and 3) the inclusion of item response feedback will
result in greater efficiency across test conditions where feedback exists, as participants make informed choices about items based on the correctness of their
last response.
METHOD
Participants
Participants were 623 undergraduates in a large Midwestern university. All
entering students are required to make math, reading, and written English exams
in order to be placed in appropriate courses. Of the students who participated,
75 percent were freshman, 12 percent were sophomores, 4 percent were juniors,
6 percent were seniors, and 3 percent (17 cases) failed to report their class level. In
addition, 46 percent of the participants were male and 54 percent were female.
With respect to ethnicity, 79 percent (495 cases) were classified as white and
18 percent (109 cases) as nonwhite, with 3 percent (21 cases) missing. The age of
the participants ranged from fifteen to sixty-two years old, with a median age
of twenty-one years.
62 / SHERMIS, MZUMARA AND BUBLITZ
Instruments
Computerized Adaptive Test
The first instrument was a computerized adaptive test. Each individual’s test was
drawn from an item bank of 156 questions designed to assess college mathematics
ability and to place students in one of several mathematics course options. The test
curriculum was divided into four general clusters, including elementary algebra
(23 items), advanced algebra (43 items), geometry and trigonometry (64 items), and
calculus (26 items) [26]. Because of the item bank size, it was not possible for any
one individual to complete all the items. Consequently, three paper-and-pencil
tests, each with fifteen anchor items, were constructed and vertically equated. A
check on the internal consistency of the three forms, using alpha (a) coefficient,
yielded coefficients that ranged from .77 to .93. The item data obtained from these
tests (N = 1360) were used to calibrate and equate the items using the LOGIST IV,
a computer program that uses a joint maximum likelihood method in estimating
ability and item parameters. See Wingersky [27] and Wingersky, Barton, and
Lord [28] for details on the process of model parameter estimation or item calibration used in LOGIST. The steps involved in constructing the test and calibrating
the items are described in more detail elsewhere [29]. The dimensionality of the test
was examined using a procedure employed by Cook, Dorans, and Eignor [30]. The
item bank was found to be essentially unidimensional [31].
The test items were then transferred to the computerized adaptive testing
package, HyperCAT™ that runs under HyperCard™ on MacOS® computers or
WinPlus™ on Windows®-based machines. HyperCard™ and WinPlus™ are
database programming environments that allow the test administrator to create
educational “stackware.” Stackware is similar to a compiled program except that
it does not use formal computer languages, but rather is “scripted.” Each stack
contains “cards” on which objects appear. These objects represent “events” that
the author wants the computer to produce. One such “event” might be the presentation of a test item such as that depicted in Figure 1.
Shermis and Chang examined the results of tests administered using this item
bank against a 40-item paper-and-pencil math test covering similar domains and
found a strong concordance between the two (r(79) = .79, p < .05) [31]. The same
study evaluated the marginal reliability of the adaptive test. Marginal reliability
as measured by r is the CAT counterpart to the calculation of a , a measure of
internal consistency. For the Shermis and Chang study, it was computed to be r =
.80. Mzumara, Shermis, and Wimer conducted a recent study which looked at the
predictive validity of the adaptive math test in a large Midwestern university [32].
The correlation with final math exam scores was r = .55, p < .05.
Self-Adaptive Test
The second instrument was the self-adaptive test. The SAT contained items from
the same pool as the existing CAT, but was further divided into six difficulty levels.
TEST AND COMPUTER ANXIETY / 63
Figure 1. An example of what a test item might look like in HyperCard™.
Item difficulty levels can either be set globally at the beginning of the test or individually prior to the administration of the subsequent item.
Test Anxiety Inventory (TAI[33])
This measure contains twenty items rated on a 4-point Likert scale ranging
from almost never to almost always. Examples of items are “During tests I feel
very tense” and “Thoughts of doing poorly interfere with my concentration on
tests.” The TAI consists of two subscales—worry and emotionality—and scores
range from 20 to 80, with the higher scores indicating increased test anxiety. The
reliability of the TAI, as reported in the TAI manual, has reported an internal consistency a of .92 for the overall scale, .88 for the worry, and .90 for the emotionality.
Validity estimates reported in the manual range from .34 to .82, with a median
value of .75 [33].
Computer Anxiety Rating Scale
Participants also completed the Computer Anxiety Rating Scale (CARS) [34].
The CARS consists of nineteen items rated on a 5-point Likert scale ranging from
strongly disagree to strongly agree. Typical items on this scale include “I feel
apprehensive about using computers” and “I feel insecure about my ability to
interpret a computer printout.” Scores on the CARS range from 19 to 95, with
64 / SHERMIS, MZUMARA AND BUBLITZ
higher scores indicating higher computer anxiety. The CARS manual reported an
internal consistency of a = .87 and a test-retest reliability of r = .70. Estimates of
validity range from r = .20 to .74 with a median value of .48 [35].
Student Attitude Questionnaire (SAQ)
This questionnaire consisted of Likert scale and open-ended items assessing
participant attitudes about specific features of the test. Among the test features
assessed were clarity of directions, usefulness of practice items, and general attitudes toward the testing procedure. Because the initial ten items were identical for
all participants, comparisons were possible of student attitudes across conditions.
In addition, participants were asked to describe three things they liked and disliked about the testing format. The questionnaire also included a space to write
additional comments to improve the quality of the test.
Procedure
As participants checked in to take the mathematics placement test, proctors
randomly assigned each participant to one of the four test conditions (i.e., CAT,
SAT-Individual, SAT-Global, and SAT-Placebo) in addition to one of the two
feedback (yes, no) conditions. Note that the order of test conditions was determined randomly using a table of random numbers. The step involving selection of
the particular order of test condition/administration was completed prior to the
subjects’ arrival for placement testing. Subjects were given instructions describing the details of the test. The measures administered included: a) computerized
mathematics test; b) TAI; c) CARS; and d) SAQ. Participants in the CAT condition completed the first ten items of the SAQ. The participants in the SAT
conditions completed the full 16-item SAQ with the additional six items assessing
attitudes about item difficulty selection.
The present study utilized an 4 × 2 × 2 independent groups design. The three
independent variables included: condition: 1) CAT, 2) SAT-Individual, 3) SATGlobal, 4) SAT-Placebo; feedback: 1) feedback, 2) no feedback; and gender: 1) male,
2) female. Two sets of dependent variables were used in this study. The first set
included estimated ability (theta), number of items, and amount of time spent
taking the test. This set of dependent variables focused on characteristics of
student performance. The second set of dependent variables were directed to
characteristics of the examinee, including level of computer anxiety (CARS),
level of test anxiety (TAI), and examinee attitudes towards the testing situation
(SAQ).
The condition variable was administered with four levels of control. In the first
condition (CAT), the test was given as a standard computerized adaptive test. The
participants had no control over which item was administered next. In the second
condition (SAT-Individual), the participants were asked to choose the difficulty
TEST AND COMPUTER ANXIETY / 65
level of each individual item presented. In the third condition (SAT-Global), the
student could choose the level of difficulty of all items administered before the test
began. Since each of the difficulty levels contained more items than the maximum
number allowed on the test, participants could not exhaust the amount of items
at any particular difficulty level. The final condition (SAT-Placebo) appeared
to allow the students to choose the difficulty level of each item similar to the
SAT-Individual. However, the difficulty levels were selected by the computer
using the same algorithm as in the CAT. Thus, this condition examines what
effect the perception of control has on SAT, apart from the actual ability to control
the test difficulty.
The present research manipulated the presence of feedback about participant item responses. In one condition, the student received a message indicating
whether or not they responded correctly. In the other condition, the student
received no such feedback. The CAT, SAT, or placebo ended once either the ability estimate error of the participant reached below .2 or twenty-five items were
administered. The twenty-five item limit was incorporated as an operational constraint for this test. Previous work with the item bank suggested that good
placements could be achieved with as little as twenty items [26]. This limit was
raised to twenty-five to ensure variability with respect to the number of items
administered.
RESULTS
The primary data for the present study were analyzed using both multivariate
and univariate statistical methods. The first set of analyses focus on the adequacy
of the instruments employed in the study. The second set of analyses examines the
research hypotheses and follow-up exploratory work.
Instruments
Because the Student Attitude Questionnaire was created for this study, a factor
analysis was conducted to determine scale dimensionality. Figure 2 shows the scree
plot for the SAQ. The scale showed adequate evidence for unidimensionality
with the first factor explaining almost four times as much variance as the second
factor. The internal consistency reliability for the SAQ was a = .87. The ten-item
version of the SAQ reported a mean of 36.82 and a standard deviation of 5.65. In
addition, a factor analysis of the last six items of the SAQ, administered to all
SAT conditions, indicated undimensionality. An internal consistency reliability
coefficient of a = .79 was computed for these six items. The items had scores
ranging from 8 to 30, with a mean of 21.46 and standard deviation of 3.72. The
sixteen item Student Attitude Questionnaire for the SAT conditions resulted in an
average score of 57.70 and a standard deviation of 8.44.
66 / SHERMIS, MZUMARA AND BUBLITZ
Figure 2. The scree plots for the Student Attitude Questionnaires.
These results indicate moderately positive reactions to the SAT conditions.
Common responses to the open-ended item regarding self-adaptive testing were
that the procedure was faster and easier than conventional (paper-and-pencil)
tests, that the results were immediate, and students liked the ability to choose item
difficulty level. Examinees also indicated that they wanted the ability to review or
skip items, and expressed a desire for more sample items in the instructions. There
was also an expressed concern about the confidentiality of test scores.
A reliability analysis was then conducted for the CARS and TAI. The results were
consistent with those reported in the test manual (a = .89 and a = .94, respectively). The overall mean score on the TAI was 39.16 with a standard deviation of
11.81, with a worry subscale mean of 14.25 (SD = 4.69) and emotionality subscale
mean of 15.76 (SD = 5.23). The CARS scale reported an overall mean of 36.85
with a standard deviation of 9.93. A complete summary of the descriptive statistics is provided in Table 1.
Research Hypotheses
The three main hypotheses focused on measurement precision and/or efficiency, test anxiety/satisfaction, and the role of feedback in computerized testing.
The issue of gender-related differences in ability estimation and test satisfaction
was also addressed. The group means for the dependent variables, broken down
by condition (CAT, SAT-Global, SAT-Individual, SAT-Placebo) and feedback
(yes, no) conditions for test length, test time, ability, SAQ score, CARS, and TAI
are given in Table 2.
TEST AND COMPUTER ANXIETY / 67
Table 1. Descriptive Statistics of Important Variables for All Examinees
Descriptive Statistics
Variable
Theta
Items Administered
Placement Score
Test length (minutes)
TAI-Overall ( a = .94)
TAI-Worry
TAI-Emotionality
CARS (a = .89)
SAQ-CAT Conditions (a = .87)
SAQ-SAT Conditions (a = .79)
M
SD
n
–1.39
23.34
9.34
27.49
39.16
14.25
15.76
36.85
36.82
57.70
1.02
4.42
3.62
14.71
11.81
4.69
5.23
9.93
5.65
8.44
623
623
623
623
604
612
617
590
167
425
Table 2. Group Means for Test Length, Test Time, Ability Estimates, SAQ,
Total CARS, and Total TAI Score
Group Means
Test Type
CAT (no fdbk.)
SAT-Individual
(no fdbk.)
SAT-Global
(no fdbk.)
SAT-Placebo
(no fdbk.)
CAT (w/fdbk.)
SAT-Individual
(w/fdbk.)
SAT-Global
(w/fdbk.)
SAT-Placebo
(w/fdbk.)
(1)
(2)
(3)
(4)
(5)
(6)
n
23.30
22.71
26.08
26.79
–1.40
–1.41
36.96
36.41
36.42
36.25
40.80
39.66
84
81
24.76
29.70
–1.40
36.32
38.07
38.35
77
23.08
28.05
–1.37
36.57
37.17
37.10
73
23.08
22.75
26.08
27.63
–1.43
–1.29
36.67
35.98
36.67
36.49
39.68
39.99
83
74
24.94
29.45
–1.50
35.50
37.25
38.12
71
22.27
26.63
–1.29
36.47
36.54
39.37
80
(1) = Test Length (Items)
(2) = Test Time (minutes)
(3) = Ability Estimate (theta)
(4) = Score on SAQ (common CAT & SAT items)
(5) = Total CARS Score
(6) = Total TAI Score
68 / SHERMIS, MZUMARA AND BUBLITZ
The first MANOVA utilized the three performance-related dependent variables
with condition, feedback, and gender as the independent variables. The results of
this analysis are given in Table 3. Two main effects, condition (Wilks’ l = .94,
F(3,607) = 4.52, p < .001) and gender (Wilks’ l = .96, F(1,607) = 8.29, p < .001), emerged
as being statistically significant. A follow-up on the independent variable condition showed that the differences occurred in the test length dependent variable
(F(3,607) = 8.79, p < .001) with the SAT-Global condition requiring significantly
more items (M = 24.85) than the other three conditions (combined M = 22.79), a
difference of about two items.
With regard to gender, two of the three dependent variables were identified as
having significant differences. Females required significantly more items to complete the test than males (F(1,607) = 23.77, p < .001; Mfemales = 23.98, Mmales = 22.63)
while males had statistically higher ability estimates than females (F(1,607) = 14.98,
p < .001; Mfemales = –1.57, Mmales = –1.17).
Table 3. Summary of Results for the Multivariate Analysis of Analysis of Variance
for the Dependent Variables of Ability, Number of Items, and Number of Minutes
Effect
Condition
Feedback
Gender
Condition x Feedback
Condition x Gender
Feedback x Gender
Feedback x Condition x
Gender
Wilks’ l
Hypoth. DF
Error DF
F
p
.94
.99
.96
.99
.97
.99
.99
3
1
1
3
3
1
3
607
605
607
607
607
607
607
4.59
.17
8.29
.34
1.56
.06
.61
.001
ns
.001
ns
ns
ns
ns
Table 4. Summary of Results for the Multivariate Analysis of Analysis of
Variance for the Dependent Variables of TAI, CARS, and SAQ
Effect
Condition
Feedback
Gender
Condition x Feedback
Condition x Gender
Feedback x Gender
Feedback x Condition x
Gender
Wilks’ l
Hypoth. DF
Error DF
F
p
.98
.99
.94
.99
.97
.99
.99
3
1
1
3
3
1
3
557
557
557
557
557
557
557
1.06
.70
12.64
.26
1.77
.43
.34
ns
ns
.001
ns
ns
ns
ns
TEST AND COMPUTER ANXIETY / 69
The second MANOVA incorporated the three examinee characteristic measures including the CARS, TAI, and SAQ. As before the independent variables
consisted of the Condition, Feedback, and Gender variables. The results of this
analysis are given in Table 4.
Only one main effect, gender (Wilks’ l = .94, F(3,555) = 12.64, p < .001), emerged as
being statistically significant. A follow-up analysis revealed differences occurred in the
Total TAI Score dependent variable (F(1,555) = 37.58, p < .001) with females displaying significantly higher levels of test anxiety (M = 41.63) than males (M = 35.76).
With respect to the Feedback condition, the present study did not yield any statistically significant differences as either a main effect or interaction.
DISCUSSION
This study was designed to compare CAT and SAT across two sets of outcome
variables, those related to performance/efficiency and those associated with test
characteristics of participants. Both the CAT and SAT conditions were differentiated by whether or not feedback was provided and by the gender of the research
participant. Moreover, SAT was modified by the amount of control given to the
examinee. In one condition the individual was able to specify the difficulty level
of each subsequent item while in the other condition only a global setting of item
difficulty was permitted. A placebo condition was also included which allowed
examinees to make item difficulty level selections (individual or global) even though
actual item selection was accomplished through CAT scoring.
A main effect was found with the condition variable, but the statistically significant differences were only between the SAT-Global and the other three groups for
test length. There were no statistically significant differences between the SAT—
Individual, SAT—Placebo, or CAT groups. This is heartening news that suggests
there is no performance penalty for using SAT unless one incorporates the global
strategy. Even though the SAT—Global condition did have statistically significantly more items, the difference was relatively modest, about two items. Since
most implementations of SAT are consistent with the SAT—Individual strategy,
the current study suggests that there will be no loss in test efficiency (compared to
CAT) for using it.
The gender differences were consistent with previous research on the math test
[32]. Males scored significantly higher on this measure of ability and required significantly fewer items to complete the adaptive test than females. The differences
on both ability and test length, however, are both small. One plausible explanation
for the observed gender-related difference could be attributed to the general
observation that female students tend to perform less well in mathematics than
their male counterparts. Thus, a marginal difference in the response patterns
between female and male students could have had some impact in both measurement
precision and efficiency of test administration in mathematics. Further research on
the gender issue seems warranted.
70 / SHERMIS, MZUMARA AND BUBLITZ
Unfortunately this study could not replicate some of the advantages of SAT
documented in previous work [3-5]. There were no statistically significant differences in rates of computer anxiety, test anxiety, or satisfaction with the test
experience among the CAT and SAT conditions. The only significant difference
revolved around the higher test anxiety scores of females when compared to their
male counterparts.1
The lack of statistically significant feedback differences was also surprising.
Providing feedback is perhaps a double-edged sword. For some it may be re-assuring
to know that one is “doing well” on the test while for others it may be un-nerving to
see a high proportion of “incorrect” responses. Alternatively, it may be the case
that students inadvertently get feedback from the test simply by observing the difficulty level of the next administered item. Based on our informal work with students,
one of the biggest demands for the CAT test is not so much feedback on individual
item responses, but rather knowing how many total items remain on the test.
As an attempt to explain the inconsistency with previous literature regarding
the lack of a main effect for satisfaction, a follow-up ANOVA on the test satisfaction instrument was conducted. A mean comparison was made on satisfaction levels
of the CAT and SAT-Individual for high and low ability students (as measured by
their math scores). These analyses extended the work done by Vispoel, Rocklin,
and Wang [8] where they found interactions existing for two other individual
differences: verbal self-concept and test anxiety. Theta level was divided into two
equal groups using a median split. The results indicated a significant ability-bytest type (disordinal) interaction effect (F(1,291) = 4.57, p < .02) between CAT and
SAT-Individual satisfaction for the high and low ability group. Specifically, low
ability students in the SAT conditions rated their satisfaction with the test lower
than low ability students in the CAT conditions.
One rationale for limiting SAT has been the argument that SATs are less precise and/or require additional items than CATs for similar levels of precision. The
results of this study suggested that while globally-set SATs did require more items,
the practical differences were minor.
On the other hand, the prevailing argument used by SAT proponents is that
examinees will be more satisfied with their testing experience once test and computer anxiety levels are controlled. The results of this study provide mixed
evidence pertaining to satisfaction. Although the high-ability students may have
preferred the increased control provided by the SAT, the converse relationship
was observed for the low-ability group. The results of the study were inconsistent
with the predicted direction (SAT having higher satisfaction levels than CAT).
One possible explanation can be that for lower ability students, they prefer CAT
to SAT because the SAT demands more information processing than is required
1
With respect to Table 4, forty cases are missing because of listwise deletion of data employed in this
multivariate analysis. However, the distribution of missing cases is random with respect to the three independent variables.
TEST AND COMPUTER ANXIETY / 71
in CAT. Perhaps the lower-ability students have less desire for procedural control
created by the SAT than the higher-ability students.
Of course it is possible that the results could be merely a reflection of the item
bank characteristics or pool of student respondents. For example, an item bank of
only 156 items could conceivably be too limited in generating desired difficulty
levels, especially for the SAT conditions. In addition, the average ability level of
the participant population (n = 623) was –1.39, significantly lower than the peak
test information level which is located near –.5.2
It would appear from the results of this study that with “real world” data, suboptimizations such as SAT or providing feedback do practically little to detract
from the efficiency of computerized adaptive testing. Other investigators have looked
into alternate variables (i.e., skipping and reviewing items) that appear to have little
negative impact on the overall functioning of CATs [36]. Future research might
be directed to combining variables associated with perceived control to see what
impact they have both on efficiently and satisfaction measures. Also, it would be
worthwhile to address satisfaction levels for students with high ability levels.
APPENDIX A. STUDENT ATTITUDE
QUESTIONNAIRE (ADAPTED FROM [8])
STUDENT ID#: ________________________________
_____________________________________________________________
Please answer the following questions about the math placement test. If you
don’t understand a question, raise your hand and a test proctor will assist you.
1. The test directions were clear and easy to understand.
1
2
3
4
Strongly Disagree Disagree
Neutral
Agree
5
Strongly Agree
2. The practice items helped me learn how to take the computerized test.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
3. Taking the test on the computer was awkward and confusing.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
4. The computerized test took less time than other tests I usual take.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
2
Approximately 72 percent of the students are enrolled in developmental math courses (36% is the
national average). This percentage did not change significantly from the developmental rates in place
prior to implementing the CAT. Perhaps part of the explanation for this is that the study took place in a
state that does not have a community college system.
72 / SHERMIS, MZUMARA AND BUBLITZ
5. Taking this type of test makes me feel like I have more control over the test.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
6. I feel my score could have increased had I not taken this test on computer.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
7. I would recommend using this type of test format in to he future.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
8. This computerized test took too much time to complete.
1
2
3
4
Strongly Disagree Disagree
Neutral
Agree
5
Strongly Agree
9. I feel this test made me more anxious than I was before I started.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
10. I felt like I had no control over the test or my score.
1
2
3
4
Strongly Disagree Disagree
Neutral
Agree
5
Strongly Agree
11. Choosing the difficulty of items I want to answer is important to me.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
12. Having to choose the difficulty of items took too much time.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
13. Choosing the difficulty of the items helped me feel less anxious about taking
the test.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
14. It is a good idea to let students choose the difficulty level of the test items.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
15. Choosing the difficulty of the items helped me feel in control of the test.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
16. I feel I was distracted by having to choose the level of difficulty for each item.
1
2
3
4
5
Strongly Disagree Disagree
Neutral
Agree
Strongly Agree
TEST AND COMPUTER ANXIETY / 73
Describe three of the things that you liked most about the test. (Please be specific)
1. ___________________________________________________________
2. ___________________________________________________________
3. ___________________________________________________________
Describe three of the things that you disliked most about the test. (Again, be
specific)
1. ___________________________________________________________
2. ___________________________________________________________
3. ___________________________________________________________
Please provide any suggestions to improve the quality of this testing format.
________________________________________________________________
________________________________________________________________
________________________________________________________________
REFERENCES
1. H. Wainer, Computerized Adaptive Testing: A Primer, Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1990.
2. T. J. Ward, S. R. Hooper, and K. M. Hannafin, The Effects of Computerized Tests on
the Performance and Attitudes of College Students, Journal of Educational Computing Research, 5:3, pp. 327–333, 1989.
3. T. R. Rocklin, A. M. O’ Donnell, and P. M. Holst, Effects and Underlying Mechanisms of Self-Adapted Testing, Journal of Educational Psychology, 87:1, pp. 103–116,
1995.
4. S. L. Wise, L. L. Roos, B. S. Plake, and L. J. Nebelsick-Gullett, The Relationship
between Examinee Anxiety and Preference for Self-Adapted Testing, Applied Measurement in Education, 7:1, pp. 81-91, 1994.
5. S. L. Wise, Understanding Self-Adapted Testing: The Perceived Control Hypothesis,
Applied Measurement in Education, 7:1, pp. 15-24, 1994.
6. S. L. Wise, L. L. Roos, B. S. Plake, and L. J. Nebelsick-Gullett, Comparing Computerized Adaptive and Self-Adapted Tests: The Influence of Examinee Achievement on
Locus of Control, paper presented the National Council on Measurement in Education, New Orleans, Louisiana, April 1994.
7. T. R. Rocklin and A. M. O’Donnell, Self-Adapted Testing: A PerformanceImproving Variant of Computerized Adaptive Testing, Journal of Educational Psychology, 79:3, pp. 315–319, 1987.
8. W. P. Vispoel, T. R. Rocklin, and T. Wang, Individual Differences and Test Administration Procedures: A Comparison of Fixed-Item, Computerized-Adaptive, and
Self-Adapted Testing, Applied Measurement in Education, 7:1, pp. 53–79, 1994.
9. S. L. Wise, B. S. Plake, P. L. Johnson, and L. L. Roos, a Comparison of Self-Adapted
and Computerized-Adaptive Tests, Journal of Educational Measurement, 29:4,
pp. 329-339, 1992.
74 / SHERMIS, MZUMARA AND BUBLITZ
10. T. Rocklin, Self-Adapted Testing: Improving Performance by Modifying Tests Instead
of Examinees, Anxiety, Stress, and Coping, 10, pp. 83–104, 1997.
11. L. L. Roos, S. L. Wise, and B. S. Plake, The Role of Item Feedback in Self-Adapted
Testing, Educational and Psychological Measurement, 57:1, pp. 85–98, 1997.
12. V. Ponsoda, S. L. Wise, J. Olea, and J. Revuelta, An Investigation of Self-Adaptive
Testing in a Spanish High School Population, Educational and Psychological Measurement, 57:2, pp. 210–221, 1997.
13. W. P. Vispoel and D. D. Coffman, Computerized-Adaptive and Self-Adapted
Music-Listening Tests: Psychometric Features and Motivational Benefits, Applied
Measurement in Education, 7:1, pp. 15–24, 1994.
14. W. P. Vispoel, Psychometric Characteristics of Computer-Adaptive and Self-Adaptive
Vocabulary Tests: The Role of Answer Feedback and Test Anxiety, Educational and
Psychological Measurement, 35:2, pp. 115–167, 1998.
15. W. P. Vispoel, Reviewing and Changing Answers on Computer-Adaptive and
Self-Adaptive Vocabulary Tests, Educational and Psychological Measurement, 35:4,
pp. 328–347, 1998.
16. C. H. L. Chin, J. S. Donn, and R. F. Conry, Effects of Computer-Based Tests on the
Achievement, Anxiety, and Attitudes of Grade 10 Science Students, Educational and
Psychological Measurement, 51, pp. 735–745, 1991.
17. H. J. Johnson and K. N. Johnson, Psychological Considerations in the Development
of Computerized Testing Situations. Behavioral Research Methods and Instrumentation, 13, pp. 421–424, 1981.
18. M. D. Shermis and D. Lombard, Effects of Computer-Based Test Administrations on
Test Anxiety and Performance, Computers in Human Behavior, 14:1, pp. 111–123,
1998.
19. S. L. Wise, B. S. Plake, B. J. Pozehl, and L. B. Barnes, Providing Item Feedback in
computer-Based Tests: Effects of Initial Success and Failure, Educational and Psychological Measurement, 49:2, pp. 479-486, 1989.
20. J. R. Mulkey and H. F. O’ Neil, Jr., The Effects of Test Item Format on Self-Efficacy
and Worry during a High-Stakes Computer-Based Certification Examination, Computers in Human Behavior, 15, pp. 495–509, 1999.
21. J. R. Averill, Personal Control Over Aversive Stimuli and Its Relationship to Stress,
Psychological Bulletin, 80, pp. 286–303, 1973.
22. F. H. Kanfer and M. L. Seidner, Self Control: Factors Enhancing Tolerance of Noxious
Stimulation, Journal of Personality and Social Psychology, 25, pp. 381–389, 1973.
23. T. Rocklin, Individual Differences in Item Selection in Computerized Self Adapted
Testing, paper presented at the annual meeting of American Educational Research
Association, San Francisco, California, March 1989.
24. P. M. Holst, A. M. O’Donnell, and T. R. Rocklin, Effects of Feedback during
Self-Adaptive Testing on Estimates of Ability, paper presented at the American Educational Research Association, San Francisco, California, April 1992.
25. L. L. Roos, B. S. Plake, and S. L. Wise, The Effects of Feedback on Computerized and
Self-Adaptive Tests, paper presented at the National Council on Measurement in
Education, San Francisco, California, April 1992.
26. T. C. Hsu and M. D. Shermis, The Development and Evaluation of a Microcomputerized Adaptive Placement Test System for College Mathematics, Journal of
Educational Computing Research, 5:4, pp. 473–485, 1989.
TEST AND COMPUTER ANXIETY / 75
27. M. S. Wingersky, LOGIST: A Program for Computing Maximum
28. M. S. Wingersky, M. A. Barton, and F. M. Lord, LOGIST User’s Guide, Educational
Testing Service, Princeton, New Jersey, 1982.
29. M. D. Shermis, Microcomputerized Adaptive Placement Testing for College Mathematics, paper presented at the American Educational Research Association, San
Francisco, California, 1986.
30. L. L. Cook, N. J. Dorans, and D. R. Eignor, An Assessment of Dimensionality of Three
SAT-Verbal Test Editions, Journal of Educational Statistics, 13:1, pp. 19–43, 1988.
31. M. D. Shermis and S. H. Chang, The Use of Item Response Theory (IRT) to Investigate the Hierarchical Nature of a College Mathematics Curriculum, Educational and
Psychological Measurement, 57:3, pp. 450–458, 1997.
32. H. R. Mzumara, M. D. Shermis, and D. Wimer, Validity of the IUPUI Placement Test
Scores for Course Placement, Indiana University-Purdue University Indianapolis,
Indianapolis, Indiana, 1996.
33. C. D. Speilberger, Preliminary Professional Manual for the Test Anxiety Inventory,
Consulting Psychological Press, Palo Alto, California, 1980.
34. A. K. Heinssen, C. R. Glass, and L. A. Knight, Assessing Computer Anxiety: Development and Validation of the Computer Anxiety Rating Scale, Computer in Human
Behavior, 3, pp. 49–59, 1987.
35. P. C. Chu and E. E. Spires, Validating the Computer Anxiety Rating Scale: Effects of
Cognitive Style and Computer Courses on Computer Anxiety, Computers in Human
Behavior, 7:1–2, pp. 7–21, 1991.
36. M. E. Lunz and B. A. Bergstrom, An Empirical Study of Computerized Adaptive Test
Administration Conditions, Journal of Educational Measurement, 31:3, pp. 251–263,
1994.
Direct reprint requests to:
Dr. Mark D. Shermis
IUPUI Testing Center
620 Union Drive
Indianapolis, IN 46202-5168