J. EDUCATIONAL COMPUTING RESEARCH, Vol. 24(1) 57–75, 2001 ON TEST AND COMPUTER ANXIETY: TEST PERFORMANCE UNDER CAT AND SAT CONDITIONS MARK D. SHERMIS HOWARD R. MZUMARA SCOTT T. BUBLITZ Indiana University Purdue University Indianapolis ABSTRACT This article examines the differences between computer adaptive (CAT) and self-adapted testing (SAT) along with possible differences in feedback conditions and gender. Areas of comparison include measurement precision/ efficiency and student test characteristics. Participants included 623 undergraduates from a large Midwestern university who took math placement tests in a 4 (condition) × 2 (feedback) × 2 (gender) design. The four conditions included: a) CAT; b) SAT—Global; c) SAT—Individual; and d) SAT—Placebo groups. Multivariate Analysis of Variance was used to analyze the data. The perceived control hypothesis was used as a framework to explain the differences between CAT and SAT. Results indicated that measurement efficiency is differentially affected by the type of test condition with the SAT—Global condition performing worse than the others. Moreover, there were significant gender effects with regard to ability, test length, and test anxiety. There was no relative advantage for the inclusion of item feedback. Implications for computerized adaptive testing and areas of future research are discussed. INTRODUCTION Computerized testing is a popular alternative to the traditional paper-and-pencil format. These instruments are easy and inexpensive to administer, and generally take less time to score and tabulate. Moreover, computerized-adaptive testing (CAT) has been shown to produce more precise ability estimates while requiring 57 © 2001, Baywood Publishing Co., Inc. 58 / SHERMIS, MZUMARA AND BUBLITZ fewer items than traditional fixed-item tests [1]. However, the increased efficiency of computerized testing may come at the cost of higher test anxiety for some examinees [2]. As an attempt to give the examinee increased control over the testing situation and thereby reduce test anxiety, self-adapted testing (SAT) is a technique that has met with some success [3, 4]. Instead of the computer systematically selecting an item in an attempt to maximize information at the examinee’s current ability estimate, SAT lets individuals select items calibrated to a “desired” (relative) difficulty level. Otherwise, the calculation of ability level and estimate precision is carried out as it normally is with CAT. For example, before an individual is presented with an item, he/she is asked how difficult of an item they would prefer. The levels of difficulty can vary, typically ranging from three (easy, medium, hard) up to eight. Using the difficulty range chosen by the individual, the algorithm selects items specifically tailored to the present ability estimate. SAT has been found to lessen the increase in anxiety that occurs with computerized testing [3-5]. Research has not fully investigated the cause of anxiety reduction in SAT. Three hypotheses were offered by Wise et al. [6]. One possible explanation, the self-monitoring hypothesis, states that SAT allows the examinee access to information beyond which would be available to a traditional computerized testing algorithm (e.g., current affective (emotional) and motivational states) [7]. For example, some examinees begin an examination by feeling apprehensive, anxious, or insecure about their ability. In other cases, examinees may feel quite optimistic and confident in their mastery of the test content. This additional information about affective and motivational states allows the student to tailor the test to their specific psychological states and thus reduce their anxiety. Another hypothesis is that examinees that are anxious about the assessment experience test irrelevant thoughts that interfere with test performance. With SAT, the act of continuously choosing item difficulty levels may block test-irrelevant thoughts and keep the examinee more focused. Allowing students to choose the difficulty of the items may be enough to keep them focused and thereby reduce anxiety during the test [8, 9]. A final possible explanation about SAT effects has focused on perceived control. Perceptions of control over stress sources have been hypothesized to improve performance, reduce anxiety, and increase motivation [5]. Since SAT allows for choice over item difficulty, examinees can perceive this as increased control of the testing process. Wise suggests that “a person’s perception of control has been found to be more important in stress reduction than actually having control in an aversive situation” [p. 5, 18]. The purpose of this research is to examine possible differences between CAT and SAT with respect to test anxiety, computer anxiety, and a variety of efficiency measures. Moreover, this research will examine the participants’ attitudes about CAT and SAT. Attitudes of interest include examinees’ opinions concerning strengths, weaknesses, and modifications of each administrative condition. Finally, the present study intends to explore the processes underlying self-adapted testing TEST AND COMPUTER ANXIETY / 59 by manipulating several key characteristics. One way this will be accomplished is by varying levels of control over item difficulty: some individuals will vary the difficulty level of each item that is presented while others can set only the one difficulty level for all items in the test (referred to as “global difficulty level”). Another exertion of test control is by manipulating the presence or absence of item response feedback. A great deal of research has recently been conducted comparing the computer adaptive testing (CAT) and self-adapted testing (SAT). In general, it appears that CAT surpasses the SAT in terms of measurement precision and efficiency [10]. Wise et al. found that participants taking a 20-item CAT took less time to complete the test and received a lower standard error of ability estimate than participants given the 20-item SAT [9]. These results were replicated in another study by Roos, Wise, and Plake [11]. They also found that the average difficulty of items administered did not differ between CAT and SAT conditions. Another study by Ponsoda, Wise, Olea, and Revuelta attempted to extend the findings to a non-US population using Spanish high-school students in a study [12]. Although they had found no difference in ability estimates between CAT and SAT, they did find that the CAT had a lower standard error of ability estimate and administered more difficult items than the SAT. In another study by Vispoel and Coffman, a direct comparison was made among fixed-item (FIT), CAT, and SAT tests [13]. The authors concluded that CATs were significantly more precise and efficient in their ability estimates than either SATs or paper-and-pencil tests. Specifically, in order to achieve a specific precision level the SAT needed almost twice as many items and the FIT more than three times the number of items as the CAT. Rocklin et al. also found that students taking the SAT not only needed more time to complete each item than those taking the CAT, but also needed more items to reach a specified precision level [3]. Finally, two studies by Vispoel found that CAT was more precise and required less time to administer than SAT [14, 15]. These results are not surprising, considering that computer adaptive tests use algorithms to administer items that provide the most information about the examinee’s ability. Due to the examinee’s option to choose the item difficulty level in SATs, the efficiency of the test could decrease when items too easy or too difficulty for the examinee are selected. In addition, the time that the examinee spends choosing difficulty levels of the items inflates length of the testing session [9, 11]. However, problems other than precision and efficiency may materialize in testing situations that reduce the effectiveness of CAT. Individual difference variables such as computer anxiety, verbal self-concept, or test anxiety may confound the ability estimates derived from CATs, which can limit their advantage over SATs. For computer anxiety, previous research has not established whether this variable significantly affects computer test performance. One position in this debate suggests that computer anxiety can significantly affect test performance [16-18], while other research examining computer anxiety fails to find this relationship [8, 19]. The Vispoel, Rocklin, and Wang study also found 60 / SHERMIS, MZUMARA AND BUBLITZ similar results for another individual variable, verbal self-concept, which was based on the participants’ belief in their ability to perform well on the administered vocabulary test [8]. Their research suggested that because participants who had low verbal self-concepts taking the SAT performed better than similar participants in the CAT and FIT conditions. Research regarding the effects of test anxiety on CAT and SAT are consistent with the literature on computer anxiety. For CAT, Mulkey and O’Neil found that success and failure on the CAT state self-efficacy and the worry aspect of test anxiety [20]. Vispoel found that students with lower levels of test anxiety performed better than those with higher levels of test anxiety [14]. However, an earlier study by Vispoel and Coffman found that estimated ability and test anxiety were significantly correlated in the CAT and FIT conditions, but were not in the SAT condition [13]. Another study performed by Rocklin, O’Donnell, and Holst indicated similar findings on the anxiety-performance relationship [3]. The relationship found between ability estimates and anxiety was strongly negative in the CAT condition r = –.53), but significantly weaker in the SAT condition r = –.16). This difference in relationship strength means that test anxiety contaminated the ability estimates much less in the SAT than in the CAT. Other studies also found that participants in the SAT condition reported significantly less post-test state anxiety than those taking the CAT [7, 9]. These findings collectively suggest that the SAT has higher construct validity because test anxiety may confound the CAT ability estimates. Research has not yet investigated the cause of the anxiety reduction in the selfadaptive test. One explanation, taken from psychological literature, focuses on perception of control. Averill found that subjects who are given control over aversive stimuli had lower stress levels when compared to subjects who had no control [21]. Research performed by Kanfer and Seidner indicated that people can tolerate unpleasant situations if they believe they have some control over the source, even if no actual control is present [22]. Allowing students to believe that they are in control of the testing situation may be enough to reduce their anxiety during the test. Another important consideration when comparing CAT and SAT is the examinee’s attitudes about the various features of the tests. Research indicates that people generally prefer computerized testing, but dislike not being able to review or skip items as in traditional FIT formats [cf. 8]. Research directly comparing participant attitudes of CAT and SAT has been sparse. Research conducted by Vispoel, Rocklin, and Wang provided evidence that the inclusion of answer feedback and item difficulty selection was seen as a strength of the SAT [8]. They also found that most participants in the SAT condition thought of the ability to view and skip items as less important than the CAT. Since the lack of the ability to review and skip items is reviewed as a disadvantage to the CAT, using the SAT may reduce these negative effects [10]. Feedback in SAT allows participants to make informed choices about items based on the correctness of their last response [23]. The majority of previous TEST AND COMPUTER ANXIETY / 61 research indicates that the inclusion or absence of answer feedback should not affect the construct validity of mean ability estimates [24, 25]. In addition, Vispoel found that administration time for CAT and SAT decreased when item feedback was given, and those results increased when the examinee had high levels of test anxiety [14]. One study by Roos et al. did find higher proficiency estimates for a subset of examinees receiving feedback in CAT and SAT conditions, but admitted the result was difficult to interpret and suggested additional research in the area [11]. The present study attempts to investigate further the effects of feedback on CAT and SAT in terms of efficiency, anxiety, and student attitudes toward testing. This study used four test conditions to investigate underlying differences between CAT and SAT: a traditional CAT, a SAT-individual condition in which individuals select the difficulty of each item; a SAT-global condition in which the individual initially selects the difficulty range of the entire test; and a SAT-placebo in which the individual is asked to choose the difficulty of each item, but the computer selects items in a manner similar to the CAT condition. The SAT-placebo condition looks identical to the SAT-individual condition for the examinee, but the items are selected using algorithms for the CAT condition. The SAT-placebo attempts to separate the actual control from the perception of control over the item difficulty. These four test conditions were crossed with either the presence or absence of feedback about the correctness of the individual response. Based on a review of the literature, the present study addressed the following three research hypotheses: 1) test conditions that employ CAT procedures will yield more precise and efficient ability estimates than SAT conditions; 2) participants in the SAT conditions will report significantly less post-test state anxiety than those taking CATs; and 3) the inclusion of item response feedback will result in greater efficiency across test conditions where feedback exists, as participants make informed choices about items based on the correctness of their last response. METHOD Participants Participants were 623 undergraduates in a large Midwestern university. All entering students are required to make math, reading, and written English exams in order to be placed in appropriate courses. Of the students who participated, 75 percent were freshman, 12 percent were sophomores, 4 percent were juniors, 6 percent were seniors, and 3 percent (17 cases) failed to report their class level. In addition, 46 percent of the participants were male and 54 percent were female. With respect to ethnicity, 79 percent (495 cases) were classified as white and 18 percent (109 cases) as nonwhite, with 3 percent (21 cases) missing. The age of the participants ranged from fifteen to sixty-two years old, with a median age of twenty-one years. 62 / SHERMIS, MZUMARA AND BUBLITZ Instruments Computerized Adaptive Test The first instrument was a computerized adaptive test. Each individual’s test was drawn from an item bank of 156 questions designed to assess college mathematics ability and to place students in one of several mathematics course options. The test curriculum was divided into four general clusters, including elementary algebra (23 items), advanced algebra (43 items), geometry and trigonometry (64 items), and calculus (26 items) [26]. Because of the item bank size, it was not possible for any one individual to complete all the items. Consequently, three paper-and-pencil tests, each with fifteen anchor items, were constructed and vertically equated. A check on the internal consistency of the three forms, using alpha (a) coefficient, yielded coefficients that ranged from .77 to .93. The item data obtained from these tests (N = 1360) were used to calibrate and equate the items using the LOGIST IV, a computer program that uses a joint maximum likelihood method in estimating ability and item parameters. See Wingersky [27] and Wingersky, Barton, and Lord [28] for details on the process of model parameter estimation or item calibration used in LOGIST. The steps involved in constructing the test and calibrating the items are described in more detail elsewhere [29]. The dimensionality of the test was examined using a procedure employed by Cook, Dorans, and Eignor [30]. The item bank was found to be essentially unidimensional [31]. The test items were then transferred to the computerized adaptive testing package, HyperCAT™ that runs under HyperCard™ on MacOS® computers or WinPlus™ on Windows®-based machines. HyperCard™ and WinPlus™ are database programming environments that allow the test administrator to create educational “stackware.” Stackware is similar to a compiled program except that it does not use formal computer languages, but rather is “scripted.” Each stack contains “cards” on which objects appear. These objects represent “events” that the author wants the computer to produce. One such “event” might be the presentation of a test item such as that depicted in Figure 1. Shermis and Chang examined the results of tests administered using this item bank against a 40-item paper-and-pencil math test covering similar domains and found a strong concordance between the two (r(79) = .79, p < .05) [31]. The same study evaluated the marginal reliability of the adaptive test. Marginal reliability as measured by r is the CAT counterpart to the calculation of a , a measure of internal consistency. For the Shermis and Chang study, it was computed to be r = .80. Mzumara, Shermis, and Wimer conducted a recent study which looked at the predictive validity of the adaptive math test in a large Midwestern university [32]. The correlation with final math exam scores was r = .55, p < .05. Self-Adaptive Test The second instrument was the self-adaptive test. The SAT contained items from the same pool as the existing CAT, but was further divided into six difficulty levels. TEST AND COMPUTER ANXIETY / 63 Figure 1. An example of what a test item might look like in HyperCard™. Item difficulty levels can either be set globally at the beginning of the test or individually prior to the administration of the subsequent item. Test Anxiety Inventory (TAI[33]) This measure contains twenty items rated on a 4-point Likert scale ranging from almost never to almost always. Examples of items are “During tests I feel very tense” and “Thoughts of doing poorly interfere with my concentration on tests.” The TAI consists of two subscales—worry and emotionality—and scores range from 20 to 80, with the higher scores indicating increased test anxiety. The reliability of the TAI, as reported in the TAI manual, has reported an internal consistency a of .92 for the overall scale, .88 for the worry, and .90 for the emotionality. Validity estimates reported in the manual range from .34 to .82, with a median value of .75 [33]. Computer Anxiety Rating Scale Participants also completed the Computer Anxiety Rating Scale (CARS) [34]. The CARS consists of nineteen items rated on a 5-point Likert scale ranging from strongly disagree to strongly agree. Typical items on this scale include “I feel apprehensive about using computers” and “I feel insecure about my ability to interpret a computer printout.” Scores on the CARS range from 19 to 95, with 64 / SHERMIS, MZUMARA AND BUBLITZ higher scores indicating higher computer anxiety. The CARS manual reported an internal consistency of a = .87 and a test-retest reliability of r = .70. Estimates of validity range from r = .20 to .74 with a median value of .48 [35]. Student Attitude Questionnaire (SAQ) This questionnaire consisted of Likert scale and open-ended items assessing participant attitudes about specific features of the test. Among the test features assessed were clarity of directions, usefulness of practice items, and general attitudes toward the testing procedure. Because the initial ten items were identical for all participants, comparisons were possible of student attitudes across conditions. In addition, participants were asked to describe three things they liked and disliked about the testing format. The questionnaire also included a space to write additional comments to improve the quality of the test. Procedure As participants checked in to take the mathematics placement test, proctors randomly assigned each participant to one of the four test conditions (i.e., CAT, SAT-Individual, SAT-Global, and SAT-Placebo) in addition to one of the two feedback (yes, no) conditions. Note that the order of test conditions was determined randomly using a table of random numbers. The step involving selection of the particular order of test condition/administration was completed prior to the subjects’ arrival for placement testing. Subjects were given instructions describing the details of the test. The measures administered included: a) computerized mathematics test; b) TAI; c) CARS; and d) SAQ. Participants in the CAT condition completed the first ten items of the SAQ. The participants in the SAT conditions completed the full 16-item SAQ with the additional six items assessing attitudes about item difficulty selection. The present study utilized an 4 × 2 × 2 independent groups design. The three independent variables included: condition: 1) CAT, 2) SAT-Individual, 3) SATGlobal, 4) SAT-Placebo; feedback: 1) feedback, 2) no feedback; and gender: 1) male, 2) female. Two sets of dependent variables were used in this study. The first set included estimated ability (theta), number of items, and amount of time spent taking the test. This set of dependent variables focused on characteristics of student performance. The second set of dependent variables were directed to characteristics of the examinee, including level of computer anxiety (CARS), level of test anxiety (TAI), and examinee attitudes towards the testing situation (SAQ). The condition variable was administered with four levels of control. In the first condition (CAT), the test was given as a standard computerized adaptive test. The participants had no control over which item was administered next. In the second condition (SAT-Individual), the participants were asked to choose the difficulty TEST AND COMPUTER ANXIETY / 65 level of each individual item presented. In the third condition (SAT-Global), the student could choose the level of difficulty of all items administered before the test began. Since each of the difficulty levels contained more items than the maximum number allowed on the test, participants could not exhaust the amount of items at any particular difficulty level. The final condition (SAT-Placebo) appeared to allow the students to choose the difficulty level of each item similar to the SAT-Individual. However, the difficulty levels were selected by the computer using the same algorithm as in the CAT. Thus, this condition examines what effect the perception of control has on SAT, apart from the actual ability to control the test difficulty. The present research manipulated the presence of feedback about participant item responses. In one condition, the student received a message indicating whether or not they responded correctly. In the other condition, the student received no such feedback. The CAT, SAT, or placebo ended once either the ability estimate error of the participant reached below .2 or twenty-five items were administered. The twenty-five item limit was incorporated as an operational constraint for this test. Previous work with the item bank suggested that good placements could be achieved with as little as twenty items [26]. This limit was raised to twenty-five to ensure variability with respect to the number of items administered. RESULTS The primary data for the present study were analyzed using both multivariate and univariate statistical methods. The first set of analyses focus on the adequacy of the instruments employed in the study. The second set of analyses examines the research hypotheses and follow-up exploratory work. Instruments Because the Student Attitude Questionnaire was created for this study, a factor analysis was conducted to determine scale dimensionality. Figure 2 shows the scree plot for the SAQ. The scale showed adequate evidence for unidimensionality with the first factor explaining almost four times as much variance as the second factor. The internal consistency reliability for the SAQ was a = .87. The ten-item version of the SAQ reported a mean of 36.82 and a standard deviation of 5.65. In addition, a factor analysis of the last six items of the SAQ, administered to all SAT conditions, indicated undimensionality. An internal consistency reliability coefficient of a = .79 was computed for these six items. The items had scores ranging from 8 to 30, with a mean of 21.46 and standard deviation of 3.72. The sixteen item Student Attitude Questionnaire for the SAT conditions resulted in an average score of 57.70 and a standard deviation of 8.44. 66 / SHERMIS, MZUMARA AND BUBLITZ Figure 2. The scree plots for the Student Attitude Questionnaires. These results indicate moderately positive reactions to the SAT conditions. Common responses to the open-ended item regarding self-adaptive testing were that the procedure was faster and easier than conventional (paper-and-pencil) tests, that the results were immediate, and students liked the ability to choose item difficulty level. Examinees also indicated that they wanted the ability to review or skip items, and expressed a desire for more sample items in the instructions. There was also an expressed concern about the confidentiality of test scores. A reliability analysis was then conducted for the CARS and TAI. The results were consistent with those reported in the test manual (a = .89 and a = .94, respectively). The overall mean score on the TAI was 39.16 with a standard deviation of 11.81, with a worry subscale mean of 14.25 (SD = 4.69) and emotionality subscale mean of 15.76 (SD = 5.23). The CARS scale reported an overall mean of 36.85 with a standard deviation of 9.93. A complete summary of the descriptive statistics is provided in Table 1. Research Hypotheses The three main hypotheses focused on measurement precision and/or efficiency, test anxiety/satisfaction, and the role of feedback in computerized testing. The issue of gender-related differences in ability estimation and test satisfaction was also addressed. The group means for the dependent variables, broken down by condition (CAT, SAT-Global, SAT-Individual, SAT-Placebo) and feedback (yes, no) conditions for test length, test time, ability, SAQ score, CARS, and TAI are given in Table 2. TEST AND COMPUTER ANXIETY / 67 Table 1. Descriptive Statistics of Important Variables for All Examinees Descriptive Statistics Variable Theta Items Administered Placement Score Test length (minutes) TAI-Overall ( a = .94) TAI-Worry TAI-Emotionality CARS (a = .89) SAQ-CAT Conditions (a = .87) SAQ-SAT Conditions (a = .79) M SD n –1.39 23.34 9.34 27.49 39.16 14.25 15.76 36.85 36.82 57.70 1.02 4.42 3.62 14.71 11.81 4.69 5.23 9.93 5.65 8.44 623 623 623 623 604 612 617 590 167 425 Table 2. Group Means for Test Length, Test Time, Ability Estimates, SAQ, Total CARS, and Total TAI Score Group Means Test Type CAT (no fdbk.) SAT-Individual (no fdbk.) SAT-Global (no fdbk.) SAT-Placebo (no fdbk.) CAT (w/fdbk.) SAT-Individual (w/fdbk.) SAT-Global (w/fdbk.) SAT-Placebo (w/fdbk.) (1) (2) (3) (4) (5) (6) n 23.30 22.71 26.08 26.79 –1.40 –1.41 36.96 36.41 36.42 36.25 40.80 39.66 84 81 24.76 29.70 –1.40 36.32 38.07 38.35 77 23.08 28.05 –1.37 36.57 37.17 37.10 73 23.08 22.75 26.08 27.63 –1.43 –1.29 36.67 35.98 36.67 36.49 39.68 39.99 83 74 24.94 29.45 –1.50 35.50 37.25 38.12 71 22.27 26.63 –1.29 36.47 36.54 39.37 80 (1) = Test Length (Items) (2) = Test Time (minutes) (3) = Ability Estimate (theta) (4) = Score on SAQ (common CAT & SAT items) (5) = Total CARS Score (6) = Total TAI Score 68 / SHERMIS, MZUMARA AND BUBLITZ The first MANOVA utilized the three performance-related dependent variables with condition, feedback, and gender as the independent variables. The results of this analysis are given in Table 3. Two main effects, condition (Wilks’ l = .94, F(3,607) = 4.52, p < .001) and gender (Wilks’ l = .96, F(1,607) = 8.29, p < .001), emerged as being statistically significant. A follow-up on the independent variable condition showed that the differences occurred in the test length dependent variable (F(3,607) = 8.79, p < .001) with the SAT-Global condition requiring significantly more items (M = 24.85) than the other three conditions (combined M = 22.79), a difference of about two items. With regard to gender, two of the three dependent variables were identified as having significant differences. Females required significantly more items to complete the test than males (F(1,607) = 23.77, p < .001; Mfemales = 23.98, Mmales = 22.63) while males had statistically higher ability estimates than females (F(1,607) = 14.98, p < .001; Mfemales = –1.57, Mmales = –1.17). Table 3. Summary of Results for the Multivariate Analysis of Analysis of Variance for the Dependent Variables of Ability, Number of Items, and Number of Minutes Effect Condition Feedback Gender Condition x Feedback Condition x Gender Feedback x Gender Feedback x Condition x Gender Wilks’ l Hypoth. DF Error DF F p .94 .99 .96 .99 .97 .99 .99 3 1 1 3 3 1 3 607 605 607 607 607 607 607 4.59 .17 8.29 .34 1.56 .06 .61 .001 ns .001 ns ns ns ns Table 4. Summary of Results for the Multivariate Analysis of Analysis of Variance for the Dependent Variables of TAI, CARS, and SAQ Effect Condition Feedback Gender Condition x Feedback Condition x Gender Feedback x Gender Feedback x Condition x Gender Wilks’ l Hypoth. DF Error DF F p .98 .99 .94 .99 .97 .99 .99 3 1 1 3 3 1 3 557 557 557 557 557 557 557 1.06 .70 12.64 .26 1.77 .43 .34 ns ns .001 ns ns ns ns TEST AND COMPUTER ANXIETY / 69 The second MANOVA incorporated the three examinee characteristic measures including the CARS, TAI, and SAQ. As before the independent variables consisted of the Condition, Feedback, and Gender variables. The results of this analysis are given in Table 4. Only one main effect, gender (Wilks’ l = .94, F(3,555) = 12.64, p < .001), emerged as being statistically significant. A follow-up analysis revealed differences occurred in the Total TAI Score dependent variable (F(1,555) = 37.58, p < .001) with females displaying significantly higher levels of test anxiety (M = 41.63) than males (M = 35.76). With respect to the Feedback condition, the present study did not yield any statistically significant differences as either a main effect or interaction. DISCUSSION This study was designed to compare CAT and SAT across two sets of outcome variables, those related to performance/efficiency and those associated with test characteristics of participants. Both the CAT and SAT conditions were differentiated by whether or not feedback was provided and by the gender of the research participant. Moreover, SAT was modified by the amount of control given to the examinee. In one condition the individual was able to specify the difficulty level of each subsequent item while in the other condition only a global setting of item difficulty was permitted. A placebo condition was also included which allowed examinees to make item difficulty level selections (individual or global) even though actual item selection was accomplished through CAT scoring. A main effect was found with the condition variable, but the statistically significant differences were only between the SAT-Global and the other three groups for test length. There were no statistically significant differences between the SAT— Individual, SAT—Placebo, or CAT groups. This is heartening news that suggests there is no performance penalty for using SAT unless one incorporates the global strategy. Even though the SAT—Global condition did have statistically significantly more items, the difference was relatively modest, about two items. Since most implementations of SAT are consistent with the SAT—Individual strategy, the current study suggests that there will be no loss in test efficiency (compared to CAT) for using it. The gender differences were consistent with previous research on the math test [32]. Males scored significantly higher on this measure of ability and required significantly fewer items to complete the adaptive test than females. The differences on both ability and test length, however, are both small. One plausible explanation for the observed gender-related difference could be attributed to the general observation that female students tend to perform less well in mathematics than their male counterparts. Thus, a marginal difference in the response patterns between female and male students could have had some impact in both measurement precision and efficiency of test administration in mathematics. Further research on the gender issue seems warranted. 70 / SHERMIS, MZUMARA AND BUBLITZ Unfortunately this study could not replicate some of the advantages of SAT documented in previous work [3-5]. There were no statistically significant differences in rates of computer anxiety, test anxiety, or satisfaction with the test experience among the CAT and SAT conditions. The only significant difference revolved around the higher test anxiety scores of females when compared to their male counterparts.1 The lack of statistically significant feedback differences was also surprising. Providing feedback is perhaps a double-edged sword. For some it may be re-assuring to know that one is “doing well” on the test while for others it may be un-nerving to see a high proportion of “incorrect” responses. Alternatively, it may be the case that students inadvertently get feedback from the test simply by observing the difficulty level of the next administered item. Based on our informal work with students, one of the biggest demands for the CAT test is not so much feedback on individual item responses, but rather knowing how many total items remain on the test. As an attempt to explain the inconsistency with previous literature regarding the lack of a main effect for satisfaction, a follow-up ANOVA on the test satisfaction instrument was conducted. A mean comparison was made on satisfaction levels of the CAT and SAT-Individual for high and low ability students (as measured by their math scores). These analyses extended the work done by Vispoel, Rocklin, and Wang [8] where they found interactions existing for two other individual differences: verbal self-concept and test anxiety. Theta level was divided into two equal groups using a median split. The results indicated a significant ability-bytest type (disordinal) interaction effect (F(1,291) = 4.57, p < .02) between CAT and SAT-Individual satisfaction for the high and low ability group. Specifically, low ability students in the SAT conditions rated their satisfaction with the test lower than low ability students in the CAT conditions. One rationale for limiting SAT has been the argument that SATs are less precise and/or require additional items than CATs for similar levels of precision. The results of this study suggested that while globally-set SATs did require more items, the practical differences were minor. On the other hand, the prevailing argument used by SAT proponents is that examinees will be more satisfied with their testing experience once test and computer anxiety levels are controlled. The results of this study provide mixed evidence pertaining to satisfaction. Although the high-ability students may have preferred the increased control provided by the SAT, the converse relationship was observed for the low-ability group. The results of the study were inconsistent with the predicted direction (SAT having higher satisfaction levels than CAT). One possible explanation can be that for lower ability students, they prefer CAT to SAT because the SAT demands more information processing than is required 1 With respect to Table 4, forty cases are missing because of listwise deletion of data employed in this multivariate analysis. However, the distribution of missing cases is random with respect to the three independent variables. TEST AND COMPUTER ANXIETY / 71 in CAT. Perhaps the lower-ability students have less desire for procedural control created by the SAT than the higher-ability students. Of course it is possible that the results could be merely a reflection of the item bank characteristics or pool of student respondents. For example, an item bank of only 156 items could conceivably be too limited in generating desired difficulty levels, especially for the SAT conditions. In addition, the average ability level of the participant population (n = 623) was –1.39, significantly lower than the peak test information level which is located near –.5.2 It would appear from the results of this study that with “real world” data, suboptimizations such as SAT or providing feedback do practically little to detract from the efficiency of computerized adaptive testing. Other investigators have looked into alternate variables (i.e., skipping and reviewing items) that appear to have little negative impact on the overall functioning of CATs [36]. Future research might be directed to combining variables associated with perceived control to see what impact they have both on efficiently and satisfaction measures. Also, it would be worthwhile to address satisfaction levels for students with high ability levels. APPENDIX A. STUDENT ATTITUDE QUESTIONNAIRE (ADAPTED FROM [8]) STUDENT ID#: ________________________________ _____________________________________________________________ Please answer the following questions about the math placement test. If you don’t understand a question, raise your hand and a test proctor will assist you. 1. The test directions were clear and easy to understand. 1 2 3 4 Strongly Disagree Disagree Neutral Agree 5 Strongly Agree 2. The practice items helped me learn how to take the computerized test. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 3. Taking the test on the computer was awkward and confusing. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 4. The computerized test took less time than other tests I usual take. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 2 Approximately 72 percent of the students are enrolled in developmental math courses (36% is the national average). This percentage did not change significantly from the developmental rates in place prior to implementing the CAT. Perhaps part of the explanation for this is that the study took place in a state that does not have a community college system. 72 / SHERMIS, MZUMARA AND BUBLITZ 5. Taking this type of test makes me feel like I have more control over the test. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 6. I feel my score could have increased had I not taken this test on computer. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 7. I would recommend using this type of test format in to he future. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 8. This computerized test took too much time to complete. 1 2 3 4 Strongly Disagree Disagree Neutral Agree 5 Strongly Agree 9. I feel this test made me more anxious than I was before I started. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 10. I felt like I had no control over the test or my score. 1 2 3 4 Strongly Disagree Disagree Neutral Agree 5 Strongly Agree 11. Choosing the difficulty of items I want to answer is important to me. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 12. Having to choose the difficulty of items took too much time. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 13. Choosing the difficulty of the items helped me feel less anxious about taking the test. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 14. It is a good idea to let students choose the difficulty level of the test items. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 15. Choosing the difficulty of the items helped me feel in control of the test. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree 16. I feel I was distracted by having to choose the level of difficulty for each item. 1 2 3 4 5 Strongly Disagree Disagree Neutral Agree Strongly Agree TEST AND COMPUTER ANXIETY / 73 Describe three of the things that you liked most about the test. (Please be specific) 1. ___________________________________________________________ 2. ___________________________________________________________ 3. ___________________________________________________________ Describe three of the things that you disliked most about the test. (Again, be specific) 1. ___________________________________________________________ 2. ___________________________________________________________ 3. ___________________________________________________________ Please provide any suggestions to improve the quality of this testing format. ________________________________________________________________ ________________________________________________________________ ________________________________________________________________ REFERENCES 1. H. Wainer, Computerized Adaptive Testing: A Primer, Lawrence Erlbaum Associates, Hillsdale, New Jersey, 1990. 2. T. J. Ward, S. R. Hooper, and K. M. Hannafin, The Effects of Computerized Tests on the Performance and Attitudes of College Students, Journal of Educational Computing Research, 5:3, pp. 327–333, 1989. 3. T. R. Rocklin, A. M. O’ Donnell, and P. M. Holst, Effects and Underlying Mechanisms of Self-Adapted Testing, Journal of Educational Psychology, 87:1, pp. 103–116, 1995. 4. S. L. Wise, L. L. Roos, B. S. Plake, and L. J. Nebelsick-Gullett, The Relationship between Examinee Anxiety and Preference for Self-Adapted Testing, Applied Measurement in Education, 7:1, pp. 81-91, 1994. 5. S. L. Wise, Understanding Self-Adapted Testing: The Perceived Control Hypothesis, Applied Measurement in Education, 7:1, pp. 15-24, 1994. 6. S. L. Wise, L. L. Roos, B. S. Plake, and L. J. Nebelsick-Gullett, Comparing Computerized Adaptive and Self-Adapted Tests: The Influence of Examinee Achievement on Locus of Control, paper presented the National Council on Measurement in Education, New Orleans, Louisiana, April 1994. 7. T. R. Rocklin and A. M. O’Donnell, Self-Adapted Testing: A PerformanceImproving Variant of Computerized Adaptive Testing, Journal of Educational Psychology, 79:3, pp. 315–319, 1987. 8. W. P. Vispoel, T. R. Rocklin, and T. Wang, Individual Differences and Test Administration Procedures: A Comparison of Fixed-Item, Computerized-Adaptive, and Self-Adapted Testing, Applied Measurement in Education, 7:1, pp. 53–79, 1994. 9. S. L. Wise, B. S. Plake, P. L. Johnson, and L. L. Roos, a Comparison of Self-Adapted and Computerized-Adaptive Tests, Journal of Educational Measurement, 29:4, pp. 329-339, 1992. 74 / SHERMIS, MZUMARA AND BUBLITZ 10. T. Rocklin, Self-Adapted Testing: Improving Performance by Modifying Tests Instead of Examinees, Anxiety, Stress, and Coping, 10, pp. 83–104, 1997. 11. L. L. Roos, S. L. Wise, and B. S. Plake, The Role of Item Feedback in Self-Adapted Testing, Educational and Psychological Measurement, 57:1, pp. 85–98, 1997. 12. V. Ponsoda, S. L. Wise, J. Olea, and J. Revuelta, An Investigation of Self-Adaptive Testing in a Spanish High School Population, Educational and Psychological Measurement, 57:2, pp. 210–221, 1997. 13. W. P. Vispoel and D. D. Coffman, Computerized-Adaptive and Self-Adapted Music-Listening Tests: Psychometric Features and Motivational Benefits, Applied Measurement in Education, 7:1, pp. 15–24, 1994. 14. W. P. Vispoel, Psychometric Characteristics of Computer-Adaptive and Self-Adaptive Vocabulary Tests: The Role of Answer Feedback and Test Anxiety, Educational and Psychological Measurement, 35:2, pp. 115–167, 1998. 15. W. P. Vispoel, Reviewing and Changing Answers on Computer-Adaptive and Self-Adaptive Vocabulary Tests, Educational and Psychological Measurement, 35:4, pp. 328–347, 1998. 16. C. H. L. Chin, J. S. Donn, and R. F. Conry, Effects of Computer-Based Tests on the Achievement, Anxiety, and Attitudes of Grade 10 Science Students, Educational and Psychological Measurement, 51, pp. 735–745, 1991. 17. H. J. Johnson and K. N. Johnson, Psychological Considerations in the Development of Computerized Testing Situations. Behavioral Research Methods and Instrumentation, 13, pp. 421–424, 1981. 18. M. D. Shermis and D. Lombard, Effects of Computer-Based Test Administrations on Test Anxiety and Performance, Computers in Human Behavior, 14:1, pp. 111–123, 1998. 19. S. L. Wise, B. S. Plake, B. J. Pozehl, and L. B. Barnes, Providing Item Feedback in computer-Based Tests: Effects of Initial Success and Failure, Educational and Psychological Measurement, 49:2, pp. 479-486, 1989. 20. J. R. Mulkey and H. F. O’ Neil, Jr., The Effects of Test Item Format on Self-Efficacy and Worry during a High-Stakes Computer-Based Certification Examination, Computers in Human Behavior, 15, pp. 495–509, 1999. 21. J. R. Averill, Personal Control Over Aversive Stimuli and Its Relationship to Stress, Psychological Bulletin, 80, pp. 286–303, 1973. 22. F. H. Kanfer and M. L. Seidner, Self Control: Factors Enhancing Tolerance of Noxious Stimulation, Journal of Personality and Social Psychology, 25, pp. 381–389, 1973. 23. T. Rocklin, Individual Differences in Item Selection in Computerized Self Adapted Testing, paper presented at the annual meeting of American Educational Research Association, San Francisco, California, March 1989. 24. P. M. Holst, A. M. O’Donnell, and T. R. Rocklin, Effects of Feedback during Self-Adaptive Testing on Estimates of Ability, paper presented at the American Educational Research Association, San Francisco, California, April 1992. 25. L. L. Roos, B. S. Plake, and S. L. Wise, The Effects of Feedback on Computerized and Self-Adaptive Tests, paper presented at the National Council on Measurement in Education, San Francisco, California, April 1992. 26. T. C. Hsu and M. D. Shermis, The Development and Evaluation of a Microcomputerized Adaptive Placement Test System for College Mathematics, Journal of Educational Computing Research, 5:4, pp. 473–485, 1989. TEST AND COMPUTER ANXIETY / 75 27. M. S. Wingersky, LOGIST: A Program for Computing Maximum 28. M. S. Wingersky, M. A. Barton, and F. M. Lord, LOGIST User’s Guide, Educational Testing Service, Princeton, New Jersey, 1982. 29. M. D. Shermis, Microcomputerized Adaptive Placement Testing for College Mathematics, paper presented at the American Educational Research Association, San Francisco, California, 1986. 30. L. L. Cook, N. J. Dorans, and D. R. Eignor, An Assessment of Dimensionality of Three SAT-Verbal Test Editions, Journal of Educational Statistics, 13:1, pp. 19–43, 1988. 31. M. D. Shermis and S. H. Chang, The Use of Item Response Theory (IRT) to Investigate the Hierarchical Nature of a College Mathematics Curriculum, Educational and Psychological Measurement, 57:3, pp. 450–458, 1997. 32. H. R. Mzumara, M. D. Shermis, and D. Wimer, Validity of the IUPUI Placement Test Scores for Course Placement, Indiana University-Purdue University Indianapolis, Indianapolis, Indiana, 1996. 33. C. D. Speilberger, Preliminary Professional Manual for the Test Anxiety Inventory, Consulting Psychological Press, Palo Alto, California, 1980. 34. A. K. Heinssen, C. R. Glass, and L. A. Knight, Assessing Computer Anxiety: Development and Validation of the Computer Anxiety Rating Scale, Computer in Human Behavior, 3, pp. 49–59, 1987. 35. P. C. Chu and E. E. Spires, Validating the Computer Anxiety Rating Scale: Effects of Cognitive Style and Computer Courses on Computer Anxiety, Computers in Human Behavior, 7:1–2, pp. 7–21, 1991. 36. M. E. Lunz and B. A. Bergstrom, An Empirical Study of Computerized Adaptive Test Administration Conditions, Journal of Educational Measurement, 31:3, pp. 251–263, 1994. Direct reprint requests to: Dr. Mark D. Shermis IUPUI Testing Center 620 Union Drive Indianapolis, IN 46202-5168
© Copyright 2024