Document 285706

Jones,M. H. (1988,August). The effectsof a small examineesample
sizeon the precisionof measurementfor testsdevelopedbv four
different item-selectionstratesies.Dissertation.Florida State
University,Tallahassee,
Florida.
I
I
I
I
I
I
I
I
I
T
t
I
I
I
I
I
I
t
I
Abstract
This
selection
study
sought
procedure
tarhere small
procedures
utilizing
examinee
evaluated
(c)
dornain sarnpling,
referenced,
the
24o items.
Each data
serected
set
item
was used to
simulated
responses
Three
matrix
and or
data
item
crassical,
i
selection;
and
(b-values).
of
a data
matrix
Looo examinees
of
responses
to
data were drawn.
of
associated
sets
coefficients
create
1"2 sets
the
correlations;
logits
program
contained
selection
random item
examinees and their
24o items.
resurts
utilizing
From this
The item
phi
utirizing
itern
situations
biseriar
utilizing
best
testing
(a) nodified
and point
criterion
A computer
the
sampres exist.
were
(b)
(d) Rasch nodel,
determine
to use in mastery
p-varues
containing
to
50 randomly
responses
were used for
to
computing
statistics
for
procedures
were evaluated
arr
test
each itern selection
procedure.
rtem
test
serection
information,
misclassification
percentage
with-each
correct
standard
rate,
score.
error
of the
and accuracy
Test
t-1
of the
to
terrns
of
estirnate,
informatj_on
itern was computed according
in
domain
associated
a three
parameter
,,
'i "
I
t
I
I
t
I
I
I
J
I
t
I
T
I
item
response
theory
The results
(IRT) model.
show that
referenced
procedures
which,
a given
for
parameter
three
However,
can not
way to
were effective
cut-off
ability
effectively
utilize
rndeed,
item
selection
score
procedures
this
than
which
produced
I
T
a three
items
random iten
reason
the
random item
reconmended
for
mastery
samples exist.
rLL
of
item
biased
high
indices
as
estimates
of
'roptimal,
of
the
item
higher
accurate
selectj.on
in which
no
procedures
As a result
generally
is
with
selection
of the
model
50 there
identified
selection
tests
by a
parameter
used statistical
and less
t
I
I
for
alr
itens
information.
score.
estimates
serecting
would be identified
produced
correct
rates
the
at
high
of the
selection,
misclassification
estimates
the
all
domain percentage
donain
and criterion
from sample sizes
(Rasch model included),
biased
score,
scores
be generated
information.
the
classical
model as having
since
a bases for
the
domain score
procedure.
procedure
smarr
was
examinee
For
t
I
t
I
t
I
t
I
I
t
t
t
T
I
t
T
I
I
I,
Table
of
Contents
Chapter L.
Introduction
Introduction
Purpose of Study
.....1
.......9
Chapter 2
Review of the Literature
..11
Mastery Testing
11
Reliability
of Mastery Decisions
....L2
Test Size and Classification
Accuracy
....j_5
Item and Test Infonnat,ion
Curves
. . . .16
using Traditional
rtem statistics
to Focus Measurement
Information
.....2o
Strengths
and Weaknesses of Conventional
and IRT
Optinral ftem Selection Strategies
...27
Interpretation
of Domain Score Estimates
......30
Studies Comparing fRT to Other Test Developrnent,
Procedures
......38
Chapter 3
Methodology
.....4g
Introduction
....48
Definitions
of Item Selection Techniques
......50
Estabrishing
rtem selection
criteria
For The Modified
Classical
Technigue
..5L
Data Generating program
...53
Program For Selecting ltem and Subject Samples
.....54
Exan lt,em pool
.......54
fest Length and Subject Sarnple Size
......55
Test ltem Selection
.......87
Dependent Measures
...60
Procedures For Judging The Results
.......63
Chapter 4
Results
Test Information
and Standard Errors
Misclassif ication
Rates
Accuracy of Domain Score E s t i n a t e s
Chapter 5
Discussion
Overview.
l-v
of
Estimates
. .56
..66
...74
.......89
I
I
t
I
I
I
I
I
I
I
I
I
I
t
I
I
t
t
t
Measurement Precision For smalr sarnple conditions
.......94
Measurement Precision For Large Sample Conditions.. ......97
Accuracy of Domain Score Estj-mates and Classificati.on
Accuracy
.....
...99
procedures be Used
Should Traditional
Item Selection
to Sinulate IRT ltem Selection procedures
.....9g
Maximizing
The Percentage Correct
Domain Score
Accuracy
.....
...L02
Chapter 6
Conclusions
References
and Suggestions
for
Future
Research
. . . LO3
LO7
Appendix
L
Appendix
2
1,t7
I
I
t
.l
I
I
I
I
t
I
I
I
I
l
I
I
t
t
t
I
Chapter
I
Introduction
The purpose
selection
of this
study
methods in terms
maximizing
test
conditions
involving
of
information
was to
their
at
compare various
effectiveness
a cutoff
score
item
in
under
small
examinee sample sizes
(i.e.,
there
has been a proliferation
of
N = 5O).
In recent
mastery
years
programs
testing
in
licensure,
and personner
areas,
mastery/nonmastery
test
the
scores
have serious
examinees.
For this
classifications
is
This
guestions
which
cutoff
1958).
the
to
it
is
is
of
j-nforrnation.
that
the
the
of
that
standard
point
is
measurement
(Novick,
techniques
error
of
as low as
by selecting
can be used to
Most of
from
accurate
There are many itern selection
literature
1ives
critical
that
accomplished
interest
derived
as possible.
cutoff
concentrate
score
on the
achieving
insure
(sEE) around the
possible.
the
reason
rn each of these
classifications
inpact
means of
crassifications
estimate
selection.
be as accurate
The primary
professional
education,
test
inforrnation
1969;
around
Birnbaum,
methods discussed
focus
measurement
described
are based
j_n
I
I
t
2
on a statistical
identifying
I
providing
I
test
index
iterns
which
measurement
statistical
index
development
that
is
are optimal
associated
t
models
commonly used today.
and/or
test
I
theory,
I
I
I
t
I
development
referenced.
each of
theoretical
The objective
be easily
is
sirnplest
IRT model,
selected
which
(theta
t
measurement
I
I
are
However,
it
sufficiently
up entirely
Therefore,
I
1
I
as possible
an item
Item
test
development
a one parameter
of
the
at
iterns
the
enable
with
one t1pically
response
the
generate
the
models
are
the
a test
have difficurty
ability
the
should
values
cutoff
that
score.
itern pools
cornposition
of tests
same difficulty
selects
cutoff
to
In the
To maximize
score,
of
can
(IRT)
items
close
score.
level
to
a cutoff
theory
model,
a cutoff
ability
to
items
cutoff
study.
and scoring.
values
that
unrealistic
to the
present
if
response
with
and/or
SEE around
test
large
of
approaches
associated
reducing
information
is
response
and domain
technique
have difficulty
to
item
referenced,
the
major
models
of
level)
identical
are;
four
measurement
in the
used for
be cornposed of
one of the
investigated
achieved
approach
level
approaches
of
each
These theoretical
development
rnodels is
purpose
the
theoretical
and/or
criterion
test
in
Further,
with
One itern selection
these
for
information.
approaches
classical,
the user
assists
iterns that
made
value.
are as crose
point.
are we1l
adapted
to
the
task
of
I
t
I
3
item
selection
classical
at
the
models,
item
t
same scale
I
decreases
so does the
estimates
(see t e.g.
parameter
estimates
t
t
I
t
even though
serious
I
is
I
I
I
I
I
they
not
always
estabrish
length
on the
from
for
smalr
test
accurate
of an
and sample size
itern parameter
of
19g2).
However,
samples
rnight
purposes
construction
enough
for
is
to
a rarge
the
score
deverop
a large
pool
examinees
Lord
moders for
the
purpose
no recommendations
than
has shown that
1960) is
sample sizes
the
of
about
to
is
the
in
testing.
examinees
1OO.
estimating
what
areas
number of
one pararneter
superior
between 1oo to
there
This
and personnel
the
srightly
items,
from which
incidence
ricensure,
areas
domain of
of
estimates.
row examinee
may be less
(1983)
(Rasch,
of
poor
a rather
apprications
an adequate
itern parameter
low incidence
presents
deveropment
typically
professionar
t,ested per year
sampre size
many test
there
accurat,e
education,
for
test
adeguate
for
case in many of
noder
are reported
Hambreton & cook,
not
unrike
The disadvant,ages
precision
derived
of
from which
rn these
that
have,
purposes.
while
content
as the
are
obstacle
I
statistics
information
The probrem
because,
l
I
I
are that
useful
generating
because they
as examinee abilities.
rRT approach
provide
cutoff
2oo.
to
when examinee samples are below 100.
other
examineesr
However,
rRT model to
Rasch
use,
rRT
true
score
Lord offers
if
any,
The literature
is
I
t
t
I
I
I
I
I
t
I
I
I
I
I
t
I
t
I
I
4
silent
with
regard
useful
for
purposes
ability
to whether
estinates
of
optirnal
inclusion
includes
item
in
iten
that
r.
is
this
reason
for
poo1.
it
However,
used by these
optimal'
the
entirery
cutoff,
the
than
correlations
the
purpose
of
to
valuesrf
for
That
is,
if
rater
would
not
with
high
item
around
items.
rndeed,
,crassical
itern test
were more useful
be
a
information
that:
a test
programs
have some usefurness
items
,non
and that
selecting
stated
values
inappropriate
more optirnal
focusing
argue
a given
from
rRT computer
for
for
are
most circumstances
randomry
(discriminations)
strategy
strong,
scales
(1993)
term
itern statistics
allow
rn general,
the
descriptor.
under
sinpry
items
and
crassical
between
that
may be too
will
low correlation
A third
p,
itern statistics
have access
discrinination
selection.
with
with
optirnar
itens
mode1, which
unfortunately,
believed
Hambreton and de Gruijter
item
optimal
difficulty,
classicar
approach
for
be
even though
such as iten
a connection
inappropriate.
selection,
classical
selecting
authorsr
use of
identifying
crassical
is
does not
crassical
the
the
rnight be a better
developer
selection
Harnbleton and de Gruijter
trinappropriatert
then
not
and items.
for
is
statistics
there
persons
itern
use in
tests
discrimination,
statistics
Rasch model may stirl
rnight be unacceptable.
A second model for
for
the
in
itern
score
than
it,ems
(p.356).
optimal
item
selection
wourd be
\J''\
I
t
t
I
I
I
I
I
t
I
I
I
I
I
I
I
t
I
I
5
to
use one of the
criterion
(1987)
discrinination
reference
review
procedures:
(cRT).
testing
four
indices
phi
the
(b)
difficurties
passing
B-index-
(i.e.,
of the phi
as a sorution
to
from differences
phi
coefficient
(d)
agreement
the
functions
derived
through
rtem
by which
infomation
(1969)
measurement
info:mation
(theta)
particularly
strategies
for
because
they
Gruijter
indices
(1993)
for
this
response
(rrFs)
resulting
is
the
on a test
as
information
theory
formulas
as
was evaluated.
were first
proposed
as the
by an iteur at
The iten
infor:mation
optirnal
measures of
power at
cut-of
f scores.
consider
rrFs
superior
(1959)
agreement
provided
measuring
max-
coefficient.
used iten
offer
reason.
value
of
and can be interpreted
useful
dj.scrirnination
This
each cRT procedure
leveI.
of phi
range
and outcomes
authors
iteur
phirzphi
by cureton
naxinum phi
functions
by Birnbaun
ability
in
item
the
(c)
probability
on a given
study
of exaninees
test.
totals.
the
rn this
between item
proposed
index
by the
statistic-
outcomes
standard
the
in rnarginal
divided
between
correct)
restriction
a whole.
the
difference
failing
serection
performance
test
proportion
and exaninees
a nodification
between
the
iten
correlation
exami.nees I itern and dichotonized
with
shannon and cliver
criterion-referenced
(a) phi-
outcomes-
associated
anount
a specific
functions
itern
of
are
selection
itern
Hanrbreton and de
to
conventional
t
I
I
I
t
I
I
I
t
I
I
I
I
I
I
I
I
I
I
6
The results
showed that
of the
some of
study
the
CRT statistics
produced
high
correlations
function
(IIF)
used in
indicate
that
resurts
reasonable
alrow
the
statistics
for
that
index
Several
studies,
crassically
was the
item
item
for
short
reasonabre
to
do not
require
present
there
selection
produced
rrFs.
This
optirnal
selection
strategy
item
purposes
selection.
procedures.
assume that
lirniting
are not
rerative
20 to
most appried
test
sizes
any studies
in which
to
the
a
studies
(i.e.,
studies
testing
2o or
of
rt
30 items.
optirnal
test
factor
that
has not
been studied,
is
situations
sizes
At
item
were
used.
Another
50
various
were used.
comparing
realistic
an rRT
these
rnslead,
30 itens
amount of
to
size
comparing
and
through
of realistic
of
resurt
the
A weakness of
the
CRT
investigated
can be gained
that
length,
strategies
for
do not
conventional
may be the
(1987),
selection
relatively
may be
Hambleton and Cook (L979)
to use tests
i-tems or more)
the
with
choice
These
indices
coefficient
based approach.
fairure
optimal
of
precision
based optirnal
theory.
of
coefficient
and Arrasrnith
measurement
information
when circumstances
correlation
discrirnination
Hanbreton
phi
the
phi
the
item
response
rRT procedures.
median rank
suggests
the
rfFrs
(L987)
investigated
cRT discrimination
compared,
highest
with
iten
substitutes
use of
by Shannon and Cliver
whi.ch
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
il
7
affects
the
measurement
strategiies,
is
particularly
the
decreases
it
also
of
of
subject
is
selection
conventionar
area
stability
as the
Therefore,
of
optirnal
examinee sample size.
important
known that
item
accuracy
important
to
selection
a
because
parameter
it
is
estimates
decreases.
detennine
functions
in
how an rRT
relation
techniques
referenced)
is
investigation
rRT iten
itern selection
and criterion
This
sample size
strategy
itern
to
(e.g.,
more
classical
in
situations
from
studies
invorving
samples
is
where a smarr
(i.e.,
rnformation
test
sizes
complete
and small
the
performance
Further,
the
examinee
research
of
developers
base regarding
optimar
resurts
itern
wirl
method to
of
strategies.
be of
the
advantages
from one test
assistance
to
test
and
deveropnent
another.
A fourth
iten
selection
t'he domain sarnpling
one sirnpry
fashion
rerative
clearly
switching
realistic
needed to
the
selection
who are weighinE
disadvantages
is
gleaned
serect
iterns
from the
content
model.
in
This
selected
behavior
make up the
to
model requires
rn this
represent
domain.
(Popham and Husek,
r-969; Harnbleton,
and coulson,
Nitko,
l97g;
being
random
approach
important
used,
that
a random or stratified
domain.
are pri-marily
that
mode1, currentry
items
crasses
of
Many researchers
swaminathan,
1984) have stated
that
Algina
in domai_n
I
I
I
I
t
I
referencing
serecting
the
an optirnal
weaken the
serecting
takes
application
set
of
of
items
int,erpretability
this
model
an inplicit
item
of
for
item
position
I
more important
accuracy
which
is
gained
t
position
raises
accuracy
is
I
I
I
t
I
I
I
t
I
I
I
I
iten
Hanbleton
a cut-off
sizes
derTeroper
domain
the
item
measurement
serection.
itern serection
(de Gruijter
such a
verses
random
both,
random item
were used,
100) ltrere large
parameter
terms
the
Tests for
optimar
subject
enough to
of
rates
selection.
estimates.
question
selection
strategy
100.
achieve
information
iteur
selection
and
smarl
relatively
stable
did
address
rate,
of
were
used (i.e.,
not
at
of various
relatively
N >
Rasch
the
domain score
a random itern
an optiural
where exaninee
information
tests
sample sizes
Because domain sarnpling
settings
for
Although
compares to
circumstances
I9g3;
these studies
of how the misclassification
sEE and test
test
(rRT)
These studies
accuracy,
used in
in
B-2o items).
through
& Harnbleton,
1993; Haladyna & Roid., 1983)
question
this
(i.e.,
through
studies
and misclassification
constructed
than
test
of how much measurement
by optinal
& de Gruijter,
investigated
under
By
selection.
Several
tests
than
by optinal
question
the
accurate
is
gained
domain score.
selection
that
for
would theoretically
the
representation
the
statistics
samples
rRT strategy
are
moders are
where examinee sample sizes
smaller
freguently
are
less
than
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
t
I
I
I
9
100 this
research
Of the
studies
comparisons
various
for
using
test
interpreted
the
rRT procedures
characteristic
estimate
relative
tneasured by the
This
the
thus
that
taken
estimate
of the
of
the
to
from the
(i.e.
in
observed
of
terms
similar
items
reviewed
the
certification,
the
the
rittle
an unbiased
is
score
donnain.
research
study
ratter
known with
other
itern
of
observed
focuses
those
encountered
professional
in
(i.e.,
teacher
licensure,
is
since
regard
serection
scores
to domain scores.
low examinee incidence
to
from
accuracy
estimates)
is
domain score
obserrred
utilized
compare with
of
score,
the
with
Study
The present
involving
a
be
associated
moder,
when the
a domain score,
domain score
Purpose
produces
can be contrasted
rn this
domain score
how rRT procedures
procedures
method
domain being
domain score
rnodel.
studies
of
This
(1984)
ability
estimate
based on a random sample of
the
method.
derived
appropriately
latent
for
IRT procedure.
version
estimate,
were tlpicarly
should
the
rates
domain score
and swaminathan
to
domain sanpling
definition
in which
the
curve
rRT dornain score
another
none of
far,
proced.ures,
by Harnbleton
domain score
with
reviewed
selection
estimates
described
needs to be addressed.
were made of the misclassification
itern
the
question
on settings
N's
< LOO)
content
area
and personnel
€
I
I
I
I
I
I
I
t
I
I
T
I
I
I
t
I
t
t
I
1,0
testing.
for
The test
the
study
50 itens),
are:
at
underlying
2.
of
questions
ability
the
(e.g.,
24O items.
were addressed.:
and point
biserial
selecting
items
cut-off
score
the
sEE, test
of dornain score
accuracy
such as classical,
and the
in
accuracy
classification
size
used
to
focus
a given
with
distribution?
differences
functions,
pool
item
of reasonable
be used for
a specified
of
Given a srnall examinee sample size
the
for
test
criterion
(n = 50),
what
information
estimates
development
referenced.,
and
strategies
domain sampring,
Rasch model?
3,
How do the
compare with
statistics,
results
resurts
used to
large
4.
optimal
(a) a test
should
infonnation
use of
size
What range of p-values
correlations
are
and the
(b) an itern pool
The following
L.
size
found
derived
select
and accuracy
number 2 (above)
from tests
itens,
in whieh
hrere derived
the
item
through
the
examj_nee samples.
How does focusing
item
in
selection
of
affect
test
inforrnation
through
the misclassification
domain percentage
correct
score
rate
estimates?
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
Chapter
Review of
This
to
study
increase
iten
of
topics:
mastery
decisions,
the
test
conventional
strengths
samples
reliability
of
to
This
of
measurement
comparing
rRT to
a
begins
using
to
focus
curves,
and rRT optinal
of
domain
by a more detailed
other
test
development
procedures.
Masterv
Testing
As Berk
terms
like
(1983) points
domain-referenced
LL
out
it
test,
i-s not
with
information,
inforrnation
and interpretation
was followed
a
following
indices
conventional
is
to use in
accurdcy,
itern and test
I
I
I
focus
using
classification
discrimination
strategies,
studies
of
how
IRT,
study
the
and classification
and weaknesses
estimates.
This
with
test
there
are best
exist.
testing,
referenced
selection
mastery
procedures,
dealing
information,
know more about
Additionally,
selection
statistics
measurement
review
in
literature
size
criterion
score
need to
information
itern
where small
a review
iteur
the
CRT, or d.omain sampling,
situation
using
Literature
statistics.
know which
classical,
the
fron
measurement
conventional
need to
resurts
II
uncommon to
objectives-referenced
find
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
L2
test,
competency-based
test,
and criterion-referenced
in
the
to
the
.literature.
meaning
of
the
the purposes
rrMastery
tesLing
testing
in which
mastery
score
is
of this
above which
he is
donrt
is
above the mastery
nastery
us what the
tests
informative
but
referenced
testst'
The general
to
d5-rectly
(Hills,
falls
into
threshord
constructed
interpretable
estimation.
to
yield
call
from the
create
terms
of
score
that
test
he
of this
more
criterion-
9 7 l r.
testingr
that
are
performance
specified
L97L, p.
which
rr a test
as:
measurements,
653).
a dozen different
of criterion
(b)
The other
can do except
we will
or
Decisions
one of three
has its
and appropriate
statistics
referenced
categories
squared-error
Each index
and disadvantages
person
a cut-off
as successful
as fairing.
was defined
to
in
of l{asterv
loss,
regarded
1981, p.
testing
definition
important,
rfcriterion-referenced
term
reliability
the
as follows:
keep them separate
There are more than
measuring
is
score.
standardsrr (Glaser & Nitko,
Reliabilitv
is
much more difficult
encompasses mastery
deriberately
testing
as
criterion-referenced.
one is
regarded
revels
teII
of
one score
mastery
rnay be some guestion
study
a subtype
test,
used interchangeably
terrn mastery
only
and below which
type
test
Because there
used for
not
proficiency
test,
of
lossr
tests.
reriability,
or
(c)
own associated
applications.
for
Each
(a)
donain
seore
advantages
I
I
t
I
I
t
t
I
I
I
I
I
1_3
The threshold
dichotornous
It
assumes the
are equalry
this
type
Novick,
L973).
examinees
(i.e.,
pass or fail)
The squared
score.
The sguared
mastery
or nonmastery
deals
that
with
losses
I
I
I
I
instructional
the
along
.tests
of
individual
along
the
of
the
classified
there
is
mastery-nonmastery,
however,
teacher
another
on the
The dornain score
a concern
the
score
estiuration
rn
approach
and assurnes
and nonmastery
are most useful
for
rn licensure
not
of
The statistics
of mastery-nonmastery
typically
degree
this
mastery
serious.
cutoff
continuum.
approach,
where the
continuum.
from the
the
score
farse
approach
a score
based on the
of measurenent,
with
situations
is
reflect
loss
equally
this
perspective
&
percentage
scores
deviations
associated
degree
nonmastery)
(Hanbleton
the
1oss approach
consistency
with
all
An example of
index
reflects
threshold
are not
associated
with
the
size.
who were correctly
error
devi-ations
the
with
exists.
.
squared
to
of
or
score
and false
po
the
index
a test
of mastery
associated
mastery
is
This
taking
contrast
losses
regardless
function
a
based on a cutoff
(false
serious
of
t
I
an objective
misclassifications
assumes that
classification
of
usually
function
qualitative
nonmastery
decisions
l
loss
in
is
concerned
each student
and certification
with
statistic
degrees
does offer
precision.
statistics
deal
with
of
I
I
I
I
I
L4
estiraating
are
d.own into
two categori.es,
specific.
t
t
t
I
of
confidence
intenrals
The group
individual
true
error
of
progress
to
the
extent
and group
ic
are
can be used to
are
form
obsenred
averages
calculated
the
use of
smaller
sub-sampres
produce
the
rarely
is
over
of
the
all
of
computers.
of
simulated
on the
item
domai.ns are
rn this
regrard,
finite
item
study
as well
(i.e.,
For example,
by sinply
differs
this
domain
as differences
where exarnineesr crassification
obserrred score,
can be
By using
domain
scores.
can be obtained
domains
domain,
responses
estimates.
total
and sarnple domain
based on their
content
From a finite
dourain score
scores
needed, for
specifying
but
one of
observred scores.
1982).
large
available,
t1pically
some finite
& Haladyna,
can be calculated
misclassification
are
accuracy
that
through
instances
can be broken
statistics
has been nade in
can sinurate
population
or mastery
specif
statistics
above
the
researchers
scores)
score
each ind,ividuals
statistics
(Roid
technique
that
specific
specifiable
drawn to
a domain which
statistics
specific
nentioned
estirnates
donai.ns
off
individuar
domain scores
procedures
Hoi*ever,
within
tested.
since
naking
adequacy
around
specific
individuals
the
standard
iterns
any cut
The individual
score.
I
of
Domain score
I
I
I
t
of
standard.
t
I
I
I
proportion
known independent
estirnates
t
the
between
the
summing the
(pass/fail),
fron
their
true
I
I
I
I
I
15
classification
based on their
accuracy
of the observed scores
correct)
can be evaluated
deviations
of obsenred scores
then taking
I
I
I
I
test
I
I
I
I
I
I
I
I
when using
length
i,s directly
errors
and,
Aqcur acy
the matter
related
that
are tolerable
of specified
upper linit
on test
the lengths
certification,
practicar
faI1
of most of the tests
few tests
within
items simply
acceptable
reliabilities
o f th e studies
Hanbleton
& de Gruijter,
serection
methods for
of less than 30 i.tems.
shourd utilize
with
test
testing
( e.g.,
contain
situations.
ress than 5o
Hanbleton & Cook, LgT g
is believed
that
very
fewer than 50 items.
mastery testj.ng
sizes
200 itens.
of generating
1983) investigating
rt
suggests
and occupational
the range of 50 to
these realms will
involve
in teacher
because of the difficulty
test
H ow e ve r, a l l
within
can
which set the
experience
personner selection,
licensing,
users.
situations
tirne parameters
size.
a
is very 1arge.
mastery testing
the reality
to test
of urisclassification
the size of the test
Ilowever, most appried
of determining
to the number of
very low probabirities
be achieved if
of applied
as percentage
from the domain score,
a mastery test
classification
that
the
an average.
t
obviousry
(expressed
simirarty
by surnming the absorute
T e st S i ze a n d C l a ssification
t
donain score.
optinral
have utilized,
that
future
are more realistic
itern
tests
studi_es
in terms
I
I
I
I
I
I
I
I
t
t
I
T
I
I
I
I
I
I
I
L6
Item and Test Infonnation
Curves
Birnbaum (1969) defined
j.nversely
quantity
the confidence
the notion
proportionar
interval
of infornation
to the squared rength
around an estimate
as a
of
of an
e xa mi n e e ts a b i l i ty.
Generally,
tests
where information
is
vary
from one another
focused within
regdrd
the test
varies
with
the abirity
level
Because the
information
varies
scale
information
to the next,
it
informati.on
curves
reliability
estimates
in test
the test.
currre within
rn this
a single
test
measured by the test.
from one point
has been suggested
should
in terms of
replace
that
on the test
test
the use of classical
and standard
errors
of measurement
score interpretation.
stated
mathematically,
the test
information
currre
appears as follows:
n
I(O) =
Pg'2
g=L
rn this
a b i l i ty
l e ve l
probability
w i th
expression
a b i l i ty
PgQg
the amount, of information
i s e xpr essed as I( O)
of a correct
re ve l
at an
and pg is the
answer to itern g by an examinee
or eg is egual to l- pg;
and p' g is t he
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
t
I
L7
slope of the item characteristic
The guantity
presented
represents
contributes
level.
which
the
to the total
The prot
at all
is
ability
of the
levels
infornation
that
inforrnat,ion
at a given
information
is referred
curve.
when the
and plotted
for
test
called
the test
curves,
information
one to detenrtine
level
measure test
measuring
ability
the
items
item
with
of test
test
off.
The test
of test
the construction
information
infornration
at a given
paraneter
items
of the
difficulty
level.
However,
are considered
and therefore
it
nod.ers,
curve and the cond.itional
mod.er the
information
alrow
by sirnply
curltre d,epends on the slope
curve at a cut-off
information
curves
level
scores at each abirity
number of items with
is
one can directly
the two and three
discriminating,
informati.on
plot
which each abirity
ability
itern characteristic
in the one parameter
equally
with
a given
are sum:ned.
The itern infornation
information
of the test
i.nformation
particular
variance
for
by an itern
curves
the resulting
specifically,
the height
level.
test
abirity
to as the item
curve.
the accuracy
is estimated.
o.
item g
contributed.
information
items,
and the resulting
revel
sumrned in the equation
information
alr
curve at ability
the height
is
of the
dependent on the
varues crose to the cut-
curve and the particular
provides
of mastery
must be focused.
to be
tests
is particularly
type
useful
where measurement
Many excerlent
discussions
in
I
I
I
I
I
t
I
I
I
T
I
I
I
I
I
I
I
I
I
L8
of
information
curves
(Hambleton &
can be found
S w a r n j . n a t h a n ,1 9 8 5 ; L o r d , I 9 8 O ; L o r d , 1 9 7 7 ;
Wright,
L977i
Birnbaurn, 196g).
The procedure
for
usi.ng the test
focus measurement infornation
involves
four
1.
basic
Describe
steps
fu n cti o n .
infornration
function.
will
select
fill
3.
4-
continue
function
h i g h e st
and three
the.tar get
functions
that
areas under the t,arget
for
test
items until
approximates
iterns.
the test
the target
information
be noted that, iterns which are optinal
i n i nfonnat,ion)
a given cut- off
for
The rerationships
without
of
using
can be
the
iten
three
serection
will
(see
Although the
vary
models the basic
item
item parameters
have been shown by nany researchers
parameter
rn all
test
degree.
data.
for
carcurate
the serected
H a mb l e to n & S waminathan, L986) .
procedures
test
information
or estimated
and information
same.
1977) z
identified
infonnation
e . 9 .,
filr
selecting
function
should
generally
iten
function
to a satisfactory
rt
and
each item is added to the test,
information
information
( i - e .,
items with
curve to
function.
After
the test
(Lord,
sinple
Lor d ( Lg77) cal1s this
up the hard to
informati.on
fairly
the shape of the desired
i n fo rma ti o n
2-
is
information
for
the one, two
concepts
moders the b-value
are the
estimates
the point
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
L9
on the ability
scale where an item is maximally
discriminating.
However, for
moders the addition
estinate
the two and three
of the a-value
of the arnount of discrimination
at the b-value.
one would select
items with
which have the highest
the three
properti.es
of the a-value
above zeto.
A function
function
is
test
si n ce
r(0 )
to
ability
and
which is
the
when the c-value
procedures
root
standard
their
information
(sEE).
estimation
0,s,
of the
is the
This
information.
deviation
For exarnple, if
of errors
one were to give
identical
of
a
Ors, and use
the standard
deviation
of
would be the SEE.
va ri es
As t'he inforrnation
ar ong the o scaler
increases,
alternative
so will
the sEE decreases.
concept of sEE in rRT (Hanbleton
more viabre
nodel,
ability
item selection
to the test
of the ability
to estimate
focused
the same.
related
abirity.
at that
and the b-value
egual to r/square
those estimates
is
moder tend.s to distort
to a group of examinees with
the test
an
cut-off
The c-value
However, the
The sEE is the expected
estimated
b-values
parameter
would remain basically
error
which
a given
for
a-varues.
added with
standard
for
For exampre, in a two parameter
choose items which are optirnal
rises
provides
parameter
the s EE.
This
& swaminathan, 1983) is a
to the crassical
function:
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
20
oe : ox [Pxx (1-pxxt11L/2
This
function
over the ability
the
act
represents
levels.
of averaging
of true
and error
coefficient,
the standard
errors
sarnejirna (L977),
errors
concludes that
and assurning the
scores is unreasonabre,
and the classical
averaged
ind.epend,ence
and that
standard, error
itrs
of
measurement are unpalatable.
using
Tradif;ionar
rtem statistics
to Focus Measurenent
Infonqation
Richardson (i.936) showed that if
differentiate
examinees below a given
those
above it,
without
rnaking distinctions
examinees in the two groups,
should
be of a difficulty
correctly
by half
interest.
build
build
alr
level
rn other
items
a test
difficulty
words if
of thirty
composed of
lever.
since
enough items of precisely
developer
d e si re d
will
items
such that
a test
revel.
196L; Henrysson,
to
in the test
they
are rnarked
level
developer
percent
wants to
difficulty
helshe
items which have a thirty
it
of
between examinees capable
witl
be unlikely
the desired
have to use some items
d i fi cu l ty
item serection
the
from
among
the examinees at the ability
an exam to discrininate
passing
one wants to
abirity
level
other
of
wourd
percent
to have
lever,
the test
above and below the
author s
( e.g.,
L97L) have arso discussed
Davis,
the subject
focus measurement information.
The
of
I
I
t
I
t
T
2L
basic
procedure
values
to select
of interest.
interest
biserial
To date there
what specific
I
I
I
out that
given
scale
cut
focus
those with
high point
t
I
I
I
I
I
I
I
I
items that
serecting
(e.9.,
by these sources is to use p-
Then, of the items
use in order
t
offered
information
falling
at the area
in the area of
the best discrimination
correlations).
have not been any studies
ranges of p-values
and point
investigating
bi-serials
to maximize the measurement information
score.
there
Hambleton and de Gruijter
is not a relationship
inappropriate
for
this
reason,
as an optirnar
However in two other
between the underling
crassical
They
statistics
itern serection
studies,
Hambreton and cook (Lg7g)
(!gg7) | the authors
specified
and point
correlations
discussed
particular
to select
by the authors
there
is
cut-off
Arthough
was to
point.
score albeit
used a
biserial
the apparent
some relationship
the dornain abirity
d i re ct
items.
range of p-values
at a particular
that
test
are
method.
and Hambleton and Arrasmith
range of p-values
at a
(19g3) point
of the domain scores and the p-values.
conclude that,
to
intent
was not
of using
focus test
rntuition
it
information
would suggest
between the p-value
not a
mathematicarry
study
a mathematical
and
re l a ti o n sh i p .
rn the next
relat,ionship
itern statistic,
will
section
of this
be shown between another
the phi coefficient
a
conventional
and the rRT test
I
T
I
I
I
I
I
I
I
I
22
inforrnation
similar
for
correlation,
other
at the cut-off
relationship
established
with
the p-value
IRT statistical
related
test
score.
information
and
these conventionar
Schnidt
point
Arthough
a
can not be
biserial
statistics
are related
concepts.
(1,977) shows that
to the p-value
the b parameter
in the following
yz (1-c)
is
way:
KR-20
b
dpq
where
t
I
I
I
I
I
I
I
I
function
d = d-value,
the point
biserial
item-test
correlation
p = p-value,
answering
the
the proportion
of examinees correctly
item
g=l-p
K.R. 20 = Kuder-Richardson
y = th e h e i g h t
t h at
of the ar ea under the N( 0,1,)
function
z = the z-score
the upper portion
function
20 reliability
of the N( Or L) cur ve at the z scor e
cu ts p r p ro p o rtion
frequency
formula
that
cuts off
pr proportion
of the area under the N(0r1)
in
frequency
to
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
23
c = the c-va1ue
(itern pseudo chance level)
p '= p - c
1
Schmidt
biserial
shows that
the
correlation
a parameter
is
in the following
related
to the p oi nt
complex way:
dpq
a
( KR- 20) ( t - c) 2y2- a2pq
These formulas
mathematical
demonstrate
relationship
measures of difficulty
biserial
correration)
d i ffi cu l ty
e mp i ri ca l
rf
this
biserial
(p-value)
rerationships
( e.g.,
and these statistical
assumption
informatj.on
a
statistical
and discrimination
is
true
it
is berieved
cor r elational)
at a given cut-off
that
Giv en
an
between iter n
measures can be shown.
then the p-value
in identifying
(point
of
( a- value) .
could be used in a manner sirnilar
and the a-value
is
between conventional
and discr inination
re l a ti o n sh i p
infonnation
there
and the rRT corollaries
(b -va l u e )
these mathematical
that
itens
score.
that
and point
to the b-value
wirl
focus
I
I
I
I
I
I
I
I
24
The theoretical
shows that
work by Richardson
measurement information
the use of classical
The question
that
statistics
purpose of focusing
Because there
values
information
information
t
qtrestion
I
I
I
I
I
I
I
I
I
I
the
it
item
the relation
that
research
informat,ion
of p-varues
function
traditional
empirical
or mathematicar
item statistics
ways to
test
regarding
this
indices,
(rrr)
as criterion
reviewed
c orre l a ti o n ,
w i th
the r r F' s
coeffj,cient.
The authors
the
the phi
that
proportional
was explained
B index,
using
Lggt) .
( m edian r =.96)
posturated. this
is
conventionar
phi
over phi
of the
rank
was the phi
finding
based on
approxirnately
This approximate
relat,ing
A study
item information
the highest
coefficient
to the rrF.
by first
has been shown
of effectiveness.
the one with
to
unknown, there
evaruated, four
phi,
biserials
in which an
relationship
max, and the agreement statistic,
functions
and point
j.9g6i van der Linden,
& S u b ko vi a k,
itern discrinination
fact
the
and point
at present
is
L987 by shannon and criver
indices
into
of using p-values
is believed
for
at a cut score.
the purpose of maximizing
are other
in
are ways to
is warranted.
while
( H a rri s
there
has not, been any research
for
through
to make item selections.
of these itern statistics
maximize the effectiveness
biseriar
can be focused
rernains is whether
maximize the effectiveness
(l_93G) clearly
rerationship
rrF to the B index and the
I
I
t
I
I
I
I
25
p h i co e ffi ci e n t.
IIF
82
and
PiQi
where Pi = the proportion
fIF
approximately
rn this
context
O
represents
This was acconprished.
relations
relation.
the covariance
itern and test
scores
, uT ) = piT - pipt.
and u1 : Cov (ui
The
of ut = p1eg, so B can be expressed as:
t
I
I
I
I
I
I
I
a proportional
by derineating
between the binary
expressed as, oi
variance
of examinees who answer item
and ei = 1 - p.
i correctly
t
I
I
I
approximately
cov ( ui,
_
u1' )
l=
var (ur)
This equation
line
predicting
u1
The nurnerator
equation
to ability.
the slope of the regression
from the binary
of the information
below is defined
squared derivative
respect
represents
abiJ-ity
measure uT.
function
in the
by Birnbaum (i-968) as the
of the item response
function
with
t
I
26
I
I
I
I
I
I
I
I
I
IIO,u1J :
Pi(o) [L - Pi(o)J
since
P1(o) is the regression
the continuous
equation
ability
of the itern score on
measure o, the numerator
above is the squared slope
regression.
Therefore
information
can be replaced
s ub sti tu te d
a s a n e s tim ate
The rrF
of the
the squared slope
with
for
in the
item ability
the
82 and p1 can be
of pi( O) .
can then be related
to phi
because phi
and B
are related:
t
t
t
I
I
I
I
I
IP'i (o) ] 2
82=02
P i Qt
then beeause Prer
PrQr
is constant
for
test:
82
PiQi
02
all
items
in a
I
t
27
I
therefore:
t
I
I
I
I
t
t
I
I
I
I
I
I
I
I
I
the fIF
approximately
The mathenatically
direct
relationship
between the phi
suggest
the phi
that
present
tests
there
through
tests
test
and the
items.
that
error
produced through
thaL
are
developed.
there
and t,est
for
coefficient
have not been
compared the crassification
of the estirnate
well
However, at
to those
specificalry,
rrr.
confirming
the use of the phi
in measurement precision
any studies
strong
wourd perform
optimal
rRT procedures.
standard
coefficient
have not been any studies
developed through
comparable
and empirically
coefficient
the purpose of serecting
02.
accuracy,
inforrnation
of
the use of IRT and phi
c oe ffi ci e n ts.
Strengths
and Weaknesses of Conventional
fRT Optinal
ltem Selection
The concepts
their
Strateoies
of strong
and weak are relative
use must be accompanied by a reference
which a comparison
reference
technology,
point
for
can be made.
the applied
1970fs and perhaps still
to as crassical
of testing
rRT the
testing
through
itern statistics
group dependent itern statistics,
from
theory,
dominates the L980rs.
the weaknesses of conventional
are,
rn the case of
world
and
point
comparison is the standard
sometimes referred
which dominated
cited
and
the
Briefry,
commonry
test
dependent
I
I
I
I
t
I
I
T
t
I
I
I
T
I
I
I
I
T
t
28
ability
estimates,
and a single
measurement error
exi.sting
statistic
in a test
representing
for
(Lord & Novick,
Marco, 1977; Harnbleton, Swaminathan, Cook, Eignor
G i ffo rd ,
avoid
L 9 7 9 ).
T he use of an r RT moder allows
these pitfalls
by generating
and measures of the precision
estimation
at different
rRT is especially
contribution
for
classical
testing
of any item to test
this
to the test
can be deterrni.ned independentry
with
items
technology,
reliability
test
information
classical
values
test
in the test.
task because the
information
of
independently
point
measurement philosophy.
of alr
rndeed,
the
of focusing
is not alien
to the
the use of p-
index were being
long before
items.
the exact contribution
However, the concept
and discrirnination
function
test
or the error
at a cut off
information
of ability
of the other
measurement can not be determined
other
the us er to
levels.
usefur
of each iten
&
itern and sample free
statistics
ability
IgGg;
used to
rRT was popular
focus
(Richardson,
r.e36).
The benefits
without
of using
some disadvantages.
disadvantages
according
an rRT approach d,o not come
The four
to Hambleton and swaminathan
( L 9 8 5 ) a re ; me e ti n g dir nensionality
identifying
the model that
needs for
large
prograns,
securing
best
samples, using
highly
most comnonly cited
assum ptions,
fits
the data,
conplicated
trained
technical
rneeting the
computer
staff
to
I
I
t
I
t
I
I
I
I
I
I
I
29
interpret
the results,
The disadvantages
optinal
item serection
statistical
procedures
criterion
referenced
dj.sadvantages
The following
classicar
in general
do not apply
utirized
with
are conceptually
associated
with
rndeed,
to the
is the
makes
conventional
approaches
four
the rRT approach.
approaches do not require
it
and
approaches appealing.
referenced)
pararlel
to the
the rRT approach that
advantages that
and criteri.on
an rRT approach to
in the crassical
statistical
are four
(i.e.,
offer
which
disadvantages
(a) The conventionar
complex statistical
analysis
to
prove the data is unidimensionar.
(b) The conventional
model that
by the purposes
test
is
rather
data.
chosen is determined
than the nature
data
occurring
for
indicates
then a three
conventionar
coefficient
of numbers.
that
of the candidate
For exampre, from an rRT perspective
(c) srnall sample sizes
I
with
approaches.
the more conventionar
I
t
to lay
associated
associated
t
I
results
i n d i vi d u a l s.
response
t
t
and explaining
at least
that
parameter
approaches.
The point
could be calculated
rn contrast,
it
looo subjects
(d) The general
response
if
candidate
guessing
is
model is warranted.
do not present
computer prograrn LoGrsr for
parameters.
extensive
of the
for
as serious
biserial
three
a probrem
and phi
bivariate
pairs
is reconmended (Lord,
19go)
and 30 items be used for
carculating
public
the rRT
itern and abitity
and test
deveropers
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
30
have more farniliarity
with
conventional
procedures
computing iten
statistics
Interpretation
of Domain Score Estirnates
and total
scores.
rn L969 Popham and Husek published
rrlmplications
articre
of Criterion
they
referenced
discussed
The authors
point
test
referenced
out that
exanineets
rn this
between norm
approaches to testing.
one of the central
score variability
constructor
differences
For the norm referenced
is very
inportant
wants to be able to evaluate
performance
titred
Referenced Measurenentrt. In this
is the use of score variability.
procedure,
an article
the distinctions
and criterion
for
in relation
to all
because the
an
other
exarninees.
case the amount and the type of discriminations
item makes becomes important.
s t ati sti ca l
i n d i ce s
correrations)
Therefore,
ar e used ( e.g.
in order
to evaluate
point
an
certain
biser iar
the d,iscrimnation
power
each item exhibits.
rn the case of criterion
are typicarly
constructed
from a large
or instuctional
pool
referenced
by selecting
of items that
rnaterial.
measurement accuary
measurement, test
a subset
represent
a dornain of task
Each items I irnportance
of the test
is deterrnined
importance
of the instuctional
represents
and not the amount, and the type
discriminations
discrinination
it
produces.
indices,
of items
naterial
Therefore
to increase
to the
by the
or task
it
of
the use of item
score variability,
nay
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
3l_
reduce the interpretabirity
would occur
if
the
items
not representative
This position
Hanbleton;
is
of the test
chosen through
of the rarger
This
itern anarysis
item group
supported by other
Swaminathan; Algina
scores.
(dornain).
researchers,
& Coulson,
were
(e.g.
LgTg; Berk,
r - e 8 0 ).
Early
in the L98Ors researchers
d e Gru i j te r,
1 9 8 3 ; H aladyna & Roid,
investigating
how rRT procedures
item performance
on criterion
compared several
traditional
to
item response theory
purpose
of
score-
As a resurt
focusing
(Harnbleton &
19g3) began
might, be used to evaluate
referenced
item
tests.
item discrimination
indices
inforrnation
for
indices
measurement infornation
of their
They
research
at a cut-off
findings,
these
researehers
proposed the use of rRT measures of
inforrnation
to
that
and more specifically
review
arises
the
has typically
the related
that
ill-defined
test
reveals
that
criterion
terminology.
the reasoning
items without
scores.
apparently
A
this
addressed and debated.
has been lost
surrounded
which depend on score
of the test
has yet to be directly
Perhaps the question
however,
selecting
interpretability
of the riterature
question
is how can rRT procedures,
rRT procedures
be used for
destroying
item
improve measurement accuracy.
The question
variability,
the
behind
in the confusion
referenced
rt
testing
that
and
is believed,
the suggested
use of
I
I
iteur information
of a rrdomain scoret
definition
I
to make item selections
IRT applications
t,o criterion
specifically,
correct
I
of
score
rRT arlows
on a large
when the test
items
a proportion
or infinite
domain
of items which have been
optimal
Swamj.nathan, (t-994) state
used in
tests.
one to estimate
(d.ornain score)
to provide
is normally
refereneed
iterns from a smalr subset
selected
as it
is based on the
discrimination.
Harnbreton and
the following:
incruded
in the test
area are a
I
representative
I
I
sample of test
items
of items measuring the ability,
characteristic
function
estimates
meaningful
into
problern arises,
the associated
transforms
however,
from the domain
test
the abirity
score
domain score estimates.
if
A
a non-representative
I
sample of test
a
itenrs measuring
items
is drawn from a pool
an ability
sample may be drawn to,
I
I
a
r
ability
scale.
derived
from such a non-representative
in some region
The test
items does provide
Such a
exampre, irnprove decision
accuracy
of
interest
characteristic
a way for
sample of test
converting
ability
to domain score estimates.
estimates
do not depend upon the choice
representative
serection
characteristic
wilr
on the
function
estimates
the test
r
for
interest.
naking
domain score estimates
a
of
of test
whire
score
ability
of items,
the
be biased due to the non-
of test
function
items.
for
However, if
the total
pool
I
I
I
I
I
I
I
I
I
I
I
I
I
t
t
I
I
t
I
35
random or stratified
larger
pool
random selection
of i"tems.
The score resulting
conpretely
unambiguous.
90 percent
correct
representing
the desired
percentage
has mastered..
rt
derived
The abitity
and difficult
matter,
abirity
dimension would probably
for
items
example,
increased
and or certification
interpret,.
rf
defined
anatomy of the
be
is one
the
set
foot,
of
then the
have some relation
to the
schema like
Howeverr 65 the number of subject
then the meaning of the ability
dimension becomes more complex because certain
areas night
score
what interpretation
in terms of some taxonomic
of Gagne or Bloom.
areas is
domain ability
narrowly
subject
that
second
score estirnate,
to
are drawn out of a single
of the
this
aimJnsion which the items define
which is etherear
level
is with
just
to a d,omain ability
"licensure
of the knowredge
by popham and Husek.
to specify
from a typical
then one d.oes
from the domain
Ecore offered
is difficult
a user rnay give
items
rt
where an rRT derived
correct
the
d.omain of items,
of the percent
would appear to diverge
items
i_tems were a
knowredge domain,
estimate
of cRT test
test.
the
sample of the larger
domain the subject
estimate
an exarninee earned a score of
However, if
have an unbiased
type
rf
from a
is not
one does not know which
examinee missed.
representative
of items
systematicarly
areas and by chance produce
subject
more complex than other
items with
p-values
all
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
36
centered
at a particular
would be diffieult
to
ability
interpretable
performance
standards.
rnight argue that
would tend to violate
underlying
will
second type
continue
and thus
occurs
it
score could
the
second type
of
unidimensionality
rRT theories,
the use of rRT procedures.
procedures
this
in terms of any specified
Many rRT theorists
assumptions
rf
imagine how an ability
be directly
CRT tests
lever.
thus
However, it
contraindicating
is
likery
to be used on cRT test
open the way for
that
rRT
of the
mis-interpretation
of
the domain score estirnates.
The difference
interpretations
that
importance
for
this
focuses
study
cut-off
point
on increasing
from consideration
must necessarily
be derived
the effect
at a cut-off
estirnate
score.
rn this
point
that
specificarry,
measurement accuracy
at a
where smalr examinee
The sm all samples used
of
the d,omain score estimates
from percentage
way information
focusing
correct
wilr
be greaned
measurement information
has on the accuracy of a domain score
calculated
from the percentage
As previously
because with
and abirity
the use of rRT estimat,es
Therefore,
observed scores.
study.
situations
n = 50) .
(i .e .
correct
to a domain score has
the purposes of this
domain abilities.
regarding
can be given
in testing
s a mp l e s e xi st
eliminate
in the percentage
simulated
mentioned,
this
data a rarge
correct
obserrred
can be accomplished
subject
by iteur poll
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
37
can be created
and smaller
can be drawn for
samples of subjects
estimating
the known population
parameters.
The accuracy
be evaluated
both in terms of classification
(pass/fail),
and absorute
(i.e.
especially
(1987).
have a high
rn their
accuracy.
(i .e .
number correct
a test
to
with
identify
determining
what effect
on the classification
percentage
of
items
wourd
(from a
May 6, 19gg) using
a
of the dornain score.
score the domain score wourd
of the present
focusing
are interpreted
of the
be answered correctly.
accuracy of
study
wirl
be in
measurement aecuracy
has
domain score estimates
as an estimate
of the
examinees wourd answer correctry
they were to given aIl
are
The author s
in terrns of the percentage
importance
when the estimates
that
cun/e would not be available
in the domain which could
once again the
reconmend the
when rRT procedures
conversation
domain ability
be
by shannon and
items
s r nall sample sizes) .
characteristic
will
itern i.nformation
score as the estimate
have to be reported
analysis
of the study
rRT perspective)
recommend (personal
items
can
from the donain score
stud,y the authors
coefficient
parameter
to estimate
in right
correlation
n o t fe a si b l e
since
deviation
gai,ned from this
irnportant
use of the phi
three
of the domain score estimate
score on the item pool).
The information
cliver
and items
items in the item pool.
if
I
I
I
T
I
I
I
I
I
I
I
T
I
t
I
I
I
I
I
38
Studies
Comparingr IRT to Other Test Development procedures
studies
contrasting
IRT item selection
referenced
for
strategy
and other
rimited.
the measurernent accuracy
to classical,
item selection
domain
strategies
are
rn L979 Harnbleton and cook used sirnulated
200 items and 200 subjects
selectj.on
techniques.
score information
selection
totarry
to compare five
ability
selected
levers.
were (a) Random- items
at random; (b) standard-
items with
b e tw e e n .30 and .20 wer e ser ected.
items within
this
difficulty-
range,
only
pararreters
the thirty
test
items
that
provided
ability
lever
provided
First
the
level
an iten
provided
across three ability
at o.o.
at +r.0
This was repeated until
(e) Maximum Information-
level.
itens
averaging
by each of the items
l.0,
method
at an
that
The third
and then go to step
thirty
involved
levels
of o. o
an item was serected
Then an item was selected
the maximum information
i-nformation
provided
the maximum amount of information
of -r.0.
item
(c) Middle
(d) up and down- this
step process.
step was to select
one.
that
of the
the highest
at an abirity
from the pool;
a three
items with
were choseni
maxj.mumamount of inforrnation
were selected
item
The item
d i f fi cu l ti e s
discriminations
data
Tests were compared in terms of
at five
strategies
serected
invorved
of an
0.0,
were selected.
the
in the pool
and. r.0.
The items
I
I
I
I
I
I
T
I
I
I
I
I
I
I
I
I
I
I
I
39
with
the highest
average across the three
ability
levers
were selected.
In the results
surprisingly,
of this
levels
nethod provided
of the roughly
information
ability
All
procedures
revel
below -L.0.
two adjacent
ability
method.
rnethod in addition
amount of information
amounts of
to
at the cutoff
inforrration
at the
The up and down method
of the random method.
at 0.0,
However, this
methods at abirity
at almost
method
method at revers
amount of information
other
of
of the ability
surpassed. the standard
revels.
method at the revers
classicar
as much
at the center
surpassed the classical
appreciabre
item
the only method that
The rrmaximuminformationrr
information
A reflection
0.0 was the rniddre difficulty
the least
surpassed all
+1-.0-
that
the greatest
exception
.o).
at the
shape of the
The rniddle difficurty
provided
provided
(i.e.,
approach provided
In fact,
infornation
abirities
and at the upper revels
of interest.
providing
for
as the maximum infonnation
for
also
this
not
The standard/classical
normal distributional
the distribution
presented
of interest.
distribution
Interestingly
information
amount of
maximum information
center of ability
Ievels
the randorn method,
produced the smallest
at the ability
pool.
study,
revels
with
the
method
of -i-.0
and
rnethod provided
the same revel
as the up and down
of .-L and *l- but was egual
and surpassed by the middle difficurty
to the
rnethod
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
40
for the level
of O.O.
In summdry, the random method, which would most
closery
correspond
poorly
fared
approach did
to the d,omain referenced.approach,
at the theta
levels
surprisingly
wel1,
studied.
The crassical
egual to or better
the two rRT based approaches at the center
and was surpassed
The difference
information
only
with
approach and 35 for
rn light
selection
this
of this
night
study.
1-.
of 0.0 was five
a value
the rniddre
comparing
there
are a few questions
Given the fact
that
alternative
it,ern
a
to the purposes
at the center
approach was only
been set
of
in
surpassed
Assuming that
in
infor:nation
lever
information
r niddle difficulty) ,
w hat
have been had the test
the test
lengths
remained at 30
what would happen to the differences
the parameter
ability
at 50 or 60 it,ens?
rRT based procedures
if
test
approach.
ask which are rerevant
wourd the difference
items,
procedure.
study
b y o n e rR T b a se d a p pr oach ( i.e.,
2-
of o.o
They are:
the crassical
lengths
of 40 for
the classical
strategies,
researcher
by the rnid,dre difficulty
at the level
points
revel
than
and the crassicarly
estimates
between the
based procedures
had been based on sample sizes
o f l e ss th a n L OO?
rn L983 Hambl-eton and de Gruijter
study with
three
prinary
objectives.
conducted a similar
First,
to consider
T
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
4L
the inappropriateness
criterion
clarify
offer
of crassical
referenced
test
that
statistics
itern selection.
the rRT itern selection
two exanples
iten
second, to
procedure.
highlight
in
Finally,
to
the ad.vantages of an rRT
method.
The authors
to a classical
p-value,
out that
the authors
i te rn s a t th re e
.60 and five
accurate
effective
provide
that
to show with
is that
is a relationship
index
(b-pararneter)
value).
Further,
parameter)
b i se ri a r).
although
there
is
difficurty
the author s
fail
relationship
classical
items.
is a
(a-
index
(point
out that
between the
scale the
itern statistics
rn fact,
(p-
index
to point
and the item difficulty
optirnar
statistics
index
there
discrirnination
is not an exact
in selecting
what the
between rRT difficulty
does not show that
such that
does not
The example is
regard to classical
and the classicar
.80,
rn the
level
between the rRT discrimination
domain score scale
useful
cut-off
1.0,
the dornain score estimate
and classicar
it
F i n a rl y
rerationship
each.
are not on the same scale.
exarnple fails
relationship
r evels
domai.n score estimates.
and the p-values
there
subjects
To
example using
difficulty
at the correct
in demonstrating
index,
from domain scores.
a sinple
differ ent
disadvantage
the item difficulty
scale
groups of twenty
example the p-value
produce
the primary
approach is that
is on a different
illustrate,
three
point
can be
the results
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
42
of a L979 study by Hambleton and cook, previously
show this
quite
c1ear1y.
( 1 9 8 3 ) a rti cl e
study.
the authors
of misclassificat,ions
a crassical
for
approach.
selection
strategy
expected
the resurts
to be superior
optinal
probabilities
item selection
the authors
strategy
using
of the L97g
did not generate
rnstead
randorn itern selection
i t e n s)
The Harnbleton and de Gruj-jter
d o e s not m ention the r esults
rn fact
with
simulated
for tests
compared a
an rRT optimal
data.
of various
differ ent
data for
only
sizes
cut- off
item
As would be
showed the one parameter
a n d a t se ve ral
cited,
rRT approach
g-20
(i.e.,
scor es
( i.e.,
-75, and .80) when itern poors are homogeneous with
t'o discrimination.
this
study
important
still
The conclusion
is that
misclassifications
A similar
the same year
therefore
preset
stud.y in that
the true
The results
of the previous
A third
is
which
score scaLe.
by the same authors
in
from the
data was used and
of misclassification
This study differed
were estimated,
realistic.
possible
The study differed
probabilities
from
when it
from an abirity
simulated
regard
regarding
study was conducted
(i-983).
ideal
test
criteria
as derived
could be computed.
parameters
is
to produce the shortest
meets certain
previous
an rRT strategy
one can derive
.65,
also
in that
itern
thus rnaking the study more
errere generalry
congruent
with
those
study.
study was conducted
in
i.983 by Haladyna and
I
I
t
I
I
I
I
I
t
I
I
I
T
I
t
I
I
I
I
43
Roid,
which was very
reviewed.
rn this
to the other
simirar
study
the authors
used, a one parameter
model (Rasch Moder) to focus information
score for
a criterion
referenced
two studies
test
at a cut-off
on dentar
health.
once again a random sarnpling moder was used for
purposes.
cited
This
in that
accuracy
study
differed
an ad,ditional
was carculated
derived.
from the other
for
the domain score estirnates
This measure was the average absorute
The domain score
ratio
for
this
items created
study
for
was defined
as a large
The study utilized
of the AAD to the sD of the deviations
rnodel.
The authors
relative
accuracy
felt
that
through
believes
could
that
between a dornain ability
percentage
correct
this
type
score
test
from a rear
previous
area ohe step
development
test.
studies
to
the
from the
However,
of comparison
differences
and a domain
score.
In 1985, Hambleton and Arrasmith,
several
procedure
be judged.
because of the conceptual
t'hat exist
in this
this
of the domain score estimates
the present, author
research
in order
a
model to the randorn sanpling
Rasch and the random procedures
inappropriate
deviation
from the domain score.
the study.
compare the one parameter
is
studies
measure of measurement
(AAD) of the domain score estimate
poor of
comparison
further
strategies
using procedures
in this
area,
carried
the
by comparing
based on items taken
similar
to some of the
the researchers
defined
a
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
t
I
I
44
finite
domain of
a smarrer test
the previous
(itern pool)
items
by selecting
studies
items from the poor.
rr content
rRT approaches
optirnal
to deal with
may arise
rather
the cut-off
score subject
is
correct
for
scores
correct
increasing
evaluate
when using
the content
rndeed,
(dornain score estimates)
the finite
crassification
score perspective.
the final
specifications
studies,
rRT procedures
did
to
not mention
representation,
of the domain percentage
there
between the percent
the case in previous
at
to address the possibility
the accuracy
score estimate.
scores for
procedure
that
the content
However, the authors
t,o increase
the differences
which
committee.
representation
information.
the reason
For this
to the constraint
study was the first
of low content
validity
which provided. maximum information
of the exam must follow
This
by the
based. upon statisticar,
consideration.
approved by a content
which
a nehr approach carled
approach was created
when items are selected
items were selected
focus
included
the problem of content
than content
version
rRT based strategies.
optiural. rl
The content
authors
As with
the random method and the crassical
approach were compared to several
However, the
and then deveroped
was no anarysis
correct
of
observed
and the domain percent
popurat,ion defined.
the authors
accuracy
As was
choose to
in terms of an rRT domain
I
t
t
t
I
I
t
I
t
I
t
I
I
I
I
I
I
I
t
45
The classiqal
h a d (a ) p -va l u e s
approach involved
b e tween .40 and .BO; and ( b) the highes t
avai-1ab1e classicar
b i se ri a l
iten
discrimination
co rre l a ti o n s ) .
in cornpliance with
For this
The authors
ninirnize
study
stated
tests
that
different
sub-areas
the large
nurnber of different
contained
this
within
the nursing
residuals.
parameter
model for
better
fit.
fit
well
shape of the examinee popuration
studies
in that
optirnal
more i.nformation
practically
when cut-off
distribution.
of
chose to use the
did produce
scores selected
for
The distributional
was not nentioned.
study were sirnilar
exams provided
three
to previous
to
four
times
in
improvement in crassification
accuracy based on domain abirity
produced decision
the
based on analysis
than the rand.om exams and resulted
significant
Lz
rRT moders
the stud.y because it
The cut-off
of this
alr
to
Despite
suggesting
The authors
study were Gs?, 7az and 75t.
The findings
across
fierd.
subtests,
short
The
249 items distributed
o n e ,tw o a n d thr ee)
a slightly
test.
of more than one dirnension,
the standardized
items were used.
was kept
the criterion
test
three
twenty
the exam length
criterion
( i .e .,
the exam had to be
blueprint.
of only
with
(point
index
Additionar ly,
the test
the overlap
possibility
j,tems that
selecting
accuracy
scores.
classical
exams
comparable to rRT based exams
scores were near the center
However, they fared
less
of the
werr when cut-off
I
I
I
I
I
t
t
I
I
I
I
I
46
scores were not near the center
new finding
mentioned
of the distribution.
by the authors
was that
when cut-off
scores were near the center
of the distribution,
overlap
the rRT exam was high.
of the content
opposite
with
was found when cut-off
A
the
The
scores were not near the
c en te r.
The findings
do not seem to
procedures
of the studies
indict
as being
selection.
rndeed,
the traditional
the crassical
procedure
c on si ste d
o f re ra ti vely
considering
that
tests
iterns).
rt
gaps between the crassical
procedure
procedure
if
wourd be reduced
say 50 items.
item selection
criteria
for
I
I
I
I
T
I
more crosely
rerated
distribution
had been used for
n o d i fi e d
to
( i.e.,
that
and the best
rRT
of the
procedure
results
for
p-values
would
the
which were
to the upper end of the ability
lhe cut off
of 7sz, more
would have been obtained..
selection
range should
to e n co mp ass the values
.G5 to
have been
.95 instead
.80.
Also missing
the
were more
mod.ifications
For exampre, if
classifications
Perhaps the p-value
produced
is berieved
sizes
produce even more favorable
approach.
itern
utilized
the crassical
t
accurate
test
Further,
probably
classical
optirnal
sr na11 num ber s of items
and thirty
section
item selection
for
results
realistic,
in this
inappropriate
favorable
between eight
reviewed
from the studies
reviewed. was the
of
.40
I
I
47
I
evaluation
I
rerationship
t
I
t
I
t
I
I
I
t
I
I
I
I
t
I
serecti.ng
of the phi coefficient
items with
with
and cliver,
to perfonn
quite
of the stud,ies
cited
all
ability
domains.
by shannon
would be expected
reviewed. discussed
are given t,o domain scores.
the estimates
consequentry,
effects,
percent,age correct
Given the strong
identified
which evaluated
evaluated
information,
the purpose of
well.
that
scores,
test
item information
none of the studies
duel definitions
what
information.
(L997) the phi coefficient
Finally,
as to
high
for
has on estimates
rndeed,
of domain
in terms of latent
Do evaruations
the selection
score.
estimates
the
were made
of items to increase
of the domain
I
T
I
I
t
I
I
I
I
I
I
I
I
I
t
I
I
I
I
Chapter III
Methodology
Introduction
The purpose of this
study was to examine the
differences
that
traditionar
approaches to maximizing
information
at a particurar
for
exist
between the results
measurement
score pointr
competency or mastery examinat,ions.
this
study
exist
focused
on the differences
among these
deveropersr
are based on small
methodological
ds should be done
specifically,
in resurts
that
approaches when the parameter
needed to guide the test
items,
of rRT and more
outline
serection
sampre sizes.
that
was forlowed
rnay
estimates,
of test
The general
for
this
study
is
as follows:
1through
A siurulated
itens
exarninee iten
information
correlations
off
scoresi
rtem inforrnation
in this
expected
pool was estabrished
the use of a computer program d.esigned to generate
simulated
2-
item by subject
functions
poor.
The rerationship
functions
and p-values
was determined
to yield
were calcurat,ed
between the items'
and point
in order
maximum infonnation
biserial
to select
items
at the chosen cut-
scorei
3.
A second sirnulated
48
for
itern by subject
pool
al1
I
I
I
I
I
I
I
I
I
l
I
I
t
t
I
I
I
I
I
49
(identical
in size
to the first)
was established,
from
which random samples of 50 examineesr responses were drawn
for
use in calculating
biserial
item b-values,
correlations,
4.
and phi
Using the test
selection
chosen by each serection
the
rrrandom selectionr
point
coefficients;
statistics
methods (described
p-values,
associated
below),
method.
with
50 test
items were
Note that
method were simply
the
items used for
randomly
selected;
5.
The second simulated
arso used to construct
selection
examinees by 240 items,
will
The test
serve
( i .e .
of the
by each of the
data rnatrix
carculation
for
50 item test
of Looo
of the item
produced in these ideal
point,
itern
conditions
evaluating
the
developed. under adverse
sma l l sa mp l e s ) conditions;
6.
Finally,
the effectiveness
deveLopment technigues
s E E rs, me a n te st
rate
the full
for
as a reference
performance
by itern pool was
a 50 item test
methods utirizing
statistics.
subject
r/as evaluated
i n for m ation
and average absolute
scores,
for
deviations
These methodorogicar
sections,
misclassification
each technigue.
steps are further
itern serection
techniques
the complexity
of the rnodified
the mean
(AAD) from the d.omain
through
beginning
test
by comparing
functions,
the t,ests developed
the forlowing
of the four
with
to be studied.
classical
delineated
definitions
Next,
in
of the
because of
approach,
a
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
50
detailed
given.
description
This will
generating
for
be followed
conputer
distributj.onal
created
of the procedures
by the program.
the length
item selection
This
this
forlowed
procedures
and of the
parameters
a rat,i-onale will
and the size
is
of the data
study
itern and abirity
Next,
of the test
samples to be used.
the
by a description
program used for
shape of the
to be used are
be given
of the subject
by a description
and the procedures
for
generating
the random sampres of examinee response
patterns.
Finally,
to evaluate
techniques
results
the dependent measures that
the effectiveness
are defined
and the procedures
of Item Selection
The itern serection
criterion
referenced,
Each are operationally
1.
Modified
varues and point
score.
defined
classical
the
biserial
were identified
information
criterion
were examined in
classical,
and random.
as follows:
items are selected
correlations
approach slightly
high
2.
judging
for
item response theory,
This approach deviates
classical
that
to as modified
to maximize measurement, precision
with
item selection
Technicrues
techniques
study lrere referred
values
were used
are described.
Definitions
this
of the four
of
cutoff
from the traditional
specific
which identify
functions
referenced
which are expected
at a particurar
in that
apriori
using p-
at the cut-off
- the iterns with
ranges of pthe items
score.
the highest
5L
passing
or failing
failing
status
for
inclusion
on the test
for
Iten
response theory
highest
item
information
- the items yielding
functions
to the one parameter
These terrns were be given
section
that
Establishing
Classical
Itern Selection
Rasch model were selected.
each item at a cut-off
to p-values
amount of
inforrnation
item information
biseriar
values
the modified
definition
further
Criteria
in the
For The Modified
values,
that
ability
(high
biserial
to low)
itern information
in terms
plots
two bivariate
values
biserial
by p-values
values.
varues to be used'in
procedure.
in
correlations.
AIl
of the
The sorted
and point
the items for
are presented
were also used in determining
classical
.5 was calculated
p-values
associated
procedure
the
and biserial
The information
were used in serecting
classical
by point
of
responses of
produced at the cut-off.
Appendix 2 contains
information
of the simulated
and point
L.
rnodified
at random.
24O items was created.
items were rank ordered
plots
at the chosen cutoff
Technique
1000 examinees to
addition
the
follows.
A data pool consisting
I
I
I
I
examinees were selected
Random - items were selected
4.
for
all
or
in the test.
3.
according
I
I
on the item and passing
status
in Appendix
representing
and item
These bivariate
the range of p-values
serecting
items for
These data were derived
the
7
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
l,c,l
,l
i
i)i
A
t]
s3
i'$*
I
fz
second data pool was in how the a, b, and c parameters
appeared for
given
each item.
In other
a, b, and c values
wordsr
dD item with
pool did not
frorn the first
necessarily
have an id,entical
Differences
in the two itern poors were also manifested
the differences
information
and for
p-values
biserials
and inforrnation
and the correration
v a lu e s
point
i n cre a se d to
From this
pool
of information
correlations
three
for
itens
and biserial
tests
were calcurated.
correlations
represent,ing
Data Generating
The test
through
generates
output
values
increased, to
and point
biserial
this
and point
biserial
The ranges of p-
determined
from the first
itern selections
selection
by
for
technigue.
program
items used in this
, L973) .
This
study hrere simulat,ed
FORTRANcornputer program
examinee response data
The user contrors
examinee sarnple size,
abilities,
and
the use of a computer proetram, DATAGEN(Harnbreton
6r Rovinelli
rnodels -
and
random samples of 50 subjects
pool were then used to make the
three
p-values
.75.
24a iteurs were drawn and the p-varues
values
for
by
For the second item poor the correration
between iten
-.24
between the correlations
values
information.
mate in the second group.
from logistic
the nurnber of test
the distribution
and the distribution
frorn the program includes,
test
items,
of examinee
of itern pararneters.
examinee response
The
|
. ,\'l
i,
I
t
I
I
I
I
I
I
54
patterns,
item parameters,
for
ability
subject
TORTRANcomputer program.
items
by itern matrix
or subjects
the
responses
which
automatically
items
produces
selectionrt
by Kernit
selected.
record
from a
a quantity
The user
of examinee
which rists
the
also
selection
be used for
items
for
rnethod.
Rose at the Florida
generating
the
the
rrrandom
The prog,ram was written
state
university
conputing
Center.
Exam Item Pool
DATAGENwas used to generate
of the
24o items
the three
is described
parameter
discrimination
of
The program also
a sunmary report
of 50 test
iten
the
which were selected.
program will
random selection
this
to be serected
specifying
of the output
is produced.
and subjects
This
or,
with
specifying
of;
to be randonly
format
50
pool was accomplished
subset of j.tems or subjects
specific
I
I
I
I
I
I
I
a separate
Samples
of itern responses for
proEram the user has the option
controls
I
ltern and Subject
from the 1OOOsubject
through
and item information
.5.
The random selection
subjects
item parameters,
from -3 to +3 ad.vancing in
ranging
Program For Selecting
t
I
of
on the
statistics
examinee abilities,
levels
incrernents
larger
t
descriptive
logistic
a 24o item pool.
by the
test
(a) , i-tem difficulty
Each
iteur parameters
model:
item
(b) , itern pseudo-
in
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
55
chance level
iten
(c).
statistics
values
that
The average and range of varues of the
in the pool were chosen to correspond
are characteristic
competency deternination.
of mastery
The range of
to
examinations
for
itern parameter
values were B, -2.00 to 2.00; A, .L9 to 2.OOi C, .OO to
.2O.
The ability
negatively
scores were drawn from a slightly
skewed distribution
approximately
deviation
1.0.
was generated
it' represents
encountered
the generar
with
distributional
type
parameters
produced
statistics.
biserial
for
the responses
Because itern scores
it
skewed
study
because
shape that
test
used for
is
licensure
purposes.
From these specifications
generated.
A slightly
the present
for
most mastery
and certification
ability
a mean of
.3 (raw score mean = 139) and standard
of approximately
dist,ribution
with
latent
for
trait
looo examinees lrere
and total
scores
was possibJ.e to compute conventional
such statistics
correlations,
would include
and phi
item score and the pass-fail
item and
were
item
p-values,
coefficients
point
between each
score on the total
test.
Test Length And Sub-iect Sarnple Size
Ideally,
mastery tests,
minimum competency testing
to produce reriabre
the examinees.
such as those common to
programs,
should
be long enough
scores yet not long enough to fatigue
Harnbleton and Arrasmith
rra common characteristic
of credentialing
(Lgg7),
state
that
exams is their
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
56
unusual
length.
Exams with
found in practice.rf
excessive
200 to 5oo items are regularry
point
These authors
lengths
out that
are d.efended. by exam deveropers
grounds that
since
cornpetency exams are rarely
tested,
extra
items
that
are found following
from exam scoring
lengths
that
without
smaller
fear
can be elininated.
of shortening
where the psychornetric
not be acceptable.
are not clear
as to why smaller
believed
the authors
the exam
properties
These authors
is better
were referring
justifying
the need for
smaller
states
test
nay affect
that
pilot-
exams would be an improvement.
that
fatigue
on the
are needed. so the bad iterns that
exam ad.ministrations
to tbe point
exam scores will
such
exams.
suggest
The authors
but
to
of
it
is
fatigue
cronbach
as
(Lg84) |
the effort
lever
(Harnbleton & Arrasrnith,
L9g7 i
of
examinees.
Previous
studies,
Hambleton & de Gruijter,
r.983; Harnbleton & cook,
atternpted
the effectiveness
optinar
to dernonstrate
item selection
selection
strategies
over other
using test
thirty
itens.
length
would not provide
for
rntuiti.on
most mastery
be realistic.
testing
suggests
of rRT based
traditionar
sizes
item
between eight, and
that
tests
adequate content
of such short
representation
programs and therefore
rndeed from a fatigue
LgTg)
standpoint
would not
there
is
sinply
no reason to want to limit
a test
thirty
items.
could be completed in an
Tests of this
size
to only twenty
or
I
I
I
I
I
I
I
I
I
I
t
I
I
I
T
I
I
t
I
57
hour or less
Therefore
even allowing
two rninut,es per iten.
gi.ven the need. to reduce tests
arrow reasonable
cont,ent representation
measurement accuracy,
the use of
appear more congiruent with
testing
study
programs.
tests
this
study
sarnple size
the sample size
involving
fifty
for
of fifty
because it
is believed.
item tests
would
rnastery
has been selected
to reasonabry
for
represent
in many competency exams
This belief
obser:r,rations of the present
seven years of applied
adequate
the purposes of this
of fifty
encountered
that
items were used..
low examinee incidence.
professional
and/or
the goars of applied
Therefore,
consisting
A subject
to sizes
test
is based on
author
through
devel0pment experience.
Test ftem Select,ion
The folrowing
in generating
selection
1.
the
derineates
fifty
the steps that
item tests
for
were forlowed
each of the
itern
strategies.
A 24O itern test
DATAGENFORTRANprogram.
pool was generated using
The subject
sample size
the
was
L000.
2-
From the
24a by Looo itern subject
scores a random sampre of 50 subjects
random selection
of subjects
FORTRANcomputer program.
response patterns
data fiIe.
was selected.
This program writes
step al1
of item
The
was accomprished through
of the subjects
For this
matrix
selected
a
the
to a separate
24o items were maintained
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
58
and only
the subjects
were randomry sampred.
generated
in this
situation
where the test
step wilr
item bank on a small
total
response
of BrcAL,
must pilot
developer
sample of subjectsr
difficurty
a one pararneter
The procedures
all
for
(c) Modified
the point
for
optirnal
selected
50 items
at the cutoff
ability
the range of
from the 50
by the
(d) criterion
p-values
findings
of
and
of the first
referencedphi
each item score and the pass fair
all
tevel
to
which $rere used to serect
which have the highest
coefficients
score.
pass/fail
the nurnber correct
rn order
items were
between
score on the totar
procedures
a number correct
status
and
(b) rRT- items hrere selected
items was deterrnined
examinee scores for
correrat,ions
each of the item selection
correlations
phase of the study.
item
(a) Random- a randorn sample of
classical-
biserial
biserial
selecting
are as forlows:
the information
the
to
items.
items was selected.
naximize
rnodified
FORTRANcomputer program hras
point
by 24o items pool
strategies
fifty
for
of
A version
items was used to generate
p-values,
correlati.ons
subjects
rRT Rasch model,
Another
the
itern parameters.
were generated.
itern parameters
the
or adninister
data
varues.
4-
test
by Z4O iterns matrix
used to generate
the phi
testing
From the 50 subjects
acconmodate 24o test
Note,
an applied
itern bank in ord.er to establish
3.
.5-
simulate
The data
test.
were reported
to determi.ne candidate
scores corresponding
as
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
59
to an ability
test
of
characteristic
subjects
matrix.
items correct
rrFts
curve for
through
the use of the
24o iterns by j_ooo
the total
This varue was d.etermined to be L44
or 60 percent.
Note that
the
.5 was deternined
the sEE varues which are derived
produced by DATAGENare not estimates
population
values.
based upon the
selection
These known sEE values
items
selected
techniques.
estimating
Therefore,
rn other
dDy error
in each computation
word.s, the only
item
involved
variable
that
changed
of the sEE was the items selected.
interject,ed
through
of the sEE because the
the estirnation
each item selected
should
in
in item serection
There is not any error
number of tests
but known
are computed
by each of the
the sEErs $ras due to error
technique.
for
by surnming
is known,
after
rt
this
stable
inforrnation
is believed
produced by each serection
produce reasonably
point
a small
technigue
estimates
of the sEE
v a ri a ti o n .
B IC A L (Me a d , Wr ight,
& Bell,
LgTgl, a one par amet er
rRT computer program based on the Rasch modeI,
L 96 0 , L 9 6 6 ) w a s u se d to car culate
used in
selecting
items
for
(Rasch,
the difficulty
the rRT strategy.
values
The one
parameter model has been shown, (Lord
I Lgg3) to be
superior
to more general
a r e i n vo rve d .
findings
d e Gr uijter
are correct
models when smarl sample sizes
( 1986) points
except when guessing
out that
Lor drs
is a probrem.
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
t
I
50
The nature
of the types
this
is
study
extensive
siurulating
giuessing.
guessing
index)
(e.g.,
of tests
negativery
d.o not typicalry
For this
was allowed
skewed)
exhibit
reason c-value
(pseudo-
to vary between o and .2o.
Dependent Measures
The primary
four
focus
item selection
referenced,
c o n d i ti o n s
rRT'
of this
study was to compare how
models; nodified
classical,
and random serection
i n vo l vi n g
crit,erion
perform
(N :
smalr sam pr e size
under
50) .
Four
dependent measures were gathered.
The first
dependent measure was the test
infonnation
functions
which were computed from item
generated
by DATAGEN. The average inforrnation
were calculated
L000 exaninees
fits.
based on the total
generated
in this
considered
stated, the
manner wilr,
population
comparing the various
values
be
and used as the standards
for
function
methods.
(rrFs)
and the
error
concepts
based on item response theory
used for
two of the dependent measures in this
is believed
that
comparisons
i.s appropriate
rRT offers
certain
(sEEs), which are measurement
mathematics,
the use of rRT statistics
statistics
rnodel
purposes,
standard
of estimate
of
estimates
practicar
item selection
The itern inforrnation
functions
parameter
information
for
varues
population
subject
and assuming the three
As previously
information
for
the present
werl
suited
for
were
study.
rt
rnaking
study
because
to the task
of
T
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
6L
making comparisons of measurement accuracy.
the calculation
of an estimate
For example,
of the standard
error
at a
cut-off,
which can be accomprished by an rRT approach was
of great
assistance
the
in deterrnining
item selections
Additionally,
that
made through
by generating
the assumptions
known to
fit,
s ta ti sti cs
a high
(e .g .,
selection
the effectiveness
each serection
of
method.
LOoo examinee responses such
of a three
rRT particuLar
degree of accuracy
is
mod.er are
assured
for
sEE) used to m ake compar isons of item
procedures
based on only
50 examinees.
The second dependent measure was the mean standard
error
of ability
tests
composed by each item serection
of three
estirnate
sEErs derived
used in order
cut-off
into
off
score-
errors
strategies
score.
feII
ability
might
revel
the 1oo subjects
occur
only
of
due
for
.5.
rn other
closest
to the
which encompassed the cut-
words, their
on either
the entire
erosest
used, had raw scores which arr
a range of 19 points
100 subjects
that
were
made by each of the test
for
subjects
minus nine points
L44 for
The mean
dependenL measure was the number of
miscrassification
generation
these three
repli.cations
The sEgrs were calculated
at the hypothetical
The third
for
strategy.
from the three
to observe d,ifferences
to sampring alone.
abilities
(sEE) generated
domain of
scores were prus or
side of the cut-off
24a items.
to the cutoff
point
of
A subgroup of the
score were used for
t
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
62
evaluating
misclassifications
for
majority
of misclassificat,ions
subjects
occure within
that
it
wourd be easier
fluxuations
selection
procedures
or fail)
if
population
was berieved
the relative
with
the item
subgroup were used.
each simulated
score for
was set
ability
revel
correct
score corresponding
calculated
.5.
using
The fourth
deviation
at the
of
24O iten
(pass
examinee in the
24o item test
deterrnining
and each
An approximation
to a theta
correct
curi\re for
the
rnatrix.
closest
information
of the
once again
to the cut-off
were
in the particurar
of misclassifications
purpose of this
the use of conventionar
from the domain
score estimate.
the amount of error
scores where the najority
to maximize test
an
.5 was
a measure of the accuracy
the Loo subjects
Another
score nearest
dependent measure was the average absorute
This represents
used to evaluate
of
characteristic
by :.oOO subject
or
of the number
of the domain score estimaLe
for
passing
percentage
integer
the test
domain percentage
varues
the i.ooo
in each random sample (N=50) who t,ook a 50 itern
faili.ng
score.
(b) rt
(a) The
measure the classification
(N:1000) who took the
The cut-off
total
for
associated
a smaller
this
exist
group.
to evaruate
was determi.ned for
exaninee
that
in rnisclassifications
To accomplish
test-
this
two reasons.
occurred.
study was to
investigate
at a cut-off
score through
itern statistics
ways
under conditions
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
63
i n w h i ch l a rg e
su b j ect
regard
itern selection
all
the
determine
sample sizes.
procedures
each procedure.
inforrnation
selection
i n for mation,
In addition
per cent
to provid.ing
examinee samples) the results
investigation
performance
of this
of the itern selectj.on
sample conditions.
runs for
rt
the stabitity
of the previous
the relatj-ve
procedure,
the use of
used earrier
were not
was used.
studies
(Harnbreton & cook,
L9g3; Harnbleton &
l_996; Shannon & Cliver,
procedures
reveared
];ggT) investigating
that
the authors
not utilize
any parametric
or nonparai:netric tests
statistical
significance.
However, the reason for
using
any statistical
believed
that
tests
parametric
large
proced.ures und.er smarl
population
L979; Hambleton & de Gruijter,
item selection
(i.e.,
of the itern statistics,
necessary because the total
A review
iten
additional
should be noted that
each selection
of
general
conditions
j.n evaluating
$rere herpful
performance
of conventionar
und.er ideal
above
of m iscr assificatio n
the rerative
about the performance
procedures
to evaluate
than smalr
The same depend,ent measures mentioned
and AAD) were used to evaluate
Arrasmith,
r n this
were conpared to
because of reasons other
sE E , te st
nultiple
( N=1000) .
the one which should be used when rRT methods
are not feasible
( i .e .
sampr es exist
were not given.
stat,istical
tests
It
did
of
not
is
of significance
I
I
I
I
I
I
I
t
I
I
I
t
64
were not used because certain
tests
would have been violated.
of inferences
procedure
of variance
would have been guestionabre
given
that
the
(e.g.,
information)
were not normal.
the nonparametric
procedures
night
population
values
naking
rt
for
This
assessing
is probably
from one appried
therefore
estabrished
important
computation
did not establish
the practical
wise since
importance
the decision
is needed wirl
situation
that
so that
item serection
that
the need for
the authors
how much measurement precision
deal
fact
to a population.
values
the results.
the
were known elininated
is also noted that
threshord
Although
have been applicable
inferences
t
I
I
or analysis
the validity
of the dependent vari.ables
I
t
from a t-test
For exanple,
to such
distribution
test
I
I
assumptions rerating
to the next.
is
benchrnarks or standards
be
performance
of
a great
vary
rt
relative
of
of the various
proced,ures can be judged by researchers
and
developers.
The design
of
of the present
information
upper and lower values
realistically
and their
study
provided
for
the
varues which served as the
of measurement precision
that
could
be expected given the exarninee population
it,em responses.
by the construction
The upper value was established
of the rrbestrr 50 itern test
using
information
three
paramet,er rRT rnodel utilizing
values
for
items derived
possible
through
the totar
the
24o x l_ooo
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
65
item by subject
itern tests
constructed
tn addition
precision
which will
precision
poo1.
The rower value was set by the 50
through
to upper and rower values
some interirn
varues wirr
serve as reference
of smalr sample test
example the mean sEE for
test
random item selection.
of measurement
also be established.
points
for
judging
ad.rninistrations.
the 50 itern modified
the
For
classical
can be compared to the upper and, rower values
may also
but
be compared to the sEE produced by the best
itern test
produced by tfr.e rnodified
utilizing
the totar
population
classical
procedure
of i.ooo subjects.
50
it
t
I
I
t
I
I
I
I
I
I
I
I
T
I
t
I
I
t
I
Chapter
IV
Results
Test
Information
and Standard
1- presents
Table
each of the three
the
of Estimates
Errors
50 item tests
procedure.
The data showed that
selection
procedures,
with
produced
peaked in shape.
with
regard
around the cut-off
The range of
developed'by
variability
ability
values
L2 ability
for
the phi,
procedures
rnodified
was little
values
or at
The differences
and the minimum
1.9,
4.8 and L.3
Rasch rnodel and random
respectively.
information
procedure was basically
provide
tests
at the cut-off
shown.
levels
classical,
the three
there
values generated were L.7,
The form of the
selection
for
values
between the maximum i-nformation
information
which was focused
showed that
in the information
varied
.5.
Level of
each procedure
any of the other
procedure
information
information
of the item
which were
distributions
However, each selecti-on
to the amount of
all
of the random
the exception
inforrnation
by
by each item
constructed
selection
procedure,
generated
values
information
flat
procedures.
function
relative
for
the random
to the other
item
However, the random procedure
more i,nformation
extremes of the abitity
than any other
distributions.
66
procedure,
did
at the
I
I
I
I
I
I
I
I
I
I
I
I
t
67
Tabte I
Information Vatues at iline Abitity
Procedure
Abitity tevets
Setection
T e s t#
I
I
I
I
- 3 . 0 - ? . 5 - 2 . 0 - , t. 5 - 1 . 0
0.0
.5
1.0
1.5
Procedrre
Phi
Md.cts.
Rasch
Randon
I
t
Levets For Three 50 ltem Tests DevetoDedbv Fach Salmrim
2.0
3.0
1
.0
.2
.9
3.3
9 . 9 1 8 . 9 29.7 40.6
43.2
27.3
11.6 4.4
1.7
2
.0
.1
.5
?.3
8 . 2 1 7 . 3 29.7 41.0
41.2
26.3
11.7 4.5
1.7
3
.1
.2
I .'t
3 . 9 1 0 . 5 1 9 . 8 32.0 42.3
41.9
26.8
11.6 3.9
1.3
1
.2
.4
.7
1.6
4 . 2 1 1 . 8 26.2 41.7
46.4
8.'l
11.8 1.3
1.6
2
.1
.2
.4
1.0
3.1
9 . 8 24.7 41.3
16.7
31.7
14.4 5.4
1.9
3
.1
.1
.4
1.4
4.5 12.5 27.5 42.6
47.1
30.5
12.7 4.6
1.7
I
.3
.4
.6
1.1
2.0
4.7 11.2 A.1
28.6
19.7
10.4 5.4
2.9
2
.2
.3
.6
1.1
?.1
5 . 4 14.9 27.1
28.3
't8.0
9.2
4.7
2.6
3
.3
.4
.7
't.l
?.2
5 . 8 16.2 ?7.9
27.1
17.',1
9.0
4.8
2.6
1
1.3
3.3
10.2 4.4
1.9
2
1.5
4 . 2 1 1 . 1 1 5 . 6 1 3 . 3 1 1 . 1 1 1 . 3 1 5 . 0 20.9
18.2
11.4 5.5
2.2
3
.7
2.0
20.2
14.3
2.4
7.4 ',t1.4 1 2 . 3 1 1 . 6 1 1 . 5 1 6 . 3 z',t.9 18.0
6.0 12.4 1 5 . 3 1 4 . 3 1 3 . 0 1 5 . 1 1 8 . 9
Note. ild.Cts. is the abbreviation for rpdified ctassicaI procedure.
;*
2.5
Z*
T:
tl
A-12' ':::
L.l
{
3*
1 .:
1{
q
I L'/
' ! ''-s
t* /l*
\
6.2
I
t
t
I
t
I
I
I
I
l
I
I
I
I
I
I
I
I
I
58
Table 2 provides
the three
provides
replications
other
reference
points.
used.
These values
represent
given the present
values
composed through
and the total
be achieved
for
in the bank is
the maximum inforrnation
the best
the use of a three
examinee popuration,
represent
item
information
bank of Z4O items.
for
for
which may be used as
are accrued when every
information
values
values
values
two and also
At the top are the test
that
test
shown in table
inforrnation
values
possible
the average inforrnation
the highest
the
50 item test,
parameter
rRT model
are presented.
information
a 50 itern test
Next,
generated
values
These
that
could
from the present
itern bank.
Presented
the three
for
tests
next
are the average inforrnation
at various
each of the three
item parameters
tests
inforrnation
under conditions
to calculate
levels.
for
rtem selection
was acconplished
(traditional
from 50 randomly serected
displays
ability
varues
by utilizing
and rRT) which were d.erived
examinees.
varues
Finalry,
for
each serection
where the total
looo subjects
item parameters/statistics.
this
table
procedure
were used
I
I
I
T
I
I
I
I
I
I
t
T
I
I
I
I
I
I
I
69
Tabte 2
Comparisonof AveraEe lnformation Vatues For Three Tests Generated Bv Each Selection proeertrrne
UtitizirE
Smatt Sdnotes (tl = 501 To Information vatr.resFor a Sinote Test GeneratedBv Each
Setection Procedure Utitizim
Larse Samptes(N = 1000\
AbiLity tevel,s
Setect i on
!!
-3.0
-2.0 -1.5
.5
0.0
.5
1.0
1.5
2.0
62.7 74.5
89.3 t!6.1
61.4
2.5
3.0
Procedure
Total Bank nla
5 . 4 1 5 . 5 38.0 57.7 & . 6
63.3
30.4 13.2
3 Para.
1000
.0
.1
.3
1.1
4 . 0 1 3 . 0 30.2 47.3
51.5 31.5
12.1
Phi
50
.0
.2
.8
3.2
9.5 18.7
30.4 41.3
42.1 26.8
1 1. 6
12.8 4.74
Phi
1000
.0
.1
.7
2.5
7.5 17.4
32.3 16.0
47.8 28.9
11.1
3.5
1.1
Itld.Cts.
50
.1
.2
.5
1.3
3 . 9 1 1 . 3 26.1 41.9
46-7 26.8
13.0
4.8
1.74
lild.Cts.
1000
.0
.1
.5
1.5
4.4 12.3
27.7 42.2
41.9 24.8
10.3
3.8
1.5
Rasch
50
.3
.4
.6
1.1
2.1
5.3
14.1 26.0
28.0 18.3
9.5
5.0
Z.7a
Rasch
1000
.3
.4
.7
1.2
2.4
6.1
15.5 25.4
23.1 15.1
8.1
4.2
2.4
nla
1.2
3.2
8 . 2 1 3 . 0 13.6 12.3
1'f.9 15.4
?0.6 18.8
11.9
Random
3.9
5.35 6.54
Note. l{d. Cts. is the abbreviation for rpdified ctassicat, 3 para is the abbreviation
for the
three parameter nodet.
"N.rrb".s in the ror represent the average information values for three tests.
1.3
I
t
t
70
Review of the information
2
shows that
a total
varues at the top of table
information
t
accrued at the cut-off
rn contrast
the three
I
I
I
I
I
I
I
I
l
I
I
information
varue at the cut-off
fifty
was able to capture
tot,al
information
p e rce n t
(i .e .
selection
of
47.3 when the best
Thus, the three
approximatery
63.4 percent
utilizing
5 o /2 4o) of the item s.
model with
parameter model
of the
onry about 2t
r n tur n,
com par ison
the traditional
item
procedures
indicates
produced
information
values
at the cut,-off
were a maximum of 6 information
points
from the maximum
procedures
information
This
that
includes
traditional
the
50 were used.
that
information
a 3 parameter
words,
or modified
and capture
p e r ce n t
(i .e .
rnodel.
that
test.
composed through
classicar
that
of the
information
g7
Further,
be captured
with
only
5
needed to pr oduc e
The one parameter
points
the one parameter
53.6 percent
one
item selection
2L.g information
Therefore
approximately
iten
the data indicates
could
estirnates.
(Rasch) was a maximum of
captured
tests
5 0 /L 0 00) of the subjects
parameter
a fifty
nrod.el would produce.
of the infonnation
3 parameter
for
over 97 percent
percent
I
the traditional
in which examinee samples of onry
rn other
could use the phi
procedures
that
could be achieved. for
procedures
t
I
model produced an
at the cut-off,
of t'he 3 parameter
of 74.5 was
.5 when a1r 24o items were used..
parameter
items were used.
the three
t
of
value
of the total
moder
berow the
model only
I
t
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
7L
information
possible
comparisons
representing
obtained
off.
large
differences
at, the various
for
values
between proced.ures
examinee sampre conditions,
The difference
cut-off
a 50 itern test.
of information
large
relatively
for
in information
ability
1evels,
show some
values
including
between the information
the phi and the rnodified
values
classical
l-5.8.
The differences
procedure
between the modified
in the information
the Rasch model (IRT) procedure,
(traditional)
as the difference
utodified
crassical
for
phi
procedures
as large
also
was four
between the phi
procedures.
comparison
similar
to five
of the
sample conditions
small
absolute
5 . 3,
.3 a n d -6 fo r
rnoder procedures
other
and the
differences
were
procedures
values
for
srnarr absolute
rn general,
the tests
showed
at the cut-off
s/ere arso found at ability
ability.
and
varues.
r nodified classical,
respectively.
functions
between rarge
in the information
the phi,
values
varues
the same procedure
for
inforrnation
than the cut-off
informatj.on
information
differences
in test
in information
times
sarnple conditions.
small
differences
between
and modified
found between Rasch model and the other
the small
2O.7
and the Rasch model proced,ure was
Thus, the differences
crassical
at the
procedures
crassicar
a n d th e p h i a n d R a sch r nodel pr ocedur es wer e 4.L,
respectively.
the cut-
The
were
and Ras c h
differences
levers
the test
composed through
the
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
72
use of smarl sample sizes
generated
by the
procedure.
rt
large
should
were very crose to the functions
sampre size
be noted that
model computer program using
produced
run.
small
an average inforrnation
was .6 points
rt
above the value
appears that
fluctuations
tests
this
the runs of the Rasch
samples actually
value
at the cut-off
produced by the
is a resurt
in the results
d.erived
small
fact
that
Table
23-L whereas the other
large
of simple
sampre
chance
This belief
i. shows that
sarnple runs produced
which
from the Rasch moder
runs on the smalr examinee samples.
on the
any given
for
is based
one of the Rasch rnodel
an information
value
of onry
two Rasch model runs produced
values of 27.L and 22.9.
The distribution
procedure
of information
disprays
lower
of the dist,ribution
abillties
lower
test
rerative
near the cut-off.
infor:nation
values
inforrnation
to
the distribution
for
Note that
the total
bank.
direction
the
values
of inforrnation
information
from that
than the totar
distributional
at
difficulty
point.
test
shape to that
values
values
for
of j_.0 and
Random
of 50 items should produce information
which are smaller
sirnilar
values
24o items in the bank peak at a value
decrease in either
serection
at the extremes
This phenomenon can crearry
be seen by examining
the total
the random
is expected due to the
items with
at the ends of the continuum.
for
inforrnation
This
for
values
but which
of the total
values
show a
bank.
I
I
I
I
I
I
I
I
I
I
73
rndeed,
values
that
for
is
the
exactly
3 randomly generated
The peak of the distribution
L.0 and decreases
ability
with
information
errors
of
.s,
created
a reflection
of
end of the
of
by the
of the underrying
of the estimates
which were produced by tests
item serection
procedures,
(sEE), at the
representing
are displayed
3.
Table
3
t
I
I
Six
I
I
I
I
I
either
level
of item information.
The standard
various
are averaged.
Thus, the distribution
random procedure was simply
cut-off,
information
is at the ability.
around the cut-off
distribution
tests
movement toward
distribution.
Standard
t
what occurs when the
at
the
Errors
of
Cut-off
Different
for
Item
the
Estimates
Tests
Composed by
Selection
procedures
Selection
Procedure
Total
fsEE'l
Bank
.L16
3 Parameter
.L4S
Phi
. L55
Md.Cls.
. L54
Rasch
.L96
Random
.254
the
in Tabre
I
I
I
I
I
I
t
I
I
I
t
I
I
I
I
I
I
t
I
74
These values
deviation
represent
of ability
same abirity
the expected
scores if
and were given
Analysis
traditional
item selection
increase
in the sEErs relative
increase
of this
i.ncrease
parameter
model.
M i scl a ssi f
i ca ti o n
since
procedures
shows that
to the three
in the sEE relative
to the three
Rates
represented
raw scores
to percentage
ability
scores could not be calculated.
ability
measurement availabre
transformation
samples is
other
on a
per centage of item s cor r ect)
transformed
smarl
parameter
produced a 37 percent
Because of the smarl sample sizes
deal with
the
prod.uced a 7 percent
domain scores are typically
study
scores-
information
The rand.om proced,ure produced a 43
s ca l e o f 0 to i .o o (i.e.
present
produced by each
the BrcAL procedure
in the sEE.
percent
Looo examinees had the
a test
procedure.
model. utirizing
alr
standard.
to test
th e
correct
involved
rn fact,
developers
the onry
who must
some form of raw score
than nonlinear
rRT transformations
( e .g p e rce n ta g e co rr ect) .
Table 4 presents
of the three
conditions
that
tests
and for
the miscrassification
developed under small
each itern selection
the miscrassification
serection
procedure
errors
h/ere derived
rates
from the
each
examinee sample
procedure.
listed
for
for
Note
each item
Loo examinees
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
75
closest
to the cutoff
m o st d i ffi cu l t
ratio
score.
That is,
to classify.
Because of this
of miscrassifications
a p p e a rs to b e ve ry high
The reader
should
examinees will
to correct
( e.9.
to cor r ect
fact
in some cases ) .
the ratio
of
classifications
be much srnaller
the
classi.fications
50 per cent
keep in rnind that
m i scra ssi fi ca ti o n s
the l-oo subjects
than the ratio
for
all
for
the
1oo0
subgroup used.
Frorn the data table
not a great
deal
4 it
can be seen that
of variability
among the total
misclassification
difference
rates
rates
the phi,
procedures
modified
table
item serection
it
The
and smallest
each procedure was 5, 7, 3 and
classical,
Rasch model and random
misclassifications
all
That is,
selection
the optimar
showed a sirnilar
vir tually
were of the farse
was found for
misclassifications
can also be seen that
strategies
m iscl a ssi fi ca ti o n s.
random iten
for
was
respectively.
From this
opposite
a given procedure.
bettreen the largest
misclassification
7 for
for
there
all
fail
the misclassification
procedure.
That is,
were of the false
pattern
of
of the
type.
The
errors
of the
most of the
pass type.
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
76
Table
4
Misclassification
Rates
For Each of
Three
Tests
Develorred
M isclassifications
Procedure
Test #
Phi
Md.Cls.
Rasch
Random
Note.
simulated
pass
False
38
37
L
2
37
36
L
3
33
30
3
L
43
43
0
2
50
50
0
3
48
48
0
l_
50
50
0
2
47
46
L
3
46
46
L
1
37
2
35
2
39
6
33
3
32
6
26
procedure
from a sample size
subjects
score.
is based, on itern parameters
of 50 examinees.
being classified
scores at or within
cut-off
Fa1se Fail
1
Each selection
calculated
true
TotaI
plus
by each test
or minus g points
The 1oo
all
have
of the
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
77
A summary of the data
in table
4 showing the
compiration
of the average misclassification
three
composed by each optirnal
tests
procedure
under small
table
Additionally,
5.
misclassification
optinal
rates
itern select,ion
conditions.
this
rates
for
each test
through
for
rate
procedure,
three
difference
in totar
was L0.
represents
errors
that
and L0 percent
model procedure
the phi
The
between the
The
each misclassification
procedure
than the modified
fewer rniscrassifications
for
classicar
and the Rasch model procedure
L examinee who was misclassified.
misclassifications
score.
total
procedure was three.
shourd be noted that
can be stated
off
classicar
the item
by the modified
rnisclassification
between the phi
rt
tests
sample conditions
had the smallest
folrowed
for
parameter model, and BICAL.
and the modified
difference
the three
rates
used under large
procedure
misclassification
phi
the
parameter model and
the three
rate
sample
random itern selection.
procedures
shows the phi
developed by each
provides
Review of the misclassification
selection
is presented. in
under large
table
the average misclassification
developed
the
t,able provid.es the
procedure
Finally,
misclassification
for
for
item serection
sampre conditions
this
rate
the Loo subjects
Therefore
had 3 percent
classical
fewer
procedure
than the Rasch
closest
to the cut-
it
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
78
Table 5
comparison of Average Miscrassification
Rates For Three
smal1 sarnple Tests Generated By Each serection
To Bates
For
Sinqle
Larcre Sample Tests
Mi.sclassif
Procedure
n
procedure
TotaI
False
ications
FaiI
Fa1se Pass
3 Param.
L000
42
4L
l_
Phi
L000
36
35
t-
Phi
50
36
34.3
L.6
Md.CIs.
l_000
39
38
L
Md.C1s.
50
47
47
0
Rasch
L000
49
47
2
Rasch
50
47 .6
47.3
.3
Random
n/a
36
4.6
3L.3
Note.
3 Pararn. is the abbreviation
Note.
The rnisclassifications
drawn from the tot,al
population
for
hrere based on Loo examinees
of looo subjects.
examinees drawn had domain (percentage)
the domai-n cut-off
su b j e cts
cl a ssi fi e d
3 parameter model.
score of 60 percent.
The j_oo
scores crose to
of the l-00
5L passed. and 49 failed..
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
T
I
I
79
The dat,a in this
procedures
selection
misclassification
with
rates
random procedure.
approxirnately
table
shows that
also
the lowest
total
phi procedure
are.the
These two procedures
the same total
misclassifications.
the item
and the
produced
number (n = 36) of
The procedure
that
was second in
terms of the fewest number of misclassifications
modified
classical
s u b j e cts.
classical
procedure using
T h e th re e par am eter
the total
i-OOO
( n = 1OOO) , r nodified
(n: 50), Rasch model (n = 50), Rasch model (n =
L000), lrere tbird,
fourth,
fifth
From the data
in table
5 it
rate
the phi procedure
misclassification
unaffected
was the
by small
for
and sixth
appears as though the
sample sizes.
That is,
( n = 36) for
m i scl a ssi fi ca ti o n
rate
was the same rate
(n = 36) as that
same can not be said
for
sample condition
is
relat,ively
the average
the sm all sar nple si z e
for
the rnodified
which produced a 25 percent
respectively.
difference
the large.
classical
The
procedure
between the large
and the average of the small
sample
c on d i ti o n s.
In sumrnary, the data in tables
there
is generally
rates
for
optimal
littre
for
small
variabirity
item selection
smal1 examinee sampres.
4 and 5 shows that
Further,
procedures
rates
for
large
involving
rniscrassification
exarninee sample procedures
rnisclassification
in misclassification
rates
are comparabLe to the
exami-nee sampres.
From
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
80
the data
optinal
in these
itern selection
prinariry
false
procedures
fails
total
while
for
that
passes.
false
this
items
ratter
rn order
finding
selected
to
the
by each item
Finally,
the reader
the magnitude of the miscrassification
1000 subjects
Table
procedure
is much ress than that, for
used for
6 presents
compiling
the table
the averagle p-value
bank and the average p-value
produced by each of the
large
tended to produce
produced by each itern serection
of l-00 subjects
the
the random itern select,ion
for
the
be seen that
procedure was calculated.
cautioned
errors
procedures
the reasons
average p-values
selection
it, can also
produced prirnariry
investigate
is
tables
for
for
the
the subset
varues.
for
the
item
each of the tests
item serection
procedures
using
examinee sarnples.
Note that
procedures
shown represent
whole popuration
Therefore,
the study
the p-values
the values
strategies
varues relative
calculating
are stabre
were replicated
that
the optirnar
the values
is used for
of the data reveals
selection
for
using
item selection
obtained
when the
item parameters.
and wourd not change if
the same data.
in general
the optimar
tended to select
to the average p-value
items with
for
rnspection
item
lower p-
the item bank.
t
8L
I
I
I
I
I
I
I
Table
Averaqe
P Values
Selected
For
Items
by Each Procedure
Average
P value
Procedure
Total
t
I
I
I
I
I
I
I
I
I
I
6
Bank
.57I
3 Parameter
.472
Phi
.494
Md.Cls.
.49L
Rasch
.481_
Randorn
.60La
Randorn L
.505
Random 2
.587
Random 3
.5r.L
Note.
p-value
a represents
for
generated
the mean
the three
through
tests
random item
se l ection.
rt
is believed
procedure
present
variance
that
the reason the modified
tended to select
i-tems with
row p-varues
data set is based in the relat,ionship
has with
the point
biserial
classical
that
correlation.
for
item
the
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
B2
Specifically,
the upper lirnit
correration
for
The point
biserial
an itern is
for
incidence
correrations
for
item.
Therefore,
there
high point
biserial
p-varues
crose to
will
.5 than any
j-tem p-va1ues.
other
This
does not mean that
automaticatry
regard
a higher
point
which the dichotomous
variable
s y ste ma ti ca l l y
rf
item with
a p-value
examinee test
of
of
.5
value
rn this
coefficient,
a given p-va1ue is affected
va ry.
biserial
or lower p-value.
the vaLue of the correlation
item with
a p-value
an itern with
have a rarger
than an item with
with
.5.
of items with
items with
of the
can reach a maximum only
an item is
be a greater
biserial
set by the p-value
correlation
when the p-varue
will
of the point
for
an
by the degree to
and the continuous
the scor e ( i.e.
variable
r or o) of the
.5 does not systematicarly
scores then a row correlation
vary
wilr
be
produced.
For the modified
the highest
point
classicar
biserial
correrations
between the range of
.3 to
in the tests.
it
highest
biserial
p-values
procedure
.5o.
since
is known that
This resulted
and p-varues
wilt
the
items with
with
the
classical
iterns around the p-value
in a test
.49 which was considerably
incrusion
tend to be items with
around .5 the modified
tended to select
the items with
.7, were chosen for
correlations
centered
procedure
of
an average p-value
lower than the mean p-value
of
for
t
83
I
I
I
I
I
I
I
t
the itern bank which was .57g.
bias
torarard falsely
It
failing
out that
test
inforrnation.
into
this
range of p-values
inclusion
with
fail
to
the resurt
(1 9 G1) pr ovides
why iterns with
t
examinees capable of passing
I
I
I
I
I
inforrnation
of p-varues
p-values
at low abirities
tend to have higher
Lindquist
difficulty
states
tif
iterns
than to
about the
correrations
in
one to understand
tend to have higher
and items with
low p-values
at high abilities.
we want to discriminate
between
an item at the 3o percent
1eve1 and those not capable of doing so, w€
are 1-00 examinees this
or 2 ' L00 discriminations
useful
falring
rather
a discussion
have to emproy an item of 30 percent
there
relationship
case.
which helps
inforrnation
that
.3 to
would have been a test
and biserial
to mastery tests
high
than
selecting
pass candidates
falsery
I
I
I
I
apprication
( i.e.,
case items between
been used for
them as in the present
L i n d g u i st
relationship
rn this
cut-off
Had the subset of items
in a test,
a tendency
falsely
other
would have produced the highest
with
for
a different,
then a range of p-values
.7 would have been designated.
.99
if
exam ple, gO/24O item s
s co re h a d b e e n u se d, for
.55 to
was a systematic
candidates.
should be pointed
33.3? correct,)
The resurt
discriminations.,
to rRT applications.
itern wirl
difficulty
lever.
make only
3o x 70,
in the sampre, but they wirr
This concept is very
rf
be
familiar
For example, in the one parameter
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
84
model one selects
corresponding
order
items at the difficulty
to the ability
to make useful
at the cut-off
p-values
nature
lower
c oe ffi ci e n t.
computation
is a particular
case of the product
Like the point
correlation
coefficient
the phi coefficient
-L to +1.
These lirnits
variabres
is,
(e.9.
the percentage
passing
of subjects
of subjects
Ferguson (L983) st,ates that
asymmetriear,
that
is,
(i .e.
the two
an item is
the test
is
passing
passing
That
.7 and
.7.
when these variabres
does not equal the proportion
t h e sca l e l i n i ts
when the
are the same.
passing
the proportion
moment
has a range of
and failing
passing
biser i al
biserial
can only be obtained
the item and the test)
the percentage
score)
the point
like
coefficient.
of subjects
items
of the correlation
correlation
proportions
in
is based in the
(itern score and, test
T h e p hi coefficient,
correlation,
tended to select
than than the average
of the two variables
which are used for
of interest
discrirninations.
The reason the phi coefficient
with
level
are
the item
the test
then one of
- i,, + 1) r nay be r eached but not
b o th .
rn general
test
will
affect
given p-value.
proportion
off
the proportion
the lever
since
passing
score will
of candi.dates passing
of correlation
the cut-off
the test
cause the phi
it
for
an item of a
score affects
can be said that
coefficient
the
the
the cut-
of a given
item
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
85
to vary.
that
For example, if
the proportion
the phi
coefficient
generally
also
phi
passing
for
coefficients,
p ro ce d u re ,
used then there
of
such
.6 then
.6 wirl
data set was such that
for
with
a cut-off
of
the
L44 iterns,
items which had p-values
As was the case with
had a cut- off
the modified
of Bo/24o items bee n
would have been a tendency
an average p-varue
higher
frorn .3 to
items at the rever
below the mean p-value.
with
increases
of the present
tended to be associated
c l assi ca l
score is varied
increase.
The nature
highest
the cut-off
to select
items
which would have been systematicarly
than the average p-value.
Like
utilize
the optinal
itern serection
product-moment correlations
also
selected
items with
this
occurred
i.s complex but
that
item variability
positive
greater
way.
The greater
item can be maximized.
increases,
increases.
due to the
test
infornation
the itern variability
the discriminations
fact
in a
the
made by an
when the discrimination
of an item
the srope of the itern characteristic
curve also
As the slope of the curve increases
the information
for
the curve inflexion
The items with
c l a ssi ca r
The reason why
is primarily
affects
which
the rRT procedures
low p-values.
also
the chance that
procedures
the range of theta
point
(i.e.
the greatest,
levers
so does
located
near
b-value).
discrirnination
se n se w e re those near a p- value
of
in a
.5.
Given
I
I
t
I
I
I
1
85
t h e ra th e r
between item
information
discrimination
values
items with
values
parameter
items that
is not surprising
information
.47.
produced an average p-vaIue
o f fa l se
I
c o rre ct,
the other
t
would have been higher
for
items
selected
t
above the mean of the itern poor.
T
t
Inspection
than the average p-
relative
percentage
to their
itern selection
of - L.0,
would have been a test
then the
rRT procedures
that
in the
produced
p a ss m isclassifications.
of the data in table
produced tests
with
6 shows that
procedures
correct
Therefore
observed scores
domain scores.
in
the
average p-varues
produced by the random proced,ure typically
higher
large
through
to the opt,imar itern selection
random procedure
Given that
than the average p-value
T
T
I
I
.4g.
optimal
of r evel
bank.
fa l se
selected.
score been set at go items
w h i ch i s an ability
The result
of
fails.
had the cut-off
averagie p-values
at p-
both Rasch model and the three
As is the case with
procedures
the
by the three
of
were much lower
find
occurred
model produced a correspondingly
p r o p o rti o n
contrast
to
The Rasch moder procedure
in the bank,
p r ima ri l y
also
The items selected
average p-varues
parameter
I
and classical
model produced an average p-value
approximately
these
it
( .25)
r elationship
values
the highest
of around .5.
values
I
stro n g cor r elational
the tests
generated
for
examinees
I
I
t
I
I
I
I
l,
I
t
I
t
I
t
I
I
I
I
I
87
To gain
values
it
for
insight
the process by which mean p-
into
are produced by the random item selection
is helpfur
to examine various
the population
sampling
distribution
from the finite
standard
distribution
which the
and the theoretical
of
24a items.
of the population
The mean and
of p-varues
from
samples of 50 items were drawn was calcurated
be .578 and .25o respectively.
distribution
From the central
frorn .Lo3 to
rimit
theorem it
asymptotically.
population
is known that
sarnpring distribution
approach a normal distribution
of the sampling
.g7o) and the
uniform.
shape of the theoretical
increases
to
The range of the p-varue
was .967 (i.e.,
shape was approximately
wirl
as the number of sampres
It
distribution
the
is also known that
will
the mean
approach the mean of the
as the number of samples increases
asynptotically.
sanpring
statistics
of means which would be prod.uced
population
deviation
d,escriptive
proced.ure
The standard. deviation
distribution
carcurated
of means for
to be .03L5.
distribution
of the theoretical
the present
data was
The range of the sarnpring
was found to be .677 (i.e.,
from .236 to
.eL3).
Given this
that
information
it, can be seen from table
the means of the three
approximately
deviations
.3 standard
samples ranged from
deviations
to
i- standard
above the mean of the theoretical
sampring
6
I
I
I
I
distribution.
t
s a mp l e o f me a n s w a s .GOL and not
I
I
t
I
88
Given that
combinations
of
be very large
- 2 3 6 to
the number of possible
24o itern p-vaIues,
(i.e.,
.9 L 3 , i t
z4al/sol
taken 50 at a time,
x i,9o!) and could vary from
i s not sur pr ising
the mean of a s m al l
that
.579.
Accuracv of Domain Score Estimates
For an additional
accuracy
perspective
of the percentage
correct
produced by each item selection
absorute
deviation
on the measurement
domain score estimates
procedure,
the average
(AAD) of the domain score estirnates
from the dornain score was calcurated
for
The L00 subjects
to the cut-off
with
scores closest
each procedure.
I
were used for
group of subjects
which were used to calcurate
I
rnisclassification
rates
I
I
I
I
I
I
I
I
would
calcuration
of the AAD.
This was the same
the
for
each item serection
The same groups of subjeets
were used to allow
evaluation
of the relative
in the scores,
misclassifications
presents
deviations
s c o re s.
amounts of error
which in turn,
resulted
found in tabres
the means and standard
between domain scores
score
procedure.
for
that
occurred
in the
4 and 5.
deviations
Table I
of the absolute
and estirnated
domain
I
I
I
I
I
I
I
89
Table
Means and Standard
Deviations
(True
I
t
t
I
I
Scores)
Phi
Md.C1s.
I
t
t
Rasch
Random
Note.
and Estinated
Test #
of
Absolute
S. D.
1
8.8
4.8
2
8.8
5.0
3
6.0
4.O
L
LL.7
5.7
2
L5.5
5.5
3
t2.6
5.6
l_
l_5.9
5.5
2
l _ 3 .1
6.2
3
L2.L
5.2
L
5.1_
3.6
2
3.8
2.7
3
4.3
3.8
on a percentage
L0O points.
Domain Scores
Mean
The means and standard
represented
to
Deviations
Between Dornain Scores
Procedure
t
I
I
I
8
deviat,ions
score
scale
are
of
O
I
I
I
t
I
I
I
I
I
I
t
I
t
I
I
I
I
I
I
90
Review of the table
procedure
produced the lowest
deviations.
accuracy
rn general,
3 .8 ,
va l u e s o b tained
and random procedures
procedures
was little
deviation
varues
lower
selection
procedures.
average absolute
the 50 iterns selected
of
That is,
examinee samples.
not produce
higher
of
than the rnean
correct
than the domain scores.
data shown in tabre
9 along with
item serection
did
in percentage
higher
A summary of the accuracy
comparison with
the
itern
the averagie p-values
were slightly
scores which were generalry
in table
rndeed,
zero was due to the nature
itern bank which resurted
presented
rower than the
produced mean and standard
the randorn procedure
of the sampres drawn.
3.g,
The random and phi
than any of the other
deviations
and
Rasch model
procedure.
crassical
consistently
in the
each pr ocedur e wer e 2.g,
ur odified classicar ,
random procedure
that
variation
any of the
produced values
Rasch model or modified
The fact
for
respectively.
consistentry
the random
between the rargest
for
th e phi,
that
in terms of absolute
errors
there
The differences
a n d L .8 fo r
of the
reveals
between the replications
procedures.
s ma rl e st
varues
other
pertinent
proced,ures using
g is
data for
large
I
T
I
I
I
I
9L
Tab1e 9
Means and Standard
Deviations
t
I
3 Para.
Phi
I
I
I
t
I
t
I
S. D.
L000
L2.4
5.L
1_000
8.5
4.7
50
7.8
4.64
Md.Cls.
1000
9.4
4.7
Md.CIs.
50
L3.2
5.64
LO.7
5.3
L3.7
5. 6a
4.4
3.34
l_000
Rasch
50
Random
Note.
n/a
Means and standard
represented
averag:es of
8.
deviations
on a percentage
a Means
and standard
table
Scores)
Mean
Rasch
I
I
(True
Domain Scores For Various
Phi
t
Absolute
Item
procedures
Procedure
I
of
Between Domain Scores
and Estimated
Selection
Deviations
the
three
score
deviations
values
are
scale.
represent
displayed
in
the
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
92
From this
selection
absolute
table
procedure
deviations
dornain scores
deviations.
the phi procedure
procedure
estimates.
can be seen that
crearly
the rand.om item
produced the lowest mean
of estimated. dornain scores
in addition
This
the optimaL
it
includes
to the smalrest
the three
utilizing
the total
item selection
procedures
produced the most accurate
frorn the
standard
parameter moder and
j-ooo subjects.
presented
of
the phi
domain score.
I
I
I
I
I
I
I
I
I
I
I
I
procedures
t
discrimination
I
I
I
t
I
t
Chapter V
Discussion
Overview
The riterature
for
on optirnal
mastery testing
silent
with
regard. to which
are most effective
with
smalr examinee sampres.
However, there
selection
studies
have been numerous studies
in which rarge
re vi e w e d
(e.g.
Haladyna & Roidr
procedures
( e.9 .
is
citing
score.
into
question
item serection
by recent
(e.g.
at
according
to the calculations
identifying
selection
(shannon and cliver;
referenced
the phi coefficient)
items which have high
derived
are very
information
from Locrsr
(i.e.
IRT program).
The purpose of the present
in the literature
as the cut-
advantage has been calred
criterion
effective
parameter
on the same scare
findings
L987) which show certain
three
procedures
the advantage of having a m ea s ur e
of this
indices
19g3,
reconmend the use of rRT
which is
The value
item
of the
Har nbleton & de Gr uijter ;
over traditionar
cl a ssi ca l ),
in optinal
sarnples were ut,ilized.
L983) all
of i.tern infonnation
off
procedures
item selection
regarding
procedures
the efficacy
where small
93
study was to
firl
a void
of optinar
samples exist.
item
Further,
a
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
94
this
study
investigated
the effect
optirnal
itern selection
correct
score produced.
rn brief
traditionar
nodified
of this
item serection
procedure
and
are effective
precision.
However, the findings
large
it,em selection
score estirnates
procedure
sections
produce unbiased
wirl
will
interpret
the
the use of a phi
measurement
also
show that
procedure
bias
and that
certain
specificalry
at focusing
examinee sarnple size will
correct
study show that
procedures,
coefficient
of any optimal
forms of
have on the d.omain percentage
the results
classical
which alr
with
a small
or
the domain percentage
only
the random item
est,imates.
and evaruate
the use
The next
three
the irnplications
of
these major findings.
M e a su re me n t P re q i si o n
Of the four
evaruated
item selection
under conditions
samples, the rnodified
produced the highest
lowest
standard
point.
For Sm all Sam nle Conditions
Relative
errors
i.nvolving
classical
levels
strategies,
srnarl examinee
and the phi proced.ures
of test
information
of the estimates
to these two
which were
at the cut-off
procedures
the Rasch model
and randorn procedures
demonstrated. considerably
measurement precision
at the cut-off.
procedure
produced consistently
precision
across the errtire
rneasured.
and the
less
The Rasch model
lower measurement
spectrum of abilities
However, the random procedure
produced higher
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
,95
measurement precision
one point
and cliver
item
the phi
functions
support
capable of producing
These results
findings
score.
a hypothesis
high test
information
do not support
are ineffective
is true
for
items that
that
crassical
provide
high
The phi
referenced
still
test
with
effectively
informati.on
coefficient,
classical
test
iter n
information
estim ate
in
have the
( i.e.,
p-
on the domain score
the point
biserial
can, under modified
identify
items that
at a cut-off
will
.
which is considered
item discrinination
of item difficulty
(i-9g3) and
statistics
is not defined
these two statistics
circumstances,
at a cut-off
t,ests.
sca1e, when used together
correlation
are
the general
at focusing
d i sa d va n ta g e o f u si ng a difficulty
value)
study
procedures
classicar
statistics
it
the
of the present
a n d H a mbleton ( t- 993) , that
while
Further,
of Hambreton and deGruijter
referenced
are comparable
as measures of
d e Gru i j te r
criterion
of shannon
the phi coefficients
and the rnodified
conclusions
the
power at a passing
clearly
score.
supports
(L997) that
discrirnination
that
study
inforrnation
results
beyond approxirnately
above and below the cut-off.
The present
to
at abilities
a criterion
index overcomes the probrern
and person ability
incompatibility.
The phi coefficient
by correlating
examineesr pass or fail
scale
avoids
status
this
problem
on each
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
96
item with
their
the test.
This
rerative
is not true
no direct
is affected
When the cut-off
data rnatrix,
the crassical
relationship
s co re sca l e .
identify
information
iterns with
scotre through
high
the calculation
is berieved
procedure
attributed
to
that
identify
in that
an
the user can
at a unigue cut-off
the failure
to two factors.
which have
offer s
of a single
items with
a
score or the domain
T h u s, the phi coefficient
statistics
for
also vary.
statistics
advantage over classical
rt
wirr
to the cut-off
on
by the cut-off
score is varied,
the phi coefficient
for
point
to the cut-off
The phi coefficient
score value.
given
status
item statistic.
of the Rasch moder
high
informati.on
can be
(a) The data did not fit
the
assumptions of the one parameter rnodel, such as, uniform
discriurination
information
parameter
of items and no guessing.
of the items was calculated
examinee sample size
appear to cause serious
the itern pararneters
Iarge
for
with
by the
fact
information
regard
varues
This
when the itern
number) for
sample conditions
items serected
that
(n = 50) used did not
to estimating
because there
sample conditions.
item identification
large
errors
(b-values)
between the
and small
supported
based on the three
rnoder and not on the one pararneter model.
The small
difference
(b) The
generated
finding
by the
is also
numbers (i.e.
the items selected
r,rere compared to the
in the small
was littre
iten
sampre conditions,
in the
numbers
it
was
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
97
found that,
selected
oD the average 70 percent
were conmon to both the large
conditions.
radically
Therefore,
items
and sma1I sampre
the sma1l samples did not seem to
change the subset of items selected
sarnple conditions.
concluded that
prohibit
of the
For this
reason it
examinee sample sizes
the Rasch moder procedure
information
iteurs in situations
under }arge
can not be
as small
as 50
from identifying
where the data
high
fits
the
assumptions of the one parameter model, a Rasch score
scale
is used, and information
is defined
in terms of the
one paramet,er model.
Measurement Precision
rn general
small
For Large sample conditions
the findings
of this
examinee sampre conditions
examinee sample conditions.
sample estimate
larger
infornat,ion
produced with
varues
than any of the three
from 50 to
sample test
phi procedure was larger
in
would
infornation
information
estimate
than the modified
information
population
represent
varues
the
as
1OOO.
since
finding
rarge
case the
This would mean that
procedure by seven percent.
values
for
to this
rn this
to
(n = l-ooo) produced. > 7 percent
from o to 7 percent
samples increased
The large
coefficient.
smarr samples.
user would gain
were also valid
The exception
be in the case of the phi
large
study relating
the
classicar
these large
would be expected to generalize
for
values
sample test
this
to sirnirar
data
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
98
sets.
Therefore,
preferred
over
examinee
the
the
samples
phi
procedure
modified
of
would
also
be
procedure
classical
for
1_OOOor more.
Accuracv
The random item selection
highest
level
estinate.
of accuracy
The optinal
average absorute
procedure
with
items
deviation
regard
to the domain score
serection
procedures
minor
acguired
compared to
estimates,
through
losses
optinat
in accuracy
The gains
information.
more than offsets
The classification
item serection
procedures
in test
item selection
is
of dornain score
because the acconpanying bias
items selected
produced
varues which were two to three
tirnes as high as the random procedure.
information
produced the
in difficulty
of
any increase
in test
accuracy
the optirnal
showed rosses
for
ranging
from 0 to
L8 percent.
The primary
obstacre
to using traditional
select,ion
procedures
parameter
rRT item selection
way to report
This
simulating
procedure
is that
there
ability
is no
scaIe.
problem because the ad.vantages of
item information,
c l a ssi fi ca ti o n
the two or three
scores in terms of the latent
is a serious
increasing
for
item
a ccu racy,
such as increasing
can onr y be r ear ized
if
scor es
I
I
t
I
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
99
are reported
on an rRT ability
rRT item serection
gained through
procedure
using
scare.
the
By simulating
advantages
scores
are reported
the results
increase
selection
only
study
information
through
procedure,
serves
percentage
correct
that
increase
iten
who consider
referenced
in higher
these findings
find,ings
of previous
at a cut-off
accuracy of ability
show that
will
scores
procedure
the use of a
to simulate
is not advised.
tests
must understand
of rRT ability
correct
domain-referenced
mentioned,
two different
The f irst
there
an rRT
Test developers
the use of such proced.ures for
As previously
tests.
of the domain
information
study
procedure
the domain percentage
describe
test
of this
between estimates
traditional
the use of any item
the general
item selection
selection
to
IRT procedures.
The results
traditional
rnd.eed
than random item selection,
the classification
through
when
scale.
should be noted that
increasing
verses
show that, seeking
score which results
rt
phi),
are rost
correct
the accuracy
in any way dispute
studies
derived
other
to decrease
misclassification.
do not
on a percentage
of the present
test
which are
a non rRT approach (e.g.
an rRT model, such as ease of cornputation,
the
criterion
is a difference
scores and estimates
scores associated
with
testing.
popham and Husek (r-969)
types of criterion
type is d.eterministic
referenced
in nature
and
of
I
I
I
I
highly
t
l- scale.
I
I
I
I
I
I
I
I
I
I
I
I
I
I
L00
unidirnensional,
Iike
measures of intelligence.
use of an IRT derived
domain ability
for
because the score represents
this
type of test
dominant underlying
ability
The second type
score is well
stratified
random sample of
items that
represent
the
O to
a randon or
is
items
suited
on a familiar
represented
of test
The
from a large
some performance
group of
criterion.
It
is
the second type which is most conmon in certification
licensure
testing.
For this
selection
procedure
is particularly
will
provide
correct
type of test
an unbiased estimate
the random item
suited
scenario
of the domain percentage
in trying
estimate
of a dornain percentage
a domain of
the danger
illustrates
involved
to use a domain ability
items derived
correct
fron
score as an
score.
several
are shown to be sufficiently
unidirnensional
of IRT application.
assume that
areas produced
Further
items with
From an IRT perspective
mastery/non-mastery
composed of
ability
Ievel
subject
purposes
for
these
areas,
subject
determination
on the cut-off
the
ideal
test
one
ability
for
would be one which was
items which $rere drawn from the cut-off
level.
were all
developed
Assune
narrow ranltes of b-va1ues,
of which happened to be centered
level.
because it
score.
The following
that
well
and
If
items representing
frorn one subject
the cut-off
ability
area then the test
from the itern pool would be composed of the
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
10L
items from only
one subject
of the domain ability
of estimating
procedure
designated
content
area.
the resulting
resulting
from a test
optirnal)
a content
optirnal
conducted.
fashion
be
rt
is believed
correct
which would be inplied
by
approach untir
further
study wourd contraindicate
this
to content.
items with
in situations
domain percentag:e correct
non-
p-values
that
the same
where items with
from clusters
The result
all
than the average p-value
would suggest
are serected
is
which utilized
tended to serect
rntuition
research
study found that
procedures,
phenomenon would occur
estimate
of how to
can the scores
in this
scores?
which were on the average lower
according
from
interpretation
indices,
information
items
the
specifically,
in the domain.
information
that
random itern selection
statisticar
would draw a
of both domain percentage
of the present
a dual
that
constructed
scores and donain abirity
attenpting
score.
an item sarnpling
scores arises.
as estimates
in terrns
Hambleton and
once again the guestion
interpret
estimate
biased
abirity
possibility,
this
number of the highest
each subject
resurts
correct
(L997) recommend using
(i.e.,
interpreted
The resurting
score rnight be very
the percentage
To account for
Arrowsmith
area.
of
high
items grouped
would be estimates
of
scores which would tend to under
the examineesr domain percentage
Thus one would have to exerci-se great
correct
caution
with
score.
regard
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
I
LO2
to rnaking dual
interpretations.
Maximizincr The percentage
The resurts
value,
for
sampling
of this
tests
correct
study
Domain score Accuracy
show that
deveroped through
procedure,
deviated
the random itern
from the mean p-value
itern bank the result
was a systematic
Further,
ind.icate
the results
that
deviations
from the mean p-value
systematic
bias
deviation
mean p-varues
passes.
false
that
wourd seem that
content
is
effort
adeguately
present
which were
range of
referenced
insure
test,
that
the
as the distribution
The alternative
it
of
could
in which dornain percentage
be
correct
are produced which are radicarly
data set the lowest
through
would have resulted
For example, in the
mean p-value
that
value being generated
could
random sarnpring was .236,
in rarge numbers of farse
miscrassifications.
stratified
the second
in the tlpical
be nade to
from the domain scores.
been generated
for
Given the large
as welr
sampled.
an unhappy situat,ion
different
in a large
For exarnpte a positive
a eriterion
should
of p-varues
score estimates
result
wourd be inherent
for
the
sma11
in misclassifications
sampling distribution
distribution
will
for
in the scores.
, ol- from the rnean p-varue
random sample resulted
almost entirery
bias
rerativery
in crassification.
of only
when the mean p-
Arthough the probability
is very
srnall,
item sarnpling is not used.
it
have
which
fail
of this
is possible
if
t
I
t
Chapter VI
I
t
t
I
I
I
I
I
I
I
I
I
I
I
I
I
Conclusions
and Suggestions
Future
The results
a modified
quite
off
classical
effective
score,
of this
study
proced.ure or a phi
iten
itern selection
can identify
high
item
i-s one twentieth
coefficient
that,
That is,
information
rRT model
phi
and
items when only
these procedures
for
can
an examinee sarnple that
of the rninimal sample size
use with
Locrsr,
used computer program for
cut-
these
(i.e.
information
can be
a given
for
Further,
procedures
the size
the use of either
by a 3 parameter
high
Lr000) recommended for
widely
items
information.
50 examinees are used.
identify
show that
would be identified
traditional
crassicar)
Research
at serecting
as having high
for
(i.e.
which is the most
the three
parameter
rRT
model.
However, the prirnary
information
increase
increases
increases,
scale
at a cut-off
score for
the crassification
the classification
but only
for
such as those
for
increasing
accuracy.
As test
accuracy of test
scare.
test
a mastery test
scores reported
such as the rRT theta
can not be generated
reason
is to
inforrnation
scores
also
on an abirity
since
ability
scores
from smalr examinee sample sizes,
concerned
in this
103
study,
there
is
no way to
I
t
I
I
t
I
I
I
I
t
t
t
t
t
I
I
I
I
I
LO4
effectively
utilize
focused through
procedures.
the test
traditional
can not be effectively
measurement information
For mastery
situations
for
tests
the test
developer
estimates
of rRT ability
is,
at present,
tests
Thus, the test
offer
should
optirnal
some hope for
correct
traditionar
classification
with
ability
will
trait,
procedures
the obstacle
only be useful
scores from small
if
the
in smalr
rRT itern parameter estimation.
procedures
estimating
score.
procedures.
underlying
item selection
dealing
score.
be used to produce
who wish to evaluate
in terms of some 1atent
samples pose for
for
correct
of examinees, who are being tested
traditional
which is to
only a random or stratified
itern selection
developers
there
score estimates.
prod.uce the great,est
of any of the
For test
groups
However,
smalr sample conditions
procedure
will
or
are desired.
of the domain percentage
procedures
abilities
scores,
of the domain percentage
randorn itern serection
accuracy
correct
whether
assembler has only one choice,
under these circumstances
lhese
exarninee sample
no way to produce ability
use an estimate
the estimate
test.
large
scores,
involving
focus
must decide
of domain percentage
nastery
item serection
used to
involving
might be
itern selection
mastery
estimates
for
that
the use of trad.itionar
Therefore,
procedures
inforrnation
the
studied
that
small
However,
some method.
sarnpres can be
I
I
t
I
I
I
I
I
t
I
L05
found.
In this
regard
perforrn
night
when data has been edited
parameter
nodel
terns
log abilities.
of
there
and
is
the
model
items
conditions,
selected
verses
gathered
from
findings
provide
t
error
I
accuracy
tests
in
for
that
suggest
values
Rasch
further
itern
sample
parameters
conditions.
research
These
the use
into
sample conditions.
research
is needed to explore
(e.9.
the measurement precision
classification
estimated
small
from
developed using
small
further
one
cornposed using
derived
impetus
of the estirnate)
ways
standard
the
and at the same time maximizing
accuracy and the domain score estimate
in terms of domain pe,rcentage correct
t
rnay be possible
I
I
I
ability
I
l
I
use of a random or stratified
provides
data
examinee sample
of the Rasch mod,e1 for
of maximizing
The present
for
tests
large
Finally,
are
the
fit
between the information
parameters
item
to
scores
and candidate
difference
little
procedure
is unknown how the BICAI
it
to achieve
a reasonable
a selection
strategy
scores.
that
compromise between accuracy
score estimates
and domain percentage
It
of
correct
score estimates.
In conclusion,
in circumstances
domain percentage
selection
the results
where small
correct
should be used.
of this
study
suggest
examinee sarnples exist
score estimate
is desired,
random procedure
This procedure will
for
that
and a
the
item
produce the
I
I
t
I
I
I
I
I
t
I
t
I
I
I
I
I
I
I
t
Lo6
lowest
rnisclassification
accuracy,
scores,
rate
and highest
level
between the domain score estimates
relative
avai-Iable
to any itern selection
of
and the domain
procedure
currently
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
l
I
I
LO7
References
Birnbaum, A. (L968). Some latent
trait
models and
their
use in inferring
and examines ability.
In F. M.
Lord and M. R. Novick (eds. ) , Statistical
theories
of
mental test scores. Reading, Mass., Addison-Wes1ey.
Camer,
R. P. (1970). Special problems in measuring
change with psychometric
In Evaluative
devices.
Research:
Stratecries
and Methods. Pittsburgh:
American Institutes
for Research, pages 48-53.
Cook, L. L. & Hambleton, R. K. (1979a). A comparative
study of itern selection
trait
methods utilizing
latent
theoretic
(Report Number 88).
models and concepts.
Amherst,
MA: University
of Massachusetts,
Arnherst.
Cook, L. L. & Hambleton, R. K. (L979). Application
of
latent
trait
models to the development of nonnreferenced
(Report
and criterion
test.
referenced
Number 72). Amherst, MA: University
of Massachusetts.
Cureton, E. E. (L959). Note on phi,/phi
PFvchometrika, 24, 89-9L.
max.
Cronbach, L. J. (L984). Essentials
of Psvchological
(4th ed. ) . (p.p. 55) , New York: Harper
Testincr,
Row.
and
Davis, F. B. (L961). Itern selection
technigues.
fn
E. F. Lindquist
(Ed.), Educational
Measurement (4th
ed. ) . (p.p. 309-311) . Washington, D. C. : American
Council
on Education.
de Gruijter,
D. M. N. (1996). Srnall N does not
justify
psycholocrical
Rasch mode1. Applied
L
9
4
.
Measurement, 2, L87
always
Ferguson G. A. (198L), Statistical
Analvsis
and Education.
New York: McGraw-Hill.
psvcholocrv
in
Haladyna T. M. & Roid G. H. (i-983). A comparison of two
approaches to criterion-referenced
test construction.
Journal of Educational
Measurement, 2O., 27L
292.
Harnbleton R. K., Arrasmith,
D. & Smith, L. (1997) .
Optimal
item selection
with credentialing
exaninations.
(Report Number L57). Anherst,
MA: University
of
Massachusetts
I
108
l
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
Hanbleton R. K. & De Gruijter,
Using item response models
item selection.
Journal
of
24, 355-370.
D. N. M. (L983).
test
to criterion-referenced
Measurement,
Educational
Hambleton, R. K. & C o o k L . L . ( L 9 8 2 ) . T h e r o b u s t n e s s
and
of latent trait
of test
length
models and effects
sample size on the precision
In
estirnates.
of ability
D . W e i s s ( e d . ) , New horizons
in testincr.
New York:
Acadenic Press.
Hambleton, R. K. & Swaminathan, H. (L985). Item
response theorv:
and applications.
Principles
Hinghan, l{A: Nijhoff
.
Hambleton, R. K., Swaminathan, H., Algina,
J. , &
Coulson D. B. (L978). Criterion-referenced
testing
and measurement: A review of technical
issues and
developrnents.
Review of Educational
Research,
48,
1- 46.
Hambleton, R. K., Swaminathan , tl. , Cook, L. L. , Eignor ,
D.R., & Gifford,
J . A . ( 1 9 7 8 ) . Developments in latent
trait
theory:
models, technical
issues, application.
Review of Educational
Research, 4 8 , 4 6 7 - 5 1 _ 0 .
Hambleton, R. K. & Novick, M. R. (L973). Toward an
integration
of theory
and method for criterionreferenced
test.
Journal
Measurement,
of Educational
10, 159-170.
Hambleton, R. K. & Rovinelli,
R . ( 1 9 7 3 ) . A F O R T R A NI V
program for generating
examinee response data frorn
logistic
test models (Courputer program).
Behavioral
Science, 18, 74-75.
Henrysson, S. (L97l-). Gathering,
analyzing
and using
data on test
(Ed. ) ,
items.
In R. L. Thorndike
Educational Measurement (2nd ed.).
(p.p. L30-141").
Washington,
D. C.: American Council
on Education.
Hills,
J. R. (1981). Measurement and evaluation
in
the classroom
(2nd ed. ) , Columbus: Charles E. Merrill
Publishing
Cornpany.
Huynh, H. (L976). on the reliability
of decisions
in
domain-referenced
of Educational
testing.
Journal
Measurement, 13, 253-264.
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
L09
Lord, F. M. (L982). smarl N justifies
rn D. weiss (ed. ) , New horizons Jn
Academic Press.
Rasch methods.
Eesting.
I,Iew york.
Lord-, F. M. (L980). Anplications
of item response
Ehegry to practicAl
testing
problerns.
f,awrence,
Erlbaum Assoc., Hillsdale,
New Jersey.
Lord-, F. M. (L977) . practical
applications
characteristic
curve theory.-]ournal
of
Measurement,, L4, j.L7-l_39.
Lord-, F..M.,
5r Novick,
M. R.
Addison-Wesley.
(L969).
of item
Educational
Statistical
. Read,insl-EaG. ,
M a rco , G. (L 9 i 7 ). rtem char acter istic
cur ve solutions
to three intractabre
testing problems. Journar of
E d u ca ti o n a l Me a sur ement, L4, 1gg- feo.
Nitko, A- J. (L974). probrems in the development of
criterion-referenced
test:
The IpI pittsLurgh
e xp e ri e n ce . In C . W . Har r is, M. C. Alkin, ind W . J.
P o p h a m (e d s.), p roblem s in cr iter ion- r efer ence
$Fasurenent _(csE Monograph series in Evaruation,
No.
3). Los Angeles: Center for the Study of
Evaluation,
University
of California,
Sg-g2.
N i tko , A . J. (1 9 2 0 ). Defining r cr iter ion- r efer enced
te strr. In R . A . Ber k ( ed. i , A guide to cr iter ion
. galtinore:
Johns
press, page L2.
Hopkins University
Popham, J. W. & T.R. Husek. (L969). Implications
criterion-referenced
measurernent. Journar of
Educational Measurement, 1., 1
9.
Raschr.G. (L96G). An item
individual
differences
of
analysis which takes
into-account.
British
Journal
L
9, 49-57.
,
Raschf G.. (L960). probabiristic
rnodels for sone
irtgflio"ngg.
.tta
: The
Danish Institute
for nduCational Research.
R i ch a rd so n M. w . (i -936) The r elation
33-49.
between the
.
tl
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
I
LL0
s a me j i ma , F . (L 9 7 7 ) . A use of the infor mation
in tailored
testing.
Anplied psychological
L, 233-247.
function
Measurement,
Shannon, A. c. & Cliver, B. A. (L997). An application
of item response theory in the comparisoir-of
four
conventional
item discrimination
inaices for
criterion-referenced.
test. Journar of Educational
Me a su fe me n t, 2 4 | 347- 359.
s u b ko vi a k, M. J. (L976) . Estir nating r er iability
a single adrninistration
of a rnaltery test.
Educational Measurement, L3 , 265-27-6.
fr om
]ournal
of
w a rm, T . A . (r.9 7 9 ). A pr imer of item r esponse theor v.S
u-s- coast Guard rnstitute,
okrahoma -iEyl-x.-Es.
Department of commerce, National technicll
rnformation
Serrrice Technical Report 94t27g AD-AOG3 O72.
w i lco x, R . (1 9 7 6 ). A note on the r ength and passing
score of a mastery
test. Journal of Educltional
Statistics,
1, 359-364.
W ri g h t, B . D . (L 9 7 7 ) . Solving m easur ement pr oblem s
with the Rasch model. Jouinal of Educational
Measurement, L4, 97-L66.
Wright, B. D. & Stone, M. H. (i.979). Best test
Rasch measurement. Chicago: MESApiess.
desiqn:
Wright, B. D., Mead , R, & BeIl, S. R. (1979). BICAL
Iconputer program] chicago: university
ot' ctricago,
statisticar
Laboratory,
Departrnent of Education]
t
I
1LL
Appendix
t
I
I
I
I
I
I
I
I
I
t
I
I
I
I
I
I
rNF. P-VAL. Pbis rTEM
NUM.
0.00
0. 01
0.0L
0.0L
o.0L
0.0L
0.0L
0.0L
0.0L
0.0r.
0.02
0.02
0.02
0.02
0.02
o.a2
0.03
0.03
0.03
0.03
0.03
0.03
0.03
o.03
0.03
0.03
o.03
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0.04
0. 04
0. 05
0. 05
0.05
0.05
0.23L 0.146 L73
0.97 4 o.3L2 227
0. L97 0.194
30
0.962 o.359
3L
o.L67 o.L69
99
0.96t_ 0.363 2L5
0.967 0.35L L79
0.970 0.334
98
o.225 0.L79 L25
0.960 0.398 205
0. L6L 0.304 236
0.924 0.485
9L
0.92].. 0.5L4 L95
0.427 o.L77 L96
0.634 0.L55 L22
0.925 o.507 L16
0.518 0.188
39
0.514 0.185
59
0.556 0.1 48 26
0.504 0.164
24
0.709 0.251
2
0.759 0.21_8 66
0.696 o.181
6L
0.936 o.434 161
0.934 o.4L7 230
0.94L 0.4L5
L0
0 . 7 4 0 o . L94 l_65
0.2L0 o.244 226
0.932 o.444
67
0.568 o.227
3
0.72L 0.176 Lg3
0.763 o.L77 L62
0.920 0.485 L89
0.93s o.432 LL0
0.L23 o.322 L35
0.914 0.519
29
0.923 o.475
6
0.924 o.479 L46
0.5L0 0.156 2L0
0.914 0.498
50
0.486 0. L92 131_
0.l_57 o.277 L39
1
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t
I
I
LL2
INF. P-VAL.
NUM.
0. 05
0. 05
0. 05
0. o5
0.05
0. 05
0.05
0.06
0.06
0. o6
0. 06
o. o6
o.06
o. 06
o.06
0.06
0.05
0. 07
o. 07
0.07
0.07
0.07
o. 07
0. 07
o. 07
o. 07
0. 07
0.07
0.07
o. 08
0. o8
0.08
o.09
0.09
0.09
0. 09
0. 09
Pbi"
0.902 0.509
o.792 o.22l.
0.939 0.370
0.908 0.502
0.91L 0.539
o .382 o.209
0.485 o.220
0.125 0.353
o.927 o.329
0.9LL 0.467
0.821 0.239
0.800 o.278
0. 660 0.283
0.930 0.340
0.578 o.2t4
0.912 0.392
0.357 0.30L
0. r.56 0.365
0.365 o.267
0.291 o.299
0.855 0.288
0.913 0.405
0.905 o.327
0.895 0.307
0.438 o.L79
o.529 0.305
0.706 0.205
0.905 0.360
o.480 0.208
0.250 0.329
0.891_ o.347
0.895 0.362
o. l_88 0.334
0.405 o.270
0.826 0.325
0.891 o.444
o.3L2 o.274
o. o9 0.454 0.245
0. l_o0.280 o.294
0. 10 0. 504 4.252
0. L0 o.294 0.3L7
0. L0 0.459 o.296
o. l_oo. 335 o.264
0.10 0.88L 0.535
0. l_0 o.443 o.229
ITEM
L68
87
63
9
L49
18L
L82
L26
77
237
L32
37
55
L70
40
25
83
84
82
l_63
60
].-76
90
165
L27
L8
L43
47
L42
74
72
222
5L
159
2L7
128
L4
45
Lt 3
L87
L14
145
L78
120
64
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
t
I
L13
INF.
0.10
0.10
0.to
0. L0
0. 10
0.1L
0. L1
0.11
0.1L
P-VAL.
Pbis
rTEM
NI'M.
o . L 2 9 0 . 3 0 5 L57
o . 6 8 2 0 . 3 1 0 L64
o.757 0. 357 44
o.567 0. 308 2L8
o . 3 7 9 o . 2 8 9 r-L8
o.872 0.41L L88
0.834 0.353
94
o . 7 L 8 0 . 3 6 0 LO2
0.864 o.37 4 105
0 . 1 ,r. 0 . 5 9 9 0 . 3 2 8
7
o.12 o.352 o.255 144
o.L2 o.872 0.515 88
0 . t 2 0 . 8 5 6 0 . 6 0 0 L92
o.L2 0.364 o.294 229
o.r2 0.843 o.371 92
0. L2 0.829 0.345 103
0. L2 0.883 0.5L7 L08
o.t2 0.313 0. 371 L40
0. r-2 o.435 0.321 Lt7
o . L 3 0 . 1 1 2 o . 2 9 6 l_07
0 . 1 3 0 . 8 5 8 0 . 5 6 8 225
0. 13 0. 578 0. 343 198
0.13 0.336 o.282
22
0. 13 4.362 0.3L9
7L
0 . L 3 0 . 8 1 _ 0 0 . 3 9 6 239
0 . 1 3 o . 8 3 2 o . 3 4 2 2L9
0 . 1 3 o . 8 5 9 o . 4 9 7 22L
0. 13 o.408 o.326 169
o. 13 o.7 42 0.384
46
0. L3 0.32L 0. 370 L90
o. l_4 o.446 0.31_3 43
o. l_4 o.851_ 0.555
58
0 . L 4 o . 8 2 6 o . 4 6 4 2L3
0. L4 o.L77 o.379
23
0 . L 5 o . 3 0 9 0 . 3 0 4 L67
0 . 1 5 0 . 3 5 1 _o . 3 L 2
20
0 . 1 _ 54 . 2 7 5 o . 2 7 8 L 7 4
0. L5 0.730 o.327 l_54
0. r.5 o.769 0.375 34
0 . 1 5 0 . r " 1 , 2 o . 3 7 9 194
0. r.5 o.439 o.324
L
I
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
t
I
Ll_4
INF.
P-VAL.
Pbi=
ITEM
Nmd.
0. L5 0.242 0.319 11_5
0. L5 0.842 0.607 199
0 . 1 _ 60 . 7 3 5 0 . 3 6 0
95
0. L5 0.867 0.525 202
0.17 0.550 0.383 LL2
0. 17 0.789 0.415 203
o.L7 0.823 o.447 L2L
o. L8 0.270 o.442 23L
0.18 0.75L o.447
35
0.L8 0.833 0.579
49
0. L8 0.355 0.378 t-41
0.L8 0.429 0.335 L0L
o.18 0.832 0.463 209
o.19 0.354 o.284
27
0. l_9 0.539 o.317
48
0.L9 0.596 0.393
68
o.20 0.676 o.4L8
42
0.20 0.228 0.480 L84
0.20 0.731 o.464 206
0.20 0.11_5 0.387
L3
0.2L 0.836 0.507
73
o.2L O.799 o.464 185
o.22 0.346 0.337 238
o.22 0.71L 0.484
53
4.23 0.373 0.352 21.2
o.23 0.820 0.622 233
0.23 0.8L9 o.627
80
o.25 0.57 4 0.399 180
o.25 0.463 0.34s 234
0.25 0.494 0.4L5
LL
0.25 0.307 0.380
8L
o.25 0.8L3 o.597
32
o.26 0.624 0.387
79
o.27 0.803 0.582 L55
o.27 0.796 0.643 191
o . 2 7 0 . 3 5 5 0 . 3 1 _ 8L 8 3
o.27 0.824 o.574 L48
o .29 0.27 4 o .47 6 t_5
o.29 0.777 0.585 220
0.30 0.424 0.357 L55
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
I
t_L5
INF.
P-VAL.
Pbi=
0.31
0.3L
o.32
0.34
0.34
0. 35
0.36
0.36
0.36
0.36
0.36
o.37
0.37
o.37
o.37
0.37
0.38
0.38
0.40
0.40
o.42
o.42
o.42
o.42
o.44
0.45
0.45
0.45
0.45
o.46
0.48
0.48
0.49
0. 50
0.50
0.5L
o.52
0.52
0.53
0.54
0.57
0.57
0.59
0.60
0.7L6 0.48L
0.377 0.433
0.143 0.484
0.238 o.495
0.380 o.422
0.591_ 0.520
0.7 40 0.519
0.7 46 o.55L
0.227 0.4L5
0.308 0.399
0.687 0.500
0.2]-6 0.455
0.634 0.554
0.7 46 0.503
0.387 o.4L7
0.545 0.483
0.394 o.447
0.786 0.639
0.757 0.545
0.494 0.405
0.2].4 0.456
0.461 o.486
0.778 0. 6L8
0.447 o.427
0.779 o.649
0.303 o.420
0.47L 0.491
0.543 0.558
0.752 0.620
0.325 0.485
0.59L 0.s12
0.796 0.594
0.327 o.494
0.7 65 0.693
0.346 0.485
o.4L1 o.495
0.477 0.537
0.362 o.492
0.325 o.467
0.298 0.458
0. 654 0.552
0.349 0.465
0.7L3 0.639
0.669 0.605
ITEM
NT'M.
33
7A
L24
235
85
54
200
16
75
38
L50
204
56
I
57
L86
78
l_58
3.LtL77
62
224
l_33
93
130
28
208
4L
L75
L7
2L4
L37
119
100
36
97
L9
232
zLL
A7L
4
L52
109
L29
I
I
I
I
I
I
t
I
I
I
I
I
I
I
I
I
I
I
I
INF.
P-VAL.
0.50
0. 61
0.61
o.62
0.63
0.54
0.59
o.72
o .72
0.75
0.75
o.76
0.78
o.85
0.89
0.91
l.02
L.22
L.31
t.37
L.37
L.46
L.57
L.50
L.62
L.74
L.97
2.L4
0.329
0.269
0.490
0.437
0.735
0.731
0.689
0.758
0.520
0.695
0. 650
0.50L
0.369
0.684
0.678
0.654
0.595
0.614
0.398
o. 500
0.536
0.633
0.438
0.464
0.381
0.526
0.507
0.516
Pbi=
ITEM
NUM.
0.485
86
0.503 Los
o.497
L2
0.481
2L
0.662 L97
0.681_ 228
0.582
65
0.586
5
o.47L LsL
o.692 2L6
0.604 138
0.541
89
0.548
69
0.528 201_
0.607 240
0.538 ]-47
o.646
96
o.697
76
o.587 153
0.645 160
0.535 135
0.580
52
0,503 L04
0.660 L72
o.622 L34
0.599 207
o.678 223
0.654 L23
Appendix 2
Test Information
p-va1ue
Test
Information
I
I
I
t
I
I
I
I
t
I
I
I
I
t
I
I
I
I
t