Psychometric Considerations for the Next Generation of

Psychometric Considerations for
the Next Generation of
Performance Assessment
Charlene G. Tucker, Ed.D.
Maine Department of Education
National Council on Measurement in Education
April 18, 2015
Chicago, IL
Study Group
• Rich Shavelson (Chair), Emeritus, Stanford
•
•
•
•
•
Tim Davey, ETS
Steve Ferrara, Pearson
Noreen Webb, UCLA
Laurie Wise, HumRRO
Paul Holland, Emeritus, UC, Berkeley
• Charlene Tucker, Project Manager
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
2
Acknowledgements
• Expert Presenters
–
–
–
–
–
Brian Clauser, NBME
Peter Foltz, Pearson
Aurora Graf, ETS
Yigal Rosen, Pearson
Matthias von Davier, ETS
• Consortia Stakeholders
– Enis Dogan, PARCC
– Marty McCall, SBAC
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
3
Acknowledgements
• State Stakeholders
–
–
–
–
Juan D’Brot (WV)
Garron Gianopulos (NC)
Pete Goldschmidt (NM)
Jeffrey Hauger (NJ)
• Technical Reviewers
– Suzanne Lane (UPitt)
– Matthias von Davier (ETS)
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
4
Study Group Work
• Three 2-day meetings in San Francisco
• Chapters and primary authors emerged
• Lots of work and calls between meetings
• 100-page white paper intended for test
development and measurement experts
• Shorter, less technical policy brief intended for
broader audience of practitioners and policy
makers
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
5
Session Sequence
• Steve Ferrara, Chapter 2
Definition of Performance Assessment (15 min.)
• Tim Davey, Chapter 3
Comparability of Individual-Level Scores (15 min.)
• Noreen Webb, Chapter 4
Reliability and Comparability of Groupwork Scores (15 min.)
• Laurie Wise, Chapter 5
Modeling, Dimensionality, and Weighting (15 min.)
• Ronald K. Hambleton, Discussant (15 min.)
University of Massachusetts
• Audience Comments and Questions (10 min.)
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
6
Recommendations
#1: Measure what matters. If expectations
include the ability to apply complex
knowledge to solve problems or to deliver
complex performances, explore ways to
measure those things. What gets assessed
often commands greater attention.
#2: If psychometric advisors are concerned
about your emphasis on performance
assessment, listen to them. They are trying
to advise you responsibly.
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
7
Discussion
Ronald K. Hambleton
University of Massachusetts
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
8
Both papers may be downloaded at:
http://www.k12center.org/
publications/all.html
4/18/2015
NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment
9
II. Definition of Performance
Assessment
Steve Ferrara
Center for Next Generation Learning and Assessment
Pearson Research and Innovation Network
Presentation in C. Tucker (Chair), Psychometric Considerations for
the Next Generation of Performance Assessment, a coordinated
session conducted at the annual meeting of the National Council on
Measurement in Education, Chicago.
April 18, 2015
Overview
•
•
•
•
•
Formal definition of performance assessment
Five distinguishing characteristics
Five approaches
Illustration
Two recommendations
• Other topics in the chapter that I will not address today
– Assessment activities not consistent with our definition
– Performance assessment as a guide to teaching, higher
expectations, and learning and achievement
– Role of groupwork
– Performance assessment features and characteristics for each
approach
– Some practical considerations
– Psychometric considerations
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
2
DEFINITION OF
PERFORMANCE ASSESSMENT
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
3
Definition in the chapter
An assessment activity or set of activities that requires test
takers, individually or in groups, to generate products or
performances in response to a complex task that provides
observable or inferable and scorable evidence of the test
taker’s knowledge, skills, and abilities (KSAs) in an academic
content domain, a professional discipline, or a job.
Typically, performance assessments emulate a context outside
of the assessment in which the KSAs ultimately will be applied;
require use of complex knowledge, skills, and/or reasoning;
and require application of evaluation criteria to determine
levels of quality, correctness, or completeness
(adapted from Lai, Ferrara, Nichols, & Reilly, 2014)
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
4
Definition parsed
• An assessment activity or set of activities that requires test
takers, individually or in groups,
• to generate products or performances in response to a
complex task
• that provides observable or inferable and scorable evidence
• of the test taker’s knowledge, skills, and abilities (KSAs) in
an academic content domain, a professional discipline, or a
job.
Typically, performance assessments emulate a context outside
of the assessment in which the KSAs ultimately will be
applied; require use of complex knowledge, skills, and/or
reasoning; and require application of evaluation criteria to
determine levels of quality, correctness, or completeness
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
5
Definition parsed
• Activity or set of activities
– Require examinees to generate products or performances in
– In response to a complex task
• Provide
– Observable or inferable and scorable evidence
– About knowledge, skills, and abilities (KSAs) in an academic
content domain, a professional discipline, or a job.
• Typically:
– Emulate a context
– Require applied use of complex knowledge, skills, and/or
reasoning;
– Require application of evaluation criteria to determine levels of
quality, correctness, or completeness
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
6
Why care?
• Clarity in communication
• Broad thinking during assessment design
• Avoid all sizzle and no steak
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
7
FIVE DISTINGUISHING
CHARACTERISTICS
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
8
Distinguishing characteristics
• Tasks
• Responses
• Scoring
• Fidelity
• Connectedness
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
9
Characteristics (cont.)
• Tasks
– Directions about what problem to solve, product to create, or
performance to undertake
– Requirements for responding
– Response features that will be scored
• Responses
– Intended to reflect responses in the real world situation
represented by the assessment tasks
– Constructed, complex
– Vary in length or amount of behavior capture
– Declarative, procedural, and strategic knowledge
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
10
Characteristics (cont.)
• Scoring
– For accuracy, completeness, effectiveness, justifiability, etc.
– Rubrics, counts of features, other scoring rules
• Fidelity
– Assess KSAs in real-world contexts that are relevant to
experiences inside or outside of the learning environment or on
a job
– Advocates: Supports learning
– Aka authenticity
– Scenarios and assessment activities intended to engage interest
and effort
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
11
Characteristics (cont.)
• Connectedness
– Among the assessment and activities and other requirements
for responding (e.g., stimuli)
– In contrast to the discreteness of conventional test design
– Intended to elicit broad thinking and HOTS (e.g., reasoning,
critical thinking) more explicitly than is achievable with selected
response items
– Support fidelity, enables focus on complex HOTS
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
12
APPROACHES TO
PERFORMANCE ASSESSMENT
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
13
Approaches
• Short constructed response and technology-enhanced items
• Essay prompts
• Performance tasks
As opposed to
technology
enabled items
• Portfolios
• Simulations
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
14
Approaches
• Short constructed response and technology-enhanced items
• Essay prompts
• Performance tasks
• Portfolios
• Simulations
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
15
Performance task
−
−
−
−
Literary Analysis Task
Six evidence-based selected
response (EBSR) and/or
technology-enhanced constructed
response (TECR) items, one prose
constructed response (PCR) item
Read first passage, respond to
items
Read second passage, respond to
items
Respond to a prose constructed
response (PCR) prompt
From http://www.parcconline.org/samples/english-language-artsliteracy/grade-10-elaliteracy
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
16
Culminating activity: Respond to a prose
constructed response item
• Analyze how Anne Sexton transformed Daedalus and
Icarus
• Use what you have learned
• Develop your claims, using evidence from both texts
• Think-abouts (emphasized, absent, different)
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
17
Multi-trait rubric
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
18
Summary of the task features
• Six SR and CR items, one essay
prompt
• Read first passage, respond to items
• Read second passage, respond to
items
• Respond to a prose constructed
response (PCR) prompt
• Aligned with claims in Written
Expression, English Conventions,
and Reading Literature
• Multi-trait rubric
Elements of Definition
of Performance
Assessment
•Set of activities
•Product(s)
•Complex tasks
•Observable, scorable
evidence of KSAs
•Context from
instruction
•Complex analysis,
reasoning, supported
argumentation
•Scoring rubric
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
19
Summary of the task features
• Six SR and CR items, one essay
prompt
• Read first passage, respond to
items
• Read second passage, respond to
items
• Respond to a prose constructed
response (PCR) prompt
• Aligned with claims in Written
Expression, English Conventions,
and Reading Literature
• Multi-trait rubric
Five Distinguishing
Characteristics of
Performance Assessment
•Task: Defined for each
item
•Responses:
Requirements for each item
are defined
•Scoring: Rubrics align
with response requirements
•Fidelity: Literary analysis
consistent with instruction;
analysis and argumentation
are crucial in the “real
world”
•Connectedness: Six
items require close reading,
which is prep for the
culminating activity
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
20
Recommendations
#4 Use complex performance assessment
when the construct of interest is important
and cannot be assessed using multiplechoice items.
#7 Consider ways to enhance performance
assessment by combining it with other
types of test items.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
21
Reference and additional resources
Lai, E. R., Ferrara, S., Nichols, P., & Reilly, A. (2014). The once and future
legacy of performance assessment. Manuscript submitted for publication.
Additional details on approaches to performance assessment and examples of
performance assessments are available from:

ETS (see https://www.ets.org/about/who/leaders)

Pearson’s Research and Innovation Network (see
http://paframework.csprojecthub.com/?page=home

Stanford Center for Opportunity Policy in Education (SCOPE; see
https://edpolicy.stanford.edu/category/topics/32/0?type=scope_publications),

University of Oregon (see
http://pages.uoregon.edu/kscalise/taxonomy/taxonomy.html)

WestEd (see http://www.wested.org/resources/preparing-for-the-common-coreusing-performance-assessments-tasks-for-professional-development/.
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
22
Thanks!
Steve Ferrara
[email protected]
+1 612-581-6453
Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved.
23
Comparability of Test Scores
Tim Davey
ETS
Copyright © 2015 by Educational Testing Service. All rights reserved.
Score Comparability & Inference
 Mary performed at the 87th % of all
4th graders this year.
 John’s math proficiency grew at an
average rate during 6th grade.
 38% of New Jersey 8th grader’s were
judged as “Proficient” relative to
state standards.
Requirements for Comparability
 Substantive
- Tests being compared measure the
same construct(s).
 Statistical
Test forms being compared:
- Are equally reliable.
- Produce equivalent score distributions.
- Correlate equally with outside criteria.
The nature of performance assessments
poses challenges to both requirements
Substantive Equivalence
 Substantive equivalence of test
forms starts by building each to the
same specifications, which define the
characteristics of a sample from a
universe of items/tasks.
Domain Sampling
Selected Response
Performance Assessment
Forms composed of many, discrete Forms composed of few, large,
items.
context-heavy tasks.
Test specifications can be detailed
and specific.
Broader specifications befitting
task size.
Small sampling unit can promote
broad, representative sampling of
item universe.
Large sampling unit makes
representative sampling more
difficult.
Unanticipated / uncontrolled item
influences on performance have a
chance to cancel out.
Unanticipated / uncontrolled task
influences can produce
idiosyncratic form differences.
Statistical Equivalence
 Usually promoted (if not ensured) by
equating, a set of statistical methods
applied to data collected under any
of several designs.
 Characteristics of performance
assessment tasks and test forms
complicate equating and can distort
results.
Equating Designs
 Single group
- All examinees take both test forms
 Equivalent group
- Each form administered to distinct, but
randomly equivalent group.
 Common-item
- Test forms share a block of items in
common (called an anchor).
Equating Designs
 Single group
- All examinees
Single Grouptake both test forms
 Equivalent group
- Each form administered to distinct, but
Equivalent
Group
randomly
equivalent
group.
 Common-item
- Test forms share a block of items in
Common Item
common
(called an anchor).
Unhelpful Characteristics
 Tests comprised of small numbers of
tasks can defy internal anchoring.
 Tasks can be memorable, meaning
their performance may change upon
reuse.
 Rater’s judgments may differ across
occasions.
Mitigation
 The litany of performance
assessment challenges outline above
can be addressed from several
angles:
- During test development and assembly
- While equating
- During task scoring
Task Development & Test Assembly
 Attempt to identify and control as
many influences on task performance
as possible. The extreme case is to
build forms of task “clones”,
developed to perform as equivalently
as possible.
 Lengthen the test, perhaps with the
addition of selected-response items.
Equating
 Use external anchors.
 Use “collateral” anchors.
 Blocks of selected response items.
 Other, related tests.
 Use the equivalent groups design.
 Use the SiGNET design.
Scoring
 Monitor rater performance.
 Correct for rater trends over time.
 Hope for advancements in
automated scoring.
Conclusions
 Extend test length, perhaps through
augmentation with selected response
items.
 Rigidly standardize tasks and test
forms.
 Accept that scores are not
comparable and limit inferences
accordingly.
Conclusions
#5
In
order
to
provide
an
 Extend test length, perhaps through
adequate
tasks
augmentation
with number
selectedofresponse
under the required conditions,
items.
one may need
to and
compromise
 Rigidly standardize
tasks
test
on the length/complexity of
forms.
tasks.
 Accept that
scores are not
comparable and limit inferences
accordingly.
Conclusions
#5
In
order
to
provide
an
 Extend test length, perhaps through
adequate
tasks
augmentation
with number
selectedofresponse
under the required conditions,
items.
one may need
to and
compromise
 Rigidly standardize
tasks
test
on Ifthe
length/complexity
of
#6
inferences
are not needed
forms.
tasks.
at
the student
level, consider
 Accept that
scores
are not
comparable
anda matrix
limit inferences
using
sampling
accordingly.
approach.
Thanks!
Copyright © 2015 by Educational Testing Service. All rights reserved.
The Role of Groupwork in
Performance Assessment
Noreen Webb
University of California, Los Angeles
National Council on Measurement in Education,
Chicago, April 18, 2015
Psychometric Considerations for the Next
Generation of Performance Assessment
(C. Tucker, Chair)
1
What is groupwork?
Individuals work together in small groups
to achieve a common goal
• Solve a problem
• Complete a task
• Produce a product
Group members may interact
• Face to face or virtually
• Synchronously or asynchronously
• Following roles or scripts, or not
2
Why groupwork?
• 21st century workplace requires
collaboration and teamwork skills
• Essential teamwork competencies
–Communication
–Coordination
–Decision making
–Adaptation
–Conflict resolution
3
Possible groupwork outcomes
• Productivity and performance
–What can the group produce?
–What does each individual
contribute?
• Collaboration and teamwork skills
–How does the group function?
–What are the collaboration and
teamwork skills of individuals in the
group?
4
Special measurement challenges
#1 Groupwork outcomes depend on how a
group functions:
• How do group members interact?
• Are all team members involved?
• Do group members coordinate their
communication?
• How well do members of the group get
along?
• Does the group work together? Divide up
the work?
5
Special measurement challenges
#2 Group functioning and groupwork
outcomes depend on:
• Composition of the group
• Roles assigned to group members
• Type of task and specific task
• Occasion
• Interaction mode: Face-to-face vs.
remote
• Scorers and methods of scoring
6
Taking a closer look
7
Emily
8
The Task
• Generate multiple strategies and representations
for solving mathematics story problems involving
multiplication and division
• Example problem contexts:
• Number of seashells collected per day and/or in
total on a multiple-day trip to the beach
• Number of dollars raised per student and/or in
total for the school jog-a-thon
• Number of minutes spent practicing Folklorico
per session and/or in total for a month or year
9
To help standardize
group compositions,
computer agents
might be
programmed to
communicate in
predetermined ways
Photo credit: Manis, F. My Virtual
Child. As depicted in Graves, S. ,
Journal of Online Leaning and
Teaching, 9(1), March 2013.
10
Special measurement challenges
#3 Non-independence
• Within a group
Contributions of group members are
linked to each other
• Between groups
Contributions of groups that have
group members in common are
linked to each other
11
Tackling non-independence
• Dynamic factor
•
analysis
• Multilevel modeling
•
• Dynamic linear models
•
• Differential equation
models
• Intra-variability models •
• Hidden Markov models •
• Multiple membership •
models
•
Bayesian belief
networks
Bayesian knowledge
tracing
Machine learning
methods
Latent class analysis
Neural networks
Point processes
Social network analysis
12
How many groupwork situations
do we need for dependable
measurement?
The answer may differ depending on the focus:
• Process (teamwork skills) vs. productivity
(quality of team solution)
• Individual vs. group level
Generalizability theory may be helpful here for
estimating the magnitude of sources of error
and the number of conditions needed
13
What if the number of groupwork
situations needed for dependable
measurement is simply too large?
14
Strategies
• Shift the focus from individual or group to
classroom or school
• Apply matrix sampling
Assign different groups within a
classroom or school to different
conditions
• Form groups randomly
Effects associated with particular group
compositions may cancel out in the
aggregate
15
Recommendations
#8 Consider ways of making performance
assessment – individually and with groupwork –
part of the fabric of classroom instruction.
#10 Consider using performance assessment in
the context of groups, but be careful about any
inferences about individual students.
16
Noreen Webb
[email protected]
17
Psychometric Considerations for the Next
Generation of Performance Assessment
Modeling, Dimensionality, and Weighting
of Performance Task Scores
Lauress L. Wise, HumRRO
National Council on Educational Measurement Meeting
April 18, 2015, Chicago, IL
66 Canal Center Plaza, Suite 700 Alexandria, VA 22314-1578 | Phone: 703.549.3611 | Fax: 703.549.9661 | www.humrro.org
What is a Construct Measurement Model?
●
A mathematical function giving the probability of
any observed response for test takers at any
particular point on the underlying construct(s)
– Examples:
• 1, 2, and even 3 parameter logistical models for dichotomously
scored item responses
• Partial credit or (generalized) graded response models for
polytomously scored responses (with ordered categories).
– Will be more complicated for performance tasks with
responses that are not dichotomously or polytomously
scored.
• Time to (successfully) complete a task
• More continuous measures of nearness to perfection
2
Modeling: What is it good for?
●
Research on the construct being assessed
– Is it one thing or does it involve multiple dimensions?
– How does it relate to other similar constructs?
●
Research on the measure of that construct
– Does the measure contain “nuisance” factors (e.g., methods
effects) that must be “factored” out?
– How well to scores generalize across tasks, raters, occasions
and other potentially important factors?
●
Operational use of construct models
– You will need a likelihood function to do maximum likelihood
estimation!
– Estimating the precision of score estimates
– Person and task ‘fit” indices to identify outliers
3
Modeling Different Types of Constructs and Response Data
Table 5.1. Probability Models for Different Types of Response Data and Construct Models
Type of Construct Model
Type of Response
Data
Continuous Bounded
Continuous - Infinite
Ordered Discrete
Categories
Dichotomous
Classical Test Theory
Various IRT Models
Latent Class Models
Polytomous
Simple Score Point
Models
Partial Credit; Graded
Response
Log Linear Models
Continuous
Inverse Normal or
Logistic
transformations
Regression Models,
Generalizability
Theory Models
Logistic or
Discriminant Analysis
Models
4
Special Considerations for Performance Tasks
●
Performance tasks require test takers to exhibit
relatively complex behaviors
– A simple dichotomous score would be a waste of good
information
– High likelihood that multiple skills will be involved
●
Scoring performance tasks is also far from simple
– Need to incorporate accuracy and consistency of
human or machine scorers in the measurement model
●
Targeted content domain may not be well defined
– What really is “math reasoning”?
– Do we know good writing when we see it?
5
Models Need to be Tested
●
To confirm hypothesized relationships
– Within and across item types
– With other related measures
●
To identify additional, unintended dimensions in
observed scores
– Incorporate unexpected, but valid aspects of the
targeted construct into the overall measurement model
– Eliminate or control for “nuisance” factors that might
bias score estimates for some groups of examinees
– Check item and person fit indicators
6
Modeling Construct and Nusiance Factors
7
Weighting Scores on Multiple Dimensions
●
Often there are multiple hypothesized dimensions
– Reading and writing components of language arts
– Areas of mathematics knowledge versus math reasoning
– NGSS Core ideas, practices, and crosscutting concepts
●
Alternative criteria for total composite scores
– Maximizing reliability
– Maximizing correlation with an external criterion
• Such as combining reading and writing to predict Freshman GPA
– Transparency in scoring (e.g., using simple sums)
– Sending correct messages for instruction (judgmental
weights), such as:
• All areas are equally important
• Writing is one-third as important as reading
8
Corollary Conclusions
1.
2.
3.
4.
Models are useful both for estimating and for
understanding construct scores.
Confirmatory analyses are needed to verify
assumptions in the construct model.
Exploratory analyses are needed to identify any
unintended factors in the observed scores.
The purpose and intended uses of composite
score should drive component weighting
9
Recommendations
#3 Where possible, use multiple-choice assessment
items or use other techniques to increase the amount
and reliability of the information generated from
performance tasks.
– Also work to improve amount and reliability of information
from performance tasks
#9 Carefully consider the implications of various ways of
combining results of performance assessment with
other assessment types.
– Specific weighting of performance assessment components
may send a clearer message of their importance
10