Psychometric Considerations for the Next Generation of Performance Assessment Charlene G. Tucker, Ed.D. Maine Department of Education National Council on Measurement in Education April 18, 2015 Chicago, IL Study Group • Rich Shavelson (Chair), Emeritus, Stanford • • • • • Tim Davey, ETS Steve Ferrara, Pearson Noreen Webb, UCLA Laurie Wise, HumRRO Paul Holland, Emeritus, UC, Berkeley • Charlene Tucker, Project Manager 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 2 Acknowledgements • Expert Presenters – – – – – Brian Clauser, NBME Peter Foltz, Pearson Aurora Graf, ETS Yigal Rosen, Pearson Matthias von Davier, ETS • Consortia Stakeholders – Enis Dogan, PARCC – Marty McCall, SBAC 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 3 Acknowledgements • State Stakeholders – – – – Juan D’Brot (WV) Garron Gianopulos (NC) Pete Goldschmidt (NM) Jeffrey Hauger (NJ) • Technical Reviewers – Suzanne Lane (UPitt) – Matthias von Davier (ETS) 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 4 Study Group Work • Three 2-day meetings in San Francisco • Chapters and primary authors emerged • Lots of work and calls between meetings • 100-page white paper intended for test development and measurement experts • Shorter, less technical policy brief intended for broader audience of practitioners and policy makers 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 5 Session Sequence • Steve Ferrara, Chapter 2 Definition of Performance Assessment (15 min.) • Tim Davey, Chapter 3 Comparability of Individual-Level Scores (15 min.) • Noreen Webb, Chapter 4 Reliability and Comparability of Groupwork Scores (15 min.) • Laurie Wise, Chapter 5 Modeling, Dimensionality, and Weighting (15 min.) • Ronald K. Hambleton, Discussant (15 min.) University of Massachusetts • Audience Comments and Questions (10 min.) 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 6 Recommendations #1: Measure what matters. If expectations include the ability to apply complex knowledge to solve problems or to deliver complex performances, explore ways to measure those things. What gets assessed often commands greater attention. #2: If psychometric advisors are concerned about your emphasis on performance assessment, listen to them. They are trying to advise you responsibly. 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 7 Discussion Ronald K. Hambleton University of Massachusetts 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 8 Both papers may be downloaded at: http://www.k12center.org/ publications/all.html 4/18/2015 NCME 2015: Psychometric Considerations for the Next Generation of Performance Assessment 9 II. Definition of Performance Assessment Steve Ferrara Center for Next Generation Learning and Assessment Pearson Research and Innovation Network Presentation in C. Tucker (Chair), Psychometric Considerations for the Next Generation of Performance Assessment, a coordinated session conducted at the annual meeting of the National Council on Measurement in Education, Chicago. April 18, 2015 Overview • • • • • Formal definition of performance assessment Five distinguishing characteristics Five approaches Illustration Two recommendations • Other topics in the chapter that I will not address today – Assessment activities not consistent with our definition – Performance assessment as a guide to teaching, higher expectations, and learning and achievement – Role of groupwork – Performance assessment features and characteristics for each approach – Some practical considerations – Psychometric considerations Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 2 DEFINITION OF PERFORMANCE ASSESSMENT Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 3 Definition in the chapter An assessment activity or set of activities that requires test takers, individually or in groups, to generate products or performances in response to a complex task that provides observable or inferable and scorable evidence of the test taker’s knowledge, skills, and abilities (KSAs) in an academic content domain, a professional discipline, or a job. Typically, performance assessments emulate a context outside of the assessment in which the KSAs ultimately will be applied; require use of complex knowledge, skills, and/or reasoning; and require application of evaluation criteria to determine levels of quality, correctness, or completeness (adapted from Lai, Ferrara, Nichols, & Reilly, 2014) Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 4 Definition parsed • An assessment activity or set of activities that requires test takers, individually or in groups, • to generate products or performances in response to a complex task • that provides observable or inferable and scorable evidence • of the test taker’s knowledge, skills, and abilities (KSAs) in an academic content domain, a professional discipline, or a job. Typically, performance assessments emulate a context outside of the assessment in which the KSAs ultimately will be applied; require use of complex knowledge, skills, and/or reasoning; and require application of evaluation criteria to determine levels of quality, correctness, or completeness Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 5 Definition parsed • Activity or set of activities – Require examinees to generate products or performances in – In response to a complex task • Provide – Observable or inferable and scorable evidence – About knowledge, skills, and abilities (KSAs) in an academic content domain, a professional discipline, or a job. • Typically: – Emulate a context – Require applied use of complex knowledge, skills, and/or reasoning; – Require application of evaluation criteria to determine levels of quality, correctness, or completeness Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 6 Why care? • Clarity in communication • Broad thinking during assessment design • Avoid all sizzle and no steak Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 7 FIVE DISTINGUISHING CHARACTERISTICS Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 8 Distinguishing characteristics • Tasks • Responses • Scoring • Fidelity • Connectedness Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 9 Characteristics (cont.) • Tasks – Directions about what problem to solve, product to create, or performance to undertake – Requirements for responding – Response features that will be scored • Responses – Intended to reflect responses in the real world situation represented by the assessment tasks – Constructed, complex – Vary in length or amount of behavior capture – Declarative, procedural, and strategic knowledge Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 10 Characteristics (cont.) • Scoring – For accuracy, completeness, effectiveness, justifiability, etc. – Rubrics, counts of features, other scoring rules • Fidelity – Assess KSAs in real-world contexts that are relevant to experiences inside or outside of the learning environment or on a job – Advocates: Supports learning – Aka authenticity – Scenarios and assessment activities intended to engage interest and effort Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 11 Characteristics (cont.) • Connectedness – Among the assessment and activities and other requirements for responding (e.g., stimuli) – In contrast to the discreteness of conventional test design – Intended to elicit broad thinking and HOTS (e.g., reasoning, critical thinking) more explicitly than is achievable with selected response items – Support fidelity, enables focus on complex HOTS Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 12 APPROACHES TO PERFORMANCE ASSESSMENT Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 13 Approaches • Short constructed response and technology-enhanced items • Essay prompts • Performance tasks As opposed to technology enabled items • Portfolios • Simulations Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 14 Approaches • Short constructed response and technology-enhanced items • Essay prompts • Performance tasks • Portfolios • Simulations Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 15 Performance task − − − − Literary Analysis Task Six evidence-based selected response (EBSR) and/or technology-enhanced constructed response (TECR) items, one prose constructed response (PCR) item Read first passage, respond to items Read second passage, respond to items Respond to a prose constructed response (PCR) prompt From http://www.parcconline.org/samples/english-language-artsliteracy/grade-10-elaliteracy Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 16 Culminating activity: Respond to a prose constructed response item • Analyze how Anne Sexton transformed Daedalus and Icarus • Use what you have learned • Develop your claims, using evidence from both texts • Think-abouts (emphasized, absent, different) Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 17 Multi-trait rubric Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 18 Summary of the task features • Six SR and CR items, one essay prompt • Read first passage, respond to items • Read second passage, respond to items • Respond to a prose constructed response (PCR) prompt • Aligned with claims in Written Expression, English Conventions, and Reading Literature • Multi-trait rubric Elements of Definition of Performance Assessment •Set of activities •Product(s) •Complex tasks •Observable, scorable evidence of KSAs •Context from instruction •Complex analysis, reasoning, supported argumentation •Scoring rubric Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 19 Summary of the task features • Six SR and CR items, one essay prompt • Read first passage, respond to items • Read second passage, respond to items • Respond to a prose constructed response (PCR) prompt • Aligned with claims in Written Expression, English Conventions, and Reading Literature • Multi-trait rubric Five Distinguishing Characteristics of Performance Assessment •Task: Defined for each item •Responses: Requirements for each item are defined •Scoring: Rubrics align with response requirements •Fidelity: Literary analysis consistent with instruction; analysis and argumentation are crucial in the “real world” •Connectedness: Six items require close reading, which is prep for the culminating activity Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 20 Recommendations #4 Use complex performance assessment when the construct of interest is important and cannot be assessed using multiplechoice items. #7 Consider ways to enhance performance assessment by combining it with other types of test items. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 21 Reference and additional resources Lai, E. R., Ferrara, S., Nichols, P., & Reilly, A. (2014). The once and future legacy of performance assessment. Manuscript submitted for publication. Additional details on approaches to performance assessment and examples of performance assessments are available from: ETS (see https://www.ets.org/about/who/leaders) Pearson’s Research and Innovation Network (see http://paframework.csprojecthub.com/?page=home Stanford Center for Opportunity Policy in Education (SCOPE; see https://edpolicy.stanford.edu/category/topics/32/0?type=scope_publications), University of Oregon (see http://pages.uoregon.edu/kscalise/taxonomy/taxonomy.html) WestEd (see http://www.wested.org/resources/preparing-for-the-common-coreusing-performance-assessments-tasks-for-professional-development/. Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 22 Thanks! Steve Ferrara [email protected] +1 612-581-6453 Copyright © 2015 Pearson Education, Inc. or its affiliates. All rights reserved. 23 Comparability of Test Scores Tim Davey ETS Copyright © 2015 by Educational Testing Service. All rights reserved. Score Comparability & Inference Mary performed at the 87th % of all 4th graders this year. John’s math proficiency grew at an average rate during 6th grade. 38% of New Jersey 8th grader’s were judged as “Proficient” relative to state standards. Requirements for Comparability Substantive - Tests being compared measure the same construct(s). Statistical Test forms being compared: - Are equally reliable. - Produce equivalent score distributions. - Correlate equally with outside criteria. The nature of performance assessments poses challenges to both requirements Substantive Equivalence Substantive equivalence of test forms starts by building each to the same specifications, which define the characteristics of a sample from a universe of items/tasks. Domain Sampling Selected Response Performance Assessment Forms composed of many, discrete Forms composed of few, large, items. context-heavy tasks. Test specifications can be detailed and specific. Broader specifications befitting task size. Small sampling unit can promote broad, representative sampling of item universe. Large sampling unit makes representative sampling more difficult. Unanticipated / uncontrolled item influences on performance have a chance to cancel out. Unanticipated / uncontrolled task influences can produce idiosyncratic form differences. Statistical Equivalence Usually promoted (if not ensured) by equating, a set of statistical methods applied to data collected under any of several designs. Characteristics of performance assessment tasks and test forms complicate equating and can distort results. Equating Designs Single group - All examinees take both test forms Equivalent group - Each form administered to distinct, but randomly equivalent group. Common-item - Test forms share a block of items in common (called an anchor). Equating Designs Single group - All examinees Single Grouptake both test forms Equivalent group - Each form administered to distinct, but Equivalent Group randomly equivalent group. Common-item - Test forms share a block of items in Common Item common (called an anchor). Unhelpful Characteristics Tests comprised of small numbers of tasks can defy internal anchoring. Tasks can be memorable, meaning their performance may change upon reuse. Rater’s judgments may differ across occasions. Mitigation The litany of performance assessment challenges outline above can be addressed from several angles: - During test development and assembly - While equating - During task scoring Task Development & Test Assembly Attempt to identify and control as many influences on task performance as possible. The extreme case is to build forms of task “clones”, developed to perform as equivalently as possible. Lengthen the test, perhaps with the addition of selected-response items. Equating Use external anchors. Use “collateral” anchors. Blocks of selected response items. Other, related tests. Use the equivalent groups design. Use the SiGNET design. Scoring Monitor rater performance. Correct for rater trends over time. Hope for advancements in automated scoring. Conclusions Extend test length, perhaps through augmentation with selected response items. Rigidly standardize tasks and test forms. Accept that scores are not comparable and limit inferences accordingly. Conclusions #5 In order to provide an Extend test length, perhaps through adequate tasks augmentation with number selectedofresponse under the required conditions, items. one may need to and compromise Rigidly standardize tasks test on the length/complexity of forms. tasks. Accept that scores are not comparable and limit inferences accordingly. Conclusions #5 In order to provide an Extend test length, perhaps through adequate tasks augmentation with number selectedofresponse under the required conditions, items. one may need to and compromise Rigidly standardize tasks test on Ifthe length/complexity of #6 inferences are not needed forms. tasks. at the student level, consider Accept that scores are not comparable anda matrix limit inferences using sampling accordingly. approach. Thanks! Copyright © 2015 by Educational Testing Service. All rights reserved. The Role of Groupwork in Performance Assessment Noreen Webb University of California, Los Angeles National Council on Measurement in Education, Chicago, April 18, 2015 Psychometric Considerations for the Next Generation of Performance Assessment (C. Tucker, Chair) 1 What is groupwork? Individuals work together in small groups to achieve a common goal • Solve a problem • Complete a task • Produce a product Group members may interact • Face to face or virtually • Synchronously or asynchronously • Following roles or scripts, or not 2 Why groupwork? • 21st century workplace requires collaboration and teamwork skills • Essential teamwork competencies –Communication –Coordination –Decision making –Adaptation –Conflict resolution 3 Possible groupwork outcomes • Productivity and performance –What can the group produce? –What does each individual contribute? • Collaboration and teamwork skills –How does the group function? –What are the collaboration and teamwork skills of individuals in the group? 4 Special measurement challenges #1 Groupwork outcomes depend on how a group functions: • How do group members interact? • Are all team members involved? • Do group members coordinate their communication? • How well do members of the group get along? • Does the group work together? Divide up the work? 5 Special measurement challenges #2 Group functioning and groupwork outcomes depend on: • Composition of the group • Roles assigned to group members • Type of task and specific task • Occasion • Interaction mode: Face-to-face vs. remote • Scorers and methods of scoring 6 Taking a closer look 7 Emily 8 The Task • Generate multiple strategies and representations for solving mathematics story problems involving multiplication and division • Example problem contexts: • Number of seashells collected per day and/or in total on a multiple-day trip to the beach • Number of dollars raised per student and/or in total for the school jog-a-thon • Number of minutes spent practicing Folklorico per session and/or in total for a month or year 9 To help standardize group compositions, computer agents might be programmed to communicate in predetermined ways Photo credit: Manis, F. My Virtual Child. As depicted in Graves, S. , Journal of Online Leaning and Teaching, 9(1), March 2013. 10 Special measurement challenges #3 Non-independence • Within a group Contributions of group members are linked to each other • Between groups Contributions of groups that have group members in common are linked to each other 11 Tackling non-independence • Dynamic factor • analysis • Multilevel modeling • • Dynamic linear models • • Differential equation models • Intra-variability models • • Hidden Markov models • • Multiple membership • models • Bayesian belief networks Bayesian knowledge tracing Machine learning methods Latent class analysis Neural networks Point processes Social network analysis 12 How many groupwork situations do we need for dependable measurement? The answer may differ depending on the focus: • Process (teamwork skills) vs. productivity (quality of team solution) • Individual vs. group level Generalizability theory may be helpful here for estimating the magnitude of sources of error and the number of conditions needed 13 What if the number of groupwork situations needed for dependable measurement is simply too large? 14 Strategies • Shift the focus from individual or group to classroom or school • Apply matrix sampling Assign different groups within a classroom or school to different conditions • Form groups randomly Effects associated with particular group compositions may cancel out in the aggregate 15 Recommendations #8 Consider ways of making performance assessment – individually and with groupwork – part of the fabric of classroom instruction. #10 Consider using performance assessment in the context of groups, but be careful about any inferences about individual students. 16 Noreen Webb [email protected] 17 Psychometric Considerations for the Next Generation of Performance Assessment Modeling, Dimensionality, and Weighting of Performance Task Scores Lauress L. Wise, HumRRO National Council on Educational Measurement Meeting April 18, 2015, Chicago, IL 66 Canal Center Plaza, Suite 700 Alexandria, VA 22314-1578 | Phone: 703.549.3611 | Fax: 703.549.9661 | www.humrro.org What is a Construct Measurement Model? ● A mathematical function giving the probability of any observed response for test takers at any particular point on the underlying construct(s) – Examples: • 1, 2, and even 3 parameter logistical models for dichotomously scored item responses • Partial credit or (generalized) graded response models for polytomously scored responses (with ordered categories). – Will be more complicated for performance tasks with responses that are not dichotomously or polytomously scored. • Time to (successfully) complete a task • More continuous measures of nearness to perfection 2 Modeling: What is it good for? ● Research on the construct being assessed – Is it one thing or does it involve multiple dimensions? – How does it relate to other similar constructs? ● Research on the measure of that construct – Does the measure contain “nuisance” factors (e.g., methods effects) that must be “factored” out? – How well to scores generalize across tasks, raters, occasions and other potentially important factors? ● Operational use of construct models – You will need a likelihood function to do maximum likelihood estimation! – Estimating the precision of score estimates – Person and task ‘fit” indices to identify outliers 3 Modeling Different Types of Constructs and Response Data Table 5.1. Probability Models for Different Types of Response Data and Construct Models Type of Construct Model Type of Response Data Continuous Bounded Continuous - Infinite Ordered Discrete Categories Dichotomous Classical Test Theory Various IRT Models Latent Class Models Polytomous Simple Score Point Models Partial Credit; Graded Response Log Linear Models Continuous Inverse Normal or Logistic transformations Regression Models, Generalizability Theory Models Logistic or Discriminant Analysis Models 4 Special Considerations for Performance Tasks ● Performance tasks require test takers to exhibit relatively complex behaviors – A simple dichotomous score would be a waste of good information – High likelihood that multiple skills will be involved ● Scoring performance tasks is also far from simple – Need to incorporate accuracy and consistency of human or machine scorers in the measurement model ● Targeted content domain may not be well defined – What really is “math reasoning”? – Do we know good writing when we see it? 5 Models Need to be Tested ● To confirm hypothesized relationships – Within and across item types – With other related measures ● To identify additional, unintended dimensions in observed scores – Incorporate unexpected, but valid aspects of the targeted construct into the overall measurement model – Eliminate or control for “nuisance” factors that might bias score estimates for some groups of examinees – Check item and person fit indicators 6 Modeling Construct and Nusiance Factors 7 Weighting Scores on Multiple Dimensions ● Often there are multiple hypothesized dimensions – Reading and writing components of language arts – Areas of mathematics knowledge versus math reasoning – NGSS Core ideas, practices, and crosscutting concepts ● Alternative criteria for total composite scores – Maximizing reliability – Maximizing correlation with an external criterion • Such as combining reading and writing to predict Freshman GPA – Transparency in scoring (e.g., using simple sums) – Sending correct messages for instruction (judgmental weights), such as: • All areas are equally important • Writing is one-third as important as reading 8 Corollary Conclusions 1. 2. 3. 4. Models are useful both for estimating and for understanding construct scores. Confirmatory analyses are needed to verify assumptions in the construct model. Exploratory analyses are needed to identify any unintended factors in the observed scores. The purpose and intended uses of composite score should drive component weighting 9 Recommendations #3 Where possible, use multiple-choice assessment items or use other techniques to increase the amount and reliability of the information generated from performance tasks. – Also work to improve amount and reliability of information from performance tasks #9 Carefully consider the implications of various ways of combining results of performance assessment with other assessment types. – Specific weighting of performance assessment components may send a clearer message of their importance 10
© Copyright 2024