STAR Reading™ Technical Manual United Kingdom Australia Renaissance Learning UK Ltd. 32 Harbour Exchange Square London E14 9GE EdAlliance Pty Ltd PO Box 8099 Armadale Victoria 3143 Australia Tel: +44(0)845 260 3570 Fax: +44(0)20 7538 2625 FreeCall (AU): 1800 655 359 FreeCall (NZ): 0800 440 668 Email: [email protected] Website: www.renlearn.co.uk [email protected] www.EdAlliance.com.au Copyright Notice Copyright © 2014 Renaissance Learning, Inc. All Rights Reserved. This publication is protected by US and international copyright laws. It is unlawful to duplicate or reproduce any copyrighted material without authorisation from the copyright holder. This document may be reproduced only by staff members in schools that have a license for STAR Reading software. For more information, contact Renaissance Learning, Inc., at the address above. All logos, designs, and brand names for Renaissance Learning’s products and services, including but not limited to Accelerated Maths, Accelerated Reader, AR, AM, ATOS, MathsFacts in a Flash, Renaissance Home Connect, Renaissance Learning, Renaissance School Partnership, STAR, STAR Assessments, STAR Early Literacy, STAR Maths and STAR Reading are trademarks of Renaissance Learning, Inc. and its subsidiaries, registered, common law, or pending registration in the United Kingdom, United States and other countries. All other product and company names should be considered as the property of their respective companies and organisations. METAMETRICS®, LEXILE® and LEXILE® FRAMEWORK are trademarks of MetaMetrics, Inc., and are registered in the United States and abroad. Copyright © 2014 MetaMetrics, Inc. All rights reserved. STAR Reading has been reviewed for scientific rigor by the US National Center on Student Progress Monitoring. It was found to meet the Center’s criteria for scientifically based progress monitoring tools, including its reliability and validity as an assessment. For more details, visit www.studentprogress.org. Please Note: This manual presents technical data accumulated over the course of the development of the US version of STAR Reading. All of the calibration, reliability, validity and normative data are based on US children, and these may not apply to UK children. The US norm-referenced scores and reliability and validity data presented in this manual are for informational purposes only. 8/2014 SRRPUK Contents Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 STAR Reading: Progress Monitoring Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Tier 1: Formative Class Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Tier 2: Interim Periodic Assessments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Tier 3: Summative Assessments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 STAR Reading Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2 Design of STAR Reading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3 Improvements to the STAR Reading Test in Versions 2.x and Higher . . . . . . . . . . . . . . . . . . . . . 5 Improvements Specific to STAR Reading Versions 3.x RP and Higher . . . . . . . . . . . . . . . . . . . . . 5 Improvements Specific to STAR Reading Version 4.3 RP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Test Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7 Split-Application Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Individualised Tests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Data Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Access Levels and Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Test Monitoring/Password Entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Final Caveat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Test Administration Procedures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8 Test Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Practice Session . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Adaptive Branching/Test Length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .9 Test Repetition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Item Time Limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .10 Content and Item Development . . . . . . . . . . . . . . . . . . . . . . . 12 Content Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 The Educational Development Laboratory’s Core Vocabulary List: ATOS Graded Vocabulary List. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Item Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 Vocabulary-in-Context Item Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 Item and Scale Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Calibration of STAR Reading Items for Use in Version 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . .15 Sample Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16 Item Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Item Difficulty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Item Discrimination. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19 Item Response Function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20 Rules for Item Retention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 STAR Reading™ Technical Manual i Contents Computer-Adaptive Test Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23 Scoring in the STAR Reading Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .24 Scale Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 The Linking Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 Dynamic Calibration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27 Score Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Types of Test Scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Estimated Oral Reading Fluency (Est. ORF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29 Lexile® Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 Lexile ZPD Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 Lexile Measures of Students and Books: Measures of Student Reading Achievement and Text Readability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30 National Curriculum Level–Reading (NCL–R) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .32 Normed Referenced Standardised Score (NRSS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 Percentile Rank (PR) and Percentile Rank Range. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 Reading Age (RA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33 Scaled Score (SS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Zone of Proximal Development (ZPD). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Diagnostic Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .35 Comparing the STAR Reading US Test with Classical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . .36 Reliability and Validity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Split-Half Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38 Test-Retest Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38 UK Reliability Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40 Validity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .41 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42 Meta-Analysis of the STAR Reading Validity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .57 Post-Publication Study Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58 Predictive Validity: Correlations with SAT9 and the California Standards Tests . . . . . . . . . . .58 A Longitudinal Study: Correlations with the Stanford Achievement Test in Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59 Concurrent Validity: An International Study of Correlations with Reading Tests in England . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60 Construct Validity: Correlations with a Measure of Reading Comprehension . . . . . . . . . . . . .61 Investigating Oral Reading Fluency and Developing the Estimated Oral Reading Fluency Scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 Cross-Validation Study Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65 UK Study Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66 Concurrent Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67 Summary of STAR Reading Validity Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70 STAR Reading™ Technical Manual ii Contents Norming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 Sample Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Regional Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71 Standardised Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72 Percentile Ranks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .74 Gender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75 Regional Differences in Outcome. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .76 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .77 Frequently Asked Questions . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Does STAR Reading Assess Comprehension, Vocabulary or Reading Achievement? . . . . . . .78 How Do Zone of Proximal Development (ZPD) Ranges Fit In?. . . . . . . . . . . . . . . . . . . . . . . . . . .78 How Can the STAR Reading Test Determine a Child’s Reading Level in Less Than Ten Minutes?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78 How Does the STAR Reading Test Compare with Other Standardised/National Tests? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 What Are Some of the Other US Standardised Tests That Might Be Compared to the STAR Reading Test? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Why Do Some of My Students Who Took STAR Reading Tests Have Scores That Are Widely Varying from the Results of Our Other US-Standardised Test Program? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80 Why Do We See a Significant Number of Our Students Performing at a Lower Level Now Than They Were Nine Weeks Ago?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 How Many Items Will a Student Be Presented With When Taking a STAR Reading Test?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81 How Many Items Does the STAR Reading Test Have at Each Year? . . . . . . . . . . . . . . . . . . . . . .81 What Guidelines Are Offered as to Whether a Student Can Be Tested Using STAR Reading Software? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 How Will Students With a Fear of Taking Tests Do With STAR Reading Tests? . . . . . . . . . . . .82 Is There Any Way for a Teacher to See Exactly Which Items a Student Answered Correctly and Which He or She Answered Incorrectly?. . . . . . . . . . . . . . . . . . . . . . . . . . . . .82 What Evidence Do We Have That STAR Reading Software Will Perform as Claimed? . . . . . .82 Can or Should the STAR Reading Test Replace a School’s Current National Tests?. . . . . . . .82 What Is Item Response Theory? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 What Are the Cloze and Maze Procedures? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 Appendix A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 US Norming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 Sample Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 Test Administration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87 US Norm-Referenced Score Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 Types of Test Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 Grade Equivalent (GE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89 Estimated Oral Reading Fluency (Est. ORF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90 STAR Reading™ Technical Manual iii Contents Comparing the STAR Reading Test with Classical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 Understanding IRL and GE scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91 Percentile Rank (PR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92 Normal Curve Equivalent (NCE) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 US Norm-Referenced Conversion Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 STAR Reading™ Technical Manual iv Introduction STAR Reading: Progress Monitoring Assessment The Renaissance Place Edition of the STAR Reading computer-adaptive test and database allows teachers to assess students’ reading comprehension and overall reading achievement in ten minutes or less. This computer-based progress-monitoring assessment provides immediate feedback to teachers and administrators on each student’s reading development. STAR Reading runs on the Renaissance Place RT platform, which stores three levels of critical student data: daily progress monitoring, periodic progress monitoring and annual assessment results. Renaissance Learning identifies these three levels as Tier 1, Tier 2 and Tier 3, as described below. Renaissance Place gives you information from all 3 tiers Tier 3: Summative Assessments Tier 2: Interim Periodic Assessments Tier 1: Formative Class Assessments Tier 1: Formative Class Assessments Formative class assessments provide daily, even hourly, feedback on students’ task completion, performance and time on task. Renaissance Learning Tier 1 programs include Accelerated Reader, MathsFacts in a Flash and Accelerated Maths. Tier 2: Interim Periodic Assessments Interim periodic assessments help educators match the level of instruction and materials to the ability of each student, measure growth throughout the year, predict outcomes on county-mandated tests and track growth in student achievement longitudinally, facilitating the kind of growth analysis recommended by county and national organisations. Renaissance Learning Tier 2 programs include STAR Early Literacy, STAR Maths and STAR Reading. STAR Reading™ Technical Manual 1 Introduction STAR Reading Purpose Tier 3: Summative Assessments Summative assessments provide quantitative and qualitative data in the form of high-stakes tests. The best way to ensure success on Tier 3 assessments is to monitor progress and adjust instructional methods and practice activities throughout the year using Tier 1 and Tier 2 assessments. STAR Reading Purpose As a periodic progress monitoring assessment, STAR Reading serves three purposes for students with at least 100-word sight vocabulary. First, it provides educators with quick and accurate estimates of reading comprehension using students’ teaching and learning reading levels. Second, it assesses reading achievement on a continuous scale over the range of school Years 1–13. Third, it provides the means for tracking growth in a consistent manner longitudinally for all students. This is especially helpful to school- and school network-level administrators. The STAR Reading test is not intended to be used as a “high-stakes” or “national” test whose main function is to report end-of-period performance to parents and educationists. Although that is not its purpose, STAR Reading scores are highly correlated with large-scale survey achievement tests, as attested to by the data in “Reliability and Validity” on page 38. The high correlations of STAR Reading scores with such national instruments make it easier to fine-tune instruction while there is still time to improve performance before the regular testing cycle. The STAR Reading test’s repeatability and flexible administration provide specific advantages for everyone responsible for the education process: For students, STAR Reading software provides a challenging, interactive and brief test that builds confidence in their reading ability. For teachers, the STAR Reading test facilitates individualised instruction by identifying children who need remediation or enrichment most. For head teachers, the STAR Reading 3.x and higher RP browser-based management program provides regular, accurate reports on performance at the class, year, school and school network level, as well as year-to-year comparisons. For administrators and assessment specialists, the Management program provides a wealth of reliable and timely data on reading growth at each school and throughout the school network. It also provides a valid basis for comparing data across schools, years and special student populations. This manual documents the suitability of STAR Reading computer-adaptive testing for these purposes and demonstrates quantitatively how well this innovative instrument in reading assessment performs. STAR Reading™ Technical Manual 2 Introduction Design of STAR Reading Design of STAR Reading One of the fundamental STAR Reading design decisions involved the choice of how to administer the test. The primary advantage of using computer software to administer STAR Reading tests is the ability to tailor students’ tests based on their responses to previous items. Paper-and-pencil tests are obviously far different from this; every student must respond to the same items in the same sequence. Using computer-adaptive procedures, it is possible for students to test on items that appropriately match their current level of proficiency. The item selection procedures, termed Adaptive Branching, effectively customise the test for each student’s achievement level. Adaptive Branching offers significant advantages in terms of test reliability, testing time and student motivation. Reliability improves over paper-and-pencil tests because the test difficulty matches each individual’s performance level; students do not have to fit a “one test fits all” model. Most of the test items that students respond to are at difficulty levels that closely match their achievement level. Testing time decreases because, unlike in paper-and-pencil tests, there is no need to expose every student to a broad range of material, portions of which are inappropriate because they are either too easy for high achievers or too difficult for those with low current levels of performance. Finally, student motivation improves simply because of these issues—test time is minimised and test content is neither too difficult nor too easy. Another fundamental STAR Reading design decision involved the choice of the content and format of items for the test. Many types of stimulus and response procedures were explored, researched, discussed and prototyped. These procedures included the traditional reading passage followed by sets of literal or inferential questions, previously published extended selections of text followed by open-ended questions requiring student-constructed answers and several cloze-type procedures for passage presentation. While all of these procedures can be used to measure reading comprehension and overall reading achievement, the vocabulary-in-context format was finally selected as the primary item format. This decision was made for interrelated reasons of efficiency, breadth of construct coverage objectivity and simplicity of scoring. For students at US grade levels 1 and 2 (Years 2 and 3), the STAR Reading 3.x and higher test administers 25 vocabulary-in-context items. For students at US grade levels 3 and above (Years 4 and above), the test administers 20 vocabulary-in-context items in the first section of the test and five authentic text passages with multiple-choice literal or inferential questions in the second section of the test. Four fundamental arguments support the use of the STAR Reading design for obtaining quick and reliable estimates of reading comprehension and reading achievement: STAR Reading™ Technical Manual 3 Introduction Design of STAR Reading 1. The vocabulary-in-context test items, while using a common format for assessing reading, require reading comprehension. Each test item is a complete, contextual sentence with a tightly controlled vocabulary level. The semantics and syntax of each context sentence are arranged to provide clues as to the correct cloze word. The student must actually interpret the meaning of (in other words, comprehend) the sentence in order to choose the correct answer because all of the answer choices “fit” the context sentence either semantically or syntactically. In effect, each sentence provides a mini-selection on which the student demonstrates the ability to interpret the correct meaning. This is, after all, what most reading theorists believe reading comprehension to be—the ability to draw meaning from text. 2. In the course of taking the STAR Reading tests, students read and respond to a significant amount of text in the form of vocabulary-in-context test items. The STAR Reading test typically asks the student to demonstrate comprehension of material that ranges over several US grade levels (UK years). Students will read, use context clues from, interpret the meaning of and attempt to answer 25 cloze sentences across these levels, generally totaling more than 300 words. The student must select the correct word from sets of words that are all at the same reading level, and that at least partially fit the sentence context. Students clearly must demonstrate reading comprehension to correctly respond to these 25 questions. 3. A child’s level of vocabulary development is a major—perhaps the major—factor in determining the child’s ability to comprehend written material. Decades of reading research have consistently demonstrated that a student’s level of vocabulary knowledge is the most important single element in determining the child’s ability to read with comprehension. Tests of vocabulary knowledge typically correlate better than do any other components of reading with valid assessments of reading comprehension. 4. The student’s performance on the vocabulary-in-context section is used to determine the initial difficulty level of the subsequent authentic text passage items. Although this section consists of just five items, the accurate entry level and the continuing adaptive selection process mean that all of the authentic text passage items are closely matched to the student’s reading ability level. This results in unusually high measurement efficiency. For these reasons, the STAR Reading test design and item format provide a valid procedure for assessing a student’s reading comprehension. Data and information presented in this manual reinforce this. STAR Reading™ Technical Manual 4 Introduction Design of STAR Reading Improvements to the STAR Reading Test in Versions 2.x and Higher Since the introduction of STAR Reading version 1.0 in 1996, STAR Reading has undergone a process of continuous research and improvement. Version 2.0 was an entirely new test, with new content and several technical innovations. The item bank was expanded from 838 test items distributed among 14 difficulty levels. For the UK, there are 1,000 items distributed among 54 difficulty levels. The technical psychometric foundation for the test was improved. Versions 2.x and higher are now based on Item Response Theory (IRT). The use of IRT permits more accurate calibration of item difficulty and more accurate measurement of students’ reading ability. The Adaptive Branching process was likewise improved. By using IRT, the STAR Reading tests effect an improvement in measurement efficiency. The length of the STAR Reading test was shortened and standardised. Taking advantage of improved measurement efficiency, the STAR Reading 2.x and higher tests administer just 25 vocabulary-in-context items questions to every student. The average length of version 1.x tests was 30 items per student. The STAR Reading test was nationally standardised in the UK prior to release. Improvements Specific to STAR Reading Versions 3.x RP and Higher Versions 3.x RP and 4.x RP are adaptations of version 2.x designed specifically for use on a computer with web access. In versions 3.x RP and higher, all management and test administration functions are controlled using a management system which is accessed by means of a computer with web access. This makes a number of new features possible: STAR Reading™ Technical Manual Multiple schools can share a central database, such as a school network-level database. Records of students transferring between schools within the school network will be maintained in the database; the only information that needs revision following a transfer is the student’s updated school and class assignments. The same database that contains STAR Reading data can contain data on other STAR tests, including STAR Early Literacy and STAR Maths. The Renaissance Place RT program is a powerful information-management program that allows you to manage all your school network, school, personnel and student data in one place. Changes made to school network, school, teacher and student data for any of these products, as well as other Renaissance Place software, are reflected in every other Renaissance Place program sharing the central database. 5 Introduction Design of STAR Reading Multiple levels of access are available, from the test administrator within a school or class, to teachers, head teachers and school network administrators. Renaissance Place RT takes reporting to a new level. Not only can you generate reports from the student level all the way up to the school level, but you can also limit reports to specific groups, subgroups and combinations of subgroups. This supports “disaggregated” reporting; for example, a report might be specific to students eligible for Free School Meals, to English language learners or to students who fit both categories. It also supports compiling reports by teacher, class, school, year within a school and many other criteria such as a specific date range. In addition, the Renaissance Place consolidated reports allow you to gather data from more than one program (such as STAR Reading and Accelerated Reader) at the teacher, class, school and school network level and display the information in one report. Since the Renaissance Place RT software is accessed through a web browser, teachers (and administrators) will be able to access the program from home—provided the school gives them that access. When you upgrade from STAR Reading version 3.x to version 4.x or higher, all shortcuts to the student program will automatically redirect to the browser-based program (the Renaissance Place Welcome page) each time they are used. Improvements Specific to STAR Reading Version 4.3 RP STAR Reading versions 3.x RP to 4.2 RP were identical in content to STAR Reading version 2.x. With the development of version 4.3 RP, changes in content have been made, along with other changes, all described below. STAR Reading™ Technical Manual The Adaptive Branching process been further improved, by changing the difficulty target used to select each item. The new difficulty target further improves the measurement efficiency of STAR Reading, and is expected to increase measurement precision, score reliability and test validity. A new feature, Dynamic Calibration, has been added. Dynamic Calibration makes it possible to include small numbers of unscored items in selected students’ tests, for the purpose of collecting item response data for research and development use. STAR Reading can now be used to test Year 1 students, at the teacher’s discretion. Score reports for Year 1 students will include Scale Scores, estimated Reading Ages and Estimated National Curriculum Levels. 6 Introduction Test Security Test Security STAR Reading software includes a number of features intended to provide adequate security to protect the content of the test and to maintain the confidentiality of the test results. Split-Application Model In the STAR Reading RP software, when students log in, they do not have access to the same functions that teachers, administrators and other personnel can access. Students are allowed to test, but they have no other tasks available in STAR Reading RP; therefore, they have no access to confidential information. When teachers and administrators log in, they can manage student and class information, set preferences, register students for testing and create informative reports about student test performance. Individualised Tests Using Adaptive Branching, every STAR Reading test consists of items chosen from a large number of items of similar difficulty based on the student’s estimated ability. Because each test is individually assembled based on the student’s past and present performance, identical sequences of items are rare. This feature, while motivated chiefly by psychometric considerations, contributes to test security by limiting the impact of item exposure. Data Encryption A major defence against unauthorised access to test content and student test scores is data encryption. All of the items and export files are encrypted. Without the appropriate decryption code, it is practically impossible to read the STAR Reading data or access or change it with other software. Access Levels and Capabilities Each user’s level of access to a Renaissance Place program depends on the primary position assigned to that user and the capabilities the user has been granted in the Renaissance Place program. Each primary position is part of a user group. There are six user groups: school network administrator, school network staff, school administrator, school staff, teacher and student. By default, each user group is granted a specific set of capabilities. Each capability corresponds to one or more tasks that can be performed in the program. The capabilities in these sets can be changed; capabilities can also be granted or removed on an individual level. Since users can be assigned to the school network and/or one or more schools (and be assigned different primary positions at the different locations), and since the capabilities granted to a user can be customised, there are many, varied levels of access an individual user can have. STAR Reading™ Technical Manual 7 Introduction Test Administration Procedures Renaissance Place RT also allows you to restrict students’ access to certain computers. This prevents students from taking STAR Reading RP tests from unauthorised computers (such as a home computer). For more information on student access security, see the Renaissance Place Software Manual. The security of the STAR Reading RP data is also protected by each person’s user name (which must be unique) and password. User names and passwords identify users, and the program only allows them access to the data and features that they are allowed based on their primary position and the capabilities that they have been granted. Personnel who log in to Renaissance Place RT (teacher, administrators or staff) must enter a user name and password before they can access the data and create reports. Without an appropriate user name and password, personnel cannot use the STAR Reading RP software. Test Monitoring/Password Entry Test monitoring is another useful STAR Reading security feature. Test monitoring is implemented using the Testing Password preference, which specifies whether teaching assistants must enter their passwords at the start of a test. Students are required to enter a user name and password to log in before taking a test. This ensures that students cannot take tests using other students’ names. Final Caveat While STAR Reading software can do much to provide specific measures of test security, the most important line of defence against unauthorised access or misuse of the program is the user’s responsibility. Teachers and teaching assistants need to be careful not to leave the program running unattended and to monitor all testing to prevent students from cheating, copying down questions and answers or performing “print screens” during a test session. Taking these simple precautionary steps will help maintain STAR Reading’s security and the quality and validity of its scores. Test Administration Procedures In order to ensure consistency and comparability of results to the STAR Reading norms, students taking STAR Reading tests should follow the same administration procedures used by the norming participants. It is also a good idea to make sure that the testing environment is as free from distractions for the student as possible. During the norming, all of the participants received the same set of test instructions and corresponding graphics contained in the Pretest Instructions included with the STAR Reading product. These instructions describe the standard test orientation procedures that teachers should follow to prepare STAR Reading™ Technical Manual 8 Introduction Test Interface their students for the STAR Reading test. These instructions are intended for use with students of all ages; however, the STAR Reading test should only be administered to students who have a reading vocabulary of at least 100 words. The instructions were successfully field-tested with students ranging from the first US grade (Year 2) through the eighth US grade (Year 9). It is important to use these same instructions with all students before they take the STAR Reading test. Test Interface The STAR Reading test interface was designed to be both simple and effective. Students can use either the mouse or the keyboard to answer questions. If using the keyboard, students press one of the four letter keys (A, B, C and D) and then press the Enter key (or the return key on Macintosh computers). If using the mouse, students click the answer of choice and then click Next to enter the answer. Practice Session The practice session before the test allows students to get comfortable with the test interface and to make sure that they know how to operate it properly. As soon as a student has answered three practice questions correctly, the program takes the student into the actual STAR Reading test. Even the lowest-level readers should be able to answer the sample questions correctly. If the student has not successfully answered three items by the end of the practice session, STAR Reading will halt the testing session and tell the student to ask the teacher for help. It may be that the student cannot read at even the most basic level, or it may be that the student needs help operating the interface, in which case the teacher should help the student through the practice session the next time. Before beginning the next test session with the student, the program will recommend that the teacher assist the student during the practice. Adaptive Branching/Test Length STAR Reading’s branching control uses a proprietary approach somewhat more complex than the simple Rasch maximum information IRT model. The STAR Reading approach was designed to yield reliable test results for both the criterion-referenced and norm-referenced scores by adjusting item difficulty to the responses of the individual being tested while striving to minimise test length and student frustration. In order to minimise student frustration, the first administration of the STAR Reading 4.4 test begins with items that have a difficulty level that is STAR Reading™ Technical Manual 9 Introduction Test Repetition substantially below what a typical student at a given UK year can handle—usually one or two years below year placement. On the average, about 86 per cent of students will be able to answer the first item correctly. Teachers can override this typical value by entering an even lower Estimated Instructional Reading Level for the student. On the second and subsequent administrations, the STAR Reading test begins with items that have a difficulty level lower than the previously demonstrated reading ability. Students generally have an 85 per cent chance of answering the first item correctly on second and subsequent tests. Once the testing session is underway, the test administers 25 items of varying difficulty based on the student’s responses; this is sufficient information to obtain a reliable Scaled Score. Test Repetition STAR Reading data can be used for multiple purposes such as screening, placement, planning teaching, benchmarking and outcomes measurement. The frequency with which the assessment is administered depends on the purpose for assessment and how the data will be used. Renaissance Learning recommends assessing students only as frequently as necessary to get the data needed. Schools that use STAR for screening purposes typically administer it two to five times per year. Teachers who want to monitor student progress more closely or use the data for instructional planning may use it more frequently. STAR may be administered as frequently as weekly for progress monitoring purposes. STAR Reading keeps track of the questions presented to each student from test session to test session and will not ask the same question more than once in any 90-day period. Item Time Limits The STAR Reading test has time-out limits for individual items that are based on a student’s year. Students in Years 1–3 have up to 60 seconds to answer each item during their test sessions (both practice questions and test questions). Students in Years 4–13 are allowed 60 seconds to answer each practice question and 45 seconds to answer each test question. These time-out values are based on latency data obtained during item validation. Very few vocabulary-in-context items at any year had latencies longer than 30 seconds, and almost none (fewer than 0.3%) had latencies of more than 45 seconds. Thus, the time-out limit was set to 45 seconds for most students and increased to 60 seconds for the very young students. Beginning with version 2.2, STAR Reading provides the option of extended time limits for selected students who, in the judgment of the test administrator, require more than the standard amount of time to read and STAR Reading™ Technical Manual 10 Introduction Item Time Limits answer the test questions. Extended time may be a valuable accommodation for English language learners as well as for some students with disabilities. Test users who elect the extended time limit for their students should be aware that STAR Reading technical data, such as reliability and validity, are based on test administration using the standard time limits. When the extended time limit accommodation is elected, students have three times longer than the standard time limits to answer each question. Therefore, students in Years 1–3 with the extended time limit accommodation have up to 180 seconds to answer each item (both practice questions and test questions). Students in Years 4–13 with the extended time limit accommodation have 180 seconds to answer each practice question and 135 seconds to answer each test question. Regardless of the extended time limit setting, when a student has only 15 seconds remaining for a given item, a time-out warning appears, indicating that the student should make a final selection and move on. Items that time out are counted as incorrect responses unless the student has the correct answer selected when the item times out. If the correct answer is selected at that time, the item will be counted as a correct response. If a student does not respond to an item, the item times out and briefly gives the student a message describing what has happened. Then the next item is presented. The student does not have an opportunity to take the item again. If a student does not respond to any item, all items are marked as incorrect. STAR Reading™ Technical Manual 11 Content and Item Development Content Development The content of UK STAR Reading version 4.3 RP is identical to the content in versions 2 and 3. Content development was driven by the test design and test purposes, which are to measure comprehension and general reading achievement. Based on test purpose, the desired content had to meet certain criteria. First, it had to cover a range broad enough to test students from Years 1–13. Thus, items had to represent reading levels ranging all the way from Year 1 to post-upper years. Second, the final collection of test items had to be large enough so that students could test up to five times per year without being given the same items twice. To adapt the STAR Reading Renaissance Place US item content for STAR Reading Renaissance Place UK, the Renaissance Learning UL Ltd. content development staff reviewed all US items and made recommendations for deletions and modifications. Given that all STAR Reading US authentic-text passage items contain passages from popular US children’s books, all 262 of these items were removed for STAR Reading UK. The remaining vocabulary-in-context items underwent review by Renaissance Learning UK. Out of a total of 1,159 items in version 2.0, 159 (13.7%) were deleted and 306 (26.4%) underwent slight modifications. The majority of the modifications pertained to language differences. For example, all references to “faucet” became “tap”. Other changes involved spelling (e.g. “airplane” to “aeroplane”) and grammar (“can not” to “cannot”) modifications. The resulting STAR Reading UK test contains 1,000 items. The Educational Development Laboratory’s Core Vocabulary List: ATOS Graded Vocabulary List The original point of reference for the development of US STAR Reading items was the 1995 updated vocabulary lists that are based on the Educational Development Laboratory’s (EDL) A Revised Core Vocabulary (1969) of 7,200 words. The EDL vocabulary list is a soundly developed, validated list that is often used by developers of educational instruments to create all types of educational materials and assessments. It categorises hundreds of vocabulary words according to year placement, from reception through post-upper years. This was exactly the span desired for the STAR Reading test. Beginning with new test items introduced in version 4.3, STAR Reading item developers used ATOS instead of the EDL word list. ATOS is a system for evaluating the reading level of continuous text; it contains 23,000 words in its graded vocabulary list. This readability formula was developed by Renaissance Learning, Inc. and designed by leading readability experts. ATOS STAR Reading™ Technical Manual 12 Content and Item Development Item Development is the first formula to include statistics from actual student book reading (over 30,000 US students, reading almost 1,000,000 books). Item Development During item development, every effort was made to avoid the use of stereotypes, potentially offensive language or characterisations and descriptions of people or events that could be construed as being offensive, demeaning, patronising or otherwise insensitive. The editing process also included a strict sensitivity review of all items to attend to issues of gender and ethnic-group balance and fairness. Vocabulary-in-Context Item Specifications Once the test design was determined, individual test items were assembled for try-out and calibration. For the STAR Reading US 2.x test, the item try-out and calibration included all 838 vocabulary items from the STAR Reading US 1.x test, plus 836 new vocabulary items created for the STAR Reading US 2.x test. It was necessary to write and test about 100 new questions at each US grade level (year) to ensure that approximately 60 new items per level would be acceptable for the final item collection. (Due to the limited number of primer words available for Year 1, the starting set for this level contained only 30 items.) Having a pool of almost 1,700 vocabulary items allowed significant flexibility in selecting only the best items from each group for the final product. Each of the vocabulary items was written to the following specifications: STAR Reading™ Technical Manual 1. Each vocabulary-in-context test item consists of a single-context sentence. This sentence contains a blank indicating a missing word. Three or four possible answers are shown beneath the sentence. For questions developed at a Year 1 reading level, three possible answers are given. Questions at a Year 3 reading level and higher offer four possible answers. 2. To answer the question, the student selects the word from the answer choices that best completes the sentence. The correct answer option is the word that appropriately fits both the semantics and the syntax of the sentence. All of the incorrect answer options either fit the syntax of the sentence or relate to the meaning of something in the sentence. They do not, however, meet both conditions. 3. The answer blanks are generally located near the end of the context sentence to minimise the amount of rereading required. 4. The sentence provides sufficient context clues for students to determine the appropriate answer choice. However, the length of each sentence 13 Content and Item Development Item Development varies according to the guidelines shown in Table 1. Table 1: STAR Reading™ Technical Manual Grade Level Maximum Sentence Length (Including Sentence Blank) UK Year Maximum Sentence Length Years 1 and 2 10 words Years 3 and 4 12 words Years 5–7 14 words Years 8–14 16 words 5. Typically, the words providing the context clues in the sentence are below the level of the actual test word. However, due to a limited number of available words, not all of the questions at or below Year 3 meet this criterion—but even at these levels, no context words are above the year level of the item. 6. The correct answer option is a word selected from the appropriate year level of the item set. Incorrect answer choices are words at the same test level or one year below. Through vocabulary-in-context test items, STAR Reading requires students to rely on background information, apply vocabulary knowledge and use active strategies to construct meaning from the assessment text. These cognitive tasks are consistent with what researchers and practitioners describe as reading comprehension. 14 Item and Scale Calibration Beginning with STAR Reading version 4.3 RP, the adaptive test item bank consists of 1,792 calibrated test items. Of these, 626 items are new, and 1,166 items were carried over from the set of 1,409 test items that were developed for use in STAR Reading version 2.0, and used in that and later versions up to and including version 4.1 RP. The test items in version 4.3 RP were developed and calibrated at two separate times, using very different methods. Items carried over from version 2.0 were calibrated by administering them to national student samples in printed test booklets. Items developed specifically for version 4.3 were calibrated online, by using the newly developed Dynamic Calibration feature to embed them in otherwise normal STAR Reading tests. This chapter describes both item calibration efforts. Calibration of STAR Reading Items for Use in Version 2.0 This chapter summarises the psychometric research and development undertaken to prepare a large pool of calibrated reading test questions for use in the STAR Reading 2.x test, and to link STAR Reading 2.x scores to the original STAR Reading 1.x score scale. This research took place in two stages: item calibration and score scale calibration. These are described in their respective sections below. The previous chapter described the design and development of the STAR Reading US 2.x test items. Regardless of how carefully test items are written and edited, it is critical to study how students actually perform on each item. The first large-scale research activity undertaken in creating the test was the item validation program conducted in the US in March 1995. This project provided data concerning the technical and statistical quality of each test item written for the STAR Reading test. The results of the item validation study were used to decide whether item grade assignments, or tags, were correct as obtained from the EDL vocabulary list, or whether they needed to be adjusted up or down based on student response data. This refinement of the item year level tags made the STAR Reading criterion reference more timely. In STAR Reading US 2.0 development, a large-scale item calibration program was conducted in the spring of 1998. The STAR Reading US 2.0 item calibration study incorporated all of the newly written vocabulary-in-context and authentic text passage items, as well as all 838 vocabulary items in the STAR Reading US 1.x item bank. Two distinct phases comprised the US item calibration study. The first phase was the collection of item response data from a multi-level national student sample. The second phase involved the fitting of item response models to the data, and developing a single IRT STAR Reading™ Technical Manual 15 Item and Scale Calibration Sample Description difficulty scale spanning all levels from US grades 1–12 (equivalent to Years 2–13). Sample Description The data collection phase of the STAR Reading US 2.x calibration study began with a total item pool of 2,133 items. A nationally representative sample of US students tested these items. A total of 27,807 US students from 247 US schools participated in the item calibration study. Table 2 provides the numbers of US students in each grade (year) who participated in the study. Table 2: Numbers of US Students Tested by Grade STAR Reading Item Calibration Study—Spring 1998 US Grade Level Number of Students Tested 1 4,037 2 3,848 3 3,422 4 3,322 5 2,167 6 1,868 7 1,126 8 713 9 2,030 10 1,896 11 1,326 12 1,715 Not Given 337 Table 3 presents descriptive statistics concerning the make-up of the US calibration sample. This sample included 13,937 males and 13,626 females (244 student records did not include gender information). As Table 3 illustrates, the try-out sample approximated the national school population fairly well. STAR Reading™ Technical Manual 16 Item and Scale Calibration Sample Description Table 3: US Sample Characteristics, STAR Reading US 2.x Calibration Study—Spring 1998 (N = 27,807 Students) Students National % Sample % Geographic Region in United States Northeast Midwest Southeast West 20% 24% 24% 32% 16% 34% 25% 25% District (School Network) Socioeconomic Status Low: 31–100% Average: 15–30% High: 0–14% Nonpublic 30% 29% 31% 10% 28% 26% 32% 14% School Type & District (School Network) Enrolment Public 17% 19% 27% 28% 10% 15% 21% 25% 24% 14% Nonpublic < 200 200–499 500–2,000 > 2,000 Table 4 provides information about the ethnic composition of the calibration sample. As Table 4 shows, the students participating in the calibration sample closely approximate the national school population. Table 4: Ethnic Group Participation, STAR Reading US 2.0 Calibration Study—Spring 1998 (N = 27,807 Students) Students Ethnic Group National % Sample % Asian 3% 3% Black 15% 13% Hispanic 12% 9% Native American 1% 1% White 59% 63% Unclassified 9% 10% Item Presentation For the US calibration research study, seven levels of test booklets were constructed corresponding to varying US grade levels. Because reading ability and vocabulary growth are much more rapid in the lower grades, only one grade was assigned per test level for the first four levels of the test (through Year 5, US grade 4). As grade level (year) increases, there is more variation among both students and school curricula, so a single test can cover more than one year. US grades were assigned to test levels after extensive consultation with reading instruction experts as well as considering STAR Reading™ Technical Manual 17 Item and Scale Calibration Sample Description performance data for items as they functioned in the STAR Reading US 1.x test. Items were assigned to years such that the resulting test forms sampled an appropriate range of reading ability typically represented at or near the targeted US grade levels. US grade levels corresponding to each of the seven test levels are shown in the first two columns of Table 5. Students answered a set number of questions at their current year, as well as a number of questions one year above and one year below their US grade level. Anchor items were included to allow for vertically scaling the test across the seven test levels. Table 5 breaks down the composition of test forms at each test level in terms of types and number of test questions, as well as the number of calibration test forms at each level. Table 5: Calibration Test Forms Design by Test Level STAR Reading US 2.x Calibration Study—Spring 1998 Test Level US Grade Levels Items per Form Anchor Items per Form Unique Items per Form Number of Test Forms A 1 44 21 23 14 B 2 44 21 23 11 C 3 44 21 23 11 D 4 44 21 23 11 E 5–6 44 21 23 14 F 7–9 44 21 23 14 G 10–12 44 21 23 15 Each of the calibration test forms within a test level consisted of a set of 21 anchor items which were common across all test forms within a test level. Anchor items consisted of items: a) on year, b) one year above and c) one year below the targeted US grade level (year). The use of anchor items facilitated equating of both test forms and test levels for purposes of data analysis and the development of the overall score scale. In addition to the anchor items were a set of 23 additional items that were unique to a specific test form (within a level). Items were selected for a specific test level based on STAR Reading US 1.x grade level assignment, EDL vocabulary grade designation or expert judgment. To avoid problems with positioning effects resulting from the placement of items within each test booklet form, items were shuffled within each test form. This created two variations of each test form such that items appeared in different sequential positions within each “shuffled” test form. Since the final items would be administered as part of a computer-adaptive test, it was important to remove any effects of item positioning from the calibration data so that each item could be administered at any point during the test. The number of field test forms constructed for each of the seven test levels is shown in the last column of Table 5 (varying from 11–15 forms per level). STAR Reading™ Technical Manual 18 Item and Scale Calibration Sample Description Calibration test forms were spiralled within a class such that each student received a test form essentially at random. This design ensured that no more than two or three students in any class attempted any particular try-out item. Additionally, it ensured a balance of student ability across the various try-out forms. Typically, 250–300 students at the designated US grade level of the test item received a given question on their test. It is important to note that the majority of questions in the STAR Reading US 2.x calibration study already had some performance data on them. All of the questions from the STAR Reading US 1.x item bank were included, as were many items that were previously field tested, but were not included in the STAR Reading US 1.x test. Following extensive quality control checks, the STAR Reading US 2.x calibration research item response data were analysed, by level, using both traditional item analysis techniques and IRT methods. For each test item, the following information was derived using traditional psychometric item analysis techniques: The number of students who attempted to answer the item The number of students who did not attempt to answer the item The percentage of students who answered the item correctly (a traditional measure of difficulty) The percentage of students who selected each answer choice The correlation between answering the item correctly and the total score (a traditional measure of item discrimination) The correlation between the endorsement of an alternative answer and the total score Item Difficulty The difficulty of an item, in traditional item analysis, is the percentage of students who answer the item correctly. This is typically referred to as the “p-value” of the item. Low p-values (such as 15%) indicate that the item is difficult since only a small percentage of students answered it correctly. High p-values (such as 90%) indicate that the majority of students answered the item correctly, and thus the item is easy. It should be noted that the p-value only has meaning for a particular item relative to the characteristics of the sample of students who responded to it. Item Discrimination The traditional measure of the discrimination of an item is the correlation between the mark on the item (correct or incorrect) and the total test score. Items that correlate well with total test score also tend to correlate well with one another and produce a test that is more reliable (more internally consistent). For the correct answer, the higher the correlation between item STAR Reading™ Technical Manual 19 Item and Scale Calibration Sample Description mark and total score, the better the item is at discriminating between low scoring and high scoring students. Such items generally will produce optimal test performance. When the correlation between the correct answer and total test score is low (or negative), it typically indicates that the item is not performing as intended. The correlation between endorsing incorrect answers and total score should generally be low since there should not be a positive relationship between selecting an incorrect answer and scoring higher on the overall test. Item Response Function In addition to traditional item analyses, the STAR Reading calibration data were analysed using Item Response Theory (IRT) methods. Although IRT encompasses a family of mathematical models, the one-parameter (or Rasch) IRT model was selected for the STAR Reading 2.x data both for its simplicity and its ability to accurately model the performance of the STAR Reading 2.x items. IRT attempts to model quantitatively what happens when a student with a specific level of ability attempts to answer a specific question. IRT calibration places the item difficulty and student ability on the same scale; the relationship between them can be represented graphically in the form of an item response function (IRF), which describes the probability of answering an item correctly as a function of the student’s ability and the difficulty of the item. Figure 1 is a plot of three item response functions: one for an easy item, one for a more difficult one and one for a very difficult item. Each plot is a continuous S-shaped (ogive) curve. The horizontal axis is the scale of student ability, ranging from very low ability (–5.0 on the scale) to very high ability (+5.0 on the scale). The vertical axis is the percentage of students expected to answer each of the three items correctly at any given point on the ability scale. Notice that the expected percentage correct increases as student ability increases, but varies from one item to another. In Figure 1, each item’s difficulty is the scale point where the expected percentage correct is exactly 50. These points are depicted by vertical lines going from the 50% point to the corresponding locations on the ability scale. The easiest item has a difficulty scale value of about –1.67; this means that students located at –1.67 on the ability scale have a 50-50 chance of answering that item right. The scale values of the other two items are approximately +0.20 and +1.25, respectively. Calibration of test items estimates the IRT difficulty parameter for each test item and places all of the item parameters onto a common scale. The difficulty parameter for each item is estimated, along with measures to indicate how well the item conforms to (or “fits”) the theoretical expectations of the presumed IRT model. STAR Reading™ Technical Manual 20 Item and Scale Calibration Rules for Item Retention Also plotted in Figure 1 are “empirical item response functions (EIRF)”: the actual percentages of correct responses of groups of students to all three items. Each group is represented as a small triangle, circle or diamond. Each of those geometric symbols is a plot of the percentage correct against the average ability level of the group. Ten groups’ data are plotted for each item; the triangular points represent the groups responding to the easiest item. The circles and diamonds, respectively, represent the groups responding to the moderate and to the most difficult item. Figure 1: Example of Item Statistics Database Presentation of Information For purposes of the STAR Reading 2.x calibration research, two different “fit” measures (both unweighted and weighted) were computed. Additionally, if the IRT model is functioning well, then the EIRF points should approximate the (estimated) theoretical IRF. Thus, in addition to the traditional item analysis information, the following IRT-related information was determined for each item administered during the calibration research analyses: The IRT item difficulty parameter The unweighted measure of fit to the IRT model The weighted measure of fit to the IRT model The theoretical and empirical IRF plots Rules for Item Retention Following these analyses, each test item, along with both traditional and IRT analysis information (including IRF and EIRF plots) and information about the test level, form and item identifier, were stored in an item statistics database. A panel of US content reviewers then examined each item, within content strands, to determine whether the item met all criteria for inclusion into the STAR Reading™ Technical Manual 21 Item and Scale Calibration Rules for Item Retention bank of items that would be used in the US norming version of the STAR Reading US 2.x test. The item statistics database allowed experts easy access to all available information about an item in order to interactively designate items that, in their opinion, did not meet acceptable standards for inclusion in the STAR Reading US 2.x item bank. US item selection was completed based on the following criteria. Items were eliminated when: Item-total correlation (item discrimination) was less than 0.30 Some other answer option had an item discrimination that was high Sample size of students attempting the item was less than 300 The traditional item difficulty indicated that the item was too difficult or too easy The item did not appear to fit the Rasch IRT model After each US content reviewer had designated certain items for elimination, their recommendations were combined and a second review was conducted to resolve issues where there was not uniform agreement among all reviewers. Of the initial 2,133 items administered in the US STAR reading 2.0 calibration research study, 1,409 were deemed of sufficient quality to be retained for further analyses. Traditional item-level analyses were conducted again on the reduced data set that excluded the eliminated items. IRT calibration was also performed on the reduced data set and all test forms and levels were equated based on the information provided by the embedded anchor items within each test form. This resulted in placing the IRT item difficulty parameters for all items onto a single scale spanning US grades 1–12 (Years 2–13). Table 6 summarises the final analysis information for the test items included in the US calibration test forms by test level (A–G). As shown in the table, the item placements in test forms were appropriate: the average percentage of students correctly answering items is relatively constant across test levels. Note, however, that the average scaled difficulty of the items increases across successive levels of the calibration tests, as does the average scaled ability of the students who answered questions at each test level. The median point-biserial correlation, as shown in the table, indicates that the test items were performing well. STAR Reading™ Technical Manual 22 Item and Scale Calibration Computer-Adaptive Test Design Table 6: US Calibration Test Item Summary Information by Test Level STAR Reading US 2.x Calibration Study—Spring 1998 Test Level US Grade Level(s) Number of Items Sample Size Average % Correct Median % Correct Median Point-Biserial Average Scaled Difficulty Average Scaled Ability A 1 343 4,226 67 75 0.56 –3.61 –2.36 B 2 274 3,911 78 88 0.55 –2.35 –0.07 C 3 274 3,468 76 89 0.51 –1.60 0.76 D 4 274 3,340 69 81 0.51 –0.14 1.53 E 5–6 343 4,046 62 73 0.47 1.02 2.14 F 7–9 343 3,875 68 76 0.48 2.65 4.00 G 10–12 366 4,941 60 60 0.37 4.19 4.72 Computer-Adaptive Test Design The third phase of content specification is determined by the student’s performance during testing. In the conventional paper-and-pencil standardised test, items retained from the item try-out or item calibration study are organised by level, then each student takes all items within a given test level. Thus, the student is only tested on reading skills deemed to be appropriate for the student’s US grade level (year). In computer-adaptive tests like the STAR Reading US 2.x test and the STAR Reading UK test, the items taken by a student are dynamically selected in light of that student’s performance during the testing session. Thus, a low-performing student’s reading skills may branch to easier items in order to better estimate the student’s reading achievement level. High-performing students may branch to more challenging reading items in order to better determine the breadth of their reading skills and their reading achievement level. During an adaptive test, a student may be “routed” to items at the lowest reading level or to items at higher reading levels within the overall pool of items, depending on the student’s performance during the testing session. In general, when an item is answered correctly, the student is then given a more difficult item. When an item is answered incorrectly, the student is then given an easier item. Item difficulty here is defined by results of the STAR Reading US item calibration study. All STAR Reading tests between version 2.0 and 4.3 RP, inclusive, administer a fixed-length, 25-item, computer-adaptive test. Students who have not taken a STAR Reading test within six months initially receive an item whose difficulty level is relatively easy for students at that year. The selection of an item that is a bit easier than average minimises any effects of initial anxiety that students may have when starting the test and serves to better facilitate the student’s STAR Reading™ Technical Manual 23 Item and Scale Calibration Scoring in the STAR Reading Tests initial reactions to the test. These starting points vary by year and were based on research conducted as part of the US national item calibration study. When a student has taken a STAR Reading test within the last six months, the difficulty of the first item depends on that student’s previous STAR Reading test score information. After the administration of the initial item, and after the student has entered an answer, STAR Reading software estimates the student’s reading ability. The software then selects the next item randomly from among all of the items available that closely match the student’s estimated reading ability. (See Table 1 on page 14 for converting US grade levels to UK years.) Randomisation of items with difficulty values near the student’s adjusted reading ability allows the program to avoid overexposure of test items. All items are dynamically selected from an item bank consisting of all the retained vocabulary-in-context items. Items that have been administered to the same student within the past six-month time period are not available for administration. The large number of items available in the item pools, however, ensure that this minor constraint has negligible impact on the quality of each STAR Reading computer-adaptive test. Scoring in the STAR Reading Tests Following the administration of each STAR Reading item, and after the student has selected an answer, an updated estimate of the student’s reading ability is computed based on the student’s responses to all items that have been administered up to that point. A proprietary Bayesian-modal Item Response Theory (IRT) estimation method is used for scoring until the student has answered at least one item correctly and one item incorrectly. Once the student has met the 1-correct/1-incorrect criterion, STAR Reading software uses a proprietary Maximum-Likelihood IRT estimation procedure to avoid any potential of bias in the Scaled Scores. This approach to scoring enables the STAR Reading 3.x RP and higher test to provide Scaled Scores that are statistically consistent and efficient. Accompanying each Scaled Score is an associated measure of the degree of uncertainty, called the standard error of measurement (SEM). Unlike a conventional paper-and-pencil test, the SEM values for the STAR Reading test are unique for each student. SEM values are dependent on the particular items the student received and on their performance on those items. Scaled Scores are expressed on a common scale that spans all UK years covered by STAR Reading 3.x RP and higher (Years 1–13). Because of this common scale, Scaled Scores are directly comparable with each other, regardless of year. STAR Reading™ Technical Manual 24 Item and Scale Calibration Scale Calibration Scale Calibration The outcome of the US item calibration study described above was a sizeable bank of test items suitable for use in the STAR Reading test, with an IRT difficulty scale parameter for each item. The difficulty scale itself was devised such that it spanned a range of item difficulty from US kindergarten level through US grade level 12 (Years 2–13). An important feature of Item Response Theory is that the same scale used to characterise the difficulty of the test items is also used to characterise examinees’ ability; in fact, IRT models express the probability of a correct response as a function of the difference between the scale values of an item’s difficulty and an examinee’s ability. The IRT ability/difficulty scale is continuous; in the STAR Reading US 2.x norming, described in “Score Definitions” on page 29, the values of observed ability ranged from about –7.3 to +9.2, with the zero value occurring at about the US sixth-grade level (Year 7). The Linking Study 4,589 US students from around the country, spanning all 12 US grades, participated in the linking study. Linking study participants took both STAR Reading US 1.x and STAR Reading US 2.x tests within a few days of each other. The order in which they took the two test versions was counterbalanced to account for the effects of practice and fatigue. Test score data collected were edited for quality assurance purposes, and 38 cases with anomalous data were eliminated from the linking analyses; the linking was accomplished using data from 4,551 cases. The linking of the two score scales was accomplished by means of an equipercentile equating involving all 4,551 cases, weighted to account for differences in sample sizes across US grades. The resulting table of 99 sets of equipercentile equivalent scores was then smoothed using a monotonic spline function, and that function was used to derive a table of Scaled Score equivalents corresponding to the entire range of IRT ability scores observed in the norming study. These STAR Reading US 2.x Scaled Score equivalents range from 0 to 1400. STAR Reading UK uses the same Scaled Score that was developed for STAR Reading US 3.x RP and higher.1 Summary statistics of the test scores of the 4,551 cases included in the US linking analysis are listed in Table 7. The table lists actual STAR Reading US 1.x Scaled Score means and standard deviations, as well as the same statistics for STAR Reading US 2.x IRT ability estimates and equivalent Scaled Scores calculated using the conversion table from the linking study. Comparing the STAR Reading US 1.x Scaled Score means to the IRT ability score means illustrates how different the two metrics are. Comparing the STAR Reading US 1. Data from the linking study made it clear that STAR Reading US 2.x software measures ability levels extending beyond the minimum and maximum STAR Reading US 1.x Scaled Scores. In order to retain the superior bandwidth of STAR Reading US 2.x software, extrapolation procedures were used to extend the Scaled Score range below 50 and above 1,350. STAR Reading™ Technical Manual 25 Item and Scale Calibration Scale Calibration 1.x Scaled Score means to the STAR Reading US 2.x Equivalent Scaled Scores in the rightmost two columns of Table 7 illustrates how successful the scale linking was. Table 7: Summary Statistics of STAR Reading US 1.x and 2.x Scores from the Linking Study, by US Grade—Spring 1999 (N = 4,551 Students) STAR Reading US 1.x Scaled Scores STAR Reading US 2.x IRT Ability Scores STAR Reading US 2.x Equivalent Scale Scores US Grade Level Sample Size Mean S.D. Mean S.D. Mean S.D. 1 284 216 95 –1.98 1.48 208 109 2 772 339 115 –0.43 1.60 344 148 3 476 419 128 0.33 1.53 419 153 4 554 490 152 0.91 1.51 490 187 5 520 652 176 2.12 1.31 661 213 6 219 785 222 2.98 1.29 823 248 7 702 946 228 3.57 1.18 943 247 8 545 958 285 3.64 1.40 963 276 9 179 967 301 3.51 1.59 942 292 10 81 1,079 292 4.03 1.81 1,047 323 11 156 1,031 310 3.98 1.53 1,024 287 12 63 1,157 299 4.81 1.42 1,169 229 1–12 4,551 656 345 1.73 2.36 658 353 Table 8 contains an excerpt from the IRT ability to Scaled Score conversion table that was developed in the course of the US linking study. Table 8: Example IRT Ability to Equivalent Scaled Score Conversions from the US Linking Study IRT Ability STAR Reading™ Technical Manual From To Equivalent Scaled Score –6.2845 –6.2430 50 –3.1790 –3.1525 100 –2.5030 –2.4910 150 –1.9030 –1.8910 200 –1.2955 –1.2840 250 –0.7075 –0.6980 300 –0.1805 –0.1715 350 0.3390 0.3490 400 26 Item and Scale Calibration Dynamic Calibration Table 8: Example IRT Ability to Equivalent Scaled Score Conversions from the US Linking Study (Continued) IRT Ability From To Equivalent Scaled Score 0.7600 0.7695 450 1.2450 1.2550 500 1.6205 1.6270 550 1.9990 2.0045 600 2.3240 2.3300 650 2.5985 2.6030 700 2.8160 2.8185 750 3.0090 3.0130 800 3.2120 3.2180 850 3.4570 3.4635 900 3.7435 3.7485 950 3.9560 3.9580 1,000 4.2120 4.2165 1,100 4.3645 4.3680 1,150 4.5785 4.5820 1,200 4.8280 4.8345 1,250 5.0940 5.1020 1,300 7.5920 7.6340 1,350 9.6870 and above 1,400 Dynamic Calibration An important new feature has been added to the assessment—dynamic calibration. This new feature will allow response data on new test items to be collected during STAR testing sessions, for the purpose of field testing and calibrating those items. When dynamic calibration is active, it works by embedding one or more new items at random points during a STAR test. These items do not count towards the student’s STAR test score, but item responses are stored for later psychometric analysis. Students may take as many as three additional items per test; in some cases, no additional items will be administered. On average, this will only increase the testing time by one to two minutes. The new, non-calibrated items will not count toward the student’s final scores, but will be analysed in conjunction with the responses of hundreds of other students from across the country. STAR Reading™ Technical Manual 27 Item and Scale Calibration Dynamic Calibration Student identification does not enter into the analyses; they are statistical analyses only. The response data collected on new items will allow for frequent evaluation of new item content, and will contribute to continuous improvement in STAR tests’ assessment of student performance. STAR Reading™ Technical Manual 28 Score Definitions STAR Reading software provides two broad types of scores: criterion-referenced scores and norm-referenced scores. For informative purposes, the full range of STAR Reading criterion-referenced and norm-referenced scores is described in this chapter. Types of Test Scores In a broad sense, STAR Reading software provides two different types of test scores that measure student performance in different ways: Criterion-referenced scores measure student performance by comparing it to a standard criterion. This criterion can come in any number of forms; common criterion foundations include material covered in a specific text, lecture or course. It could also take the form of curriculum or district educational standards. These scores provide a measure of student achievement compared with a fixed criterion; they do not provide any measure of comparability to other students. The criterion-referenced score reported by STAR Reading US software is the Instructional Reading Level, which compares a student’s test performance to 1995 updated vocabulary lists that were based on the EDL Core Vocabulary. Norm-referenced scores compare a student’s test results with the results of other students who have taken the same test. In this case, scores provide a relative measure of student achievement compared to the performance of a group of students at a given time. Normed Referenced Standardised Score, Percentile Ranks and Year Equivalents are the primary norm-referenced scores available in STAR Reading software. Estimated Oral Reading Fluency (Est. ORF) Estimated Oral Reading Fluency is an estimate of a student’s ability to read words quickly and accurately in order to comprehend text efficiently. Students with oral reading fluency demonstrate accurate decoding, automatic word recognition and appropriate use of the rhythmic aspects of language (e.g., intonation, phrasing, pitch and emphasis). Estimated ORF is reported in correct words per minute, and is based on the correlation between STAR Reading performance and a recent study that measured student oral reading using a popular assessment. Estimated ORF is only reported for students in Years 2–5. STAR Reading™ Technical Manual 29 Score Definitions Types of Test Scores Lexile® Measures In cooperation with MetaMetrics®, beginning in August 2014, users of STAR Reading will have the option of including Lexile measures and Lexile ZPD ranges on certain STAR Reading score reports. Reported Lexile measures will range from BR400L to 1825L. (The “L” suffix identified the score as a Lexile measure. Where it appears, the “BR” prefix indicates a score that is below 0 on the Lexile scale; such scores are typical of beginning readers.) Lexile ZPD Ranges A Lexile ZPD range is a student’s ZPD Range converted to MetaMetrics’ Lexile scale of the readability of text. When a STAR Reading user opts to report student reading abilities in the Lexile metric, the ZPD range will also be reported in that same metric. The reported Lexile ZPD ranges are equivalent to the grade level ZPD ranges used in STAR Reading and Accelerated Reader, expressed on the Lexile scale instead of as ATOS reading grade levels. Lexile Measures of Students and Books: Measures of Student Reading Achievement and Text Readability The ability to read and comprehend written text is important for academic success. Students may, however, benefit most from reading materials that match their reading ability/achievement: reading materials that are neither too easy nor too hard so as to maximize learning. To facilitate students’ choices of appropriate reading materials, measures commonly referred to as readability measures are used in conjunction with students’ reading achievement measures. A text readability measure can be defined as a numeric scale, often derived analytically, that takes into account text characteristics that influence text comprehension or readability. An example of a readability measure is an age-level estimate of text difficulty. Among text characteristics that can affect text comprehension are sentence length and word difficulty. A person’s reading measure is a numeric score obtained from a reading achievement test, usually a standardized test such as STAR Reading. A person’s reading score quantifies his/her reading achievement level at a particular point in time. Matching a student with text/books that target a student’s interest and level of reading achievement is a two-step process: first, a student’s reading achievement score is obtained by administering a standardized reading achievement test; second, the reading achievement score serves as an entry point into the readability measure to determine the difficulty level of text/books that would best support independent reading for the student. Optimally, a readability measure should match students with books that they are able to read and comprehend independently without boredom or STAR Reading™ Technical Manual 30 Score Definitions Types of Test Scores frustration: books that are engaging yet slightly challenging to students based on the students’ reading achievement and grade level. Renaissance Learning’s (RLI) readability measure is known as the Advantage/TASA Open Standard for Readability (ATOS). The ATOS for Text readability formula was developed through extensive research by RLI in conjunction with Touchstone Applied Science Associates, Inc. (TASA), now called Questar Assessment, Inc. A great many school libraries use ATOS book levels to index readability of their books. ATOS book levels, which are derived from ATOS for Books measures, express readability as year levels; for example, an ATOS readability measure of 4.2 means that the book is at a difficulty level appropriate for students reading at a typical level of students in year 5, month 2. To match students to books at an appropriate level, the widely used Accelerated Reader system uses ATOS measures of readability and student’s Grade Equivalent (GE) scores on standardized reading tests such as STAR Reading. Another widely-used system for matching readers to books at appropriate difficulty levels is The Lexile® Framework® for Reading, developed by MetaMetrics, Inc. The Lexile scale is a common scale for both text measure (readability or text difficulty) and reader measure (reading achievement scores); in the Lexile Framework, both text difficulty and person reading ability are measured on the same scale. Unlike ATOS for Books, the Lexile Framework expresses a book’s reading difficulty level (and students’ reading ability levels) on a continuous scale ranging from below 0 to 1825 or more. Because some schools and school libraries use the Lexile Framework to index the reading difficulty levels of their books, there was a need to provide users of STAR Reading with a student reading ability score compatible with the Lexile Framework. In 2014, Metametrics, Inc., developed a means to translate STAR Reading scale scores into equivalent Lexile measures of student reading ability. To do so, more than 200 MetaMetrics reading test items that had already been calibrated on the Lexile scale were administered in small numbers as unscored scale anchor items at the end of STAR Reading tests. More than 250,000 students in grades 1 through 12 took up to 6 of those items as part of their STAR Reading tests in April 2014. MetaMetrics’ analysis of the STAR Reading and Lexile anchor item response data yielded a means of transforming STAR Reading’s underlying Rasch scores into equivalent Lexile scores. That transformation, in turn, was used to develop a concordance table listing the Lexile equivalent of each unique STAR Reading scale score. In some cases, a range of text/book reading difficulty in which a student can read independently or with minimal guidance is desired. At RLI, we define the range of reading difficulty level that is neither too hard nor too easy as the Zone of Proximal Development (ZPD). The ZPD range allows, potentially, optimal learning to occur because students are engaged and appropriately challenged by reading materials that match their reading achievement and STAR Reading™ Technical Manual 31 Score Definitions Types of Test Scores interest. The ZPD range is simply an approximation of the range of reading materials that is likely to benefit the student most. ZPD ranges are not absolute and teachers should also use their objective judgment to help students select reading books that enhance learning. In a separate linking procedure, MetaMetrics compared the ATOS readability measures of thousands of books to the Lexile measures of the same books. Analysis of those data yielded a table of equivalence between ATOS reading grade levels and Lexile readability measures. That equivalence table supports matching students to books regardless of whether a book’s readability is measured using the Renaissance Learning ATOS system or the Lexile Framework created by MetaMetrics. Additionally, it supports translating ATOS ZPD ranges into equivalent ZPD ranges expressed on the Lexile scale. National Curriculum Level–Reading (NCL–R) The National Curriculum Level in Reading is a calculation of a student’s standing on the National Curriculum based on the student’s STAR Reading performance. This score is based on the demonstrated relationship between STAR Reading scale scores and teachers’ judgments, as expressed in their Teacher Assessments (TA) of students’ attained skills. NCL–R should not be taken to be the student’s actual national curriculum level, but rather the curriculum level at which the child is most likely performing. Stating this another way, the NCL–R from STAR Reading is an estimate of the individual’s standing in the national curriculum framework based on a modest number of STAR Reading test items, selected to match the student’s estimated ability level. It is meant to provide information useful for decisions with respect to a student’s present level of functioning. The NCL–R score is reported in the following format: the estimated national curriculum level followed by a sublevel category, labeled a, b or c. The sublevels can be used to monitor student progress more finely, as they provide an indication of how far a student has progressed within a specific national curriculum level. For instance, a student with an NCL–R of 4c would indicate that an individual is estimated to have just obtained level 4, while another student with a 4a is estimated to be approaching level 5. Table 9 shows the correspondence between NCL–R scores and Scaled Scores. Table 9: STAR Reading™ Technical Manual Correspondence of NCL–R Scores to Scaled Scores NCL–R Score Scaled Score Range NCL–R Score Scaled Score Range 1b 0–90 4b 535–699 1a/2c 91–104 4a/5c 700–895 2b 105–262 5b 896–1231 2a/3c 263–360 5a/6c 1232–1336 3b 351–456 6b 1337–1346 3a/4c 457–534 6a/7c 1347–1400 32 Score Definitions Types of Test Scores It is sometimes difficult to identify whether or not a student is in the top of one level (for instance, 4a) or just beginning the next highest level (for instance, 5c). Therefore, a transition category is used to indicate that a student is performing around the cusp of two adjacent levels. These transition categories are identified by a concatenation of the contiguous levels and sublevel categories. For instance, a student whose skills appear to range between levels 4 and 5, indicating they are probably starting to transition from one level to the next, would obtain an NCL of 4a/5c. These transition scores are provided only at the junction of one level and the next highest. A student’s actual NCL is obtained through national testing and assessment protocols. The estimated score is meant to provide information useful for decisions with respect to a student’s present level of functioning when no current value of the actual NCL is available. Normed Referenced Standardised Score (NRSS) The Norm Referenced Standardised Score is an age standardised score that converts a student’s “raw score” to a standardised score which takes into account the student’s age in years and months and gives an indication of how the student is performing relative to a national sample of students of the same age. The average score is 100. A higher score is above average and a lower score is below average. Percentile Rank (PR) and Percentile Rank Range Percentile Ranks range from 1–99 and express student ability relative to the scores of other students of a similar age. For a particular student, this score indicates the percentage of students in the norms group who obtained lower scores. For example, if a student has a PR of 85, the student’s reading skills are greater than 85% of other students of a similar age. The PR Range reflects the amount of statistical variability in a student’s PR score. If the student were to take the STAR Reading test many times in a short period of time, the score would likely fall in this range. Reading Age (RA) The Reading Age (RA) indicates the typical reading age for an individual with a given value of the STAR Reading Scaled Score. This provides an estimate of the chronological age at which students typically obtain that score. The RA score is an approximation based on the demonstrated relationship between STAR Reading and other tests of student reading ability, which were normed in the UK. RA scores are transformations of the STAR Reading Scaled Score. The scale is expressed in the following form: YY:MM, where YY indicates the reading age in years and MM the months (see Table 10). For example, an individual who has obtained a reading age of 7:10 would be estimated to be reading as well as the average individual at 7 years, 10 months of age. Due to STAR Reading™ Technical Manual 33 Score Definitions Types of Test Scores the range of items in STAR Reading and the intended range of years appropriate for use, a reading age cannot be determined with great accuracy if the reading ability of the student is either below 6:00 or above 16:06. Therefore, students who obtain an RA of 6:00 should be considered to have a reading age of 6 years, 0 months or lower, and an RA of 16:06 indicates a reading age of 16 years, 6 months or older. Table 10: Correspondence of Reading Ages to Scaled Score Ranges RA SS RA SS RA SS RA SS RA SS 6:00 0–43 8:00 283–292 10:00 519–528 12:00 756–765 14:00 993–1002 6:01 44–65 8:01 293–301 10:01 529–538 12:01 766–775 14:01 1003–1012 16:01 1240–1249 6:02 66–74 8:02 302–311 10:02 539–548 12:02 776–785 14:02 1013–1022 16:02 1250–1259 6:03 75–84 8:03 312–321 10:03 549–558 12:03 786–795 14:03 1023–1032 16:03 1260–1269 6:04 85–94 8:04 322–331 10:04 559–568 12:04 796–805 14:04 1033–1042 16:04 1270–1278 6:05 95–104 8:05 332–341 10:05 569–578 12:05 806–815 14:05 1043–1051 16:05 1279–1288 6:06 105–114 8:06 342–351 10:06 579–588 12:06 816–824 14:06 1052–1061 16:06 1289–1400 6:07 115–124 8:07 352–361 10:07 589–597 12:07 825–834 14:07 1062–1071 6:08 125–134 8:08 362–370 10:08 598–607 12:08 835–844 14:08 1072–1081 6:09 135–143 8:09 371–380 10:09 608–617 12:09 845–854 14:09 1082–1091 6:10 144–153 8:10 381–390 10:10 618–627 12:10 855–864 14:10 1092–1101 6:11 154–163 8:11 391–400 10:11 628–637 12:11 865–874 14:11 1102–1111 7:00 164–173 9:00 401–410 11:00 638–647 13:00 875–884 15:00 1112–1120 7:01 174–183 9:01 411–420 11:01 648–657 13:01 885–893 15:01 1121–1130 7:02 184–193 9:02 421–430 11:02 658–667 13:02 894–903 15:02 1131–1140 7:03 194–203 9:03 431–440 11:03 668–676 13:03 904–913 15:03 1141–1150 7:04 204–213 9:04 441–449 11:04 677–686 13:04 914–923 15:04 1151–1160 7:05 214–222 9:05 450–459 11:05 687–696 13:05 924–933 15:05 1161–1170 7:06 223–232 9:06 460–469 11:06 697–706 13:06 934–943 15:06 1171–1180 7:07 233–242 9:07 470–479 11:07 707–716 13:07 944–953 15:07 1181–1190 7:08 243–252 9:08 480–489 11:08 717–726 13:08 954–963 15:08 1191–1199 7:09 253–262 9:09 490–499 11:09 727–736 13:09 964–972 15:09 1200–1209 7:10 263–272 9:10 500–509 11:10 737–745 13:10 973–982 15:10 1210–1219 7:11 273–282 9:11 510–518 11:11 746–755 13:11 983–992 15:11 1220–1229 STAR Reading™ Technical Manual 34 RA SS 16:00 1230–1239 Score Definitions Types of Test Scores Scaled Score (SS) STAR Reading software creates a virtually unlimited number of test forms as it dynamically interacts with the students taking the test. In order to make the results of all tests comparable, and in order to provide a basis for deriving the norm-referenced scores for STAR Reading, it is necessary to convert all the results of STAR Reading tests to scores on a common scale. STAR Reading software does this in two steps. First, maximum likelihood is used to estimate each student’s location on the Rasch ability scale, based on the difficulty of the items administered and the pattern of right and wrong answers. Second, the Rasch ability scores are converted to STAR Reading Scaled Scores, using the conversion table described in “Item and Scale Calibration” on page 15. STAR Reading Scaled Scores range from 0 to 1400. Zone of Proximal Development (ZPD) The Zone of Proximal Development (ZPD) defines the readability range from which students should be selecting books in order to ensure sufficient comprehension and therefore achieve optimal growth in reading skills without experiencing frustration. STAR Reading software uses Grade Equivalents (GE) to derive a student’s ZPD score. Specifically, it relates the GE estimate of a student’s reading ability with the range of most appropriate readability levels to use for reading practice. Table 47 on page 104 shows the relationship between GE and ZPD scores. The Zone of Proximal Development is especially useful for students who use Accelerated Reader, which provides readability levels on over 80,000 trade books. Renaissance Learning developed the ZPD ranges according to Vygotskian theory, based on an analysis of Accelerated Reader book reading data from 80,000 students in the 1996–1997 school year. More information is available in Research Foundation for Reading Renaissance Target Setting (2003), which is published by Renaissance Learning. Diagnostic Codes Diagnostic codes represent general behavioral characteristics of readers at particular stages of development. They are based on a student’s Grade Equivalent and Percentile Rank achieved on a STAR Reading test. The diagnostic codes themselves (01A, 04B, etc.) do not appear on the STAR Reading Diagnostic Report, but the descriptive text associated with each diagnostic code is available on the report. Table 11 shows the relationship between the GE and PR scores and the resulting STAR Reading diagnostic codes. Note that the diagnostic codes ending in “B” (which are only used in the US version of STAR Reading Diagnostic Reports2) contain additional 2. The descriptive text associated with “A” codes provides recommended actions for students to optimise their reading growth. “B” codes recommend additional actions for a student whose PR score has fallen below 25; since PR scores are only estimated in the UK (not calculated), the descriptive text is not included on the UK version of the STAR Reading Diagnostic Report. STAR Reading™ Technical Manual 35 Score Definitions Types of Test Scores prescriptive information to better assist those students performing below the 25th percentile. Table 11: Diagnostic Code Values by Percentile Rank for STAR Reading US Diagnostic Code US Grade Equivalent PR > 25 PR <= 25 0.0–0.7 01A 01B 0.8–1.7 02A 02B 1.8–2.7 03A 03B 2.8–3.7 04A 04B 3.8–4.7 05A 05B 4.8–5.7 06A 06B 5.8–6.7 07A 07B 6.8–8.7 08A 08B 8.8–13.0 09A 09B Expert consultants from both academia and public education developed and reviewed the diagnostic codes and accompanying text using standard scope and sequence paradigms from the field of reading education. The reviewers found: 1. The diagnostic information succinctly characterises readers at each stage of development and across US grade levels K–12 (Years 1–13); 2. Critical reading behaviours are listed for successful students at each stage of development; and 3. Corrective procedures are recommended at each stage of development that adequately address important interventions. Comparing the STAR Reading US Test with Classical Tests Because the STAR Reading test adapts to the reading level of the student being tested, STAR Reading US GE scores are more consistently accurate across the achievement spectrum than those provided by classical test instruments. Grade Equivalent scores obtained using classical (non-adaptive) test instruments are less accurate when a student’s US grade placement and GE score differ markedly. It is not uncommon for a US fourth-grade student to obtain a GE score of 8.9 when using a classical test instrument. However, this does not necessarily mean that the student is performing at a level typical of an end-of-year US eighth-grader; more likely, it means that the student answered all, or nearly all, of the items correctly and thus performed beyond the range of the US fourth-grade test. STAR Reading US Grade Equivalent scores are more consistently accurate—even as a student’s achievement level deviates from the level of US STAR Reading™ Technical Manual 36 Score Definitions Types of Test Scores grade placement. A student may be tested on any level of material, depending upon the student’s actual performance on the test; students are tested on items of an appropriate level of difficulty, based on their individual level of achievement. Thus, a GE score of 7.6 indicates that the student’s performance can be appropriately compared to that of a typical US seventh-grader in the sixth month of the school year. STAR Reading™ Technical Manual 37 Reliability and Validity Reliability is the extent to which a test yields consistent results from one administration to another and from one test form to another. Tests must yield consistent results in order to be useful. Because STAR Reading is a computer-adaptive test, many of the typical methods used to assess reliability using internal consistency methods (such as KR-20 and coefficient alpha) are not appropriate. The question of the reliability of the test was approached in two ways: by calculating split-half reliability and by calculating test-retest reliability, in both cases for both Scaled Scores and Standardised Scores. Split-Half Reliability Split-half reliability for Scaled Score showed an overall mean of 590.47, standard deviation of 281.42, n = 818,064. The Spearman-Brown Coefficient was 0.918. This indicates a good level of reliability. Split-half reliability for Standardised Score showed an overall mean of 100.03, standard deviation of 15.25, n = 818,064. The Spearman-Brown Coefficient was 0.918. This indicates a good level of reliability. Note that reliability for Reading is somewhat higher than that for Maths. However, the Scaled Score standard deviation for Reading was much higher than that for Maths, indicating greater variance. Test-Retest Reliability Calculating Test-Retest Reliability was more complex, since it required obtaining a sample of cases from the most recent full year of testing (August 1, 2009–July 31, 2010) and comparing their scores to those of the same cases in the previous year (August 1st, 2008–July 31, 2009). Ensuring that only scores for the same students on both occasions were entered in this analysis took a great deal of time. All cases with more than one testing in each of these periods were deleted. In the current year, 64,472 cases were listed; in the previous year 39,993 cases were listed, but only 8,849 of these were the same students. A histogram of Current Scaled Score × Previous Scaled Score was then constructed to determine whether the distribution was relatively normal and to establish the presence of outlier or rogue results (see Figure 2). Figure 3 shows this as a histogram. STAR Reading™ Technical Manual 38 Reliability and Validity Test-Retest Reliability Figure 2: Scatter Diagram of Current Score × Previous Scaled Score, Showing Outliers Figure 3: Histogram of the Difference between Current Scaled Score and Previous Scaled Score, All Cases Any outlier results were then deleted. In fact, 83 outlier results were deleted (see Figure 4). STAR Reading™ Technical Manual 39 Reliability and Validity UK Reliability Study Figure 4: Histogram of the Difference between Current Scaled Score and Previous Scaled Score, with 83 Outliers Removed A total of 8,849 students could be matched from one year to the next with singular test results in each year (much higher than with Maths). The initial Pearson correlation between Current Scaled Score and Previous Scaled Score was 0.829. When the 83 outliers were removed, this improved to 0.853 (n = 8766). Both these correlations were highly statistically significant. Although slightly less than Maths, this is still very comparable. This shows good reliability. UK Reliability Study During October and November 2006, 28 schools in England participated in a study to investigate the reliability of scores for STAR Reading across Years 2–9. Estimates of the generic reliability were obtained from completed assessments. In addition to the reliability estimates, the conditional standard error of measurement was computed for each individual student and summarised by school year. Results are found in Table 12, and indicate a high level of score consistency. As Table 12 indicates, the generic reliability coefficients are higher for all years than the corresponding US results, and the average SEMs are lower. This is consistent with the improvement in measurement precision expected as a result of changing the adaptive item difficulty target from 75% to 67.5% correct, as described under “Improvements Specific to STAR Reading Version 4.3 RP” on page 6. STAR Reading™ Technical Manual 40 Reliability and Validity Validity Table 12: Reliability and Conditional SEM Estimates by Year in the UK Sample UK Year Number of Students Generic Reliability Average SEM Standard Deviation of SEM 2 557 0.96 26.28 19.92 3 1,076 0.96 34.36 17.28 4 1,439 0.94 45.56 19.48 5 1,514 0.94 53.88 24.94 6 1,229 0.93 63.93 27.43 7 4,029 0.92 71.74 29.24 8 1,480 0.93 76.42 31.58 9 632 0.93 81.44 31.19 Validity The key concept often used to judge an instrument’s usefulness is its validity. The validity of a test is the degree to which it assesses what it claims to measure. Determining the validity of a test involves the use of data and other information both internal and external to the test instrument itself. One touchstone is content validity—the relevance of the test questions to the attributes supposed to be measured by the test—reading ability, in the case of the STAR Reading test. These content validity issues were discussed in detail in “Content and Item Development” on page 12 and were an integral part of the design and construction of the STAR Reading test items. Construct validity, which is the overarching criterion for evaluating a test, investigates the extent to which a test measures the construct that it claims to be assessing. Establishing construct validity involves the use of data and other information external to the test instrument itself. For example, the STAR Reading 2.x and higher tests claim to provide an estimate of a child’s reading achievement level. Therefore, demonstration of the STAR Reading test’s construct validity rests on the evidence that the test in fact provides such an estimate. There are, of course, a number of ways to demonstrate this. For instance, in a study linking STAR Reading and the Degrees of Reading Power comprehension assessment, a raw correlation of 0.89 was observed between the two tests. Adjusting that correlation for attenuation due to unreliability yielded a corrected correlation of 0.96, indicating that the constructs (i.e., reading comprehension) measured by STAR Reading and Degrees of Reading Power are almost indistinguishable. Table 17 on page 53 and Table 18 on page 55 present evidence of predictive validity collected subsequent to the STAR Reading 2.0 norming study. These two tables display numerous correlations between STAR Reading and other measures administered at points in time at least two months later than STAR Reading. STAR Reading™ Technical Manual 41 Reliability and Validity External Validity Since reading ability varies significantly within and across years (US grade levels) and improves with students’ age and number of years in school, STAR Reading 2.x and higher scores should demonstrate these anticipated internal relationships; in fact, they do. Additionally, STAR Reading scores should correlate highly with other accepted procedures and measures that are used to determine reading achievement level; this is external validity. External Validity During the STAR Reading US 2.x norming study, schools submitted data on how their students performed on several other popular standardised test instruments along with their students’ STAR Reading results. This data included more than 12,000 student test results from such tests as the California Achievement Test (CAT), the Comprehensive Test of Basic Skills (CTBS), the Iowa Test of Basic Skills (ITBS), the Metropolitan Achievement Test (MAT), the Stanford Achievement Test (SAT-9) and several state tests. Computing the correlation coefficients was a two-step process. First, where necessary, data were placed onto a common scale. If Scaled Scores were available, they could be correlated with STAR Reading scale scores. However, since Percentile Ranks (PRs) are not on an equal interval scale, when PRs were reported for the other tests, they were converted into Normal Curve Equivalents (NCEs). Scaled Scores or NCE scores were then used to compute the Pearson product-moment correlation coefficients. In an ongoing effort to gather evidence for the validity of STAR Reading scores, continual research on score validity has been undertaken. In addition to original validity data gathered at the time of initial development, numerous other studies have investigated the correlations between STAR Reading tests and other external measures. In addition to gathering concurrent validity estimates, predictive validity estimates have also been investigated. Concurrent validity was defined for students taking a STAR Reading test and external measures within a two-month time period. Predictive validity provided an estimate of the extent to which scores on the STAR Reading test predicted scores on criterion measures given at a later point in time, operationally defined as more than 2 months between the STAR test (predictor) and the criterion test. It provided an estimate of the linear relationship between STAR scores and scores on measures covering a similar academic domain. Predictive correlations are attenuated by time due to the fact that students are gaining skills in the interim between testing occasions, and also by differences between the tests’ content specifications. Tables 13–18 present the correlation coefficients between the STAR Reading US test and each of the other test instruments for which data were received. Tables 13 and 14 display “concurrent validity” data, that is, correlations between STAR Reading 2.0 and later versions’ test scores and other tests administered at close to the same time. Tables 15 and 16 display all other STAR Reading™ Technical Manual 42 Reliability and Validity External Validity correlations of STAR Reading 2.0 and later versions and external tests; the external test scores were administered at various times, and were obtained from student records. Tables 17 and 18 provide the predictive validity estimates of STAR Reading predicting numerous external criterion measures given at least two months after the initial STAR test. Tables 13, 15 and 17 present validity coefficients for US grades 1–6, and Tables 14, 16 and 18 present the validity coefficients for US grades 7–12. The bottom of each table presents a grade-by-grade summary, including the total number of students for whom test data were available, the number of validity coefficients for that grade and the average value of the validity coefficients. The within-grade average concurrent validity coefficients varied from 0.71–0.81 for grades 1–6, and from 0.64–0.75 for grades 7–12. Overall concurrent validity coefficients for grades 1–6 was 0.73 and 0.72 for grades 7–12. The other validity coefficient within-grade averages varied from 0.60–0.77; the overall average was 0.73 for grades 1–6, and 0.71 for grades 7–12. The predictive validity coefficients ranged from 0.68–0.82 in grades 1–6, with an overall average of 0.79. For grades 7–12, the predictive validity coefficients ranged from 0.81–0.86, with an overall average of 0.82. The extent that the STAR Reading US 2.x test correlates with these tests provides support for STAR Reading construct validity. The process of establishing the validity of a test is an involved one, and one that usually takes a significant amount of time. Thus, the data collection process and the validation of the US and UK versions of the STAR Reading test is really an ongoing activity seeking to establish evidence of STAR Reading’s validity for a variety of settings. STAR Reading UK users who collect relevant data are encouraged to contact Renaissance Learning UK Ltd. Since correlation coefficients are available for different editions, forms and dates of administration, many of the tests have several correlation coefficients associated with them. Where test data quality could not be verified, and when sample size was limited, those data were eliminated. Correlations were computed separately on tests according to the unique combination of test edition/form and time when testing occurred. Testing data for other standardised tests administered more than two years prior to the spring of 1999 were excluded from the analyses since those test results represent very dated information about the current reading ability of students. In general, these correlation coefficients reflect very well on the validity of the STAR Reading US 2.x and higher tests as tools for assessing reading performance. These results, combined with the reliability and SEM estimates, demonstrate quantitatively how well this innovative instrument in reading assessment performs. STAR Reading™ Technical Manual 43 Reliability and Validity External Validity Table 13: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 1–6a Test Form 1 Date Score n 2 r n 3 r n 4 r 5 6 n r n r n r 34 0.72* 146 0.76* – – 0.83* 93 0.68* 280 0.80* California Achievement Test (CAT) /5 Spr 99 NCE 93 0.80* 36 0.67* – – Colorado Student Assessment Program Spr 06 Scaled – – – – 82 0.75* 79 Comprehensive Test of Basic Skills (CTBS) /4 Spr 99 NCE – – – – – – 18 0.81* – – – – A-19/20 Spr 99 Scaled – – – – – – – – – – 8 0.91* Delaware Student Testing Program–Reading Spr 05 Scaled – – – – 104 0.57* – – – – – – Spr 06 Scaled – – 158 0.68* 126 0.43* 141 0.62* 157 0.59* 75 0.66* Dynamic Indicators of Basic Early Literacy Skills (DIBELS)–Oral Reading Fluency Fall 05 WCPM – – 59 0.78* – – – – – – – – Win 06 WCPM 61 0.87* 55 0.75* – – – – – – – – Spr 06 WCPM 67 0.87* 63 0.71* – – – – – – – – Fall 06 WCPM – – 515 0.78* 354 0.81* 202 0.72* – – – – Win 07 WCPM 208 0.75* 415 0.73* 175 0.69* 115 0.71* – – – – Spr 07 WCPM 437 0.81* 528 0.70* 363 0.66* 208 0.54* – – – – Fall 07 WCPM – – 626 0.79* 828 0.73* 503 0.73* 46 0.73* – – 0.65* – – – – Florida Comprehensive Assessment Test Spr 06 SSS – – – – – – 41 Gates-MacGinitie Reading Test (GMRT) 2nd Ed., D Spr 99 NCE – – 21 0.89* – – – – – – – – L-3rd Spr 99 NCE – – 127 0.80* – – – – – – – – Illinois Standards Achievement Test–Reading Spr 05 Scaled – – 106 0.71* 594 0.76* – – 449 0.70* – – Spr 06 Scaled – – – – 140 0.80* 144 0.80* 146 0.72 – – STAR Reading™ Technical Manual 44 Reliability and Validity External Validity Table 13: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 1–6a (Continued) Test Form 1 Date Score n 2 r n 3 r n 4 r 5 6 n r n r n r Iowa Test of Basic Skills (ITBS) Form K Spr 99 NCE 40 0.75* 36 0.84* 26 0.82* 28 0.89* 79 0.74* – – Form L Spr 99 NCE – – – – 18 0.70* 29 0.83* 41 0.78* 38 0.82* Form M Spr 99 NCE – – – – 158 0.81* – – 125 0.84* – – Form K Spr 99 Scaled – – 58 0.74* – – 54 0.79* – – – – Form L Spr 99 Scaled – – – – 45 0.73* – – – – 50 0.82* Metropolitan Achievement Test (MAT) 7th Ed. Spr 99 NCE – – – – – – 46 0.79* – – – – 6th Ed Spr 99 Raw – – – – 8 0.58* – – 8 0.85* – – 7th Ed. Spr 99 Scaled – – – – 25 0.73* 17 0.76* 21 0.76* 23 0.58* Michigan Educational Assessment Program–English Language Arts Fall 04 Scaled – – – – – – 155 0.81* – – – – Fall 05 Scaled – – – – 218 0.76* 196 0.80* 202 0.80* 207 0.69* Fall 06 Scaled – – – – 116 0.79* 132 0.69* 154 0.81* 129 0.66* Michigan Educational Assessment Program–Reading Fall 04 Scaled – – – – – – 155 0.80* – – – – Fall 05 Scaled – – – – 218 0.77* 196 0.78* 202 0.81* 207 0.68* Fall 06 Scaled – – – – 116 0.75* 132 0.70* 154 0.82* 129 0.70* 175 0.66* 81 0.69* – – – 26 0.62* – – – – 85 0.79* – – 92 0.58* 46 0.52* 80 0.60* Mississippi Curriculum Test Spr 02 Scaled – – 159 0.73* 148 0.62* Missouri Mastery Achievement Test (MMAT) Spr 99 NCE – – – – – – – North Carolina End of Grade Test (NCEOG) Spr 99 Scaled – – – – – – Oklahoma Core Curriculum Test Spr 06 Scaled – – – – 78 0.62* Stanford Achievement Test (Stanford) 9th Ed. Spr 99 NCE 68 0.79* – – 26 0.44* – – – – 86 0.65* 9th Ed. Spr 99 Scaled 11 0.89* 18 0.89* 67 0.79* 66 0.79* 72 0.80* 64 0.72* STAR Reading™ Technical Manual 45 Reliability and Validity External Validity Table 13: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 1–6a (Continued) Test Form 1 Date Score 2 n r 3 n r 4 n 5 6 r n r n r n r 0.78* – – – – – – – – – – 229 0.66* – – 7 0.68* 7 0.66* TerraNova Spr 99 Scaled – – 61 0.72* 117 Texas Assessment of Academic Skills (TAAS) Spr 99 NCE – – – – – – Woodcock Reading Mastery (WRM) Spr 99 – – – – – – Summary Grade(s) All 1 2 3 4 5 6 Number of students 16,985 985 3,451 4,539 3,317 2,717 1,976 8 18 25 25 22 16 0.81 0.74 0.72 0.72 0.73 0.71 Number of coefficients 114 Average validity – Overall average 0.73 a. Sample sizes are in the columns labeled “n.” * Denotes correlation coefficients that are statistically significant at the 0.05 level. Table 14: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 7–12a Test Form 7 Date Score n 8 r n 9 r n 10 r 11 12 n r n r n r California Achievement Test (CAT) /5 Spr 99 NCE – – – – 59 0.65* – – – – – – /5 Spr 99 Scaled 124 0.74* 131 0.76* – – – – – – – – – – – – – Colorado Student Assessment Program Spr 06 Scaled 299 0.84* 185 0.83* – – – Delaware Student Testing Program–Reading Spr 05 Scaled – – – – – – 112 0.78* – – – – Spr 06 Scaled 150 0.72* – – – – – – – – – – – – – – – Florida Comprehensive Assessment Test Spr 06 STAR Reading™ Technical Manual SSS – – 74 0.65* 46 – – – Reliability and Validity External Validity Table 14: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 7–12a (Continued) Test Form 7 Date Score n 8 r n 9 r n 10 r 11 n 12 r n r n r Illinois Standards Achievement Test–Reading Spr 05 Scaled – – 157 0.73* – – – – – – – – Spr 06 Scaled 140 0.70* – – – – – – – – – – Iowa Test of Basic Skills (ITBS) Form K Spr 99 NCE – – – – 67 0.78* – – – – – – Form L Spr 99 Scaled 47 0.56* – – 65 0.64* – – – – – – Michigan Educational Assessment Program–English Language Arts Fall 04 Scaled 154 0.68* – – – – – – – – – – Fall 05 Scaled 233 0.72* 239 0.70* – – – – – – – – Fall 06 Scaled 125 0.79* 152 0.74* – – – – – – – – Michigan Educational Assessment Program–Reading Fall 04 Scaled 156 0.68* – – – – – – – – – – Fall 05 Scaled 233 0.71* 239 0.69* – – – – – – – – Fall 06 Scaled 125 0.86* 154 0.72* – – – – – – – – – – – – – – – – – – – Mississippi Curriculum Test Spr 03 Scaled 372 0.70* – – – – Missouri Mastery Achievement Test (MMAT) Spr 99 NCE – – 29 0.78* 19 0.71* – Northwest Evaluation Association Levels Test (NWEA) Achieve Spr 99 NCE – – 124 0.66* – – – – – – – – Stanford Achievement Test (Stanford) 9th Ed. Spr 99 NCE 50 0.65* 50 0.51* – – – – – – – – 9th Ed. Spr 99 Scaled 70 0.70* 68 0.80* – – – – – – – – 0.80* 7 0.60 – – 0.60* – – – – – – – – – Test of Achievement and Proficiency (TAP) Spr 99 NCE – – – – 6 0.42 13 Texas Assessment of Academic Skills (TAAS) Spr 99 NCE – – – – – – 43 Wide Range Achievement Test 3 (WRAT3) Spr 99 STAR Reading™ Technical Manual – – 17 0.81* 47 – – – Reliability and Validity External Validity Table 14: Concurrent Validity Data: STAR Reading US Correlations (r) with External Tests Administered Spring 1999 and Later, US Grades 7–12a (Continued) Test Form 7 Date Score 8 n r 9 n r 10 n r 11 n r n 12 r n r Summary Grade(s) All 7 8 9 10 11 12 Number of students 4,288 2,278 1,619 216 168 7 0 Number of coefficients 38 15 14 5 3 1 0 Average validity – 0.73 0.72 0.64 0.75 0.66 – Overall average 0.72 a. Sample sizes are in the columns labeled “n.” * Denotes correlation coefficients that are statistically significant at the 0.05 level. Table 15: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to the Spring of 1999, US Grades 1–6a 1 Test Form Date Score n 2 r n 3 r 4 n 5 6 r n r n r n r – – – – – – – American Testronics Level C-3 Spr 98 Scaled – – 20 0.71* – California Achievement Test (CAT) /4 Spr 98 Scaled – – 16 0.82* – – 54 0.65* – – 10 0.88* /5 Spr 98 Scaled – – – – 40 0.82* 103 0.85* – – – – /5 Fall 98 NCE 40 0.83* – – – – – – – – – – /5 Fall 98 Scaled – – – – 39 0.85* – – – – – – Comprehensive Test of Basic Skills (CTBS) A-15 Fall 97 NCE – – – – – – – – – – 24 0.79* /4 Spr 97 Scaled – – – – – – – – 31 0.61* – – /4 Spr 98 Scaled – – – – – – 6 0.49 68 0.76* – – A-19/20 Spr 98 Scaled – – – – – – – – 10 0.73* – – A-15 Spr 98 Scaled – – – – – – – – – – 93 0.81* A-16 Fall 98 NCE – – – – – – – – – – 73 0.67* – – 25 0.72* 23 0.38 Degrees of Reading Power (DRP) Spr 98 STAR Reading™ Technical Manual – – – – 8 48 0.71* Reliability and Validity External Validity Table 15: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to the Spring of 1999, US Grades 1–6a (Continued) 1 Test Form Date Score n 2 r n 3 r n 4 r 5 n 6 r n r n r Gates-MacGinitie Reading Test (GMRT) 2nd Ed., D Spr 98 NCE – – – – – – – – – – 47 0.80* L-3rd Spr 98 NCE – – 31 0.69* 27 0.62* – – – – – – L-3rd Fall 98 NCE 60 0.64* – – 66 0.83* – – – – – – Indiana Statewide Testing for Educational Progress (ISTEP) Fall 98 NCE – – – – 19 0.80* – – – – 21 0.79* Iowa Test of Basic Skills (ITBS) Form K Spr 98 NCE – – – – 88 0.74* 17 0.59* – – 21 0.83* Form L Spr 98 NCE – – – – 50 0.84* – – – – 57 0.66* Form M Spr 98 NCE – – 68 0.71* – – – – – – – – Form K Fall 98 NCE – – 67 0.66* 43 0.73* 67 0.74* 28 0.81* – – Form L Fall 98 NCE – – – – – – 27 0.88* 6 0.97* 37 0.60* Form M Fall 98 NCE – – 65 0.81* – – 53 0.72* – – – – Metropolitan Achievement Test (MAT) 7th Ed. Spr 98 NCE – – – – – – 29 0.67* 22 0.68* 17 0.86* 6th Ed Spr 98 Raw – – – – – – 6 0.91* – – 5 0.67 7th Ed. Spr 98 Scaled – – 48 0.75* – – – – 30 0.79* – – 7th Ed. Fall 98 NCE – – – – – – – – – – 49 0.75* Metropolitan Readiness Test (MRT) Spr 96 NCE – – – – 5 0.81 – – – – – – Spr 98 NCE 4 0.63 – – – – – – – – – – – 14 0.75* 24 0.62* 0.92* – – – – – – 53 0.76* – – – – – – – – Missouri Mastery Achievement Test (MMAT) Spr 98 Scaled – – – – 12 0.44 – New York State Student Evaluation Program (P&P) Spr 98 – – – – – – 13 North Carolina End of Grade Test (NCEOG) Spr 98 Scaled – – – – – – NRT Practice Achievement Test (NRT) Practice Spr 98 STAR Reading™ Technical Manual NCE – – 56 0.71* 49 – – Reliability and Validity External Validity Table 15: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to the Spring of 1999, US Grades 1–6a (Continued) 1 Test Form Date Score 2 n r 3 n r 4 n r 5 n 6 r n r n r Stanford Achievement Test (Stanford) 9th Ed. Spr 97 Scaled – – – – – – – – 68 0.65* – – 7th Ed. Spr 98 Scaled 11 0.73* 7 0.94* 8 0.65 15 0.82* 7 0.87* 8 0.87* 8th Ed. Spr 98 Scaled 8 0.94* 8 0.64 6 0.68 11 0.76* 8 0.49 7 0.36 9th Ed. Spr 98 Scaled 13 0.73* 93 0.73* 19 0.62* 314 0.74* 128 0.72* 62 0.67* 4th Ed. 3/V Spr 98 Scaled 14 0.76* – – – – – – – – – – 9th Ed. Fall 98 NCE – – – – 45 0.89* – – 35 0.68* – – 9th Ed. Fall 98 Scaled – – 88 0.60* 25 0.79* – – 196 0.73* – – 9th Ed. 2/SA Fall 98 Scaled – – – – 103 0.69* – – – – – – Tennessee Comprehensive Assessment Program (TCAP) Spr 98 Scaled – – 30 0.75* – – – – – – – – TerraNova Fall 97 Scaled – – – – – – – – 56 0.70* – – Spr 98 NCE – – – – 76 0.63* – – – – – – Spr 98 Scaled – – 94 0.50* 55 0.79* 299 0.75* 86 0.75* 23 0.59* Fall 98 NCE – – – – – – – – – – 126 0.74* Fall 98 Scaled – – – – – – 14 0.70* – – 15 0.77* – – – 10 0.89* 0.58* – – – – Wide Range Achievement Test 3 (WRAT3) Fall 98 – – – – – – – Wisconsin Reading Comprehension Test Spr 98 – – – – – – 63 Summary Grade(s) All 1 2 3 4 5 6 Number of students 4,289 150 691 734 1,091 871 752 Number of coefficients 95 7 14 19 16 18 21 Average validity – 0.75 0.72 0.73 0.74 0.73 0.71 Overall average 0.73 a. Sample sizes are in the columns labeled “n.” * Denotes correlation coefficients that are statistically significant at the 0.05 level. STAR Reading™ Technical Manual 50 Reliability and Validity External Validity Table 16: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to Spring 1999, US Grades 7–12a Test Form 7 Date Score n 8 r n 9 r n 10 r 11 12 n r n r n r California Achievement Test (CAT) /4 Spr 98 Scaled – – 11 0.75* – – – – – – – – /5 Spr 98 NCE 80 0.85* – – – – – – – – – – Comprehensive Test of Basic Skills (CTBS) /4 Spr 97 NCE – – 12 0.68* – – – – – – – – /4 Spr 98 NCE 43 0.84* – – – – – – – – – – /4 Spr 98 Scaled 107 0.44* 15 0.57* 43 0.86* – – – – – – A-16 Spr 98 Scaled 24 0.82* – – – – – – – – – – Explore (ACT Program for Educational Planning, 8th Grade) Fall 97 NCE – – – – 67 0.72* – – – – – – Fall 98 NCE – – 32 0.66* – – – – – – – – Iowa Test of Basic Skills (ITBS) Form K Spr 98 NCE – – – – 35 0.84* – – – – – – Form K Fall 98 NCE 32 0.87* 43 0.61* – – – – – – – – Form K Fall 98 Scaled 72 0.77* 67 0.65* 77 0.78* – – – – – – Form L Fall 98 NCE 19 0.78* 13 0.73* – – – – – – – – Metropolitan Achievement Test (MAT) 7th Ed. Spr 97 Scaled 114 0.70* – – – – – – – – – – 7th Ed. Spr 98 NCE 46 0.84* 63 0.86* – – – – – – – – 7th Ed. Spr 98 Scaled 88 0.70* – – – – – – – – – – 7th Ed. Fall 98 NCE 50 0.55* 48 0.75* – – – – – – – – – – – – – Missouri Mastery Achievement Test (MMAT) Spr 98 Scaled 24 0.62* 12 0.72* – – – North Carolina End of Grade Test (NCEOG) Spr 97 Scaled – – – – – – 58 0.81* – – – – Spr 98 Scaled – – – – 73 0.57* – – – – – – PLAN (ACT Program for Educational Planning, 10th Grade) Fall 97 NCE – – – – – – – – 46 0.71* – – Fall 98 NCE – – – – – – 104 0.53* – – – – STAR Reading™ Technical Manual 51 Reliability and Validity External Validity Table 16: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to Spring 1999, US Grades 7–12a (Continued) Test Form 7 Date Score n 8 r n 9 r n 10 r n 11 12 r n r n r – 78 0.67* – – Preliminary Scholastic Aptitude Test (PSAT) Fall 98 Scaled – – – – – – – Stanford Achievement Test (Stanford) 9th Ed. Spr 97 Scaled – – – – – – – – – – 11 0.90* 7th Ed. Spr 98 Scaled – – 8 0.83* – – – – – – – – 8th Ed. Spr 98 Scaled 6 0.89* 8 0.78* 91 0.62* – – 93 0.72* – – 9th Ed. Spr 98 Scaled 72 0.73* 78 0.71* 233 0.76* 32 0.25 64 0.76* – – 4th Ed. 3/V Spr 98 Scaled – – – – – – 55 0.68* – – – – 9th Ed. Fall 98 NCE 92 0.67* – – – – – – – – – – 9th Ed. Fall 98 Scaled – – – – 93 0.75* – – – – 70 0.75* 0.81 24 0.82* – – – – Stanford Reading Test 3rd Ed. Fall 97 NCE – – – – 5 TerraNova Fall 97 NCE 103 0.69* – – – – – – – – – – Spr 98 Scaled – – 87 0.82* – – 21 0.47* – – – – Fall 98 NCE 35 0.69* 32 0.74* – – – – – – – – Test of Achievement and Proficiency (TAP) Spr 97 NCE – – – – – – – – 36 0.59* – – Spr 98 NCE – – – – – – 41 0.66* – – 43 0.83* – – – 41 0.58* Texas Assessment of Academic Skills (TAAS) Spr 97 TLI – – – – – – – Wide Range Achievement Test 3 (WRAT3) Spr 98 9 0.35 – – – – – – – – – – Fall 98 – – – – 16 0.80* – – – – – – 0.58* – – – – Wisconsin Reading Comprehension Test Spr 98 STAR Reading™ Technical Manual – – – – – 52 – 63 Reliability and Validity External Validity Table 16: Other External Validity Data: STAR Reading US 2.x Correlations (r) with External Tests Administered Prior to Spring 1999, US Grades 7–12a (Continued) Test Form 7 Date Score 8 n r 9 n r 10 n r 11 n r n 12 r n r Summary Grade(s) All 7 8 9 10 11 12 Number of students 3,158 1,016 529 733 398 317 165 Number of coefficients 60 18 15 10 8 5 4 Average validity – 0.71 0.72 0.75 0.60 0.69 0.77 Overall average 0.71 a. Sample sizes are in the columns labeled “n.” * Denotes correlation coefficients that are statistically significant at the 0.05 level. Table 17: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 1–6a Predictor Date Criterion Dateb 1 n 2 r n 3 r n 4 r n 5 r n 6 r n r Colorado Student Assessment Program Fall 05 Spr 06 – – – – 82 0.72* 79 0.77* 93 0.70* 280 0.77* Delaware Student Testing Program–Reading Fall 04 Spr 05 – – – – 189 0.58* – – – – – – Win 05 Spr 05 – – – – 120 0.67* – – – – – – Spr 05 Spr 06 – – – – 161 0.52* 191 0.55* 190 0.62* – – Fall 05 Spr 06 – – 253 0.64* 214 0.39* 256 0.62* 270 0.59* 242 0.71* Win 05 Spr 06 – – 275 0.61* 233 0.47* 276 0.59* 281 0.62* 146 0.57* – – 409 0.67* – – – 417 0.76* Florida Comprehensive Assessment Test Fall 05 Spr 06 – – – – – – Win 07 Spr 07 – – – – – – 42 0.73* – Illinois Standards Achievement Test–Reading Fall 04 Spr 05 – – – – 450 0.73* – – 317 0.68* – – Win 05 Spr 05 – – – – 564 0.76* – – 403 0.68* – – Fall 05 Spr 06 – – – – 133 0.73* 140 0.74* 145 0.66* – – Win 06 Spr 06 – – – – 138 0.76* 145 0.77* 146 0.70* – – STAR Reading™ Technical Manual 53 Reliability and Validity External Validity Table 17: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 1–6a (Continued) 1 2 3 4 5 6 Predictor Date Criterion Dateb Fall 04 Fall 05P – – – – 193 0.60* 181 0.70* 170 0.75* 192 0.66* Win 05 Fall 05P – – – – 204 0.68* 184 0.74* 193 0.75* 200 0.70* Spr 05 Fall 05P – – – – 192 0.73* 171 0.73* 191 0.71* 193 0.62* Fall 05 Fall 06P – – – – 111 0.66* 132 0.71* 119 0.77* 108 0.60* Win 06 Fall 06P – – – – 114 0.77* – 121 0.75* 109 0.66* n r n r n r n r n r n r Michigan Educational Assessment Program–English Language Arts – Michigan Educational Assessment Program–Reading Fall 04 Fall 05P – – – – 193 0.60* 181 0.69* 170 0.76* 192 0.66* Win 05 Fall 05P – – – – 204 0.69* 184 0.74* 193 0.78* 200 0.70* Spr 05 Fall 05 P – – – – 192 0.72* 171 0.72* 191 0.74* 193 0.62* Fall 05 Fall 06P – – – – 111 0.63* 132 0.70* 119 0.78* 108 0.62* Win 06 Fall 06P – – – – 114 0.72* – 121 0.75* 109 0.64* – Mississippi Curriculum Test Fall 01 Spr 02 – – 86 0.57* 95 0.70* 97 0.65* 78 0.76* Fall 02 Spr 03 – – 340 0.67* 337 0.67* 282 0.69* 407 0.71* – – 442 0.72* Oklahoma Core Curriculum Test Fall 04 Spr 05 – – – – – – – – 44 0.63* – – Win 05 Spr 05 – – – – – – – – 45 0.66* – – Fall 05 Spr 06 – – – – 89 0.59* 90 0.60* Win 06 Spr 06 – – – – 60 0.65* 40 0.67* 79 0.69* – – 84 0.63* – – STAR Reading Fall 05 Spr 06 16,982 0.66* 42,601 0.78* 46,237 0.81* 44,125 0.83* 34,380 0.83* 23,378 0.84* Fall 06 Spr 07 25,513 0.67* 63,835 0.78* 69,835 0.81* 65,157 0.82* 57,079 0.83* 35,103 0.83* Fall 05 Fall 06P 8,098 0.65* 20,261 0.79* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.82* Fall 05 Spr 07P 8,098 0.55* 20,261 0.72* 20,091 0.77* 18,318 0.80* 7,621 0.80* 5,021 0.79* Spr 06 Fall 06 P 8,098 0.84* 20,261 0.82* 20,091 0.83* 18,318 0.83* 7,621 0.83* 5,021 0.83* Spr 06 Spr 07P 8,098 0.79* 20,261 0.80* 20,091 0.81* 18,318 0.82* 7,621 0.82* 5,021 0.81* STAR Reading™ Technical Manual 54 Reliability and Validity External Validity Table 17: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 1–6a (Continued) Predictor Date Criterion Dateb 1 2 n r n 3 r n 4 r n 5 r n 6 r n r Summary Grades All 1 2 3 4 5 6 Number of students 857,996 74,877 184,434 200,929 185,528 126,029 82,189 Number of coefficients 123 6 10 30 25 29 23 – 0.68 0.78 0.80 0.82 0.82 0.82 Average validity Overall validity 0.79 a. Grade given in the column signifies the grade within with the Predictor variable was given (as some validity estimates span contiguous grades). b. P indicates a criterion measure was given in a subsequent grade from the predictor. * Denotes significant correlation (p < 0.05). Table 18: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 7–12a Predictor Date Criterion Dateb 7 n 8 r n 9 r n 10 r n 11 12 r n r n r – – – – – 48 0.66* – – – – Colorado Student Assessment Program Fall 05 Spr 06 299 0.83* 185 0.83* – – – Delaware Student Testing Program–Reading Spr 05 Spr 06P 100 0.75* 143 0.63* – Fall 05 Spr 06 273 0.69* 247 0.70* 152 0.73* 97 0.78* – – – – Win 05 Spr 06 – 61 0.64* 230 0.64* 145 0.71* – – – – – – Florida Comprehensive Assessment Test Fall 05 Spr 06 381 0.61* 387 0.62* – – – – – – – – Win 07 Spr 07 342 0.64* 361 0.72* – – – – – – – – – – – – – Illinois Standards Achievement Test–Reading Fall 05 Spr 06 STAR Reading™ Technical Manual 173 0.51* 158 0.66* – 55 – – Reliability and Validity External Validity Table 18: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 7–12a (Continued) 7 8 9 10 11 Predictor Date Criterion Dateb Fall 04 Fall 05P 181 0.71* 88 0.85* – – – – Win 05 Fall 05P 214 0.73* 212 0.73* – – – Spr 05 Fall 05P 206 0.75* 223 0.69* – – Fall 05 Fall 06P 114 0.66* 126 0.66* – Win 06 Fall 06P 114 0.64* 136 0.71* Spr 06 Fall 06P – 30 0.80* n r n r n r n r n 12 r n r – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – Michigan Educational Assessment Program–English Language Arts – Michigan Educational Assessment Program–Reading Fall 04 Fall 05P 181 0.70* 88 0.84* – – – – – – – – Win 05 Fall 05 P 214 0.72* 212 0.73* – – – – – – – – Spr 05 Fall 05P 206 0.72* 223 0.69* – – – – – – – – Fall 05 Fall 06P 116 0.68* 138 0.66* – – – – – – – – Win 06 Fall 06P 116 0.68* 138 0.70* – – – – – – – – Spr 06 Fall 06P – 30 0.81* – – – – – – – – – – – – – – – Mississippi Curriculum Test Fall 02 Spr 03 425 0.68* – – – – STAR Reading Fall 05 Spr 06 17,370 0.82* 9,862 0.82* 2,462 0.82* 1,527 0.85* 1,443 0.83* 596 0.85* Fall 06 Spr 07 22,177 0.82* 19,152 0.82* 4,087 0.84* 2,624 0.85* 2,930 0.85* 2,511 0.86* Fall 05 Fall 06P 5,399 0.81* 641 0.76* 659 0.89* 645 0.88* 570 0.90* – – Fall 05 Spr 07P 5,399 0.79* 641 0.76* 659 0.83* 645 0.83* 570 0.87* – – Spr 06 Fall 06P 5,399 0.82* 641 0.83* 659 0.87* 645 0.88* 570 0.89* – – Spr 06 Spr 07P 5,399 0.80* 641 0.83* 659 0.85* 645 0.85* 570 0.86* – – STAR Reading™ Technical Manual 56 Reliability and Validity Meta-Analysis of the STAR Reading Validity Data Table 18: Predictive Validity Data: Correlations of STAR Reading Scaled Scores Predicting Later Performance for Grades 7–12a (Continued) Predictor Date Criterion Dateb 7 n 8 r n 9 r n 10 r n 11 r n 12 r n r Summary Grades All 7 8 9 10 11 12 Number of students 126,090 64,978 34,764 9,567 7,021 6,653 3,107 Number of coefficients 73 23 25 8 9 6 2 Average validity – 0.81 0.81 0.83 0.85 0.86 0.86 Overall validity 0.82 a. Grade given in the column signifies the grade within with the Predictor variable was given (as some validity estimates span contiguous grades). b. P indicates a criterion measure was given in a subsequent grade from the predictor. * Denotes significant correlation (p < 0.05). Meta-Analysis of the STAR Reading Validity Data Meta-analysis is a set of statistical procedures that combines results from different sources or studies. When applied to a set of correlation coefficients that estimate test validity, meta-analysis combines the observed correlations and sample sizes to yield estimates of overall validity, as well as standard errors and confidence intervals, both overall and within grades. To conduct a meta-analysis of the STAR Reading validity data, the 223 correlations with other tests, first reported in the STAR Reading version 2.0 technical manual, were combined and analysed using a fixed effects model for meta-analysis. The results are displayed in Table 19. The table lists results for the correlations within each US grade, as well as results with all twelve grades’ data combined. For each set of results, the table lists an estimate of the true validity, a standard error and the lower and upper limits of a 95 per cent confidence interval for the validity coefficient. Table 19: Results of the Meta-Analysis of STAR Reading US 2.x Correlations with Other Tests Effect Size STAR Reading™ Technical Manual 95% Confidence Level US Grade Validity Estimate Standard Error Lower Limit Upper Limit 1 0.77 0.02 0.72 0.81 2 0.72 0.02 0.68 0.74 3 0.75 0.01 0.73 0.78 4 0.75 0.01 0.73 0.77 57 Reliability and Validity Post-Publication Study Data Table 19: Results of the Meta-Analysis of STAR Reading US 2.x Correlations with Other Tests Effect Size 95% Confidence Level US Grade Validity Estimate Standard Error Lower Limit Upper Limit 5 0.75 0.01 0.72 0.77 6 0.71 0.01 0.68 0.74 7 0.70 0.01 0.67 0.72 8 0.72 0.02 0.69 0.75 9 0.72 0.02 0.69 0.75 10 0.61 0.03 0.55 0.67 11 0.70 0.03 0.64 0.75 12 0.74 0.03 0.69 0.79 All 0.72 0.00 0.71 0.73 Thus, based on the STAR Reading 2.0 pilot study data, the overall estimate of the validity of STAR Reading is 0.72, with a standard error of 0.005. The true validity is estimated to lie within the range of 0.71–0.73, with a 95 per cent confidence level. The probability of observing the 223 reported correlations, if the true validity were zero, is virtually zero. Because the 223 correlations were obtained with widely different tests, and among students from twelve different grades (years), these results provide support for the validity of STAR Reading as a measure of reading ability. Post-Publication Study Data Subsequent to publication of STAR Reading 2.0 in 1999, additional US external validity data have become available, both from users of the assessment and from special studies conducted by Renaissance Learning. This section provides summaries of those new data along with tables of results. Data from three sources are presented here. They include a predictive validity study, a longitudinal study and a study of STAR Reading’s construct validity as a measure of reading comprehension. Predictive Validity: Correlations with SAT9 and the California Standards Tests A doctoral dissertation (Bennicoff-Nan, 2002)3 studied the validity of STAR Reading as a predictor of students’ scores in a California school district (school network) on the California Standards Test (CST) and the Stanford Achievement Tests, Ninth Edition (SAT9), the reading accountability tests 3. Bennicoff-Nan, L. (2002). A correlation of computer adaptive, norm referenced, and criterion referenced achievement tests in elementary reading. Unpublished doctoral dissertation, The Boyer Graduate School of Education, Santa Ana, California. STAR Reading™ Technical Manual 58 Reliability and Validity Post-Publication Study Data mandated by the State of California. At the time of the study, those two tests were components of the California Standardized Testing and Reporting Program. The study involved analysis of test scores of more than 1,000 school children in four US grades in a rural central California school district; 83% of students in the district were eligible for free and reduced lunch and 30% were identified as having limited English proficiency. Bennicoff-Nan’s dissertation addressed a number of different research questions. For purposes of this technical manual, we are primarily interested in the correlations between STAR Reading 2 with SAT9 and CST scores. Those correlations are displayed by US grade in Table 20. Table 20: Correlations of STAR Reading 2.0 Scores with US SAT9 and California Standards Test Scores, by US Grade US Grade SAT9 Total Reading CST English and Language Arts 3 0.82 0.78 4 0.83 0.81 5 0.83 0.79 6 0.81 0.78 In summary, the average correlation between STAR Reading and SAT9 was 0.82. The average correlation with CST was 0.80. These values are evidence of the validity of STAR Reading for predicting performance on both norm-referenced reading tests such as the SAT9, and criterion-referenced accountability measures such as the CST. Bennicoff-Nan concluded that STAR Reading was “a time and labour effective” means of progress monitoring in the class, as well as suitable for program evaluation and monitoring student progress towards state accountability targets. A Longitudinal Study: Correlations with the Stanford Achievement Test in Reading Sadusky and Brem (2002)4 conducted a study to determine the effects of implementing Reading Renaissance (RR) at a Title I school in the southwestern United States from 1997 to 2001. This was a retrospective longitudinal study. Incidental to the study, they obtained students’ STAR Reading posttest scores and Stanford Achievement Test (SAT9) end-of-year Total Reading scores from each school year and calculated correlations between them. Students’ test scores were available for multiple school years, spanning US grades 2–6 (Years 3–7). Data on gender, ethnic group and Title I eligibility were also collected. The observed correlations for the overall group are displayed in Table 21. Table 22 displays the same correlations, broken out by ethnic group. 4. Sadusky, L. A. & Brem, S. K. (2002). The integration of Renaissance programs into an urban Title I elementary school, and its effect on school-wide improvement (Tech. Rep.). Tempe: Arizona State University. Available online: http://drbrem.ml1.net/renlearn/publications/RR2002.pdf. STAR Reading™ Technical Manual 59 Reliability and Validity Post-Publication Study Data Table 21: Correlations of the STAR Posttest with the SAT9 Total Reading Scores 1998–2002a Year US Grades N Correlation 1998 3–6 44 0.66 1999 2–6 234 0.69 2000 2–6 389 0.67 2001 2–6 361 0.73 a. All correlations significant, p < 0.001. Table 22: Correlations of the STAR Posttest with the SAT9 Total Reading Scores, by Ethnic Group, 1998–2002a Hispanic White Year US Grade N Correlation N Correlation 1998 3–6 7 (n.s.) 0.55 35 0.69 1999 2–6 42 0.64 179 0.75 2000 2–6 67 0.74 287 0.71 2001 2–6 76 0.71 255 0.73 a. All correlations significant, p < 0.001, unless otherwise noted. Overall correlations by school year ranged from 0.66 to 0.73. Sadusky and Brem concluded that “STAR results can serve as a moderately good predictor of SAT9 performance in reading”. Enough Hispanic and white students were identified in the sample to calculate correlations separately for those two groups. Within each ethnic group, the correlations were similar in magnitude, as the next table shows. This provides support for the assertion that STAR Reading’s validity is irrespective of student ethnicity. Concurrent Validity: An International Study of Correlations with Reading Tests in England NFER, the National Foundation for Educational Research, conducted a study of the concurrent validity of both STAR Reading and STAR Maths in 16 schools in England in 2006 (Sewell, Sainsbury, Pyle, Keogh and Styles, 2007).5 English primary and secondary students in Years 2–9 (equivalent to US grades 1–8) took both STAR Reading and one of three age-appropriate forms of the Suffolk Reading Scale 2 (SRS2) in the fall of 2006. Scores on the SRS2 included 5. Sewell, J., Sainsbury, M., Pyle, K., Keogh, N. and Styles, B. (2007.) Renaissance Learning Equating Study Report. Technical Report submitted to Renaissance Learning, Inc. National Foundation for Educational Research, Slough, Berkshire, United Kingdom. STAR Reading™ Technical Manual 60 Reliability and Validity Post-Publication Study Data traditional scores, as well as estimates of the students’ Reading Age (RA), a scale that is roughly equivalent to the Grade Equivalent (GE) scores used in the US. Additionally, teachers conducted individual assessments of each student’s attainment in terms of curriculum levels, a measure of developmental progress that spans the primary and secondary years in England. Correlations with all three measures are displayed in Table 23, by year and overall. As the table indicates, the overall correlation between STAR Reading and Suffolk Reading Scale scores was 0.91, the correlation with Reading Age was 0.91 and the correlation with teacher assessments was 0.85. Within-form correlations with the SRS ability estimate ranged from 0.78–0.88, with a median correlation of 0.84, and ranged from 0.78–0.90 on Reading Age, with a median of 0.85. Table 23: Correlations of STAR Reading with Scores on the Suffolk Reading Scale and Teacher Assessments in a Study of 16 Schools in England Suffolk Reading Scale Teacher Assessments School Yearsa Test Form N SRS Scoreb Reading Age N Assessment Levels 2–3 SRS1A 713 0.84 0.85 n/a n/a 4–6 SRS2A 1,255 0.88 0.90 n/a n/a 7–9 SRS3A 926 0.78 0.78 n/a n/a 2,694 0.91 0.91 2,324 0.85 Overall a. UK school year values are 1 greater than the corresponding US school grade. Thus, Year 2 corresponds to grade 1, etc. b. Correlations with the individual SRS forms were calculated with within-form raw scores. The overall correlation was calculated with a vertical scale score. Construct Validity: Correlations with a Measure of Reading Comprehension The Degrees of Reading Power (DRP) test is widely recognised in the United States as a measure of reading comprehension. Yoes (1999)6 conducted an analysis to link the STAR Reading Rasch item difficulty scale to the item difficulty scale of DRP. As part of the study, nationwide samples of students in the US grades 3, 5, 7 and 10 (Years 4, 6, 8 and 11) took two tests each (levelled forms of both the DRP and of STAR Reading calibration tests). The forms administered were appropriate to each student’s US grade level. Both tests were administered in paper-and-pencil format. All STAR Reading test forms consisted of 44 items, a mixture of vocabulary-in-context and extended passage comprehension item types. The US grade 3 (Year 4) DRP test form (H-9) contained 42 items and all remaining US grades (5, 7 and 10; Years 6, 8 and 11) consisted of 70 items on the DRP test. 6. Yoes, M. (1999) Linking the STAR and DRP Reading Test Scales. Technical Report. Submitted to Touchstone Applied Science Associates and Renaissance Learning. STAR Reading™ Technical Manual 61 Reliability and Validity Post-Publication Study Data STAR Reading and DRP test score data were obtained on 273 students at US grade 3 (Year 4), 424 students at US grade 5 (Year 6), 353 students at US grade 7 (Year 8) and 314 students at US grade 10 (Year 11). Item-level factor analysis of the combined STAR and DRP response data indicated that the tests were essentially measuring the same construct at each of the four years. Latent roots (Eigenvalues) from the factor analysis of the tetrachoric correlation matrices tended to verify the presence of an essentially unidimensional construct. In general, the eigenvalue associated with the first factor was very large in relation to the eigenvalue associated with the second factor. Overall, these results confirmed the essential unidimensionality of the combined STAR Reading and DRP data. Since DRP is an acknowledged measure of reading comprehension, the factor analysis data support the assertion that STAR Reading likewise measures reading comprehension. Subsequent to the factor analysis, the STAR Reading item difficulty parameters were transformed to the DRP difficulty scale, so that scores on both tests could be expressed on a common scale. STAR Reading scores on that scale were then calculated using the methods of item response theory. The correlations between STAR Reading and DRP reading comprehension scores were then computed both overall and by US grade. Table 24 below displays the correlations. Table 24: Correlations between STAR Reading and DRP Test Scores, Overall and by Grade Number of Items Test Form US Grade Sample Size STAR Calibration DRP STAR DRP Correlation 3 273 321 H-9 44 42 0.84 5 424 511 H-7 44 70 0.80 7 353 623 H-6 44 70 0.76 10 314 701 H-2 44 70 0.86 Overall 1,364 0.89 Combining students across US grade levels, and plotting both their STAR Reading and DRP scores on the same yardstick yielded the plot as seen in Figure 5. The plot shows a slightly curvilinear relationship between STAR and DRP scales, but the strong linear correlation between scores on the two tests is evident as well. STAR Reading™ Technical Manual 62 Reliability and Validity Post-Publication Study Data Est DRP Theta (from STAR) Figure 5: STAR to DRP Linking Study Grades Combined (r = 0.89) 8.0 7.0 6.0 5.0 4.0 3.0 2.0 1.0 0.0 -1.0 -2.0 -3.0 -4.0 -5.0 -6.0 -6.0 -5.0 -4.0 -3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 DRP Theta In sum, the Yoes (1999) study indicates by means of item factor analysis that STAR Reading items measure the same underlying attribute as the DRP: reading comprehension. The overall correlation of 0.89 between the DRP and STAR Reading test scores corroborates that. Furthermore, correcting that correlation coefficient for the effects of less than perfect reliability yields a corrected correlation of 0.96. Thus, both at the item level and at the test score level, STAR Reading was shown to measure essentially the same attribute as DRP. Investigating Oral Reading Fluency and Developing the Estimated Oral Reading Fluency Scale During the fall of 2007 and winter of 2008, 32 schools across the United States that were then using both STAR Reading and DIBELS oral reading fluency (DORF) for interim assessments were contacted and asked to participate in the research. The schools were asked to ensure that students were tested during the fall and winter interim assessments schedules, usually during September and January, respectively, on both STAR Reading and DORF within a 2-week time interval. Schools used the benchmark assessment passages from the year-appropriate DORF passage sets. In addition, schools were asked to submit data from the previous school year on the interim assessments. Any student that had a valid STAR Reading and DORF assessment within a 2-week time span was used in the analysis. Thus, the research involved both a current sample of students who took benchmark assessments during the fall and winter of the 2007–2008 school year, as well as historical data from those same schools for students who took either the fall, winter, or spring benchmark assessments from the 2006–2007 school year. This single group design provided data for both evaluation of concurrent validity and the linking of the two score scales. For the linking analysis, an equipercentile methodology was used. Analysis was done independently for grades 1–4 (Years 2–5). Grade 1 (Year 2) data did not include any fall data, and all analyses were done using data from winter (both historical data from 2006–2007, and extant data collections during the 2007–2008 school year) and STAR Reading™ Technical Manual 63 Reliability and Validity Post-Publication Study Data spring (historical data from the 2006–2007 school year). To evaluate the extent to which the linking accurately approximated student performance, 90% of the sample was used to calibrate the linking model and the remaining 10% were used for cross-validating the results. The 10% were chosen by a simple random function. The 32 schools in the sample came from 9 states: Alabama, Arizona, California, Colorado, Delaware, Illinois, Michigan, Tennessee and Texas. This represented a broad range of geographic areas, and resulted in a large number of students (N = 12,220). The distribution of students by year was as follows: 1st grade (Year 2): 2,001 2nd grade (Year 3): 4,522 3rd grade (Year 4): 3,859 4th grade (Year 5): 1,838 The sample was composed of 61 per cent of students of European ancestry; 21 per cent of African ancestry; 11 per cent of Hispanic ancestry; with the remaining 7 per cent of Native American, Asian or other ancestry. Just over 3 per cent of the students were eligible for services due to limited English proficiency (LEP), and between 13 and 14 per cent were eligible for special education services. Students were individually assessed using the DORF benchmark passages. The students read the three benchmark passages under standardised conditions. The raw score for passages was computed as the number of words read correctly within the one-minute limit (WCPM, Words Correctly read Per Minute) for each passage. The final score for each student was the median WCPM across the benchmark passages and was the score used for analysis. Each student also took a STAR Reading assessment within two weeks of the DORF assessment. Descriptive statistics for each year in the study on STAR Reading scale scores and DORF WCPM are found in Table 25. Correlations between the STAR Reading scale score and DORF WCPM at all grades were significant (p < 0.01) and diminished consistently as grades increased. Figure 6 visualises the scatterplot of observed DORF WCPM and STAR Reading scale scores, with the equipercentile linking function overlaid. The equipercentile linking function appeared linear; however, deviations at the tails of the distribution for higher and lower performing students were observed. The root mean square error of linking for US grades 1–4 was found to be 14, 19, 22 and 25, respectively. STAR Reading™ Technical Manual 64 Reliability and Validity Post-Publication Study Data Table 25: Descriptive Statistics and Correlations between STAR Reading Scale Scores and DIBELS Oral Reading Fluency for the Calibration Sample STAR Reading Scale Score DORF WCPM US Grade N Mean SD Mean SD Correlation 1 1794 172.90 98.13 46.05 28.11 0.87 2 4081 274.49 126.14 72.16 33.71 0.84 3 3495 372.07 142.95 90.06 33.70 0.78 4 1645 440.49 150.47 101.43 33.46 0.71 Figure 6: Scatterplot of Observed DORF WCPM and SR Scale Scores for Each US Grade with the Grade Specific Linking Function Overlaid Cross-Validation Study Results The 10 per cent of students randomly selected from the original sample were used to provide evidence of the extent to which the models based on the calibration samples were accurate. The cross-validation sample was kept out of the calibration of the linking estimation, and the results of the calibration sample linking function were applied to the cross-validation sample. Table 26 provides descriptive information on the cross-validation sample. Means and standard deviations for DORF WCPM and STAR Reading scale score for each year were of a similar magnitude to the calibration sample. Table 27 provides results of the correlation between the observed DORF WCPM scores STAR Reading™ Technical Manual 65 Reliability and Validity UK Study Results and the estimated WCPM from the equipercentile linking. All correlations were similar to results in the calibration sample. The average differences between the observed and estimated scores, and their standard deviations, are reported in Table 27 along with the results of one sample t-test evaluating the plausibility of the mean difference being significantly different from zero. At all US grades the mean differences were not significantly different from zero, and standard deviations of the differences were very similar to the root mean square error of linking from the calibration study. Table 26: Descriptive Statistics and Correlations between STAR Reading Scale Scores and DIBELS Oral Reading Fluency for the Cross-Validation Sample STAR Reading Scale Score DORF WCPM US Grade N Mean SD Mean SD 1 205 179.31 100.79 45.61 26.75 2 438 270.04 121.67 71.18 33.02 3 362 357.95 141.28 86.26 33.44 4 190 454.04 143.26 102.37 32.74 Table 27: Correlation between Observed WCPM and Estimated WCPM Along with the Mean and Standard Deviation of the Differences between Them US Grade N Correlation Mean Difference SD Difference t-test on Mean Difference 1 205 0.86 –1.62 15.14 t(204) = –1.54, p = 0.13 2 438 0.83 0.23 18.96 t(437) = 0.25, p = 0.80 3 362 0.78 –0.49 22.15 t(361) = –0.43, p = 0.67 4 190 0.74 –1.92 23.06 t(189) = –1.15, p = 0.25 UK Study Results Descriptive statistics on test results for the students from the 28 schools that participated in the 2006 UK reliability study are found in Table 28. As STAR Reading is a vertically scaled assessment, it is expected that scores will increase over time and provide adequate separation between contiguous years. Results in Table 28 indicate that the median score (50th percentile rank) and all other score distribution points gradually increase over years. In addition, a single-factor ANOVA was computed to evaluate the significance of differences between means at each year. The results indicated significant differences between years, F(7,11948) = 905.22, p < 0.001, h2 = 0.35, with observed power of 0.99. Follow-up analyses using Games-Howell post-hoc testing found significant differences, p < 0.001, between all years. STAR Reading™ Technical Manual 66 Reliability and Validity UK Study Results Table 28: Descriptive Statistics for Student Test Performance in Scale Scores Percentile Rank Year N 5 25 50 75 95 2 557 60 72 106 251 456 3 1,076 66 98 228 350 508 4 1,439 78 234 360 469 650 5 1,514 87 294 433 554 811 6 1,229 149 393 510 662 983 7 4,029 228 449 585 768 1,119 8 1,480 198 470 653 901 1,222 9 632 241 513 711 953 1,258 In addition, the time to complete a STAR Reading assessment was computed to provide evidence for the length of time a test session lasted. The distribution of test times is provided in Table 29 by year and described by percentile ranks. Results indicate at least half of the students at each year finished within 8 minutes while at least 75 per cent finished within 10 minutes. Total test time also decreased with each subsequent year. Table 29: Total Test Time, in Minutes, for a STAR Reading Test by Year (Given in Percentiles) Year N 5 25 50 75 95 2 557 3.18 5.74 7.55 9.87 14.50 3 1,076 2.99 5.45 7.11 8.77 11.92 4 1,439 3.88 5.38 6.60 7.90 10.48 5 1,514 3.57 5.05 6.38 7.70 10.15 6 1,229 3.62 4.93 5.98 7.16 9.43 7 4,029 3.57 4.80 5.82 7.00 8.98 8 1,480 3.12 4.55 5.58 6.75 8.88 9 632 3.20 4.38 5.32 6.50 8.59 Concurrent Validity A large validation study was conducted in partnership with the National Foundation for Educational Research (NFER) in the UK across Years 2–9. The study was undertaken during the 2006–2007 academic year to investigate the validity of STAR Reading in a sample of students attending schools in England. Over 250 students per year were recruited and evaluated on both STAR STAR Reading™ Technical Manual 67 Reliability and Validity UK Study Results Reading and the norm-referenced test Suffolk Reading Scale 2 by nferNelson.7 In addition, all participants had their teachers provide a teacher assessment (TA) of their present reading skills with respect to the National Curriculum Level. A single-group, cross-sectional design was used with counterbalanced test administrations. Students took both the STAR Reading assessment and the Suffolk Reading Scale 2 (nferNelson, 2002). Students in Years 2 and 3 were administered the Suffolk Scale form level 1A; students in Years 4–6 were administered level 2A; and students in Years 7–9 were administered level 3A. Age-standardised scores measured student performance on the Suffolk tests, and since students within a year all took the same form, the number correct score was also used in year analyses. Student reading ages (RA), given in months of age, were computed from their age and Suffolk reading scores. In addition to gathering external test data from the Suffolk Reading Scale 2, teachers were asked to provide the students’ present National Curriculum Level in English (Reading) through teacher assessment (TA). Descriptive data for STAR Reading scale scores (STAR), Suffolk Reading Scale 2 age-standardised score, Suffolk Reading Scale 2 raw total score, and reading age (RA) for each year are provided in Table 30. Results for the age-standardised scores suggested the students in each year compared similarly with the Suffolk norming group expectations, with average scores ranging from 96–102 with a median of 100, and standard deviations ranging from 14 to 15 with a median of 14. It was also noticeable that reading ages (RA) and STAR Reading scale scores (STAR) increased from Years 2–9, as expected. This was the expected progression as both scales span all years in the study and should have resulted in gradual increases across the years. Table 30: Descriptive Statistics and Validity Coefficients by Yearsa STAR Correlation With Nb Mean SDc Total Score Standardised Score RA 271 156 114 0.85 0.79 0.89 Total Score 29 16 Suffolk Standardised Score 96 15 Suffolk RA 80 11 280 167 0.89 0.82 0.83 Year Test 2 STAR Scale Score Suffolk 3 Score STAR Scale Score 262 Suffolk Total Score 44 14 Suffolk Standardised Score 100 15 Suffolk RA 91 15 7. nferNelson (compiled by F. Hagley). (2002). Suffolk Reading Scale 2. London: nferNelson. STAR Reading™ Technical Manual 68 Reliability and Validity UK Study Results Table 30: Descriptive Statistics and Validity Coefficients by Yearsa (Continued) STAR Correlation With Nb Mean SDc Total Score Standardised Score RA 389 369 176 0.86 0.85 0.88 Total Score 46 13 Suffolk Standardised Score 100 14 Suffolk RA 105 18 458 208 0.85 0.86 0.89 0.85 0.85 0.88 0.75 0.76 0.77 0.74 0.77 0.77 0.77 0.77 0.77 Year Test 4 STAR Scale Score Suffolk 5 6 7 8 9 Score STAR Scale Score 383 Suffolk Total Score 51 12 Suffolk Standardised Score 99 14 Suffolk RA 114 21 571 239 STAR Scale Score 331 Suffolk Total Score 57 12 Suffolk Standardised Score 100 15 Suffolk RA 126 23 715 259 STAR Scale Score 271 Suffolk Total Score 45 11 Suffolk Standardised Score 100 14 Suffolk RA 138 23 830 264 STAR Scale Score 312 Suffolk Total Score 51 11 Suffolk Standardised Score 102 14 Suffolk RA 153 25 885 264 STAR Scale Score 206 Suffolk Total Score 52 9 Suffolk Standardised Score 98 14 Suffolk RA 157 25 a. Scores rounded to nearest integer. Correlations are those of STAR Reading scale scores with the Suffolk scores indicated in the respective column headings. b. Number of students with STAR and Suffolk Reading Scale 2 scores. c. Standard Deviation. Concurrent validity indices are provided for STAR with the external measures in the correlation columns. Correlations with the Suffolk raw total score ranged from 0.74 to 0.89 with a median 0.85. Age-standardised scores on the Suffolk Reading Scale 2 ranged from 0.76 to 0.86 with a median of 0.81. Correlations with reading age (RA) ranged from 0.77 to 0.89 with a median of STAR Reading™ Technical Manual 69 Reliability and Validity Summary of STAR Reading Validity Data 0.86. These results provided strong evidence of concurrent validity of scores on STAR Reading with respect to reading achievement measured using different metrics normed in the UK school aged population. Table 31 provides overall correlations between STAR scale scores and the scales of the both the reading age (RA) and teacher’s assessment (TA-NCL) for all years combined. This analysis was possible as all three scores scale across the years in the study and are not either age- or year-specific. As STAR is a vertically scaled assessment, the correlations with RA and TA were done on the complete sample. The results indicated a large correlation with reading ages (RA) of 0.93 and of 0.83 with the teacher assessments of student reading achievement with respect to the national curriculum levels. Table 31: Overall Correlations between STAR Reading Scale Scores and Reading Ages (RA) and the Teacher Assessment of Student Reading Attainment on the National Curriculum Level (TA-NCL) Measures N Correlation RA 2,425 0.92 TA-NCL 2,425 0.83 Summary of STAR Reading Validity Data The validity data presented in this manual includes evidence of STAR Reading’s concurrent, retrospective, predictive and construct validity. The Meta-Analysis section showed the average uncorrected correlation between STAR Reading and all other reading tests to be 0.72. (Many meta-analyses adjust the correlations for range restriction and attenuation to less than perfect reliability; had we done that here, the average correlation would have exceeded 0.80.) Correlations with specific measures of reading ability were often higher than this average. For example, Bennicoff-Nan (2002) found very consistent within-US grade correlations averaging 0.82 with SAT9 and 0.80 with the California Standards Test. Yoes (1999) found within-US grade correlations with DRP averaging 0.81. When these data were combined across US grades, the correlation was 0.89. The latter correlation may be interpreted as an estimate of the construct validity of STAR Reading as a measure of reading comprehension. Yoes also reported that results of item factor analysis of DRP and STAR Reading items yielded a single common factor. This provides strong support for the assertion that STAR Reading is a measure of reading comprehension. STAR Reading™ Technical Manual 70 Norming The data for this standardisation were mostly gathered during the academic year 2009–2010, starting August 1, 2009, although much of the data came before this, going back to 2006. Before the norming process could begin, the data needed cleaning. Some schools and school networks were not in the UK, and were deleted. Some tests were incomplete (test status did not = 33) and were deleted. In some cases the Scaled Score recorded was impossibly low (e.g. 3.7546), reflecting errors in using the test. All cases where the score was below 5 and where decimal points appeared were deleted—the number in this column should be a whole one. In addition, all cases where there were less than 5 cases per school were deleted, as it was thought that such small numbers within a school might represent teachers experimenting with the test. Sample Characteristics Regional Distribution We considered whether the regional distribution of Scaled Scores was proportionally representative of the school population of these regions in the UK. Table 32 gives the distribution of tests by region (with the number in each region expressed as a percentage of the total number of tests in all regions), then the school population of the regions. Table 32: Distribution of Test Results by Region Region North Scotland & Northern Ireland 44,245 48,236 150,586 32,315 275,382 16.07% 17.52% 54.68% 11.73% 100% 200,888 38,647 190,037 111,475 541,047 37.13% 7.14% 35.12% 20.60% 100% 245,133 86,883 340,623 143,790 816,429 30.03% 10.64% 41.72% 17.61% 100% Southeast Southwest Total Distribution of Tests Primary School Secondary School Total School Population of Regions STAR Reading™ Technical Manual Primary School 1,509,674 703,781 1,587,115 1,397,178 5,197,748 Secondary School 1,272,036 581,914 1,340,884 1,252,343 4,447,177 Total 2,781,710 1,285,695 2,927,999 2,649,521 9,644,925 71 Norming Sample Characteristics At primary level, tests are disproportionately high from the Southeast, with Scotland somewhat under quota but the North and Southwest greatly under quota. At secondary level the picture is different—the North is the most disproportionately high and the Southeast is somewhat disproportionately high. However, Scotland is very low. When these differences are tested statistically, both primary and secondary differences are statistically significantly different from what would be expected if the test were distributed proportionately according to school population of regions (see Table 33). Table 33: Proportionate Test Redistribution Based on Regional School Populations Observed Expected Pearson χ2 North 16.07% 29.04% –24.068 Scotland 17.52% 26.88% –18.053 Southeast 54.68% 30.53% 43.707 Southwest 11.73% 13.54% –4.919 North 37.13% 28.84% 15.437 Scotland 7.14% 27.47% –38.789 Southeast 35.12% 30.35% 8.658 20.60% 13.33% 19.912 Region Primary Schoolsa Secondary Schoolsb Southwest a. Pearson χ2(3) = 2.8e+03 Pr = 0.000; Likelihood-ratio χ2(3) = 2.6e+03 Pr = 0.000. b. Pearson χ2(3) = 2.2e+03 Pr = 0.000; Likelihood-ratio χ2(3) = 2.8e+03 Pr = 0.000. Consequently, we cannot say with certainty that the standardisation equally represents all areas of the UK. However, it is not unusual for standardisations to be done which do not represent all areas of the UK. Standardised Scores Student age at time of testing in Years and Months was established by subtracting their date of testing from their date of birth. Students within the same Month of age were treated as equal and aggregated. All students with a given Month had their test scores analysed and a new variable of Standardised Score was created with a mean of 100, a standard deviation of 15, and consistent and regular psychometric properties.Table 34 is a list of all ages in Years:Months with the number of students (frequencies) who were at each Month of Age. It is evident that much younger and much older students were not well represented. There were less than 100 students at every age below 5:07, and at every age above 17:01. By contrast, at the age STAR Reading™ Technical Manual 72 Norming Sample Characteristics of 11:08, there were 22,981 students. At extremes of age the standardisation may not be entirely reliable owing to small numbers of students. Table 34: Number of Students at Each Month of Age Number Number Number Number Number Number of of of of of of Age in Age in Age in Age in Age in Age in Months Students Months Students Months Students Months Students Months Students Months Students 4:00 3 6:05 666 8:10 5,366 11:03 13,456 13:08 7,142 16:01 299 4:01 1 6:06 878 8:11 5,397 11:04 15,012 13:09 6,250 16:02 289 4:02 5 6:07 914 9:00 5,871 11:05 15,926 13:10 5,292 16:03 228 4:03 5 6:08 972 9:01 5,734 11:06 17,304 13:11 4,775 16:04 199 4:04 6 6:09 1,139 9:02 5,960 11:07 18,215 14:00 4,971 16:05 172 4:05 7 6:10 1,191 9:03 5,770 11:08 22,981 14:01 3,867 16:06 117 4:06 6 6:11 1,241 9:04 5,957 11:09 22,237 14:02 3,750 16:07 123 4:07 7 7:00 1,678 9:05 6,057 11:10 22,186 14:03 3,093 16:08 127 4:08 5 7:01 1,952 9:06 5,770 11:11 22,748 14:04 3,009 16:09 138 4:09 8 7:02 2,332 9:07 6,223 12:00 24,441 14:05 2,226 16:10 95 4:10 22 7:03 2,434 9:08 6,359 12:01 20,245 14:06 1,978 16:11 112 4:11 14 7:04 2,762 9:09 6,275 12:02 21,206 14:07 1,564 17:00 104 5:00 13 7:05 3,061 9:10 6,264 12:03 19,606 14:08 1,640 17:01 86 5:01 14 7:06 3,127 9:11 6,185 12:04 20,200 14:09 1,214 17:02 72 5:02 26 7:07 3,587 10:00 6,746 12:05 18,083 14:10 1,029 17:03 71 5:03 37 7:08 3,672 10:01 6,358 12:06 18,012 14:11 797 17:04 71 5:04 49 7:09 3,877 10:02 6,443 12:07 17,128 15:00 757 17:05 39 5:05 64 7:10 3,980 10:03 5,987 12:08 18,236 15:01 738 17:06 33 5:06 96 7:11 4,055 10:04 6,164 12:09 17,003 15:02 841 17:07 49 5:07 98 8:00 4,417 10:05 6,200 12:10 14,571 15:03 616 17:08 39 5:08 133 8:01 4,419 10:06 6,064 12:11 14,266 15:04 631 17:09 31 5:09 134 8:02 4,614 10:07 6,063 13:00 15,166 15:05 701 17:10 31 5:10 174 8:03 4,415 10:08 6,268 13:01 11,775 15:06 613 17:11 30 5:11 198 8:04 4,829 10:09 6,181 13:02 10,988 15:07 554 18:00 27 6:00 248 8:05 4,738 10:10 6,477 13:03 10,117 15:08 501 6:01 339 8:06 4,740 10:11 6,665 13:04 9,542 15:09 447 6:02 444 8:07 5,070 11:00 8,088 13:05 8,578 15:10 375 6:03 561 8:08 5,321 11:01 10,794 13:06 8,093 15:11 355 6:04 637 8:09 5,411 11:02 12,667 13:07 7,114 16:00 366 There are some very high standardised scores at the top end of the distribution for the youngest children—this is a result of rather small frequencies in these cells. STAR Reading™ Technical Manual 73 Norming Sample Characteristics Percentile Ranks From the Standardised Scores, Percentile Ranks were developed. All Percentile Ranks from 1 to 100 are listed with the mean Standardised Score which goes with each item. For each Percentile Rank, the 90% Confidence Limits are also given. Table 35: Percentile Ranks Developed from Mean Standardised Scores Mean Standardised Std. Score Dev. Percentile 90% Confidence Interval Mean Standardised Std. Score Dev. Freq. Percentile 90% Confidence Interval Freq. 0.01 67.21 0.755 67.203 67.225 8224 0.51 98.85 0.099 98.853 98.855 8194 0.02 69.51 0.777 69.496 69.519 8223 0.52 99.20 0.102 99.200 99.203 8235 0.03 72.10 0.680 72.091 72.111 8222 0.53 99.56 0.104 99.554 99.557 8219 0.04 74.19 0.528 74.181 74.196 8218 0.54 99.91 0.101 99.909 99.912 8224 0.05 75.82 0.433 75.815 75.828 8222 0.55 100.28 0.110 100.277 100.280 8226 0.06 77.18 0.362 77.178 77.188 8227 0.56 100.66 0.111 100.662 100.665 8233 0.07 78.35 0.316 78.349 78.359 8227 0.57 101.05 0.110 101.044 101.047 8220 0.08 79.38 0.280 79.371 79.380 8225 0.58 101.42 0.108 101.418 101.421 8227 0.09 80.32 0.264 80.312 80.320 8221 0.59 101.81 0.112 101.805 101.809 8218 0.1 81.21 0.251 81.206 81.213 8222 0.6 102.19 0.112 102.185 102.188 8219 0.11 82.04 0.228 82.032 82.038 8226 0.61 102.59 0.116 102.589 102.593 8223 0.12 82.79 0.214 82.791 82.797 8220 0.62 103.00 0.120 102.996 102.999 8226 0.13 83.52 0.206 83.516 83.522 8225 0.63 103.41 0.118 103.407 103.411 8223 0.14 84.22 0.205 84.221 84.227 8225 0.64 103.82 0.122 103.822 103.826 8237 0.15 84.90 0.186 84.893 84.899 8225 0.65 104.25 0.125 104.253 104.256 8209 0.16 85.52 0.174 85.518 85.523 8212 0.66 104.69 0.124 104.685 104.688 8221 0.17 86.09 0.156 86.087 86.091 8235 0.67 105.12 0.127 105.116 105.120 8221 0.18 86.62 0.151 86.619 86.624 8222 0.68 105.57 0.132 105.565 105.569 8213 0.19 87.14 0.147 87.138 87.142 8216 0.69 106.02 0.132 106.018 106.022 8236 0.2 87.63 0.139 87.626 87.630 8233 0.7 106.49 0.136 106.483 106.487 8225 0.21 88.10 0.134 88.100 88.104 8214 0.71 106.96 0.140 106.954 106.958 8216 0.22 88.55 0.128 88.549 88.553 8226 0.72 107.45 0.141 107.444 107.448 8231 0.23 88.98 0.122 88.981 88.984 8221 0.73 107.94 0.141 107.934 107.938 8212 0.24 89.39 0.115 89.389 89.393 8215 0.74 108.44 0.150 108.435 108.440 8230 0.25 89.79 0.113 89.784 89.787 8229 0.75 108.96 0.151 108.960 108.965 8224 0.26 90.17 0.113 90.171 90.175 8226 0.76 109.48 0.154 109.478 109.483 8223 STAR Reading™ Technical Manual 74 Norming Sample Characteristics Table 35: Percentile Ranks Developed from Mean Standardised Scores (Continued) Mean Standardised Std. Dev. Percentile Score 90% Confidence Interval Mean Standardised Std. Dev. Freq. Percentile Score 90% Confidence Interval Freq. 0.27 90.56 0.111 90.560 90.563 8206 0.77 110.04 0.161 110.033 110.038 8216 0.28 90.94 0.106 90.936 90.939 8236 0.78 110.60 0.164 110.593 110.598 8223 0.29 91.30 0.105 91.303 91.306 8223 0.79 111.16 0.165 111.161 111.166 8235 0.3 91.66 0.104 91.662 91.665 8231 0.8 111.74 0.171 111.739 111.744 8216 0.31 92.03 0.108 92.031 92.035 8220 0.81 112.35 0.182 112.348 112.353 8224 0.32 92.39 0.101 92.387 92.390 8238 0.82 113.00 0.194 113.000 113.005 8223 0.33 92.73 0.097 92.731 92.734 8216 0.83 113.69 0.203 113.687 113.693 8219 0.34 93.07 0.100 93.068 93.071 8211 0.84 114.41 0.213 114.410 114.417 8227 0.35 93.42 0.098 93.414 93.417 8234 0.85 115.16 0.223 115.161 115.168 8225 0.36 93.75 0.096 93.749 93.752 8203 0.86 115.96 0.238 115.960 115.967 8227 0.37 94.09 0.095 94.086 94.088 8229 0.87 116.83 0.253 116.827 116.835 8217 0.38 94.42 0.099 94.420 94.423 8233 0.88 117.77 0.284 117.761 117.769 8227 0.39 94.76 0.097 94.760 94.763 8223 0.89 118.80 0.307 118.795 118.804 8220 0.4 95.10 0.097 95.101 95.103 8224 0.9 119.89 0.336 119.889 119.899 8222 0.41 95.44 0.099 95.438 95.441 8226 0.91 121.08 0.355 121.077 121.087 8229 0.42 95.78 0.098 95.779 95.782 8211 0.92 122.39 0.392 122.381 122.393 8215 0.43 96.12 0.101 96.119 96.122 8224 0.93 123.76 0.403 123.754 123.765 8227 0.44 96.47 0.099 96.467 96.470 8226 0.94 125.21 0.419 125.200 125.212 8228 0.45 96.81 0.100 96.811 96.814 8228 0.95 126.69 0.436 126.684 126.697 8217 0.46 97.16 0.100 97.154 97.157 8220 0.96 128.29 0.498 128.286 128.301 8222 0.47 97.50 0.096 97.495 97.498 8224 0.97 130.11 0.566 130.099 130.116 8223 0.48 97.83 0.096 97.830 97.833 8217 0.98 132.29 0.715 132.280 132.301 8221 0.49 98.16 0.099 98.162 98.165 8238 0.99 135.36 1.117 135.344 135.376 8223 0.5 98.51 0.100 98.505 98.508 8228 1 143.60 7.776 143.485 143.711 8226 Gender Having established the basic standardisation, further studies could then be conducted. One investigation explored whether boys and girls had significantly different outcomes in terms of test scores and standardised scores (see Table 36). STAR Reading™ Technical Manual 75 Norming Sample Characteristics Table 36: Test of Differences between Females and Malesa Group Obs. Mean Std. Err. Std. Dev. Female 452944 100.5341 0.021436 14.42665 100.4921 100.5762 Male 369363 99.345 0.0257447 15.64639 99.29454 99.39545 Combined 822307 100 0.0165398 14.99848 99.96758 100.0324 1.189143 0.033226 1.124021 1.254264 Diffb 95% Conf. Interval a. Female test scores are statistically significantly higher than male test scores. b. Diff = mean(0) – mean(1), t = 35.7896, degrees of freedom = 822305. Ho: diff = 0; Ha: diff < 0 Pr(T < t) = 1.0000; Ha: diff !=0 Pr(|T| > |t|) = 0.0000; Ha: diff > 0 Pr(T > t) = 0.0000. As with Maths, female test scores are higher than male (although this is less surprising in Reading), and this difference is statistically significant. However, as with Maths, this is largely because the samples are so big, and the actual difference is only one point, so for practical purposes there is no point producing separate tables for boys and girls. As with Maths, many more girls have been tested than boys, by an even bigger differential than in maths. It is interesting to speculate about why this should be. Regional Differences in Outcome A further interesting question is whether students in the four regions of the UK (Southeast, Southwest, North, Scotland and Northern Ireland) have significantly different outcomes. Of course, if they did it would not say necessarily anything about the relative effective of schools or degree of socio-economic disadvantage in these areas, only whether the test is targeted on more or less able students (see Table 37). Table 37: Test of Regional Differences in Test Scoresa ANOVAb Source Partial SS df MS F Prob > F Model 808301.541 3 269433.847 1202.12 0 Sector 808301.541 3 269433.847 1202.12 0 Residual 182987646 816424 224.133105 Total 183795948 816424 255.122329 Regression Standard Score Coef. Std. Err. t P>t 2.439411 0.0591105 41.27 0 Southeast –0.1634189 0.0396528 –4.12 0 Southwest 1.872009 0.0497302 37.64 0 1.77454 1.969479 _cons 99.49933 0.030238 3290.54 0 99.44007 99.5586 Scotland 95% Conf. Interval 2.323557 2.555266 –0.2411371 –0.085701 a. N.B North is the reference category. b. The ANOVA shows that there are statistically significantly different test scores between regions. The regression above shows that this is driven by higher average test scores in Scotland and the South West, and lower average test scores in the South East. STAR Reading™ Technical Manual 76 Norming Other Issues Analysis of Variance shows there are very large significant differences between regions. Regression shows this is driven by higher than average tests scores in Scotland (and to a lesser extent the Southwest), while the lowest scores are in the Southeast. This is the same as for Maths. Other Issues Examining differences by socio-economic disadvantage of school or ethnic minority of student would have been of interest, but unfortunately data was not available on these factors. Reference National Foundation for Educational Research (2007). Renaissance Learning Equating Study: Report. Slough: NFER. STAR Reading™ Technical Manual 77 Frequently Asked Questions This chapter addresses a number of questions that educators have asked about STAR Reading tests and score interpretations. Does STAR Reading Assess Comprehension, Vocabulary or Reading Achievement? STAR Reading assesses reading comprehension and overall reading achievement. Through vocabulary-in-context test items, STAR Reading requires students to rely on background information, apply vocabulary knowledge and use active strategies to construct meaning from the assessment text. These cognitive tasks are consistent with what researchers and practitioners describe as reading comprehension. STAR Reading’s IRL score is a measure of reading comprehension. The STAR Reading Scaled Score is a measure of reading achievement. How Do Zone of Proximal Development (ZPD) Ranges Fit In? The Zone of Proximal Development8 defines the reading level range from which the student should be selecting books for reading practice in order to achieve optimal growth in literacy skills. The ZPD is derived from a student’s demonstrated Grade Equivalent score. Renaissance Learning developed the ZPD ranges according to Vygotskian theory, based on an analysis of Accelerated Reader book reading data from 80,000 students in the 1996–1997 school year. More information is available in Research Foundation for Reading Renaissance Target Setting Practices (2003), which was published by Renaissance Learning. This information is also distributed by Renaissance Learning. Table 47 on page 104 contains the relationship between GE and ZPD. How Can the STAR Reading Test Determine a Child’s Reading Level in Less Than Ten Minutes? Short test times are possible because the STAR Reading test is computer-adaptive. It adapts to test students at their level of proficiency. Because the STAR Reading test can adapt and adjust to the student with virtually every question, it is more efficient than conventional pencil and paper tests and acquires more information about a student’s reading ability in less time. This means the STAR Reading test can achieve measurement precision comparable to a conventional test that takes two or more times as long to administer. 8. Although the score is not displayed on UK reports, it is used in calculations behind the scenes. STAR Reading™ Technical Manual 78 Frequently Asked Questions How Does the STAR Reading Test Compare with Other Standardised/National Tests? Very well. The STAR Reading test has a standard error and reliability that are very comparable to other standardised tests. Also, STAR Reading test results correlate well with results from these other test instruments. Validity information reported here is drawn from the National Foundation for Educational Research (NFER) (2007). NFER reported that the correlations between the Suffolk Reading Scales and the STAR Reading Test were 0.84 for SRS1A, 0.98 for SRS2A and 0.78 for SRS3A (the Suffolk Scale being in three separate sections, each suitable for a different age range). These are all satisfactorily high. A further analysis of all the Suffolk items taken together (eliminating any duplicated items which could have appeared in more than one of the three separate Suffolk tests) indicated that the correlation between Suffolk and STAR Reading was 0.91. This is a very high figure for a validity measure. When performing the US norming of the STAR Reading test, we gathered student performance data from several other commonly used reading tests. These data comprised more than 12,000 student test results from test instruments including CAT, ITBS, MAT, Stanford, TAAS, CTBS and others. We computed correlation coefficients between STAR Reading 2.x results and results of each of these test instruments for which we had sufficient data. These correlation coefficients are included in “Reliability and Validity” (pages 42–57). Using IRT computer-adaptive technology, the STAR Reading test achieves its results with fewer test items and shorter test times than other standardised tests. What Are Some of the Other US Standardised Tests That Might Be Compared to the STAR Reading Test? CAT—California Achievement Test US Grades (K–12) Designed to measure achievement in the basic skills commonly found in state and district (school network) curricula. CTBS—Comprehensive Test of Basic Skills US Grades (K–12) Modular testing system that evaluates students’ academic achievement from US grades K–12. It measures the basic content areas—reading, language, spelling, mathematics, study skills, science and social studies. Gates–MacGinitie Reading Test US Grades (K–12) Designed to assess student achievement in reading. ITBS—Iowa Test of Basic Skills US Grades (K–9) Designed to provide for comprehensive and continuous measurement of growth in the fundamental skills, vocabulary, reading, the mechanics of writing, methods of study and mathematics. STAR Reading™ Technical Manual 79 Frequently Asked Questions MAT—Metropolitan Achievement Test US Grades (1–12) Designed to measure the achievement of students in the major skill and content areas of the school curriculum. Stanford Achievement Test US Grades (1–12) Designed to measure the important learning outcomes of the school curriculum. Measures student achievement in reading, mathematics, language, spelling, study skills, science, social studies and listening. TAKS—Texas Assessment of Knowledge and Skills US Grades (3–11) Texas Education Agency mandated criterion-referenced test used to assess student and school system performance throughout the state. Includes tests in reading, maths, writing, science and social studies. Passage of a US grade 10 exit exam is required for graduation. Why Do Some of My Students Who Took STAR Reading Tests Have Scores That Are Widely Varying from the Results of Our Other US-Standardised Test Program? The simple answer is that at least three factors work to make scores different on different tests: score scale differences, measurement error in both testing instruments and differences between their norms groups as well. Scale scores obtained on different tests—such as the Suffolk Reading Scale and STAR Reading—are not comparable, so we should not expect students to get the same scale scores on both tests, any more than we would expect the same results when measuring weights using one scale calibrated in pounds and another calibrated in kilograms. If norm-referenced scores, such as GE scores or percentile ranks, are being compared, scores will certainly differ to some extent because of sampling differences between the two tests’ respective norms groups. Finally, even if the score scales were made comparable, or the norms groups were identical, measurement error in both tests would cause the scores to be different in most cases. Although actual scores will differ because of the factors discussed above, the statistical correlation between scores on STAR Reading and other standardised tests is generally high. That is, the higher students’ scores are on STAR Reading, the higher their scores on another test tend to be. You will find that, on the whole, STAR Reading results will agree very well with almost all of the other standardised reading test results. All standardised test scores have measurement error. The STAR Reading measurement error is comparable to most other standardised tests. When one compares the results from different tests taken at different times, it is not unusual to see differences in test scores ranging from 2–5 grade (year) levels. This is true when comparing results from other test instruments as well. Standardised tests provide approximate measurements. The STAR Reading test is no different in this regard, but its adaptive nature makes its scores more reliable than conventional test scores near the minimum and maximum STAR Reading™ Technical Manual 80 Frequently Asked Questions scores on a given form. A common shortcoming of conventional tests involves “floor” and “ceiling” effects at each test level. The STAR Reading test is not subject to this shortcoming because of its adaptive branching and large item bank. Other factors, such as student motivation and the testing environment, are also different for STAR Reading and high-stakes tests. Why Do We See a Significant Number of Our Students Performing at a Lower Level Now Than They Were Nine Weeks Ago? This is a result of measurement error. As mentioned above, all psychometric instruments, including the STAR Reading test, have some level of measurement error associated with them. Measurement error causes students’ scores to fluctuate around their “true scores”. About half of all observed scores are smaller than the students’ true scores; the result is that some students’ capabilities are underestimated to some extent. If a group of students were to take a test twice on the same day, without repeating any items, about half of their scores would increase on the second test, while the other half would decline; the size of the individual score variations is an indicator of measurement error. Although measurement error affects all scores to some degree, the average scores on the two tests would be very similar to one another. Scores on a second test taken after a longer time interval will tend to increase as a result of growth; however, if the amount of growth is small relative to the amount of measurement error, an appreciable percentage of students may show score declines, even though the majority of scores increase. The degree of variation due to measurement error is expressed as the “standard error of measurement” (SEM). The “Reliability and Validity” chapter discusses standard error of measurement (see page 38). How Many Items Will a Student Be Presented With When Taking a STAR Reading Test? The STAR Reading UK RP tests administer the same number of items—25 vocabulary-in-context items—to all students. How Many Items Does the STAR Reading Test Have at Each Year? The STAR Reading test has enough items at each year level that students can be tested ten times per year and should not be presented with the same material they have already been tested on in the same school year. Generally, the STAR Reading software will not administer the same item twice to a student within a three-month period. STAR Reading™ Technical Manual 81 Frequently Asked Questions What Guidelines Are Offered as to Whether a Student Can Be Tested Using STAR Reading Software? In general the student should have a reading vocabulary of at least 100 words. In other words, the student should have at least beginning reading skills. Practically, if the student can work through the practice questions unassisted, that student should be able to be tested using STAR Reading software. If the student has a lot of trouble getting through the practice, it is likely that he or she does not possess the basic skills necessary to be measured by STAR Reading software. How Will Students With a Fear of Taking Tests Do With STAR Reading Tests? Students who have a fear of tests should be less disadvantaged by the STAR Reading test than they are in the case of conventional tests. The STAR Reading test purposely starts out at a level that most students will find to be very easy. This was done in order to give almost all students immediate success with the STAR Reading test. Once the student has had an opportunity to gain some confidence with the relatively easy material, the STAR Reading test moves into more challenging material in order to assess the level of reading proficiency. In addition, most students find it fun to take STAR Reading tests on the computer, which helps relieve some test anxiety. Is There Any Way for a Teacher to See Exactly Which Items a Student Answered Correctly and Which He or She Answered Incorrectly? No. This was done for two reasons. First, in computer-adaptive testing, the student’s performance on individual items is not as meaningful as the pattern of responses to the entire test. The student’s pattern of performance on all items taken together forms the basis of the scores STAR Reading reports. Second, for purposes of test security, we decided to do everything possible to protect our items from compromise and overexposure. What Evidence Do We Have That STAR Reading Software Will Perform as Claimed? This evidence comes in two forms. First, we have demonstrated test-retest reliability estimates that are very good. Second, the correlation of STAR Reading results with those of other standardised tests is also quite impressive. See “Reliability and Validity” on page 38 for reliability and validity data. Can or Should the STAR Reading Test Replace a School’s Current National Tests? This is up to the school system to decide, although this is not what the STAR Reading test was primarily designed to do. The primary purpose of the STAR Reading test is to provide teachers with a tool to improve the teaching and learning match for each student. Every school system has to consider its needs in the area of reading assessment and make decisions as to what instruments STAR Reading™ Technical Manual 82 Frequently Asked Questions will meet those needs. We are happy to provide as much information as we can to help schools make these decisions, but we cannot make the decision for them. What Is Item Response Theory? Item Response Theory (IRT) is an approach to psychometric test design and analysis that uses mathematical models that describe what happens when an examinee is administered a particular test question. IRT models give the probability of answering an item correctly as a function of the item’s difficulty and the examinee’s ability. More information can be found in any text on modern test theory. What Are the Cloze and Maze Procedures? These are terms for different kinds of fill-in-the-blank exercises that test a student’s ability to create meaning from contextual information, which have elements in common with the STAR Reading test design. STAR Reading™ Technical Manual 83 Appendix A STAR Reading is a norm-referenced and criterion-referenced test. STAR reports norm-referenced scores, including Percentile Ranks (PR) and Normed Referenced Standardised Scores (NRSS). The norm-referenced scores are based on score distributions of nationally representative samples of students who participated in the norming of STAR Reading. The information in this chapter pertains to that norming study. US Norming National norms for STAR Reading version 1 were collected in 1996. Substantial changes introduced in STAR Reading version 2 necessitated the development of new norms in 1999. Those norms were used in subsequent versions, from version 2.1 through version 4.3, which was released in March 2008. The following is a description of the development of new norms for version 4.3, collected in April and May 2008. The spring 2008 norming represents a change in the norms development procedure. Previous norms were developed by means of special-purpose norming studies, in which national samples of schools were cast, and those schools were solicited to participate in the norming by administering a special norming version of the assessment. The spring 2008 norming of STAR Reading 4.3 is the first study in which national samples of students were drawn from routine administrations of STAR Reading. Details of the procedures employed are given below. Students participating in the norming study took assessments between April 15 and May 30, 2008. Students took the STAR Reading tests under normal test administration conditions. No specific norming test was developed and no deviations were made from the usual test administration. Thus, students in the norming sample took STAR Reading tests as they are administered in everyday use. Sample Characteristics During the norming period, a total of 1,312,212 US students in grades 1–12 (Years 2–13) took STAR Reading version 4.3 tests administered using Renaissance Place RT servers hosted by Renaissance Learning. To obtain a representative sample of the student populations in US grades 1–12, a stratified random sample of the tested students was drawn, with proportional representation based on geographic region. Geographic region was based on the four broad areas identified by the National Educational Association as Northeastern, Midwestern, Southeastern and Western regions. STAR Reading™ Technical Manual 84 Appendix A Sample Characteristics A total sample size of approximately 70,000 was identified to ensure at least 1,000 students per year were eligible for sampling while maintaining geographic proportionality. The final size for the norming sample was 69,738 students in US grades 1–12. These students came from 2,709 schools across 48 states and the District of Columbia. Table 38 provides a breakdown of the number of students participating per year. Table 38: Number of Students per Year (US Grade) in the Norms Sample US Grades N US Grades N 1 7,253 7 4,767 2 10,132 8 4,364 3 10,476 9 2,921 4 9,984 10 2,079 5 8,352 11 1,795 6 6,462 12 1,153 Total 69,738 National estimates of student population characteristics in the US were obtained from two entities: the US Census Bureau and Market Data Research (MDR). First, national population estimates for children aged 5–19 were obtained from the US Census Bureau (www.census.gov); these estimates were from 2006, the most recent data available. Estimates of race/ethnicity were computed using the Census Bureau data based on single race/ethnicity. Second, estimates of other school related characteristics were obtained from December 2007 Market Data Research (MDR) information. Table 39 shows national estimates for children aged 5–19 by region, race/ethnicity and gender, along with the corresponding percentages in the sample summarised in Table 38. The sample statistics are quite similar to the national estimates, with a slightly larger proportion of students coming from the Western and Southeastern portion of the US. The sample weights included in Table 39 were used during norms analysis to weight student data, in order to more closely align score estimates with national demographic estimates. STAR Reading™ Technical Manual 85 Appendix A Sample Characteristics Table 39: Sample Demographic Characteristics Along with National US Population Estimates and Sample Weighting Coefficient National Estimate Norming Sample Sample Weight Region Midwest Northeast Southeast Western 22.2% 19.9% 24.2% 33.6% 22.1% 14.7% 26.7% 36.5% 1.00 1.35 0.91 0.92 Race/Ethnicity White Black Hispanic Asian/Pacific Islander Other 58.7% 14.8% 19.2% 4.0% 3.3% 52.3% 16.4% 24.2% 3.9% 3.2% 1.12 0.90 0.79 1.02 1.03 Gender Female Male 48.8% 51.2% 49.2% 50.8% 0.99 1.01 Table 40 provides information on the school and district level characteristics of students in the sample and national estimates provided by MDR. No weighting was done on the basis of these school level variables; they are provided to help describe the sample of students and the schools they attended. District socioeconomic status (SES) was defined by the percentage of students within the district that were eligible for free/reduced price lunches and was based only on the students attending public schools. School type was defined to be either public (including charter schools) or non-public (private, Catholic, etc.). District enrolment was defined as the average number of students per year within the district. However, district enrolment data was not available for private schools, and they were treated as a single group for this norming and not broken down by enrolment numbers as the public schools were. School location was defined as urban, suburban or rural using the definitions utilised by MDR. Table 40: School and District Level Information: National US Estimates and Sample Percentages Norming Sample District Socioeconomic Status Low:33–99% Average:16–32% High:1–15% 26.4% 33.5% 40.1% 12.5% 44.6% 42.9% School Type & District Enrolment Public < 200 200–499 500–1,999 > 1,999 Non-Public 90.3% 15.0% 26.9% 17.7% 30.7% 9.7% 93.0% 21.1% 28.8% 28.6% 21.4% 7.0% Urban Suburban Rural Unclassified 32.4% 43.1% 24.1% 0.4% 24.1% 35.0% 34.3% 6.6% Location STAR Reading™ Technical Manual National Estimate 86 Appendix A Test Administration Test Administration All students took STAR Reading version 4.3 tests under normal administration procedures. As STAR Reading 4.3 tests normally include several pre-test items administered using the Dynamic Calibration feature, students were administered the appropriate number of pre-test items randomly positioned within each test. Some students in the normative sample also took the assessment two or more times within the norming window; scores from their initial and second test administrations were used for estimation of score reliability. This allowed alternate forms reliability to be estimated, with a short time interval between testing occasions. Conditions for administering the retest were identical to the first, except that the second test excluded any items to which the student had previous exposure. Data Analysis Student test records were compiled from the complete database of STAR Reading hosted users. Only students’ scores on their first STAR Reading test between 15 April and 30 May were used in the norms computations. The scores used in the norms computation were the Rasch ability estimates (theta). The norms were based on the distribution of theta estimates for each year; interpolation was used to estimate norms for times of the year not represented in the norming study. As noted above, students were sampled within regional strata proportional to the national population estimates. The student test records were joined to the student-level demographics and school-level information. Sample weights from the regional, race/ethnicity and gender results were computed and applied to each student’s ability estimate (theta). Norms were developed based on the ability estimates and then transformed to the STAR Reading scaled score scale. Table 41 provides descriptive statistics for each US grade with respect to the normative sample performance, in Scaled Score units. Table 41: Descriptive Statistics for Unweighted (U) and Weighted (W) Scaled Scores by US Grade for the Norming Sample: Spring 2008 Scaled Score Means Scaled Score Standard Deviations Scaled Score Medians US Grade N U W U W U W 1 7,523 221 231 116 127 207 248 2 10,132 350 349 136 137 350 352 3 10,476 450 459 158 191 456 444 4 9,984 543 557 194 247 526 501 5 8,352 640 671 232 290 609 589 6 6,462 721 778 266 362 679 669 STAR Reading™ Technical Manual 87 Appendix A Data Analysis Table 41: Descriptive Statistics for Unweighted (U) and Weighted (W) Scaled Scores by US Grade for the Norming Sample: Spring 2008 (Continued) Scaled Score Means Scaled Score Standard Deviations Scaled Score Medians US Grade N U W U W U W 7 4,767 789 845 291 381 780 801 8 4,364 854 875 305 397 871 832 9 2,921 959 941 287 343 975 981 10 2,079 1,036 999 290 346 1,117 1,124 11 1,795 1,072 1,056 281 342 1,169 1,142 12 1,153 1,119 1,089 278 342 1,228 1,217 New normative data like the year equivalent or percentile rank should not be compared between the previous version of STAR Reading and the present version of STAR Reading. If it is necessary to track student change across time and the new norms interrupt that tracking, it is necessary to use the scaled score, as that metric has not changed and the unit has remained the same. In addition, it is inadvisable to continue to use the older norms, which were collected in 1999, as the newer norms collected in 2008 represent more current estimates of the population of US school children. A major demographic shift can be seen between the norming periods where Hispanic students were the third largest race/ethnic group in 1999 but by 2008 have become the second most common race/ethnic group and have grown from about 12 per cent of the population to about 19 per cent. Grade Equivalent (GE) scores within the US normative sample were defined as the median (50th percentile) Scaled Scores at each US grade; as the mean test date was in the month of April, these empirical median scores constitute the GE scores for month 7 of each grade. GE scores for other time periods were determined by interpolation. Scaled Score to Percentile Rank conversion tables for the empirical norming period are presented in Table 44 on page 99. The Scaled Score to US Grade Equivalent conversion table is presented in Table 43 on page 95. As stated previously, the norm-related information is presented in the STAR Reading UK manual only for informative purposes. All norm-referenced scores have been derived from US students and therefore should not be construed to apply to students in other countries. STAR Reading™ Technical Manual 88 Appendix A US Norm-Referenced Score Definitions US Norm-Referenced Score Definitions Types of Test Scores STAR Reading US software provides three broad types of scores: Scaled Scores, Criterion-Referenced Scores and Norm-Referenced Scores. Scaled Scores and Criterion-Referenced Scores are described under “Score Definitions” in the main body of this manual. Norm-referenced scores are described in this appendix section. US Norm-Referenced scores compare a student’s test results to the results of other US students who have taken the same test. In this case, scores provide a relative measure of student achievement compared to the performance of a group of US students at a given time. Percentile Ranks and Grade Equivalents are the two primary norm-referenced scores provided by STAR Reading software. Both of these scores are based on a comparison of a student’s test results to the data collected during the 1999 US norming study. Grade Equivalent (GE) A Grade Equivalent (GE) indicates the year placement of students for whom a particular score is typical. If a student receives a GE of 10.7, this means that the student scored as well on STAR Reading as did the typical student in the seventh month of US grade 10. It does not necessarily mean that the student can read independently at a tenth-grade level, only that he or she obtained a Scaled Score as high as the average tenth-grade, seventh-month student in the norms group. GE scores are often misinterpreted as though they convey information about what a student knows or can do—that is, as if they were criterion-referenced scores. To the contrary, GE scores are norm-referenced; a student’s GE score indicates the US grade and school month at which the median student would be expected to achieve the same scale score the student achieved. STAR Reading Grade Equivalents range from 0.0–12.9+. The scale divides the academic year into 10 monthly increments, and is expressed as a decimal with the unit denoting the US grade level and the individual “months” in tenths. Table 42 indicates how the GE scale corresponds to the various calendar months. For example, if a student obtained a GE of 4.6 on a STAR Reading assessment, this would suggest that the student was performing similarly to the average student in the fourth grade at the sixth month (March) of the academic year. Because the STAR Reading 4.x norming took place during the end of the seventh month (September) and the entire eighth month of the school year (May), the GEs ending in .8 are empirically based, and based on the observed data from the normative sample. All other monthly GE scores are derived through interpolation by fitting a curve to the grade-by-grade medians. Table 43 on page 95 contains the Scaled Score to GE conversions. STAR Reading™ Technical Manual 89 Appendix A US Norm-Referenced Score Definitions Table 42: Incremental Level Placement Values per Month Month Decimal Increment July August 0.00 or 0.99a a 0.00 or 0.99 Month Decimal Increment January 0.4 February 0.5 September 0.0 March 0.6 October 0.1 April 0.7 November 0.2 May 0.8 December 0.3 June 0.9 a. Depends on the current school year set in Renaissance Place RT. The Grade Equivalent scale is not an equal-interval scale. For example, an increase of 50 Scaled Score points might represent only two or three months of GE change at the lower grades, but over a year of GE change in the high-school grades. This is because student growth in reading (and other academic areas) is not linear; it occurs much more rapidly in the lower grades and slows greatly after the middle years. Consideration of this should be made when averaging GE scores, especially if it is done across two or more grades. Estimated Oral Reading Fluency (Est. ORF) Estimated Oral Reading Fluency (Est. ORF) is an estimate of a student’s ability to read words quickly and accurately in order to comprehend text efficiently. Students with oral reading fluency demonstrate accurate decoding, automatic word recognition and appropriate use of the rhythmic aspects of language (e.g., intonation, phrasing, pitch and emphasis). Est. ORF is reported as an estimated number of words a student can read correctly within a one-minute time span on grade-level-appropriate text. Grade-level text was defined to be connected text in a comprehensible passage form that has a readability level within the range of the first half of the school year. For instance, an Est. ORF score of 60 for a second-grade (third-year) student would be interpreted as meaning the student is expected to read 60 words correctly within one minute on a passage with a readability level between 2.0 and 2.5. Therefore, when this estimate is compared to an observed score on a specific passage, which has a fixed level of readability, there might be noticeable differences as the Est. ORF provides an estimate across a range of readability levels. The Est. ORF score was computed using the results of a large-scale research study investigating the linkage between the STAR Reading scores and estimates of oral reading fluency on a range of passages with grade-level-appropriate difficulty. An equipercentile linking was done between STAR Reading scores and oral reading fluency providing an estimate of the oral reading fluency for each scale score unit in STAR Reading for US grades 1–4 (Years 2–5) independently. STAR Reading™ Technical Manual 90 Appendix A US Norm-Referenced Score Definitions Comparing the STAR Reading Test with Classical Tests Because the STAR Reading test adapts to the reading level of the student being tested, STAR Reading GE scores are more consistently accurate across the achievement spectrum than those provided by classical test instruments. Grade Equivalent scores obtained using classical (non-adaptive) test instruments are less accurate when a student’s year placement and GE score differ markedly. It is not uncommon for a fourth-grade (fifth-year) student to obtain a GE score of 8.9 when using a classical test instrument. However, this does not necessarily mean that the student is performing at a level typical of an end-of-year eighth-grader; more likely, it means that the student answered all, or nearly all, of the items correctly and thus performed beyond the range of the fourth-grade test. STAR Reading Grade Equivalent scores are more consistently accurate—even as a student’s achievement level deviates from the level of year placement. A student may be tested on any level of material, depending upon the student’s actual performance on the test; students are tested on items of an appropriate level of difficulty, based on their individual level of achievement. Thus, a GE score of 7.6 indicates that the student’s performance can be appropriately compared to that of a typical seventh grader in the sixth month of the school year. Understanding IRL and GE scores The US version of STAR Reading software provides both criterion-referenced and norm-referenced scores. As such, it provides more than one frame of reference for describing a student’s current reading performance. The two frames of reference differ significantly, however, so it is important to understand the two estimates and their development when making interpretations of STAR Reading results. The Instructional Reading Level (IRL) is a criterion-referenced score. It provides an estimate of the year of written material with which the student can most effectively be taught. While the IRL, like any test result, is simply an estimate, it provides a useful indication of the level of material on which the student should be receiving instruction. For example, if a student (regardless of current year placement) receives a STAR Reading IRL of 4.0, this indicates that the student can most likely learn without experiencing too many difficulties when using materials written to be on a fourth-grade level. The IRL is estimated based on the student’s pattern of responses to the STAR Reading items. A given student’s IRL is the highest year of items at which it is estimated that the student can correctly answer at least 80% of the items. In effect, the IRL references each student’s STAR Reading performance to the difficulty of written material appropriate for instruction. This is a valuable piece of information in planning the instructional program for individuals or groups of students. STAR Reading™ Technical Manual 91 Appendix A US Norm-Referenced Score Definitions The Grade Equivalent (GE) is a norm-referenced score. It provides a comparison of a student’s performance with that of other students around the nation. If a student receives a GE of 4.0, this means that the student scored as well on the STAR Reading test as did the typical student at the beginning of Grade 4 (Year 5). It does not mean that the student can read books that are written at a fourth-grade level—only that he or she reads as well as fourth-grade students in the norms group. In general, IRLs and GEs will differ. These differences are caused by the fact that the two score metrics are designed to provide different information. That is, IRLs estimate the level of text that a student can read with some instructional assistance; GEs express a student’s performance in terms of the grade level for which that performance is typical. Usually, a student’s GE score will be higher than the IRL. The score to be used depends on the information desired. If a teacher or educator wishes to know how a student’s STAR Reading score compares with that of other students across the nation, either the GE or the Percentile Rank should be used. If the teacher or educator wants to know what level of instructional materials a student should be using for ongoing class schooling, the IRL is the preferred score. Again, both scores are estimates of a student’s current level of reading achievement. They simply provide two ways of interpreting this performance—relative to a national sample of students (GE) or relative to the level of written material the student can read successfully (IRL). Percentile Rank (PR) Percentile Rank is a norm-referenced score that indicates the percentage of students in the same year and at the same point of time in the school year who obtained scores lower than the score of a particular student. In other words, Percentile Ranks show how an individual student’s performance compares to that of the student’s same-year peers on the national level. For example, a Percentile Rank of 85 means that the student is performing at a level that exceeds 85% of other students in that year at the same time of the year. Percentile Ranks simply indicate how a student performed compared to the others who took STAR Reading tests as a part of the national norming program. The range of Percentile Ranks is 1–99. The Percentile Rank scale is not an equal-interval scale. For example, for a student with a US grade placement of 7.7, a Scaled Score of 1,119 corresponds to a PR of 80, and a Scaled Score of 1,222 corresponds to a PR of 90. Thus, a difference of 103 Scaled Score points represents a 10-point difference in PR. However, for the same student, a Scaled Score of 843 corresponds to a PR of 50, and a Scaled Score of 917 corresponds to a PR of 60. While there is now only a 74-point difference in Scaled Scores, there is still a 10-point difference in PR. For this reason, PR scores should not be averaged or otherwise algebraically manipulated. NCE scores are much more appropriate for these activities. STAR Reading™ Technical Manual 92 Appendix A US Norm-Referenced Conversion Tables Table 44 on page 99 contains an abridged version of the Scaled Score to Percentile Rank conversion table that the STAR Reading software uses. The actual table includes data for all of the monthly US grade placement values from 1.0–12.9. Because STAR Reading norming occurred in the seventh month of the school year (May), the values for each year are empirically based. The remaining monthly values were estimated by interpolating between the empirical points. The table also includes a column representing students who are just about to graduate from high school. Normal Curve Equivalent (NCE) Normal Curve Equivalents (NCEs) are scores that have been scaled in such a way that they have a normal distribution, with a mean of 50 and a standard deviation of 21.06 in the normative sample for a given test. Because they range from 1–99, they appear similar to Percentile Ranks, but they have the advantage of being based on an equal interval scale. That is, the difference between two successive scores on the scale has the same meaning throughout the scale. NCEs are useful for purposes of statistically manipulating norm-referenced test results, such as interpolating test scores, calculating averages and computing correlation coefficients between different tests. For example, in STAR Reading score reports, average Percentile Ranks are obtained by first converting the PR values to NCE values, averaging the NCE values and then converting the average NCE back to a PR. Table 45 on page 102 provides the NCEs corresponding to integer PR values and facilitates the conversion of PRs to NCEs. Table 46 on page 103 provides the conversions from NCE to PR. The NCE values are given as a range of scores that convert to the corresponding PR value. US Norm-Referenced Conversion Tables Conversion tables used in the US version of STAR Reading are reproduced below. These have no counterparts in the UK version. They are reproduced here solely as technical reference material. The tables include the following: Table 43, “Scaled Score to Grade Equivalent Conversions,” on page 95 This table indicates the US GE (Grade Equivalent) scores corresponding to all values of Scaled Scores. US school grades differ from UK school years by 1; to convert a GE score to a UK school year, add 1 to the GE score. Table 44, “Scaled Score to Percentile Rank Conversions,” on page 99 This table lists the minimum Scaled Scores corresponding to Percentile Ranks 1–99 in the US norming sample, for each of US grades 1–12. US school grades differ from UK school years by 1; to look up a percentile score, UK users should add 1 to the GE score to determine the equivalent UK school year. Users should also note that these are the empirical norms, STAR Reading™ Technical Manual 93 Appendix A US Norm-Referenced Conversion Tables and apply only at the 7th month of the US school year. Percentile Ranks for all other months are determined by interpolation within this table. Table 45, “Percentile Rank to Normal Curve Equivalent Conversions,” on page 102 In the US, program evaluation studies often use NCE (Normal Curve Equivalent) scores in preference to Percentile Rank scores, because the NCE scale is preferred for statistical analysis purposes. This table lists the NCE scores equivalent to each Percentile Rank 1–99. This table is based on the non-linear translation of Percentile Ranks to NCE scores. Table 46, “Normal Curve Equivalent to Percentile Rank Conversion,” on page 103 This table is the inverse of the percentile-to-NCE transformation documented in Table 45. Table 47, “US Grade Equivalent to ZPD Conversions,” on page 104 This table lists the ZPD (Zone of Proximal Development) ranges corresponding to each possible GE score. Table 48: Scaled Score to Instructional Reading Level Conversions This table lists the IRL scores (Instructional Reading Levels) corresponding to each possible Scaled Score. IRLs are expressed as US school grades; to find the equivalent UK school year, add 1 to each IRL score in this table. Table 46: Estimated Oral Reading Fluency (Est. ORF) Given in Words Correct per Minute (WCPM) by Grade for Selected STAR Reading Scale Score Units (SR SS) Research in the US has found a strong correlation between STAR Reading scale scores, which measure reading comprehension, and measures of students’ oral reading fluency, which is often used as a proxy for reading comprehension. This table lists oral reading fluency measures, expressed as words read aloud correctly per minute from grade-appropriate text, corresponding to STAR Reading scale scores, for each of US grades 1–4. (US grades differ from UK years by 1; to convert a US grade in this table to a UK year, add 1 to the US grade.) STAR Reading™ Technical Manual 94 Appendix A US Norm-Referenced Conversion Tables Table 43: Scaled Score to Grade Equivalent Conversionsa SS Range STAR Reading™ Technical Manual Low High Grade Equivalent 0 45 0.0 46 50 0.1 51 55 0.2 56 58 0.3 59 60 0.4 61 63 0.5 64 65 0.6 66 68 0.7 69 71 0.8 72 79 0.9 80 82 1.0 83 86 1.1 87 89 1.2 90 96 1.3 97 105 1.4 106 121 1.5 122 141 1.6 142 159 1.7 160 176 1.8 177 194 1.9 195 212 2.0 213 229 2.1 230 247 2.2 248 266 2.3 267 283 2.4 284 302 2.5 303 322 2.6 323 333 2.7 334 343 2.8 344 354 2.9 355 364 3.0 365 372 3.1 373 383 3.2 384 395 3.3 95 Appendix A US Norm-Referenced Conversion Tables Table 43: Scaled Score to Grade Equivalent Conversionsa (Continued) SS Range STAR Reading™ Technical Manual Low High Grade Equivalent 396 407 3.4 408 421 3.5 422 434 3.6 435 442 3.7 443 449 3.8 450 455 3.9 456 461 4.0 462 466 4.1 467 473 4.2 474 481 4.3 482 490 4.4 491 497 4.5 498 505 4.6 506 514 4.7 515 522 4.8 523 531 4.9 532 542 5.0 543 553 5.1 554 560 5.2 561 569 5.3 570 579 5.4 580 588 5.5 589 600 5.6 601 612 5.7 613 624 5.8 625 637 5.9 638 650 6.0 651 664 6.1 665 678 6.2 679 693 6.3 694 710 6.4 711 726 6.5 727 748 6.6 749 763 6.7 96 Appendix A US Norm-Referenced Conversion Tables Table 43: Scaled Score to Grade Equivalent Conversionsa (Continued) SS Range STAR Reading™ Technical Manual Low High Grade Equivalent 764 772 6.8 773 780 6.9 781 788 7.0 789 796 7.1 797 805 7.2 806 814 7.3 815 824 7.4 825 833 7.5 834 842 7.6 843 852 7.7 853 864 7.8 865 877 7.9 878 888 8.0 889 897 8.1 898 904 8.2 905 910 8.3 911 919 8.4 920 930 8.5 931 943 8.6 944 950 8.7 951 959 8.8 960 966 8.9 967 972 9.0 973 978 9.1 979 987 9.2 988 1,001 9.3 1,002 1,016 9.4 1,017 1,032 9.5 1,033 1,044 9.6 1,045 1,050 9.7 1,051 1,055 9.8 1,056 1,060 9.9 1,061 1,066 10.0 1,067 1,071 10.1 97 Appendix A US Norm-Referenced Conversion Tables Table 43: Scaled Score to Grade Equivalent Conversionsa (Continued) SS Range Low High Grade Equivalent 1,072 1,080 10.2 1,081 1,089 10.3 1,090 1,095 10.4 1,096 1,099 10.5 1,100 1,103 10.6 1,104 1,106 10.7 1,107 1,110 10.8 1,111 1,115 10.9 1,116 1,120 11.0 1,121 1,124 11.1 1,125 1,129 11.2 1,130 1,133 11.3 1,134 1,137 11.4 1,138 1,142 11.5 1,143 1,146 11.6 1,147 1,151 11.7 1,152 1,155 11.8 1,156 1,160 11.9 1,161 1,163 12.0 1,164 1,166 12.1 1,167 1,170 12.2 1,171 1,173 12.3 1,174 1,176 12.4 1,177 1,179 12.5 1,180 1,182 12.6 1,183 1,185 12.7 1,186 1,189 12.8 1,190 1,192 12.9 1,193 1,400 12.9+ a. The information presented in this table was developed for STAR Reading US and was calculated using normative data collected in the United States. As a result, this information may not generalise to students in other countries. STAR Reading UK users should avoid using the information in the table to make decisions about students’ reading ability. STAR Reading™ Technical Manual 98 Appendix A US Norm-Referenced Conversion Tables Table 44: Scaled Score to Percentile Rank Conversionsa US Grade Placement PR 1 2 3 4 5 6 7 8 9 10 11 12 1 49 68 98 173 226 280 328 355 395 464 468 493 2 50 71 107 194 247 303 355 379 430 495 501 526 3 55 73 125 210 264 321 372 403 454 519 527 556 4 56 75 139 222 276 337 390 425 470 542 553 577 5 56 77 150 232 287 350 405 444 489 559 569 598 6 58 79 160 241 298 362 421 456 505 576 587 617 7 60 80 168 250 309 370 435 467 520 591 605 635 8 60 82 176 258 317 378 447 478 534 608 620 651 9 60 83 183 265 325 390 455 491 550 621 635 668 10 60 85 190 272 333 398 463 501 559 634 649 683 11 61 86 196 277 340 407 471 513 571 647 664 700 12 61 87 202 283 346 416 479 522 583 660 678 715 13 61 88 207 289 353 425 489 532 593 673 691 730 14 61 89 213 294 360 433 497 543 607 685 706 752 15 62 90 217 300 365 441 505 554 617 698 719 773 16 63 93 222 306 369 448 513 560 627 711 733 787 17 63 95 227 311 374 453 520 568 637 723 754 801 18 63 97 231 316 378 458 528 577 648 737 773 816 19 63 99 235 320 385 463 536 586 658 756 786 832 20 64 101 239 325 391 467 545 594 670 773 798 845 21 65 103 244 330 396 473 553 604 679 785 811 855 22 65 105 248 335 401 478 558 612 690 796 826 867 23 65 107 252 339 407 485 564 620 701 808 839 880 24 65 109 256 343 413 491 570 628 713 821 850 890 25 66 114 260 347 418 496 578 637 723 834 859 898 26 66 118 263 352 423 502 585 644 735 845 872 904 27 67 123 267 356 429 508 591 653 752 853 883 910 28 67 127 271 360 434 514 599 662 769 863 892 917 29 67 132 274 364 440 518 607 671 780 875 900 925 30 68 135 277 367 444 524 613 679 791 885 905 937 STAR Reading™ Technical Manual 99 Appendix A US Norm-Referenced Conversion Tables Table 44: Scaled Score to Percentile Rank Conversionsa (Continued) US Grade Placement PR 1 2 3 4 5 6 7 8 9 10 11 12 31 68 139 281 370 449 529 620 687 802 893 910 947 32 68 142 284 373 452 535 626 697 813 900 917 957 33 69 146 287 376 456 542 633 706 826 905 925 967 34 69 149 291 379 459 548 640 716 838 909 937 974 35 70 152 294 384 463 554 647 724 848 916 946 984 36 70 156 298 389 466 558 654 735 856 923 956 998 37 70 159 302 393 470 562 661 749 866 933 966 1018 38 71 162 305 397 474 567 669 766 878 943 972 1036 39 71 165 309 401 478 572 676 776 887 951 981 1049 40 72 168 313 404 483 578 683 785 895 961 994 1063 41 72 171 316 409 488 584 691 794 901 969 1013 1079 42 73 174 319 414 493 588 700 804 906 975 1032 1097 43 73 177 322 418 496 594 708 815 911 985 1046 1106 44 74 181 326 423 500 601 716 827 918 999 1059 1121 45 74 184 329 427 505 607 724 837 926 1017 1072 1135 46 75 187 333 431 511 613 733 847 938 1035 1094 1149 47 77 190 337 436 515 618 744 854 947 1048 1103 1160 48 77 193 340 441 519 624 760 864 957 1060 1116 1169 49 77 196 343 444 523 630 772 875 966 1074 1131 1177 50 78 199 346 448 528 636 781 884 973 1094 1146 1185 51 78 203 350 451 532 642 789 892 983 1103 1157 1195 52 79 206 354 454 538 648 798 899 996 1116 1167 1206 53 80 209 358 457 544 654 807 904 1016 1130 1175 1214 54 81 212 361 460 550 661 818 909 1035 1144 1183 1219 55 82 215 364 464 555 669 829 915 1049 1156 1193 1226 56 83 218 367 467 558 675 839 921 1064 1166 1204 1233 57 84 221 370 470 562 681 848 932 1081 1174 1213 1244 58 85 224 373 474 567 689 855 942 1098 1182 1219 1252 59 86 227 375 478 572 697 864 951 1110 1191 1225 1259 60 87 230 379 483 578 705 875 962 1126 1202 1233 1268 61 88 233 383 488 583 713 884 970 1142 1211 1244 1280 STAR Reading™ Technical Manual 100 Appendix A US Norm-Referenced Conversion Tables Table 44: Scaled Score to Percentile Rank Conversionsa (Continued) US Grade Placement PR 1 2 3 4 5 6 7 8 9 10 11 12 62 89 236 388 493 588 721 892 978 1156 1217 1252 1290 63 90 239 393 497 593 729 899 989 1167 1223 1260 1295 64 92 243 396 501 601 741 904 1009 1176 1231 1269 1300 65 94 246 400 506 607 755 909 1030 1185 1241 1282 1305 66 96 250 405 512 613 769 915 1046 1198 1251 1291 1309 67 99 253 410 516 619 779 922 1061 1209 1258 1296 1314 68 101 257 415 520 625 788 934 1079 1217 1267 1301 1316 69 104 260 420 525 631 797 944 1099 1224 1280 1307 1318 70 106 264 425 531 638 808 954 1112 1233 1290 1312 1321 71 109 268 430 537 644 820 965 1130 1247 1295 1315 1323 72 117 271 436 544 652 832 973 1148 1255 1301 1317 1325 73 124 275 441 551 660 843 984 1162 1265 1306 1320 1327 74 131 279 446 556 669 852 1002 1173 1280 1311 1322 1328 75 138 282 451 560 676 862 1025 1183 1291 1315 1325 1330 76 143 286 454 565 684 875 1044 1197 1298 1317 1327 1332 77 150 291 459 572 694 886 1062 1210 1305 1320 1328 1335 78 156 295 463 579 704 895 1084 1219 1311 1323 1330 1337 79 161 300 467 586 715 903 1102 1229 1315 1325 1333 1339 80 168 306 473 592 725 909 1121 1243 1318 1327 1335 1341 81 173 311 479 603 739 917 1143 1255 1321 1329 1338 1342 82 180 316 486 611 760 929 1161 1268 1324 1331 1340 1343 83 187 321 493 619 776 944 1174 1287 1327 1334 1342 1344 84 195 327 500 628 789 958 1188 1296 1329 1337 1343 1345 85 204 334 508 638 803 971 1206 1305 1332 1340 1344 1345 86 214 340 516 648 821 986 1219 1313 1336 1341 1344 1345 87 224 346 525 660 839 1017 1232 1317 1339 1342 1345 1346 88 234 355 534 674 854 1047 1251 1321 1342 1344 1345 1346 89 243 362 547 687 873 1076 1267 1326 1343 1345 1346 1346 90 254 369 557 704 891 1107 1291 1329 1344 1345 1346 1347 91 266 376 568 722 904 1143 1303 1333 1345 1346 1346 1347 92 279 389 583 749 918 1171 1314 1338 1345 1346 1347 1347 STAR Reading™ Technical Manual 101 Appendix A US Norm-Referenced Conversion Tables Table 44: Scaled Score to Percentile Rank Conversionsa (Continued) US Grade Placement PR 1 2 3 4 5 6 7 8 93 294 400 600 781 944 1198 1320 1342 94 310 417 619 810 972 1223 1327 95 329 436 642 848 1024 1255 96 358 455 675 888 1096 97 388 480 724 924 98 452 529 827 99 1400 1400 1400 9 10 11 12 1346 1346 1347 1347 1344 1346 1347 1347 1347 1333 1345 1347 1347 1347 1353 1296 1341 1346 1347 1347 1350 1353 1171 1319 1344 1346 1347 1347 1350 1353 1031 1252 1336 1346 1347 1354 1353 1360 1363 1400 1400 1400 1400 1400 1400 1400 1400 1400 a. The information presented in this table was developed for STAR Reading US and was calculated using normative data collected in the United States. As a result, this information may not generalise to students in other countries. STAR Reading UK users should avoid using the information in the table to make decisions about students’ reading ability. Table 45: Percentile Rank to Normal Curve Equivalent Conversions STAR Reading™ Technical Manual PR NCE PR NCE PR NCE PR NCE 1 1.0 26 36.5 51 50.5 76 64.9 2 6.7 27 37.1 52 51.1 77 65.6 3 10.4 28 37.7 53 51.6 78 66.3 4 13.1 29 38.3 54 52.1 79 67.0 5 15.4 30 39.0 55 52.6 80 67.7 6 17.3 31 39.6 56 53.2 81 68.5 7 18.9 32 40.1 57 53.7 82 69.3 8 20.4 33 40.7 58 54.2 83 70.1 9 21.8 34 41.3 59 54.8 84 70.9 10 23.0 35 41.9 60 55.3 85 71.8 11 24.2 36 42.5 61 55.9 86 72.8 12 25.3 37 43.0 62 56.4 87 73.7 13 26.3 38 43.6 63 57.0 88 74.7 14 27.2 39 44.1 64 57.5 89 75.8 15 28.2 40 44.7 65 58.1 90 77.0 16 29.1 41 45.2 66 58.7 91 78.2 17 29.9 42 45.8 67 59.3 92 79.6 18 30.7 43 46.3 68 59.9 93 81.1 19 31.5 44 46.8 69 60.4 94 82.7 20 32.3 45 47.4 70 61.0 95 84.6 21 33.0 46 47.9 71 61.7 96 86.9 102 Appendix A US Norm-Referenced Conversion Tables Table 45: Percentile Rank to Normal Curve Equivalent Conversions (Continued) PR NCE PR NCE PR NCE PR NCE 22 33.7 47 48.4 72 62.3 97 89.6 23 34.4 48 48.9 73 62.9 98 93.3 24 35.1 49 49.5 74 63.5 99 99.0 25 35.8 50 50.0 75 64.2 Table 46: Normal Curve Equivalent to Percentile Rank Conversion NCE Range STAR Reading™ Technical Manual NCE Range NCE Range Low High PR Low High PR Low High PR 1.0 4.0 1 41.0 41.5 34 59.0 59.5 67 4.1 8.5 2 41.6 42.1 35 59.6 60.1 68 8.6 11.7 3 42.2 42.7 36 60.2 60.7 69 11.8 14.1 4 42.8 43.2 37 60.8 61.3 70 14.2 16.2 5 43.3 43.8 38 61.4 61.9 71 16.3 18.0 6 43.9 44.3 39 62.0 62.5 72 18.1 19.6 7 44.4 44.9 40 62.6 63.1 73 19.7 21.0 8 45.0 45.4 41 63.2 63.8 74 21.1 22.3 9 45.5 45.9 42 63.9 64.5 75 22.4 23.5 10 46.0 46.5 43 64.6 65.1 76 23.6 24.6 11 46.6 47.0 44 65.2 65.8 77 24.7 25.7 12 47.1 47.5 45 65.9 66.5 78 25.8 26.7 13 47.6 48.1 46 66.6 67.3 79 26.8 27.6 14 48.2 48.6 47 67.4 68.0 80 27.7 28.5 15 48.7 49.1 48 68.1 68.6 81 28.6 29.4 16 49.2 49.7 49 68.7 69.6 82 29.5 30.2 17 49.8 50.2 50 69.7 70.4 83 30.3 31.0 18 50.3 50.7 51 70.5 71.3 84 31.1 31.8 19 50.8 51.2 52 71.4 72.2 85 31.9 32.6 20 51.3 51.8 53 72.3 73.1 86 32.7 33.3 21 51.9 52.3 54 73.2 74.1 87 33.4 34.0 22 52.4 52.8 55 74.2 75.2 88 34.1 34.7 23 52.9 53.4 56 75.3 76.3 89 34.8 35.4 24 53.5 53.9 57 76.4 77.5 90 35.5 36.0 25 54.0 54.4 58 77.6 78.8 91 36.1 36.7 26 54.5 55.0 59 78.9 80.2 92 36.8 37.3 27 55.1 55.5 60 80.3 81.7 93 103 Appendix A US Norm-Referenced Conversion Tables Table 46: Normal Curve Equivalent to Percentile Rank Conversion (Continued) NCE Range NCE Range NCE Range Low High PR Low High PR Low High PR 37.4 38.0 28 55.6 56.1 61 81.8 83.5 94 38.1 38.6 29 56.2 56.6 62 83.6 85.5 95 38.7 39.2 30 56.7 57.2 63 85.6 88.0 96 39.3 39.8 31 57.3 57.8 64 88.1 91.0 97 39.9 40.4 32 57.9 58.3 65 91.1 95.4 98 40.5 40.9 33 58.4 58.9 66 95.5 99.0 99 Table 47: US Grade Equivalent to ZPD Conversions ZPD Range STAR Reading™ Technical Manual ZPD Range ZPD Range GE Low High GE Low High GE Low High 0.0 0.0 1.0 4.4 3.2 4.9 8.8 4.6 8.8 0.1 0.1 1.1 4.5 3.2 5.0 8.9 4.6 8.9 0.2 0.2 1.2 4.6 3.2 5.1 9.0 4.6 9.0 0.3 0.3 1.3 4.7 3.3 5.2 9.1 4.6 9.1 0.4 0.4 1.4 4.8 3.3 5.2 9.2 4.6 9.2 0.5 0.5 1.5 4.9 3.4 5.3 9.3 4.6 9.3 0.6 0.6 1.6 5.0 3.4 5.4 9.4 4.6 9.4 0.7 0.7 1.7 5.1 3.5 5.5 9.5 4.7 9.5 0.8 0.8 1.8 5.2 3.5 5.5 9.6 4.7 9.6 0.9 0.9 1.9 5.3 3.6 5.6 9.7 4.7 9.7 1.0 1.0 2.0 5.4 3.6 5.6 9.8 4.7 9.8 1.1 1.1 2.1 5.5 3.7 5.7 9.9 4.7 9.9 1.2 1.2 2.2 5.6 3.8 5.8 10.0 4.7 10.0 1.3 1.3 2.3 5.7 3.8 5.9 10.1 4.7 10.1 1.4 1.4 2.4 5.8 3.9 5.9 10.2 4.7 10.2 1.5 1.5 2.5 5.9 3.9 6.0 10.3 4.7 10.3 1.6 1.6 2.6 6.0 4.0 6.1 10.4 4.7 10.4 1.7 1.7 2.7 6.1 4.0 6.2 10.5 4.8 10.5 1.8 1.8 2.8 6.2 4.1 6.3 10.6 4.8 10.6 1.9 1.9 2.9 6.3 4.1 6.3 10.7 4.8 10.7 2.0 2.0 3.0 6.4 4.2 6.4 10.8 4.8 10.8 2.1 2.1 3.1 6.5 4.2 6.5 10.9 4.8 10.9 2.2 2.1 3.1 6.6 4.2 6.6 11.0 4.8 11.0 2.3 2.2 3.2 6.7 4.2 6.7 11.1 4.8 11.1 104 Appendix A US Norm-Referenced Conversion Tables Table 47: US Grade Equivalent to ZPD Conversions (Continued) ZPD Range ZPD Range ZPD Range GE Low High GE Low High GE Low High 2.4 2.2 3.2 6.8 4.3 6.8 11.2 4.8 11.2 2.5 2.3 3.3 6.9 4.3 6.9 11.3 4.8 11.3 2.6 2.4 3.4 7.0 4.3 7.0 11.4 4.8 11.4 2.7 2.4 3.4 7.1 4.3 7.1 11.5 4.9 11.5 2.8 2.5 3.5 7.2 4.3 7.2 11.6 4.9 11.6 2.9 2.5 3.5 7.3 4.4 7.3 11.7 4.9 11.7 3.0 2.6 3.6 7.4 4.4 7.4 11.8 4.9 11.8 3.1 2.6 3.7 7.5 4.4 7.5 11.9 4.9 11.9 3.2 2.7 3.8 7.6 4.4 7.6 12.0 4.9 12.0 3.3 2.7 3.8 7.7 4.4 7.7 12.1 4.9 12.1 3.4 2.8 3.9 7.8 4.5 7.8 12.2 4.9 12.2 3.5 2.8 4.0 7.9 4.5 7.9 12.3 4.9 12.3 3.6 2.8 4.1 8.0 4.5 8.0 12.4 4.9 12.4 3.7 2.9 4.2 8.1 4.5 8.1 12.5 5.0 12.5 3.8 2.9 4.3 8.2 4.5 8.2 12.6 5.0 12.6 3.9 3.0 4.4 8.3 4.5 8.3 12.7 5.0 12.7 4.0 3.0 4.5 8.4 4.5 8.4 12.8 5.0 12.8 4.1 3.0 4.6 8.5 4.6 8.5 12.9 5.0 12.9 4.2 3.1 4.7 8.6 4.6 8.6 13.0 5.0 13.0 4.3 3.1 4.8 8.7 4.6 8.7 Table 48: Scaled Score to Instructional Reading Level (IRL) Conversions STAR Reading™ Technical Manual Low High IRL 0 124 Pre-Primer (PP) 125 159 Primer (P) 160 168 1.0 169 176 1.1 177 185 1.2 186 194 1.3 195 203 1.4 204 212 1.5 213 220 1.6 105 Appendix A US Norm-Referenced Conversion Tables Table 48: Scaled Score to Instructional Reading Level (IRL) Conversions STAR Reading™ Technical Manual Low High IRL 221 229 1.7 230 238 1.8 239 247 1.9 248 256 2.0 257 266 2.1 267 275 2.2 276 284 2.3 285 293 2.4 294 304 2.5 305 315 2.6 316 325 2.7 326 336 2.8 337 346 2.9 347 359 3.0 360 369 3.1 370 379 3.2 380 394 3.3 395 407 3.4 408 423 3.5 424 439 3.6 440 451 3.7 452 462 3.8 463 474 3.9 475 487 4.0 488 498 4.1 499 512 4.2 513 523 4.3 524 537 4.4 538 553 4.5 554 563 4.6 564 577 4.7 578 590 4.8 106 Appendix A US Norm-Referenced Conversion Tables Table 48: Scaled Score to Instructional Reading Level (IRL) Conversions STAR Reading™ Technical Manual Low High IRL 591 607 4.9 608 616 5.0 617 624 5.1 625 633 5.2 634 642 5.3 643 652 5.4 653 662 5.5 663 673 5.6 674 682 5.7 683 694 5.8 695 706 5.9 707 725 6.0 726 752 6.1 753 780 6.2 781 801 6.3 802 826 6.4 827 848 6.5 849 868 6.6 869 890 6.7 891 904 6.8 905 916 6.9 917 918 7.0 919 920 7.1 921 922 7.2 923 924 7.3 925 928 7.4 929 930 7.5 931 934 7.6 935 937 7.7 938 939 7.8 940 942 7.9 943 948 8.0 107 Appendix A US Norm-Referenced Conversion Tables Table 48: Scaled Score to Instructional Reading Level (IRL) Conversions STAR Reading™ Technical Manual Low High IRL 949 954 8.1 955 960 8.2 961 966 8.3 967 970 8.4 971 974 8.5 975 981 8.6 982 988 8.7 989 998 8.8 999 1,011 8.9 1,012 1,022 9.0 1,023 1,034 9.1 1,035 1,042 9.2 1,043 1,050 9.3 1,051 1,058 9.4 1,059 1,067 9.5 1,068 1,076 9.6 1,077 1,090 9.7 1,091 1,098 9.8 1,099 1,104 9.9 1,105 1,111 10.0 1,112 1,121 10.1 1,122 1,130 10.2 1,131 1,139 10.3 1,140 1,147 10.4 1,148 1,155 10.5 1,156 1,161 10.6 1,162 1,167 10.7 1,168 1,172 10.8 1,173 1,177 10.9 1,178 1,203 11.0 1,204 1,221 11.1 1,222 1,243 11.2 108 Appendix A US Norm-Referenced Conversion Tables Table 48: Scaled Score to Instructional Reading Level (IRL) Conversions Low High IRL 1,244 1,264 11.3 1,265 1,290 11.4 1,291 1,303 11.5 1,304 1,314 11.6 1,315 1,319 11.7 1,320 1,324 11.8 1,325 1,328 11.9 1,329 1,330 12.0 1,331 1,332 12.1 1,333 1,335 12.2 1,336 1,337 12.3 1,338 1,340 12.4 1,341 1,341 12.5 1,342 1,342 12.6 1,343 1,343 12.7 1,344 1,344 12.8 1,345 1,345 12.9 1,346 1,400 Post-High School (PHS) Table 49: Estimated Oral Reading Fluency (Est. ORF) Given in Words Correct per Minute (WCPM) by Grade for Selected STAR Reading Scale Score Units (SR SS) Grade STAR Reading™ Technical Manual SR SS 1 2 3 4 50 0 4 0 8 100 29 30 32 31 150 41 40 43 41 200 55 52 52 47 250 68 64 60 57 300 82 78 71 69 350 92 92 80 80 400 111 106 97 93 109 Appendix A US Norm-Referenced Conversion Tables Table 49: Estimated Oral Reading Fluency (Est. ORF) Given in Words Correct per Minute (WCPM) by Grade for Selected STAR Reading Scale Score Units (SR SS) (Continued) Grade STAR Reading™ Technical Manual SR SS 1 2 3 4 450 142 118 108 104 500 142 132 120 115 550 142 152 133 127 600 142 175 147 137 650 142 175 157 145 700 142 175 167 154 750 142 175 170 168 800 142 175 170 184 850–1400 142 175 170 190 110 References Allington, R., & McGill-Franzen, A. (2003). Use students’ summer-setback months to raise minority achievement. Education Digest, Nov., 19. Allington, R., & McGill-Franzen, A. (2003). Use students’ summer-setback months to raise minority achievement. Education Digest, 69(3), 19–24. Bennicoff-Nan, L. (2002). A Correlation of Computer Adaptive, Norm Referenced, and Criterion Referenced Achievement Tests in Elementary Reading. Doctoral dissertation, The Boyer Graduate School of Education, Santa Ana, CA. Borman, G. D. & Dowling, N. M. (2004). Testing the Reading Renaissance Program Theory: A Multilevel Analysis of Student and Classroom Effects on Reading Achievement. University of Wisconsin-Madison. Bracey, G. (2002). Summer loss: The phenomenon no one wants to deal with. Phi Delta Kappan, Sept., 12. Bracey, G. (2002). Summer loss: The phenomenon no one wants to deal with. Phi Delta Kappan, 84(1), 12–13. Bryk, A., & Raudenbush, S. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage Publications. Campbell, D., & Stanley, J. (1966). Experimental and quasi-experimental designs for research. Chicago: Rand McNally & Company. Cook, T., & Campbell, D. (1979). Quasi-experimentation: Design & analysis issues for field settings. Boston: Houghton Mifflin Company. Deno, S. (2003). Developments in curriculum-based measurement. Journal of Special Education, 37(3), 184–192. Diggle, P., Heagerty, P., Liang, K., & Zeger, S. (2002). Analysis of longitudinal data (Second edition). Oxford: Oxford University Press. Diggle, P., Heagerty, P., Liang, K., & Zeger, S. (2002). Analysis of longitudinal data (2nd ed.). Oxford: Oxford University Press. Duncan, T., Duncan, S., Strycker, L., Li, F., & Alpert, A. (1999). An introduction to latent variable growth curve modeling: Concepts, issues, and applications. Mahwah, NJ: Lawrence Erlbaum Associates. Fuchs, D. & Fuchs, L. S. (2006). Introduction to Response to Intervention: What, why, and how valid is it? Reading Research Quarterly, 41(1), 93–99. Holmes, C. T., & Brown, C. L. (2003). A controlled evaluation of a total school improvement process, School Renaissance. University of Georgia. Available online: http://www.coe.uga.edu/leadership/faculty/holmes/articles.html. Holmes, C. T., & Brown, C. L. (2003). A controlled evaluation of a total school improvement process, School Renaissance. University of Georgia. Available online: http://www.eric.ed.gov/PDFS/ED474261.pdf. STAR Reading™ Technical Manual 111 References Kirk, R. (1995). Experimental Design: Procedures for the behavioral sciences, Third edition. New York: Brooks/Cole Publishing Company. Kirk, R. (1995). Experimental Design: Procedures for the behavioral sciences (3rd ed.). New York: Brooks/Cole Publishing Company. Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking (Second edition). New York: Springer. Kolen, M., & Brennan, R. (2004). Test equating, scaling, and linking (2nd ed.). New York: Springer. Multiple authors. (2002). Modeling intraindividual variability with repeated measures data: Methods and applications. In D. S. Moskowitz & S. L. Hershberger (eds.), Mahwah, NJ: Lawrence Erlbaum Associates. Neter, J., Kutner, M., Nachtsheim, C., & Wasserman, W. (1996). Applied linear statistical models (Fourth edition). New York: WCB McGraw-Hill. Neter, J., Kutner, M., Nachtsheim, C., & Wasserman, W. (1996). Applied linear statistical models (4th ed.). New York: WCB McGraw-Hill. Pedhazur, E., & Schmelkin, L. (1991). Measurement, design, and analysis: An integrated approach. Hillsdale, NJ: Lawrence Erlbaum Associates. Renaissance Learning. (2003). Guide to Reading Renaissance Goal-Setting Changes. Madison, WI: Renaissance Learning, Inc. Available online: (http://doc.renlearn.com/KMNet/R001398009GC0386.pdf). Ross, S. M. & Nunnery, J. (2005). The effect of School Renaissance on student achievement in two Mississippi school districts. Memphis: University of Memphis, Center for Research in Educational Policy. Available online: http://crep.memphis.edu/web/research/pub/Mississippi_School_Renaiss ance_FINAL_4.pdf. Ross, S. M. & Nunnery, J. (2005). The effect of School Renaissance on student achievement in two Mississippi school districts. Memphis: University of Memphis, Center for Research in Educational Policy. Available online: http://www.eric.ed.gov/PDFS/ED484275.pdf. Sadusky, L. A. & Brem, S. K. (2002). The Integration of Renaissance Programs into an Urban Title I Elementary School, and Its Effect on School-wide Improvement. Arizona State University. Sewell, J., Sainsbury, M., Pyle, K., Keogh, N. and Styles, B. (2007). Renaissance Learning Equating Study Report. Technical Report submitted to Renaissance Learning, Inc. National Foundation for Educational Research, Slough, Berkshire, United Kingdom. Yoes, M. (1999) Linking the STAR and DRP Reading Test Scales. Technical Report. Submitted to Touchstone Applied Science Associates and Renaissance Learning. STAR Reading™ Technical Manual 112 Index A England, 60 Est. ORF (Estimated Oral Reading Fluency), 29, 90 Extended time limits, 10 External validity, 42 Access levels, 7 Adaptive Branching, 5, 6, 9 Administering the test, 8 Analysis of validity data, 57 ATOS graded vocabulary list, 12 F Formative class assessments, 1 Frequently asked questions, 78 cloze and maze, 83 comparison to other standardised/national tests, 79, 80 comprehension, vocabulary and reading achievement, 78 determining reading levels in less than minutes, 78 determining which pupils can test, 82 evidence of product claims, 82 IRT (Item Response Theory), 83 lowered test performance, 81 number of test items presented, 81 pupils reluctant to test, 82 replacing national tests, 82 total number of test questions, 81 viewing student responses, 82 ZPD ranges, 78 B Bayesian-modal IRT (Item Response Theory), 24 C Calibration of STAR Reading items for use in version 2.0, 15 California Standards Tests, 58 Capabilities, 7 Comparing the STAR Reading test with classical tests, 36, 91 Computer-adaptive test design, 23 Concurrent validity, correlations with reading tests in England, 60 Construct validity, correlations with a measure of reading comprehension, 61 Content development, 12 ATOS graded vocabulary list, 12 Educational Development Laboratory’s core vocabulary list, 12 Conversion tables, 93 Criterion-referenced scores, 29 Cross-validation study results, 65 G GE (Grade Equivalent), 84, 88, 89, 92 I Improvements to the program, 5 Individualised tests, 7 Interim periodic assessments, 1 Investigating Oral Reading Fluency and developing the Est. ORF (Estimated Oral Reading Fluency) scale, 63 IRF (item response function), 20, 21 IRL (Instructional Reading Level), 91 IRT (Item Response Theory), 5, 20, 21, 25, 83 Bayesian-modal, 24 Maximum-Likelihood estimation procedure, 24 Item calibration, 15 sample description, 16 sample description, item difficulty, 19 sample description, item discrimination, 19 sample description, item presentation, 17 sample description, item response function, 20 of STAR Reading items for use in version 2.0, 15 D Data analysis, 87 Data encryption, 7 Definitions of scores, 29 Description of the program, 1 Diagnostic codes, 35 DIBELS oral reading fluency. See DORF DORF (DIBELS oral reading fluency), 63 Dynamic Calibration, 6, 27 E Educational Development Laboratory, core vocabulary list, 12 EIRF (empirical item response functions), 21 STAR Reading™ Technical Manual 113 Index Post-publication study data, 58 correlations with a measure of reading comprehension, 61 correlations with reading tests in England, 60 correlations with SAT, 59 correlations with SAT and the California Standards Tests, 58 cross-validation study results, 65 investigating Oral Reading Fluency and developing the Est. ORF (Estimated Oral Reading Fluency) scale, 63 PR (Percentile Rank), 33, 84, 88, 92 Practice session, 9 Predictive validity, correlations with SAT and the California Standards Tests, 58 Program description, 1 Program design, 3, 5 Program improvements, 5 Progress monitoring assessment, levels of pupil information, 1 Purpose of the program, 2 Item development, 12, 13 vocabulary-in-context item specifications, 13 Item difficulty, 19 Item discrimination, 19 Item presentation, 17 Item Response Function. See IRF Item Response Theory. See IRT Item retention, rules, 21 Item specifications, vocabulary-in-context items, 13 K Keyboard, 9 L Length of test, 5, 9 Levels of pupil data Tier 1, formative class assessments, 1 Tier 2, interim periodic assessments, 1 Tier 3, summative assessments, 2 Lexile Framework, 31 Lexile Measures, 30 Lexile Measures of Students and Books, 30 Lexile ZPD Ranges, 30 Linking study, 25 Longitudinal study, correlations with SAT, 59 R RA (Reading Age), 33 Rasch IRT (Item Response Theory) model, 20 Reading Age. See RA Relationship of STAR Reading scores to state tests, 42 Reliability definition, 38 UK reliability study, 40 Repeating a test, 10 Rules for item retention, 21 M Maximum-Likelihood IRT estimation, 24 Meta-analysis of the STAR Reading validity data, 57 Mouse, 9 S N Sample characteristics, norming, 84 SAT, 58, 59 Scale calibration, 15 Dynamic Calibration, 27 linking study, 25 Scaled Score. See SS Score definitions, 29 types of test scores, 29 US norm-referenced, 89 Scores conversion, 93 criterion-referenced, 29 diagnostic codes, 35 Est. ORF (Estimated Oral Reading Fluency), 29, 90 GE (Grade Equivalent), 84, 88, 89, 91, 92 IRL (Instructional Reading Level), 91 Lexile Measures, 30 Lexile ZPD Ranges, 30 National Curriculum Level–Reading. See NCL–R NCE (Normal Curve Equivalent), 93 NCL–R (National Curriculum Level–Reading), 32 Normal Curve Equivalent. See NCE Norming, 84 data analysis, 87 sample characteristics, 84 test administration, 87 Norm-referenced scores, 29 definitions, 89 NRSS (Normed Referenced Standardised Score), 33 P Password entry, 8 Percentile Rank Range, 33 Percentile Rank. See PR STAR Reading™ Technical Manual 114 Index U NCE (Normal Curve Equivalent), 93 NCL–R (National Curriculum Level–Reading), 32 norm-referenced, 29 NRSS (Normed Referenced Standardised Score), 33 Percentile Rank Range, 33 PR (Percentile Rank), 33, 84, 88, 92 RA (Reading Age), 33 relationship of STAR Reading scores to state tests, 42 SS (Scaled Score), 24, 35, 88 test scores (types), 29 ZPD (Zone of Proximal Development), 35, 78 Scoring, 24 Security. See Test security SEM (standard error of measurement), 24 Special scores diagnostic codes, 35 ZPD (Zone of Proximal Development), 35, 78 Split-application model, 7 SS (Scaled Score), 24, 35, 88 Standard Error of Measurement. See SEM Summary of validity data, 70 Summative assessments, 2 UK reliability study, 40 UK validity, study results, 66 concurrent validity, 67 Understanding GE scores, 91 Understanding IRL scores, 91 US norm-referenced score definitions, 89 V Validity concurrent, 60 construct, 61 cross-validation study results, 65 data analysis, 57 definition, 41 external validity, 42 longitudinal study, 59 post-publication study, 58 predictive, 58 relationship of STAR Reading scores to state tests, 42 summary of validity data, 70 UK study results, 66 Vocabulary lists, 12 Vocabulary-in-context item specifications, 13 T Test administration, 87 procedures, 8 Test interface, 9 Test items, time limits, 10 Test length, 5, 9 Test monitoring, 8 Test repetition, 10 Test scores, types. See types of test scores Test scoring, 24 Test security, 7 access levels and capabilities, 7 data encryption, 7 individualised tests, 7 split-application model, 7 test monitoring and password entry, 8 Time limits, 10 extended time limits, 10 Types of test scores, 29, 89 Est. ORF (Estimated Oral Reading Fluency), 29, 90 GE (Grade Equivalent), 84, 88, 89, 92 IRL (Instructional Reading Level), 91 NCE (Normal Curve Equivalent), 93 NCL–R (National Curriculum Level–Reading), 32 NRSS (Normed Referenced Standardised Score), 33 Percentile Rank Range), 33 PR (Percentile Rank), 33, 84, 88, 92 RA (Reading Age), 33 SS (Scaled Score), 35, 88 STAR Reading™ Technical Manual W WCPM (words correctly read per minute), 64 Z ZPD (Zone of Proximal Development), 35, 78 115 About Renaissance Learning UK Renaissance Learning UK is a leading provider of assessment technology for primary and secondary schools. Our products promote success amongst students of all ages and abilities through personalised practice in reading, writing and maths, and by providing teachers with immediate feedback and data that helps inform instruction. Our Accelerated Reader (AR) Advantage and Accelerated Maths (AM) Advantage software, together with the interactive NEO 2 writing tool, help to enhance literacy and numeracy skills, support differentiated instruction, and personalise practice to optimise student development. The world’s most widely used reading software, Accelerated Reader schools report an average of two years’ reading age growth in just one academic year. A member of BESA, we also support The Schools Network (formerly SSAT), National Literacy Trust and Chartered Institute of Library and Information Professionals amongst other organisations. Renaissance Learning™ 32 Harbour Exchange Square London, E14 9GE +44 (0)20 7184 4000 www.renlearn.co.uk 43849.140814
© Copyright 2025