Agreement Indices in MultiLevel Analysis Ayala Cohen Faculty of Industrial Engineering& Management Technion-Israel Institute of Technology May 2007 1 Outline • Introduction ( on Interrater agreementIRA) • rWG(J) Index of agreement • AD ( Absolute Deviation), Alternative measure of agreement -------------------------------Review Our work (2001) (2007) Etti Doveh Etti Doveh Uri Eick Inbal Shani 2 INTRODUCTION Why we need a measure of agreement In recent years there has been a growing number of studies based on multi-level data in applied psychology, organizational behavior, clinical trials. Typical data structure: Individuals within groups ( two levels) Groups within departments (three levels) 3 Constructs • Constructs are our building blocks in developing and in testing theory. • Group-level constructs describe the group as a whole and are of three types (Kozlowski & Klein, 2000): – Global, shared, or configural. 4 Global Constructs • Relatively objective, easily observable, descriptive group characteristics. • Originate and are manifest at the group level. • Examples: – Group function, size, or location. • No meaningful within-group variability. • Measurement is generally straightforward. 5 Shared Constructs • Group characteristics that are common to group members • Originate in group members’ attitudes, perceptions, cognitions, or behaviors – Which converge as a function of socialization, leadership, shared experience, and interaction. • Within-group variability predicted to be low. • Examples: Group climate, norms. 6 Configural Group-Level Constructs • Group characteristics that describe the array, pattern, dispersion, or variability within a group. • Originate in group member characteristics (e.g., demographics, behaviors, personality) – But no assumption or prediction of convergence. • Examples: – Diversity, team star or weakest member. 7 Justifying Aggregation • Why is this essential? – In the case of shared constructs, our construct definitions rest on assumptions regarding withinand between-group variability. – If our assumptions are wrong, our construct “theories,” our measures, are flawed and so are our conclusions. • So, test both: Within group agreement The construct is supposed to be shared, is it really? – Between group variability (reliability) Groups are expected to differ significantly, do they really? 8 Chen, Mathieu & Bliese ( 2004) proposed a framework for conceptualizing and testing multilevel constructs. This framework includes the assessment of inter-group agreement Assessment of agreement is a pre-requisite for arguing that a higher level construct can be operationalized . 9 Distinction should be made between: Interrater reliability (IRR= Interrater Reliability) and Interrater agreement (IRA= Interrater Agreement) Many past studies wrongly used the two terms interchangeably in their discussions. 10 The term interrater agreement refers to the degree to which ratings from individuals are interchangeable ; namely, it reflects the extent to which raters provide essentially the same rating. (Kozlowski & Hattrup,1992;Tinsley&Weiss,1975 . ( 11 Interrater reliability refers to the degree to which ratings of different judges are proportional when expressed as deviations from their means 12 Interrater reliability (IRR) refers to the relative consistency and assessed by correlations Interrater agreement (IRA) refers to the absolute consensus in scores assigned by the raters and is assessed by measures of variability. 13 Scale of Measurement • Questionnaire with J parallel items on a Likert scale with A categories e.g. A=5 1 2 3 4 5 Strongly Disagree Indifferent Agree disagree 14 Strongly agree Example Item Rater1 Rater2 k=3 raters Likert scale A=7 categories J= 5 items 15 Rater 3 Dev from mean 1 7 4 3 1 2 5 2 1 -1 3 5 2 1 -1 4 6 3 2 0 5 7 4 3 1 Prior to aggregation , we assessed within unit agreement on…… To do so, we used two complementary approaches (Kozlowski & Klein, 2000) A consistency based approach ,computation of the intra class correlation coefficient ,ICC(1) A consensus based approach ( index of agreement) 16 How can we assess agreement ? • Variability measures: e.g. Variance MAD( Mean Absolute Deviation) Problem: What are “small / large” values ? 17 The most widely used index of interrater agreement on Likert type scales has been rWG(J), introduced by James ,Demaree & Wolf (1984). J stands for the number of items in the scale 18 Examples when rWG(J) was used to assess interrater agreement Group cohesiveness Group socialization emphasis Transformational and transactional leadership Positive and negative affective group tone Organizational climate 19 This index compares the observed within group variances to an expected variance from “random responding “ In the particular case of one item (stimulus) , (J=1) this index is denoted as rWG and is equal to 2 rWG 20 Sx 1 2 2 rWG Sx Sx 1 2 2 is the variance of ratings for the single stimulus 2 is the variance of some “null distribution” corresponding to no agreement 21 Problem: A limitation of rWG(J) is that there is no clear-cut definition of a random response and the appropriate specification of the null distribution which models no agreement is debatable If the null distribution used to define 2 fails to model properly a random response, then the interpretability of the index is suspect. 22 The most natural candidate to represent non agreement is the uniform (rectangular) distribution, which implies that for an item with number of categories which equals to A, the proportion of cases in each category will be equal to 1/A 23 For a uniform null For an item with A number of categories A 1 12 2 2 null 2 rWG 24 Sx 1 2 How to calculate the sample variance Sx 2 ? We have n ratings and suppose n is “small” 1 n 2 Sx ( Xk X ) n k 1 2 25 Example A=5 k=9 raters: 3 3 3 3 5 5 5 5 4 ( With ( n-1) in the denominator), Sx 2 rW G 1 2 Sx 2 1 2 2 rWG 26 2 null A2 1 12 1 1 1 2 2 James et al. (1984): “ The distribution of responses could be non- uniform when no genuine agreement exists among the judges. The systematic biasing of a response distribution due to a common response style within a group of judges be considered. This distribution may be negatively skewed, yielding a smaller variance than the variance of a uniform distribution”. 27 Slight Skewed Null 1 = .05 2 = .15 3 = .20 Yielding 2 null 4 = .35 5 = .25 = 1.34 Used as a “null distribution” in several studies (e.g., Schreisheim et al., 1995; Shamir et al., 1998). Their justification for choosing this null distribution was that it appears to be a reasonably good approximation of random response to leadership and attitude questionnaires. 28 Null Distributions A=5 0.35 slight skewed 0.3 0.25 0.2 0.15 0.1 0.05 0 1 29 2 3 4 5 James et al )1984( suggested several skewed distributions , (which differ in their skewness and variance) to accommodate for systematic bias . 30 Often, several null distributions (including the uniform) could be suitable to model disagreement. Thus, the following procedure is suggested. Consider the subset of likely null distributions and calculate the largest and smallest null variance specified by this subset. 31 Additional “problem” The index can have negative values rWG Sx 2 1 2 Larger variance than expected from random response 32 Bi-modal distribution: ( extreme disagreement) Example: A=5 Half answer 1 , Half answer 5 Variance: 4 Uniform variance 2 null A2 1 2 12 2 rWG 33 Sx 1 2 1 What to do when rWG is negative? James et al ( 1984) recommended replacing a negative value by zero. Criticized by Lindell et al. ( 1999) 34 For a scale of J items rWG ( J ) s 35 2 J[1 ( s 2 / 2 )] 2 2 2 2 J[1 ( s / )] s / Is the average variance over the J items For a scale of J items 2 s r1 1 2 J[1 ( s / )] rW G( J ) J[1 ( s 2 / 2 )] s 2 / 2 Jr1 Jr1 Jr1 (1 r1) 1 ( J 1)r1 2 36 2 For a scale of J items rWG ( J ) Jr1 1 ( J 1)r1 Spearman Brown Reliability : kr1 1 (k 1)r1 37 Example 3 raters 7 categories Likert scale 5 items rWG( J ) J[1 ( s 2 / 2 )] J[1 ( s 2 / 2 )] s 2 / 2 5[1 ( s 2 / 2 )] 0.66 2 2 1 4[1 ( s / )] 1 ( s 2 / 2 ) 0.2777 38 Item Rater Rater Rater 1 3 2 1 2 3 4 5 7 5 5 6 7 4 2 2 3 4 3 1 1 2 3 Var* 78/27 78/27 78/27 78/27 78/27 Var calculated with n in denominator Since its introduction, the use of rWG(J) has raised several criticisms and debates. It was initially described by James et al. (1984) as a measure of interrater reliability . Schmidt & Hunter (1989) criticized this index claiming that an index of reliability cannot be defined on a single item 39 In response, Kozlowski and Hattrup (1992) argued that it is an index of agreement not reliability. James, Demaree & Wolf (1993) concurred with this distinction, and it has now been accepted that rWG(J) ) is a measure of agreement . 40 Lindell, Brandt and Whitney (1999) suggested, as an alternative to , rWG(J) a modified index which is allowed to obtain negative values (even beyond minus 1) 2 r *WG ( J ) 41 s 1 2 The modified index r*WG(J) provides corrections to two of the criticisms which were raised against rWG(J) . First, it can obtain negative values, when the observed agreement is less than hypothesized. Secondly, unlike rWG(J) it does not include a Spearman-Brown correction and thus it does not depend on the number of items (J( 42 Academy of management Journal 2006 Does Ceo Carisma matter…..Agle et al. 128 Ceo’s 770 team members “Because of strengths and weaknesses of various interrater agreement measures, we computed both the intraclass correlation statistics ICC(1) and ICC(2), and the interrater agreement statistics r*WG(J)……………” 43 Agle et al. 2006 Overall, the very high interrater agreement justified the combination of individual manager’s responses into a single measure of charisma for each CEO…. -----------------They display ICC(1)= ICC(2)= r*WG(J) = One number ? 44 Ensemble of groups(RMNET) Shall we report median, mean? ….” Observed distributions of rWG(J) are often wildly skewed ….medians are the most appropriate summary statistic”….. 45 Ehrhart,M.G.(Personnel Psychology, 2004) Leadership and procedural justice climate –as antecedents of unit-level organizational citizenship behavior Grocery store chain 3914 employees in 249 departments 46 ….”The median rwg values across the 249 departments were : 0.88 for servant leadership, ………… WHAT TO 47 CONCLUDE ?????? Rule-Of-Thumb The practice of viewing rWG in the 0.70’s and higher as representing acceptable convergence is widespread. For example: Zohar (2000) cited rWG values in the .70’s and mid .80’s as proof that judgments “were sufficiently homogeneous for within group aggregation” 48 Benchmarking rWG Interrater Agreement Indices: Let’s Drop the .70 Rule-OfThumb Paper presented in the Annual Conference of the Society for Industrial and Organizational Psychology Chicago April 2004 R.J. Harvey and E. Hollander 49 “It is puzzling why many researchers and practitioners continue to rely on arbitrary rules-of-thumb to interpret rWG, especially the popular rule-ofthumb stating that rWG ≥ 0.70 denotes acceptable agreement”….. 50 “The justification of the rule rests largely on the argument that some researchers ( e.g. James et al., 1984) viewed rater agreement as being similar to reliability, reliabilities as low as .7 are useful ( e.g. Nunnaly,1978) , therefore rWG ≥ 0.7 implies interrater reliability”….. 51 There is little empirical basis for a .70 cutoff and few studies have attempted to determine how various values equate with “real world” levels of interrater agreement 52 The sources of four commonly reported cutoff criteria Lance, Butts, Michels (2006) ORM 1) GFI>.9 Indicates well fitting SEM’s 2) Reliability of .7 or higher is acceptable 3) rWG ‘s >.7 justify aggregation of individual responses to group-level measures 4) Keep the number of factors whose eigenvalues are greater than 1. 53 Rule-Of-Thumb A reviewer … “ I believe the phrase has its origins in describing the size of the stick appropriate for beating one’s wife. A stick was to be no larger in diameter than the man’s thumb….. Thus, use of this phrase might be offensive to some of the readership”… 54 Rule-Of-Thumb Feminists often make that claim that the “rule of thumb” used to mean that it was legal to beat your wife with a rod, so long as that rod was no thicker than the husband’s thumb. But, it turns out to be an excellent example of what may be called fiction…. 55 Rule-Of-Thumb From carpentry: The length from the tip of one’s thumb to the first knuckle was used as an approximation for one inch As such, we apologize to readers who may be offended by the reference to “rule of thumb” but remind them of the mythology surrounding its interpretation. 56 Statistical Tests Test the null hypothesis of no agreement . Dunlop et al. (JAP, 2003) Provided a table of rWG “critical values”. Under the null hypothesis of uniform null ( J=1, one item, different A values of the Likert scale, and different number of judges ) 57 Critical Values of the rWG statistic at the 5% Levels of statistical Significance A 2 3 5 11 n 58 4 1.0 5 .85 .78 8 .61 .59 10 .53 .49 Dunlop et al (2003)…. “Statistical tests of rWG are useful if one’s objective is to determine if any nonzero agreement exists ; although useful, this reflects a qualitatively different goal from determining if reasonable consensus exists for a group to aggregate individual level data to the group level of analysis “ …… 59 Alternative index AD Proposed by Burke, Finkelstein & Dusig Organizational Research Methods, 1999) AD M( j) 1 n | X jk X j | n k 1 ADM ( J ) = 60 ADM ( j ) J Gonzalez-Roma,V. ,Peiro,J.M.,&Tordera,N. (2002). JAP “ The ADM(J) index has several advantages compared with the James, Demaree and Wolf ( 1984) interrater agreement index rwg, see Burke et al. (1999)”….. 61 The ADM(J) index does not require modeling the null response distribution. It only requires an a priori specification of a null response range of interrater agreement. Second, the index provides estimates of interrater agreement in the metric of the original response range. 62 We followed Burke and colleagues’(1999) specification of using a null response range equal to or less than 1 when the response scale is a Likert-type 5 point scale. This value is consistent with our judgement that any two contiguous scale points are somewhat similar for the 5-point Likert-type scales used in the present study 63 …..Organizational commitment Measured by three items ( J=3). Respondents answered using a 5-point scale (A=5) The mean ADM(J) was 0.68 ( SD=0.32) and the ICC(1) was .22. The one –way ANOVA result , F(196,441)=1.7,p<.01, Suggests an adequate between differentiation and supports the validity of the aggregate organizational commitment measure. 64 Statistical Tests Test the null hypothesis of no agreement . Dunlop et al. (JAP, 2003) provided a table of AD “critical values”. Under the null hypothesis of uniform null ( One item, different A values of the Likert scale, and different number of judges ) 65 Critical Values of the AD statistic at the 5% Levels of statistical Significance A n 4 66 2 3 5 11 0.0 .75 5 .40 1.04 8 .69 1.53 10 .74 1.70 Criticism vs Reality • Citations and applications of rWG(J) in more than 450 papers 67 So far, the index rWG(J) has been much more frequently used than ADM(J) . We performed a systematic literature search of organizational multilevel studies, that were published during the years 2000-2006 (ADM(J) was introduced in 1999). Among the 41 papers that included justification of the aggregation from individual to group level, there were 40 (98%) that used the index rWG(J) and only 2 (5%) used the index ADM(J) . One study used both indices. 68 Statistical properties of rWG(J) • Cohen, A., Doveh, E., & Eick, U. (2001). Statistical properties of the rWG(J index of agreement. Psychological Methods, 6, 297310. Studied the sampling distribution of rWG(J) under the null hypothesis 69 Simulations to Study the Sampling Distribution Depends on • J( number of items in the scale) • A Likert scale • n group size • The null variance 2 • Correlation structure between items in 2 2 the scale J[1 ( s / )] rWG ( J ) J[1 ( s 2 / 2 )] s 2 / 2 70 Simulations to Study the Sampling Distribution “Easy” for a single item rWG Sx 1 2 Simulate data for a discrete “null” distribution 71 2 Dependent vs Independent Display of E (rWG(J)) n=3,10,100 corresponding to small,medium,large group sizes J=6,10 Data either uniform or slight skew A=5 Independent or CS ( Compound Symmetry) ρ=0.6 72 Compound Symmetry Structure CS 2 2 2 V33 73 2 2 2 2 2 2 Uniform data , uniform null “Error of first kind” Skew data uniform null “power” 74 CS =0.6 75 Testing Agreement for Multi-item Scales with the Indices rWG(J) and ADM(J) Ayala Cohen ,Etti Doveh & Inbal Shani Organizational Research Methods (2007) 76 Table of .95 RWG Percentile CS correlation structure n Group size 77 J=5 A=5 rho=0 rho= .3 rho= .4 3 0.87 0.90 0.90 5 0.74 0.77 0.78 6 0.70 0.73 0.75 7 0.69 0.70 0.72 8 0.66 0.66 0.67 15 0.52 0.56 0.58 20 0.49 0.50 0.53 Simulation • Software available: Simulate the null distribution of the index for a given n, A, J, k and the correlation structure among the items. 78 Example Bliese et al.'s (2002) sample of 2042 soldiers in 49 U.S. Army Companies. The companies ranged in size from n=10 to n=99. Task significance is a three-item scale ( J=3) assessing a sense of task significance during the deployment (A=5). 79 SimulationSample based values percentiles 80 Rejection area 1=in 0=out size rwg AD .95 rwg .05 AD rwg AD 10 .00 .97 .72 .63 0 0 20 .67 .69 .61 .72 1 1 37 .44 .78 .48 .80 0 1 41 .46 .81 .47 .80 0 0 45 .00 .93 .45 .81 0 0 53 .00 .99 .43 .82 0 0 58 .39 .76 .43 .83 0 1 68 .16 .91 .40 .85 0 0 85 .51 .80 .37 .86 1 1 99 .29 .87 .34 .88 0 1 Inference on the Ensemble • ICC summarizes the whole structure, assuming homogeneity of within variances. • Agreement indices are derived for each group separately. • How to infer on the ensemble? Analogous to regression analysis: F vs individual t tests 81 Based on the rWG(J) tests, significance was obtained for 5 companies, while based on ADM(J) it was obtained for 6 companies. Under the null hypothesis, the probability of obtaining at least 5 (out of 49 independent tests), with a significance level α=0.05, is 0.097 and it is 0.035 for obtaining at least 6. Thus, if we allow a global significance level of no more than 0.05, we cannot reject the null hypothesis based on rWG(J) , but will reject based on ADM(J). 82 Ensemble of groups(RMNET) What shall we do with groups that fail the threshold? 1) Toss them out because something is “wrong” with these groups. The logic is that if they do not agree, then the construct has not solidified with them, may be they are not a collective, they are distinct subgroups…. 83 Ensemble of groups(RMNET) 2) KEEP them… If on average we get reasonable agreement across groups on a construct, it justifies aggregation for all. ( …CFA: We index everyone, even if some people may have answered in a way which is inconsistent across items, but we do not drop them..) 84 Open Questions • Extension of slight skew to A>5 • Power Analysis • Comparison of degree of agreements , non-null cases 85
© Copyright 2025