Agreement Indices in Multi- Level Analysis Ayala Cohen Technion-Israel Institute of Technology

Agreement Indices in MultiLevel Analysis
Ayala Cohen
Faculty of Industrial Engineering& Management
Technion-Israel Institute of Technology
May 2007
1
Outline
• Introduction ( on Interrater agreementIRA)
• rWG(J) Index of agreement
• AD ( Absolute Deviation),
Alternative measure of agreement
-------------------------------Review Our work
(2001)
(2007)
Etti Doveh
Etti Doveh
Uri Eick
Inbal Shani
2
INTRODUCTION
Why we need a measure of agreement
In recent years there has been a growing
number of studies based on multi-level data
in applied psychology, organizational
behavior, clinical trials.
Typical data structure:
Individuals within groups ( two levels)
Groups within departments (three levels)
3
Constructs
• Constructs are our building blocks in
developing and in testing theory.
• Group-level constructs describe the group
as a whole and are of three types
(Kozlowski & Klein, 2000):
– Global, shared, or configural.
4
Global Constructs
• Relatively objective, easily observable,
descriptive group characteristics.
• Originate and are manifest at the group level.
• Examples:
– Group function, size, or location.
• No meaningful within-group variability.
• Measurement is generally straightforward.
5
Shared Constructs
• Group characteristics that are common to
group members
• Originate in group members’ attitudes,
perceptions, cognitions, or behaviors
– Which converge as a function of socialization,
leadership, shared experience, and interaction.
• Within-group variability predicted to be low.
• Examples: Group climate, norms.
6
Configural Group-Level Constructs
• Group characteristics that describe the array,
pattern, dispersion, or variability within a group.
• Originate in group member characteristics (e.g.,
demographics, behaviors, personality)
– But no assumption or prediction of convergence.
• Examples:
– Diversity, team star or weakest member.
7
Justifying Aggregation
• Why is this essential?
– In the case of shared constructs, our construct
definitions rest on assumptions regarding withinand between-group variability.
– If our assumptions are wrong, our construct
“theories,” our measures, are flawed and so are our
conclusions.
• So, test both:
Within group agreement
The construct is supposed to be shared, is it really?
– Between group variability (reliability)
Groups are expected to differ significantly, do they
really?
8
Chen, Mathieu & Bliese ( 2004) proposed a
framework for conceptualizing and testing
multilevel constructs.
This framework includes the assessment of
inter-group agreement
Assessment of agreement is a pre-requisite
for arguing that a higher level construct can be
operationalized .
9
Distinction should be made between:
Interrater reliability (IRR= Interrater
Reliability) and
Interrater agreement (IRA= Interrater
Agreement)
Many past studies wrongly used the two
terms interchangeably in their discussions.
10
The term interrater agreement refers to the
degree to which ratings from individuals are
interchangeable ;
namely, it reflects the extent to which raters
provide essentially the same rating.
(Kozlowski &
Hattrup,1992;Tinsley&Weiss,1975 . (
11
Interrater reliability
refers to the degree to which ratings of
different judges are proportional when
expressed as deviations from their
means
12
Interrater reliability (IRR) refers to the relative
consistency and assessed by correlations
Interrater agreement (IRA) refers to the absolute
consensus in scores assigned by the raters
and is assessed by measures of variability.
13
Scale of Measurement
• Questionnaire with J parallel items on a
Likert scale with A categories
e.g. A=5
1
2
3
4
5
Strongly Disagree Indifferent Agree
disagree
14
Strongly
agree
Example
Item
Rater1 Rater2
k=3 raters
Likert scale
A=7 categories
J= 5 items
15
Rater 3 Dev from
mean
1
7
4
3
1
2
5
2
1
-1
3
5
2
1
-1
4
6
3
2
0
5
7
4
3
1
Prior to aggregation , we assessed within unit
agreement on……
To do so, we used two complementary
approaches (Kozlowski & Klein, 2000)
A consistency based approach ,computation of
the intra class correlation coefficient ,ICC(1)
A consensus based approach ( index of
agreement)
16
How can we assess agreement ?
• Variability measures:
e.g. Variance
MAD( Mean Absolute Deviation)
Problem: What are “small /
large”
values ?
17
The most widely used index of interrater
agreement on Likert type scales has been
rWG(J), introduced by
James ,Demaree & Wolf (1984).
J stands for the number of items in the scale
18
Examples when rWG(J)
was used to assess interrater agreement
Group cohesiveness
Group socialization emphasis
Transformational and transactional
leadership
Positive and negative affective group tone
Organizational climate
19
This index compares the observed within
group variances to an expected variance
from “random responding “
In the particular case of one item
(stimulus) , (J=1)
this index is denoted as rWG and is equal
to
2
rWG
20
Sx
 1 2

2
rWG
Sx
Sx
 1 2

2
is the variance of ratings for the
single stimulus
2
is the variance of some “null distribution”
corresponding to no agreement
21
Problem:
A limitation of rWG(J) is that there is no
clear-cut definition of a random response
and the appropriate specification of the null
distribution which models no agreement is
debatable
If the null distribution used to define  2
fails to model properly a random response,
then the interpretability of the index is
suspect.
22
The most natural candidate to represent
non agreement is the uniform (rectangular)
distribution, which implies that for an item
with number of categories which equals to
A,
the proportion of cases in each category
will be equal to 1/A
23
For a uniform null
For an item with A number of categories

A 1

12
2
2
null
2
rWG
24
Sx
 1 2

How to calculate the sample variance Sx 2 ?
We have n ratings and suppose n is “small”
1 n
2
Sx   ( Xk  X )
n k 1
2
25
Example
A=5
k=9
raters: 3 3 3 3 5 5 5 5 4
( With ( n-1) in the denominator),
Sx 2
rW G  1  2

Sx 2  1

2  2
rWG
26
2
null
A2  1

12
1 1
 1 
2 2
James et al. (1984):
“ The distribution of responses could be non-
uniform when no genuine agreement exists
among the judges.
The systematic biasing of a response
distribution due to a common response style
within a group of judges be considered.
This distribution may be negatively skewed,
yielding a smaller variance than the variance
of a uniform distribution”.
27
Slight Skewed Null
1 = .05
2 = .15 3 = .20
Yielding
2
null
4 = .35 5 = .25
= 1.34
Used as a “null distribution” in several studies
(e.g., Schreisheim et al., 1995; Shamir et al.,
1998).
Their justification for choosing this null distribution
was that it appears to be a reasonably good
approximation of random response to leadership
and attitude questionnaires.
28
Null Distributions A=5
0.35
slight skewed
0.3
0.25
0.2
0.15
0.1
0.05
0
1
29
2
3
4
5
James et al )1984( suggested
several skewed distributions , (which
differ in their skewness and variance)
to accommodate for systematic bias .
30
Often, several null distributions (including
the uniform) could be suitable to model
disagreement. Thus, the following
procedure is suggested.
Consider the subset of likely null
distributions and calculate the largest
and smallest null variance specified by
this subset.
31
Additional “problem”
The index can have negative values
rWG
Sx 2
 1 2

Larger variance than expected from
random response
32
Bi-modal distribution: ( extreme disagreement)
Example: A=5
Half answer 1 , Half answer 5
Variance: 4
Uniform variance

2
null
A2  1

2
12
2
rWG
33
Sx
 1  2  1

What to do when rWG is negative?
James et al ( 1984) recommended
replacing a negative value by zero.
Criticized by Lindell et al. ( 1999)
34
For a scale of J items
rWG ( J )
s
35
2
J[1  ( s 2 /  2 )]

2
2
2
2
J[1  ( s /  )]  s / 
Is the average variance over the J items
For a scale of J items
2
s
r1  1  2

J[1  ( s /  )]
rW G( J ) 

J[1  ( s 2 /  2 )]  s 2 /  2
Jr1
Jr1


Jr1  (1  r1) 1  ( J  1)r1
2
36
2
For a scale of J items
rWG ( J )
Jr1

1  ( J  1)r1
Spearman Brown Reliability :
kr1
1  (k  1)r1
37
Example
3 raters
7 categories Likert scale
5 items
rWG( J ) 
J[1  ( s 2 /  2 )]
J[1  ( s 2 /  2 )]  s 2 /  2
5[1  ( s 2 /  2 )]

 0.66
2
2
1  4[1  ( s /  )]
1  ( s 2 /  2 )  0.2777
38
Item Rater Rater Rater
1
3
2
1
2
3
4
5
7
5
5
6
7
4
2
2
3
4
3
1
1
2
3
Var*
78/27
78/27
78/27
78/27
78/27
Var calculated with n
in denominator
Since its introduction, the use of rWG(J) has
raised several criticisms and debates.
It was initially described by James et al.
(1984) as a measure of interrater
reliability .
Schmidt & Hunter (1989) criticized this
index claiming that an index of reliability
cannot be defined on a single item
39
In response, Kozlowski and Hattrup (1992)
argued that it is an index of agreement not
reliability.
James, Demaree & Wolf (1993) concurred
with this distinction, and it has now been
accepted that rWG(J) ) is a measure of
agreement .
40
Lindell, Brandt and Whitney (1999)
suggested, as an alternative to ,
rWG(J) a modified index which is
allowed to obtain negative values
(even beyond minus 1)
2
r *WG ( J )
41
s
 1 2

The modified index r*WG(J) provides
corrections to two of the criticisms which were
raised against rWG(J) .
First, it can obtain negative values, when the
observed agreement is less than
hypothesized.
Secondly, unlike rWG(J) it does not include a
Spearman-Brown correction and thus it does
not depend on the number of items (J(
42
Academy of management Journal
2006
Does Ceo Carisma matter…..Agle et al.
128 Ceo’s 770 team members
“Because of strengths and weaknesses of various
interrater agreement measures, we computed
both the intraclass correlation statistics
ICC(1) and ICC(2), and the interrater
agreement statistics r*WG(J)……………”
43
Agle et al.
2006
Overall, the very high interrater agreement justified
the combination of individual manager’s
responses into a single measure of charisma for
each CEO….
-----------------They display ICC(1)=
ICC(2)=
r*WG(J) = One number ?
44
Ensemble of groups(RMNET)
Shall we report median, mean?
….” Observed distributions of rWG(J) are often
wildly skewed ….medians are the most
appropriate summary statistic”…..
45
Ehrhart,M.G.(Personnel Psychology, 2004)
Leadership and procedural justice climate –as
antecedents of unit-level organizational
citizenship behavior
Grocery store chain
3914 employees in 249 departments
46
….”The median rwg values across the
249 departments were : 0.88 for
servant leadership, …………
WHAT TO
47
CONCLUDE ??????
Rule-Of-Thumb
The practice of viewing rWG in the 0.70’s and
higher as representing acceptable
convergence is widespread.
For example: Zohar (2000) cited rWG values in
the .70’s and mid .80’s as proof that judgments
“were sufficiently homogeneous for within
group aggregation”
48
Benchmarking rWG Interrater Agreement
Indices: Let’s Drop the .70 Rule-OfThumb
Paper presented in the Annual Conference of
the Society for Industrial and Organizational
Psychology
Chicago April 2004
R.J. Harvey and E. Hollander
49
“It is puzzling why many researchers
and practitioners continue to rely on
arbitrary rules-of-thumb to interpret
rWG, especially the popular rule-ofthumb stating that rWG ≥ 0.70
denotes acceptable agreement”…..
50
“The justification of the rule rests largely
on the argument that some researchers
( e.g. James et al., 1984) viewed rater
agreement as being similar to reliability,
reliabilities as low as .7 are useful ( e.g.
Nunnaly,1978) , therefore rWG ≥ 0.7
implies interrater reliability”…..
51
There is little empirical basis for a .70
cutoff and few studies have
attempted to determine how various
values equate with “real world” levels
of interrater agreement
52
The sources of four commonly reported
cutoff criteria
Lance, Butts, Michels (2006) ORM
1) GFI>.9 Indicates well fitting SEM’s
2) Reliability of .7 or higher is acceptable
3) rWG ‘s >.7 justify aggregation of
individual responses to group-level
measures
4) Keep the number of factors whose
eigenvalues are greater than 1.
53
Rule-Of-Thumb
A reviewer …
“ I believe the phrase has its origins in
describing the size of the stick
appropriate for beating one’s wife. A stick
was to be no larger in diameter than the
man’s thumb….. Thus, use of this phrase
might be offensive to some of the
readership”…
54
Rule-Of-Thumb
Feminists often make that claim that the
“rule of thumb” used to mean that it was
legal to beat your wife with a rod, so long
as that rod was no thicker than the
husband’s thumb. But, it turns out to be
an excellent example of what may be
called fiction….
55
Rule-Of-Thumb
From carpentry:
The length from the tip of one’s thumb
to the first knuckle was used as an
approximation for one inch
As such, we apologize to readers who
may be offended by the reference to “rule
of thumb” but remind them of the
mythology surrounding its interpretation.
56
Statistical Tests
Test the null hypothesis of no agreement .
Dunlop et al. (JAP, 2003)
Provided a table of rWG “critical values”.
Under the null hypothesis of uniform null
( J=1, one item, different A values of the
Likert scale, and different number of
judges )
57
Critical Values of the rWG statistic at the 5%
Levels of statistical Significance
A
2
3
5
11
n
58
4
1.0
5
.85
.78
8
.61
.59
10
.53
.49
Dunlop et al (2003)….
“Statistical tests of rWG are useful if one’s
objective is to determine if any nonzero
agreement exists ; although useful, this
reflects a qualitatively different goal from
determining if reasonable consensus
exists for a group to aggregate individual
level data to the group level of analysis “
……
59
Alternative index AD
Proposed by Burke, Finkelstein & Dusig
Organizational Research Methods, 1999)
AD M( j)
1 n
  | X jk  X j |
n k 1
ADM ( J ) =
60
 ADM ( j )
J
Gonzalez-Roma,V. ,Peiro,J.M.,&Tordera,N.
(2002). JAP
“ The ADM(J) index has several
advantages compared with the
James, Demaree and Wolf ( 1984)
interrater agreement index rwg, see
Burke et al. (1999)”…..
61
The ADM(J) index does not require
modeling the null response
distribution. It only requires an a priori
specification of a null response range
of interrater agreement. Second, the
index provides estimates of interrater
agreement in the metric of the original
response range.
62
We followed Burke and
colleagues’(1999) specification of
using a null response range equal to
or less than 1 when the response
scale is a Likert-type 5 point scale.
This value is consistent with our
judgement that any two contiguous
scale points are somewhat similar for
the 5-point Likert-type scales used in
the present study
63
…..Organizational commitment
Measured by three items ( J=3).
Respondents answered using a 5-point scale (A=5)
The mean ADM(J) was 0.68 ( SD=0.32)
and the ICC(1) was .22.
The one –way ANOVA result ,
F(196,441)=1.7,p<.01,
Suggests an adequate between differentiation and
supports the validity of the aggregate organizational
commitment measure.
64
Statistical Tests
Test the null hypothesis of no agreement .
Dunlop et al. (JAP, 2003) provided a table
of AD “critical values”.
Under the null hypothesis of uniform null
( One item, different A values of the Likert
scale, and different number of judges )
65
Critical Values of the AD statistic at the 5%
Levels of statistical Significance
A
n
4
66
2
3
5
11
0.0
.75
5
.40
1.04
8
.69
1.53
10
.74
1.70
Criticism vs Reality
• Citations and applications of
rWG(J)
in more than 450 papers
67
So far, the index rWG(J) has been much more
frequently used than ADM(J) .
We performed a systematic literature search of
organizational multilevel studies, that were
published during the years 2000-2006 (ADM(J)
was introduced in 1999).
Among the 41 papers that included justification
of the aggregation from individual to group
level, there were 40 (98%) that used the index
rWG(J) and only 2 (5%) used the index ADM(J) .
One study used both indices.
68
Statistical properties of rWG(J)
• Cohen, A., Doveh, E., & Eick, U. (2001).
Statistical properties of the rWG(J index of
agreement. Psychological Methods, 6, 297310.
Studied the sampling distribution of rWG(J)
under the null hypothesis
69
Simulations to Study the Sampling
Distribution
Depends on
• J( number of items in the scale)
• A Likert scale
• n group size
•
The null variance
2
• Correlation structure between items in
2
2
the scale
J[1  ( s /  )]
rWG ( J ) 
J[1  ( s 2 /  2 )]  s 2 /  2
70
Simulations to Study the Sampling
Distribution
“Easy” for a single item rWG
Sx
 1 2

Simulate data for a discrete “null”
distribution
71
2
Dependent vs Independent
Display of E (rWG(J))
n=3,10,100 corresponding to small,medium,large
group sizes
J=6,10
Data either uniform or slight skew
A=5
Independent or CS ( Compound Symmetry)
ρ=0.6
72
Compound Symmetry Structure
CS


2
  
 2
 
2
V33
73

2

2

2
 2
2

2





Uniform data
, uniform null
“Error of first
kind”
Skew data
uniform null
“power”
74
CS =0.6
75
Testing Agreement for Multi-item Scales
with the Indices rWG(J) and ADM(J)
Ayala Cohen ,Etti Doveh & Inbal Shani
Organizational Research Methods (2007)
76
Table
of
.95
RWG
Percentile
CS correlation structure
n
Group
size
77
J=5
A=5
rho=0
rho= .3
rho= .4
3
0.87
0.90
0.90
5
0.74
0.77
0.78
6
0.70
0.73
0.75
7
0.69
0.70
0.72
8
0.66
0.66
0.67
15
0.52
0.56
0.58
20
0.49
0.50
0.53
Simulation
• Software available:
Simulate the null distribution of the index for
a given n, A, J, k and the correlation
structure among the items.
78
Example
Bliese et al.'s (2002) sample of 2042 soldiers in 49
U.S. Army Companies.
The companies ranged in size from n=10 to n=99.
Task significance is a three-item scale ( J=3)
assessing a sense of task significance during the
deployment (A=5).
79
SimulationSample based
values percentiles
80
Rejection
area 1=in
0=out
size
rwg
AD
.95
rwg
.05
AD
rwg
AD
10
.00
.97
.72
.63
0
0
20
.67
.69
.61
.72
1
1
37
.44
.78
.48
.80
0
1
41
.46
.81
.47
.80
0
0
45
.00
.93
.45
.81
0
0
53
.00
.99
.43
.82
0
0
58
.39
.76
.43
.83
0
1
68
.16
.91
.40
.85
0
0
85
.51
.80
.37
.86
1
1
99
.29
.87
.34
.88
0
1
Inference on the Ensemble
• ICC summarizes the whole structure, assuming
homogeneity of within variances.
• Agreement indices are derived for each group
separately.
• How to infer on the ensemble?
Analogous to regression analysis: F vs
individual t tests
81
Based on the rWG(J) tests, significance was
obtained for 5 companies, while based on ADM(J) it
was obtained for 6 companies.
Under the null hypothesis, the probability of
obtaining at least 5 (out of 49 independent tests),
with a significance level α=0.05, is 0.097 and it is
0.035 for obtaining at least 6.
Thus, if we allow a global significance level of no
more than 0.05, we cannot reject the null
hypothesis based on rWG(J) , but will reject based on
ADM(J).
82
Ensemble of groups(RMNET)
What shall we do with groups that fail the
threshold?
1) Toss them out because something is
“wrong” with these groups. The logic is
that if they do not agree, then the
construct has not solidified with them,
may be they are not a collective, they are
distinct subgroups….
83
Ensemble of groups(RMNET)
2) KEEP them…
If on average we get reasonable agreement
across groups on a construct, it justifies
aggregation for all.
( …CFA: We index everyone, even if some
people may have answered in a way
which is inconsistent across items, but we
do not drop them..)
84
Open Questions
• Extension of slight skew to A>5
• Power Analysis
• Comparison of degree of agreements ,
non-null cases
85