Download Report

HOW TO USE PROC CATMOD
IN ESTIMATION PROBLEMS
Olaf Gefeller 1 , Franz Woltering2
1Abteilung Medizinische Statistik, Georg-August-Universitat Gottingen
2Fachbereich Statistik, Universitat Dortmund
Abstract
The paper describes a new application of the statistical analysis procedure PROC CATMOD. It demonstrates
how PROC CATMOD can be easily adapted to estimation problems in contingency tables. In particular, it
is shown how estimators of measures of association and their asymptotic variances can be calculated using
PROC CATMOD. The practical application of this approach is illustrated for the kappa-coefficient. Real
data of the German Forest Decline Survey are provided in the practical example.
Keywords: GSK-model, kappa-coefficient, measures of association, variance calculation
1. Introduction
PROC CATMOD, one of the statistical analysis procedures within the SAS/STAT* software, is originally
designed in the context of the Grizzle, Starmer & Koch (1969) (subsequently abbreviated GSK) approach
to fit linear models to functions of response frequencies in contingency tables as used in linear modeling,
log-linear modeling, logistic regression, and repeated measurement analysis. The GSK-approach is typically
applied to the analysis of contingency tables with ordered responses (Williams & Grizzle, 1972), measurement
of observer agreement (Landis & Koch, 1977), the analysis of repeated mesurement experiments (Koch et
al., 1977), and rank function analysis (Semenya et al., 1983). In all these applications PROC CATMOD has
been successfully employed to solve the computational problems of the analysis. It represents a complex but
powerful tool, which needs some experience to handle (and patience to struggle through more than 100 pages
of documentation in the SAS/STAT* manual). In this paper we demonstrate that it can be easily adapted to
estimation problems in contingency tables. For a broad class of estimators of measures of association under the
multinomial sampling model we show how to use PROC CATMOD to calculate the estimator of the measure
of association and - more importantly - the asymptotic variance of the estimator. The statistical background
of our method, the link between the GSK-approach and the estimation of measures of association, will not
be presented here (see Gefeller & Woltering, 1991). To illustrate the practical application of the procedure
the well-known kappa-coefficient, a measure of agreement between two different ratings, is used, and data
of the German Forest Decline Survey (Krahl-Urban et al., 1988) are presented. Further general remarks on
the proposed method are provided in the concluding section.
716
.
,
2. Methods
Adapting the GSK-approach to estimation problems in contingency tables, complex ratio statistics such as
measures of association have to be written as special functions of the probability estimates of the underlying
product multinomial model. To generate estimators of this type, compounded functions involving only linear,
logarithmic, and exponential transformations (Forthofer & Koch, 1973) of the general form
F(p)
= ... As [exp (~ ~n (A3 [exp (A2 ~~ (AlP)])])])],
where Ai denotes a matrix with constants,
are employed. This general framework offers the opportunity to compute complex estimators in which probabilities from different subpopuiatiolls are Combined~
In the special situation of a single multinomial population achieved by unrestricted sampling of elements
measures of association can be expressed in the same as illustrated above. The advantage derived in this situation lies in the chance of using standard software for GSK-models such as PROC CATMOD to calculate the
estimators and their asymptotic variances. The device consists of fitting the simple linear model F(p) = X{3,
where F(p) is the I-dimensional response function as specified above, X denotes the degenerated 1 x 1 matrix
consisting of the constant' 1', and {3 represents the I-dimensional parameter. Then the estimator b of (3 equals
the response function F itself and the variance of b is given by v" = VF. This analysis is possible in PROC
CATMOD by directly specifying the design matix on the MODEL statement and by using the RESPONSE
statement to describe a series of transformations to the probability estimates in order to produce F(p), the
function of interest.
At first glance this procedure seems to increase the computational effort by introducing the tedious work
of constructing a series of transformations to describe the measure of association. But, in fact, the most
annoying part of the computational task, the calculation of the asymptotic variance through computation
of the first derivative matrix and of additional matrix products, is completely undertaken by the computer
program. Thus the computational effort for the user is reduced substantially.
3. Practical Example
To illustrate the pra~tical application of the 'method the well-known kappa-coeffi~ient (Cohen, 1960)) is used.
Cohen's kappa constitutes a popular measure of agreement between two different ratings. It is defined for
quadratic K x K contingency tables. Using the usual row-column parametrization of cell probabilities in
contingency tables, which will be denoted as 7rij, i,j = 1, ... , K, the kappa-coefficient is defined as
K
K
2: 7rii - 2: 7ri.7r.i
K,:=
i=l
1-
K
i=l
2: 7ri.7r.i
i=l
K
The term
2: 7ri.7r.i represents the expected value of agreement ,under the hypothesis of independent ratings.
i=l
'
'
Procedures for the estimation of the kappa-coefficient and its asymptotic variance are not available in standard
statistical sQftware packages. To use the SAS/STAT* procedure CATMOD to do the calculations in the way
outlined above, the following steps have to be applied:
,
"
\
\,
,;
717
(1) specify
K,
as a function of a suitable vector of probabilities
(2) transform
K,
7rij
to a compounded function involving only linear, logarithmic and exponential operations
(3) set up the 'dummy' MODEL statement consisting of the constant '1' as the design matrix
(4) set up the RESPONSE statement using the transformation constructed in (2)
(5) run PROC CATMOD and look for the 'Analysis of weighted-least-squares estimates'-table in the
output, where the estimated value of K, and its asymptotic standard error appears (in addition, the
estimated asymptotic variance of K, can be obtained directly by specifying the 'COVB'-option on the
MODEL statement)
As a numerical example we use data from the Forest Decline Survey (Krahl-Urban et al., 1988)). The data
based on the variable 'loss of needles' (in percent), which has been categorized independently by two observers
into four groups according to the severity of damage, are presented in the following 4 x 4 contingency table:
1 2
3 4 L:
1 60 30
6
1 97
2 23 43 18
2 86
1 9 19
3 5 4
4
14
1 1 1 11
89
78
26
23
216
L:
Now, step 1 involves only the definition that the vector
probabilities 7rij, i, j = 1, ... ,4, as follows:
7r
:=
K,
is build up using the row-column cell
as a compounded function described in (2) is of the
= exp (A4 [In (A3 [exp (A2 [In (Al7r )))))))
where
A1 =
R16
(7rll' ••• , 7r14, 7r2b ••• , 7r24, 7r31, ••• , 7r34, 7r 41, ••• , 7r 44)'
Step 2 needs a little more work. The representation of
following form:
K,
E
7r
1
0
0
0
1
0
0
0
1
1
1
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
1
0
1
0
0
0
1
0
0
1
1
0
1
0
0
0
0
1
0
0
1
0
1
0
0
0
0
0
1
0
1
0
0
1
0
1
0
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
0
0
1
0
1
0
0
0
1
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
718
0
0
0
1
0
0
1
0
0
1
0
0
0
1
0
0
0
1
1
1
A2 =
1
0
0
0
0
0
o "0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
-1, -1 -1 1
Aa= ( -1
-1 -'-1 -1 -1 0
A4
=( 1
0
0
0
0
1
0
0
0
0
0,
0
1
n
-1 )
The other steps (3) - (5) can be seen in the listing of the SAS-program and output in the following sections.
4. SAS-Program of the Practical Example
*---------------------------------------------------------------*
Kappa statistic for interobserver agr~ement
4 response categories of the variable 'loss of needles':
1
2
3
4
=
=
less than
5%
15%
more than
5%
15%
25%
25%
2 independent observers
Data source: Krahl-Urban et al. (1988)
*---------------------------------------------------------------*;
title 'Measurement of interobserver agreement';
data fds;
*---------------------------------------------------------------*
I input of the 4*4 contingency table
I
*---------------------------------------------------------------*i
input ob1
cards;
1 1 60 1
2 1 23 2
3 1 5 3
4 1 1 4
ob2 freq @@;
2 30
2 43
2 4
2 1
1
2
3
4
3 6
3 18
3 1
3 1
1
2
3
4
4 1
4 2
4 9
4 11
\
\
719
*--------------------------------------------------------------~*
calculation of the measure of association
here: kappa-coefficient (see: Cohen, 1960»
*---------------------------------------------------------------*;
proc catmod data=fds;
*---------------------------------------------------------------*
response statement to specify series of transformations
describing the kappa-coefficient
*---------------------------------------------------------------*;
response exp 1 -1
1 0,
0 1
log -1 -1 -1 -1
-1 -1 -1 -1
exp 1
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
log 1
0
0
0
1
0
0
0
1
1
1
0
0
0
0
1
0
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
0
0
1
0
1
0
1
0
0
1
0
0
0
0
1
0
1
0
0
0
1
0
0
1
1
0 0
1 1
0 0
0 0
0 0
0 0
1 0
() 1
0 0
1 1
0 0
0
0
0
1
0
0
0 0,
0
0
0
1
0
0,
0,
0,
0,
1
0 0 0 0
0
1
0
1
0
0
0
0
1
0
1
0
0
1
0
0
0
1
0 0
1 1
0 0
0
0 0
1 0
0 1
1 0
1 1
°
0
0
0
1
1
0
0
0
0
1
0
0
0
1
0
1
0
0
0
1
0
0
0
1
0
0
1
0
0
1
0,
0,
0,
1,
0,
0,
0,
1,
1,
1;
weight freq;
..
*---------------------------------------------------------------*
degenerated 'dummy' model statement to use PROC CATMOD as
a procedure for estimating measures of association and their
asymptotic variances rather than for usual modeling
*---------------------------------------------------------------*;
model ob1
*
ob2 = (1) / nodesign noprofile covb;
run;
720
5. SAS-Output of the Practical Example
I
!.
Measurement of interobserver agreement
CATMOD PROCEDURE
Response: OB1*OB2
Weight variable: FREQ
Data Set: FDS
Response Levels (R)=
Populations
(S)=
Total FreqUency (N)=
Observations (Obs)=
1.6
1
216
16
ANALYSIS OF VARIANCE TABLE
Source
DF
MODELl MEAN
0*
RESIDUAL
o
Chi-Square
Prob
NOTE: Effects marked with * contained 1 or more
singularities (i.e., redundant parameters).
ANALYSIS OF
Effect
WEI~HTED-LEAST-SQUARES
Parameter
.
Estimate
.
ESTIMATES
Standard
Error
ChiSquare
Prob
---------------------------------------------------------------MODEL
0.2847
32.78 0.0000
0.0497
1
COVARIANCE MATRIX OF THE PARAMETER ESTIMATES
1.
1
0.00247259
i
\..
721
6. Discussion
In different fields of statistical application like social sciences, psychology, biology, and epidemiology a huge
number of specific measures of association has been proposed. Producers of statistical software packages like
SAS fight a loosing battle in trying to extend their systems to cover all measures of association proposed in
specific applications, because the variety of ways to describe the relationship between variables with regard to
some specific feature of the association seems to be unlimited. Each year some new measures of association are
added to this multitude, and there is no end of this development in sight. Whereas, in general, the calculation
of the estimator of the measure of association constitutes no problem, the asymptotic variance of the estimator
is not easy to procurecompu:tationally. In this paper we have shown how to use PROC CATMOD of the
SAS/STAT* software to solve the computational problems. The advantage of this new approach lies in a
substantial reduction of the computational effort for the user. The cumbersome calculation of the asymptotic
variance is completely undertaken by the program. The only restriction of our method results from the
distributional assumption implicitely employed when using the GSK-methodology. Therefore, e.g. data of
contingency tables arising from the hypergeometrical sampling model (i.e. all marginal distributions are fixed
prior to sampling) cannot be analysed in this framework. But for all situations of the multinomial sampling
model our approach provides a flexible and convenient way of estimating measures of association and their
asymptotic variances.
References
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psych. Meas. 20, 37 - 46.
Forthofer, R.N. and Koch, G.G. (1973). An analysis for compounded functions of categorical data.
Biometrics 29, 143 - 157.
Gefeller, O. and WoItering, F. (1991). A general method of estimating measures of association and their
asymptotic variances under the multinomial model using standard SAS software. Computat. Statist . Data
Analysis (submitted).
Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by linear models.
Biometrics 25, 489 - 504.
Koch, G.G., Landis, J.R., Freeman, J.L., Freeman, D.H. and Lehnen, R.G. (19'77). A general
methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics 33,
133 -158.
Krahl-Urban, B., Papke, H.E., Peters, K. and Schimansky, C. (1988). Forest decline. Cause-effect
research in the United States of North America and Federal Republic of Germany, Jiilich.
Landis, J.R. and Koch, G.G. (1977). The measurement of observer agreement for categorical data.
Biometrics 33, 159 - 174.
Semenya, K.A., Koch, G.G., Stokes, M.E. and Forthofer, R.N. (1983). Linear models methods for
some rank functions analyses of ordinal categorical data. Commun. Statistics - Theory Meth. 12, 1277 1298.
Williams, O.D. and Grizzle, J .E. (1972). Analysis of contingency tables having ordered response categories. JASA 67, 55 - 63.
SAS/STAT is a registered trademark of SAS Institute Inc., Cary, NC, USA.
722