HOW TO USE PROC CATMOD IN ESTIMATION PROBLEMS Olaf Gefeller 1 , Franz Woltering2 1Abteilung Medizinische Statistik, Georg-August-Universitat Gottingen 2Fachbereich Statistik, Universitat Dortmund Abstract The paper describes a new application of the statistical analysis procedure PROC CATMOD. It demonstrates how PROC CATMOD can be easily adapted to estimation problems in contingency tables. In particular, it is shown how estimators of measures of association and their asymptotic variances can be calculated using PROC CATMOD. The practical application of this approach is illustrated for the kappa-coefficient. Real data of the German Forest Decline Survey are provided in the practical example. Keywords: GSK-model, kappa-coefficient, measures of association, variance calculation 1. Introduction PROC CATMOD, one of the statistical analysis procedures within the SAS/STAT* software, is originally designed in the context of the Grizzle, Starmer & Koch (1969) (subsequently abbreviated GSK) approach to fit linear models to functions of response frequencies in contingency tables as used in linear modeling, log-linear modeling, logistic regression, and repeated measurement analysis. The GSK-approach is typically applied to the analysis of contingency tables with ordered responses (Williams & Grizzle, 1972), measurement of observer agreement (Landis & Koch, 1977), the analysis of repeated mesurement experiments (Koch et al., 1977), and rank function analysis (Semenya et al., 1983). In all these applications PROC CATMOD has been successfully employed to solve the computational problems of the analysis. It represents a complex but powerful tool, which needs some experience to handle (and patience to struggle through more than 100 pages of documentation in the SAS/STAT* manual). In this paper we demonstrate that it can be easily adapted to estimation problems in contingency tables. For a broad class of estimators of measures of association under the multinomial sampling model we show how to use PROC CATMOD to calculate the estimator of the measure of association and - more importantly - the asymptotic variance of the estimator. The statistical background of our method, the link between the GSK-approach and the estimation of measures of association, will not be presented here (see Gefeller & Woltering, 1991). To illustrate the practical application of the procedure the well-known kappa-coefficient, a measure of agreement between two different ratings, is used, and data of the German Forest Decline Survey (Krahl-Urban et al., 1988) are presented. Further general remarks on the proposed method are provided in the concluding section. 716 . , 2. Methods Adapting the GSK-approach to estimation problems in contingency tables, complex ratio statistics such as measures of association have to be written as special functions of the probability estimates of the underlying product multinomial model. To generate estimators of this type, compounded functions involving only linear, logarithmic, and exponential transformations (Forthofer & Koch, 1973) of the general form F(p) = ... As [exp (~ ~n (A3 [exp (A2 ~~ (AlP)])])])], where Ai denotes a matrix with constants, are employed. This general framework offers the opportunity to compute complex estimators in which probabilities from different subpopuiatiolls are Combined~ In the special situation of a single multinomial population achieved by unrestricted sampling of elements measures of association can be expressed in the same as illustrated above. The advantage derived in this situation lies in the chance of using standard software for GSK-models such as PROC CATMOD to calculate the estimators and their asymptotic variances. The device consists of fitting the simple linear model F(p) = X{3, where F(p) is the I-dimensional response function as specified above, X denotes the degenerated 1 x 1 matrix consisting of the constant' 1', and {3 represents the I-dimensional parameter. Then the estimator b of (3 equals the response function F itself and the variance of b is given by v" = VF. This analysis is possible in PROC CATMOD by directly specifying the design matix on the MODEL statement and by using the RESPONSE statement to describe a series of transformations to the probability estimates in order to produce F(p), the function of interest. At first glance this procedure seems to increase the computational effort by introducing the tedious work of constructing a series of transformations to describe the measure of association. But, in fact, the most annoying part of the computational task, the calculation of the asymptotic variance through computation of the first derivative matrix and of additional matrix products, is completely undertaken by the computer program. Thus the computational effort for the user is reduced substantially. 3. Practical Example To illustrate the pra~tical application of the 'method the well-known kappa-coeffi~ient (Cohen, 1960)) is used. Cohen's kappa constitutes a popular measure of agreement between two different ratings. It is defined for quadratic K x K contingency tables. Using the usual row-column parametrization of cell probabilities in contingency tables, which will be denoted as 7rij, i,j = 1, ... , K, the kappa-coefficient is defined as K K 2: 7rii - 2: 7ri.7r.i K,:= i=l 1- K i=l 2: 7ri.7r.i i=l K The term 2: 7ri.7r.i represents the expected value of agreement ,under the hypothesis of independent ratings. i=l ' ' Procedures for the estimation of the kappa-coefficient and its asymptotic variance are not available in standard statistical sQftware packages. To use the SAS/STAT* procedure CATMOD to do the calculations in the way outlined above, the following steps have to be applied: , " \ \, ,; 717 (1) specify K, as a function of a suitable vector of probabilities (2) transform K, 7rij to a compounded function involving only linear, logarithmic and exponential operations (3) set up the 'dummy' MODEL statement consisting of the constant '1' as the design matrix (4) set up the RESPONSE statement using the transformation constructed in (2) (5) run PROC CATMOD and look for the 'Analysis of weighted-least-squares estimates'-table in the output, where the estimated value of K, and its asymptotic standard error appears (in addition, the estimated asymptotic variance of K, can be obtained directly by specifying the 'COVB'-option on the MODEL statement) As a numerical example we use data from the Forest Decline Survey (Krahl-Urban et al., 1988)). The data based on the variable 'loss of needles' (in percent), which has been categorized independently by two observers into four groups according to the severity of damage, are presented in the following 4 x 4 contingency table: 1 2 3 4 L: 1 60 30 6 1 97 2 23 43 18 2 86 1 9 19 3 5 4 4 14 1 1 1 11 89 78 26 23 216 L: Now, step 1 involves only the definition that the vector probabilities 7rij, i, j = 1, ... ,4, as follows: 7r := K, is build up using the row-column cell as a compounded function described in (2) is of the = exp (A4 [In (A3 [exp (A2 [In (Al7r ))))))) where A1 = R16 (7rll' ••• , 7r14, 7r2b ••• , 7r24, 7r31, ••• , 7r34, 7r 41, ••• , 7r 44)' Step 2 needs a little more work. The representation of following form: K, E 7r 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 718 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 1 1 A2 = 1 0 0 0 0 0 o "0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 -1, -1 -1 1 Aa= ( -1 -1 -'-1 -1 -1 0 A4 =( 1 0 0 0 0 1 0 0 0 0 0, 0 1 n -1 ) The other steps (3) - (5) can be seen in the listing of the SAS-program and output in the following sections. 4. SAS-Program of the Practical Example *---------------------------------------------------------------* Kappa statistic for interobserver agr~ement 4 response categories of the variable 'loss of needles': 1 2 3 4 = = less than 5% 15% more than 5% 15% 25% 25% 2 independent observers Data source: Krahl-Urban et al. (1988) *---------------------------------------------------------------*; title 'Measurement of interobserver agreement'; data fds; *---------------------------------------------------------------* I input of the 4*4 contingency table I *---------------------------------------------------------------*i input ob1 cards; 1 1 60 1 2 1 23 2 3 1 5 3 4 1 1 4 ob2 freq @@; 2 30 2 43 2 4 2 1 1 2 3 4 3 6 3 18 3 1 3 1 1 2 3 4 4 1 4 2 4 9 4 11 \ \ 719 *--------------------------------------------------------------~* calculation of the measure of association here: kappa-coefficient (see: Cohen, 1960» *---------------------------------------------------------------*; proc catmod data=fds; *---------------------------------------------------------------* response statement to specify series of transformations describing the kappa-coefficient *---------------------------------------------------------------*; response exp 1 -1 1 0, 0 1 log -1 -1 -1 -1 -1 -1 -1 -1 exp 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 log 1 0 0 0 1 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 0 1 1 0 0 0 0 0 0 0 0 1 0 () 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0, 0 0 0 1 0 0, 0, 0, 0, 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 ° 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 1 0, 0, 0, 1, 0, 0, 0, 1, 1, 1; weight freq; .. *---------------------------------------------------------------* degenerated 'dummy' model statement to use PROC CATMOD as a procedure for estimating measures of association and their asymptotic variances rather than for usual modeling *---------------------------------------------------------------*; model ob1 * ob2 = (1) / nodesign noprofile covb; run; 720 5. SAS-Output of the Practical Example I !. Measurement of interobserver agreement CATMOD PROCEDURE Response: OB1*OB2 Weight variable: FREQ Data Set: FDS Response Levels (R)= Populations (S)= Total FreqUency (N)= Observations (Obs)= 1.6 1 216 16 ANALYSIS OF VARIANCE TABLE Source DF MODELl MEAN 0* RESIDUAL o Chi-Square Prob NOTE: Effects marked with * contained 1 or more singularities (i.e., redundant parameters). ANALYSIS OF Effect WEI~HTED-LEAST-SQUARES Parameter . Estimate . ESTIMATES Standard Error ChiSquare Prob ---------------------------------------------------------------MODEL 0.2847 32.78 0.0000 0.0497 1 COVARIANCE MATRIX OF THE PARAMETER ESTIMATES 1. 1 0.00247259 i \.. 721 6. Discussion In different fields of statistical application like social sciences, psychology, biology, and epidemiology a huge number of specific measures of association has been proposed. Producers of statistical software packages like SAS fight a loosing battle in trying to extend their systems to cover all measures of association proposed in specific applications, because the variety of ways to describe the relationship between variables with regard to some specific feature of the association seems to be unlimited. Each year some new measures of association are added to this multitude, and there is no end of this development in sight. Whereas, in general, the calculation of the estimator of the measure of association constitutes no problem, the asymptotic variance of the estimator is not easy to procurecompu:tationally. In this paper we have shown how to use PROC CATMOD of the SAS/STAT* software to solve the computational problems. The advantage of this new approach lies in a substantial reduction of the computational effort for the user. The cumbersome calculation of the asymptotic variance is completely undertaken by the program. The only restriction of our method results from the distributional assumption implicitely employed when using the GSK-methodology. Therefore, e.g. data of contingency tables arising from the hypergeometrical sampling model (i.e. all marginal distributions are fixed prior to sampling) cannot be analysed in this framework. But for all situations of the multinomial sampling model our approach provides a flexible and convenient way of estimating measures of association and their asymptotic variances. References Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psych. Meas. 20, 37 - 46. Forthofer, R.N. and Koch, G.G. (1973). An analysis for compounded functions of categorical data. Biometrics 29, 143 - 157. Gefeller, O. and WoItering, F. (1991). A general method of estimating measures of association and their asymptotic variances under the multinomial model using standard SAS software. Computat. Statist . Data Analysis (submitted). Grizzle, J.E., Starmer, C.F. and Koch, G.G. (1969). Analysis of categorical data by linear models. Biometrics 25, 489 - 504. Koch, G.G., Landis, J.R., Freeman, J.L., Freeman, D.H. and Lehnen, R.G. (19'77). A general methodology for the analysis of experiments with repeated measurement of categorical data. Biometrics 33, 133 -158. Krahl-Urban, B., Papke, H.E., Peters, K. and Schimansky, C. (1988). Forest decline. Cause-effect research in the United States of North America and Federal Republic of Germany, Jiilich. Landis, J.R. and Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159 - 174. Semenya, K.A., Koch, G.G., Stokes, M.E. and Forthofer, R.N. (1983). Linear models methods for some rank functions analyses of ordinal categorical data. Commun. Statistics - Theory Meth. 12, 1277 1298. Williams, O.D. and Grizzle, J .E. (1972). Analysis of contingency tables having ordered response categories. JASA 67, 55 - 63. SAS/STAT is a registered trademark of SAS Institute Inc., Cary, NC, USA. 722
© Copyright 2024