Three Threats to Validity of Choice-Based and Matched Sample Studies in Accounting Research* Donald P. Cram* Vijay Karan** and Iris Stuart*** November 25, 2007 *Corresponding author. Los Angeles, California, USA. [email protected]. **Department of Accounting, College of Business and Economics, California State University, Fullerton, California, USA. ***Department of Accounting, Auditing and Law, Norwegian School of Economics and Business Administration, Bergen, Norway. Three Threats to Validity of Choice-Based and Matched Sample Studies in Accounting Research Abstract We consider three technical errors in the statistical analysis of choice-based and matched sample studies in accounting research. These problems constitute threats to both internal and external validity of the research. First, we note that researchers have often failed to control for the effects of matching variables used in sample selection. Commonly, researchers believe that the selection of a matched sample already controls for the matching variables, and hence controlling for them in analyses would not be necessary, but in fact it is. Typically an unconditional analysis is performed, rather than the conditional one that is justified. Thus, failure to account for industry, size, and other matching variables may have driven incorrect findings in many research studies, or may have suppressed results waiting to be revealed. Second, where matching is by “closest” size or other continuous measure, the matching is imperfect, and there remains the possibility that case vs. control differences in this matching variable could be the cause of differences in outcome, so researchers must evaluate that possibility and perhaps control for it. Third, the disproportionate sampling for different population strata that is implicit in the choice-based and matched sample selection would usually necessitate weighting data in statistical analyses by the sampling rates in each strata, but reweighting or other appropriate adjustment to the analysis is often not implemented. A “logit exemption” to the need for reweighting has been noted in the literature, but has been used in settings where it does not apply. We provide a simulation example to demonstrate problems, and provide suggestions for more precise ways to analyze choice-based and matched samples. Keywords: Choice-based, matched samples, research designs, research methodology 1. Introduction There is a long history in accounting research of employing a research design that involves matched samples or choice-based samples. These approaches are typically employed to limit data collection costs (e.g. Geiger and Rama 2003, Heninger 2001) or to deal with nonlinearities that are not well specified (e.g. Kothari, Leone, and Wasley 2005). For example, Bartov, Gul, and Tsui (2001) identified 173 Compustat firms with qualified opinions and fully matched the firms by year, 2-digit SIC, and Big 6 or non-Big-6 auditor to a set of control firms. The authors used logit regression to identify determinants of audit opinions qualifications. This is 1 just one of many examples. However, we find numerous examples of incorrect statistical analysis of matched samples and choice based samples in accounting research going back over many years and spread across all accounting journals. These incorrect methods of analysis have gained wide acceptance within accounting research in spite of misgivings on the part of some researchers (see Maddala 1996, Smith 2003). These technical errors in statistical analysis can cause the researcher to either reject the null hypothesis when the null is true or fail to reject the null hypothesis when the null is false. The purpose of this study is to describe three technical errors that can occur due to incorrect analysis of matched samples or choice based samples, show how they can affect statistical inferences, and recommend simple corrections. This paper makes several contributions. First, we identify six research designs that we discern among choice based and matched sample studies in accounting research. These categories are defined by the manner in which the researcher chooses the sample of firms to study, the “treatment” group, and then selects the comparison sample, the “control” group. Clearly defining the six categories enables us to identify and explain the correct analysis that should be used with each design. Second, we provide guidance for correct analysis and use of univariate, ordinary least squares regression, and logit models with choice based and matched samples. We prove that, while maximum likelihood estimation generally requires reweighting in a choice based or 1 Early examples of choice based matched sample studies include Beaver (1966), Altman (1968) and Deakin (1972). In excess of 300 papers using such designs in financial distress prediction have appeared since then, and the approaches are used widely in other accounting research areas as well. We identified 73 such papers in the area of audit research alone from 1990 to 2003. This audit research paper listing, with our assessment of probable Errors 1, 2, and 3 occurrences, is available upon request. 1 matched sample design, there exists an exception for such reweighting in the special case of pairmatched logit regression. We also provide a proof for a pair-matched logit regression that asymptotic correct estimates can be derived by either (a) fully saturating the regression model, that is using a dummy variable for each pairing, or (b) using a no intercept logit regression upon pair-wise differences. Third, we describe how incorrect methods of analysis of matched samples or choice based samples have gained common acceptance over many years. We identify three technical errors found in numerous studies in accounting research and document their frequency in auditing research during the years 1980 to 2003; Error 1: Use of unconditional analysis, when analysis conditional upon effects of matching variables is needed, Error 2: Failure to control for effect of imperfectly matched variables, and Error 3: Failure to reweight observations according to differing sampling rates. Fourth we demonstrate with simulated data how incorrect analysis, that is analysis which fails to recognize that matched samples and /or choice based samples are not random samples, can lead to incorrect inferences. The simulations tangibly demonstrate that incorrect analysis may (a) fail to detect significant true effects (Type II error), (b) find false significant effects (Type I Error), and (c) find significant results that are opposite in sign to the true effects. Last, we demonstrate with a replication and reanalysis of a published paper, Ghicas (1990), differences in the results obtained by correct and incorrect analysis of a matched sample. The results of the correct analysis, unlike the original analysis in the paper finds support for a key hypothesis and helps resolve an anomalous result the author had remarked upon. The rest of the paper is organized as follows. In Section 2 we discuss basic terms, describe six research design categories in matched samples or choices based samples, and discuss the three technical errors due to the use of incorrect statistical analysis. In Section 3, we demonstrate with the use of simulations the effects of Errors 1 and 2 on reported results. Section 4 includes a replication of a matched sample study that shows changes in results. In Section 5, we describe the correct statistical analysis to apply within each of six distinct research categories. Finally, we summarize and conclude. 2 2. Research Designs and Three Errors A choice-based sample is a non-random sample where cases having one outcome (e.g., firms “choosing” to file for bankruptcy or firms receiving a qualified audit opinion) are identified, and then comparison samples of control observations are selected from available data having different outcomes. The analysis then uses the outcome as the dependent variable to be explained by other variables. Choice-based sampling is particularly useful when data collection is costly and one category of the outcome to be explained is rare, so random sampling from the population would not yield very many observations of the rare type unless very costly large samples were collected (Zmijewski 1984). Matched sample research is another form of non-random sampling that is intuitively appealing and widely used in accounting research. Matched samples are those having each member matched with a corresponding member or members in the other sample or samples, with matching by characteristics not of immediate research interest. One type of matched sample research is a “within-subject” study, often used in behavioral and experimental settings in psychology and medical research, that compares repeated measures for each subject taken before and after alternative treatments. In accounting, a within-subject matched pair might be two firmyear observations, for the same firm, before and after an event of interest. Examples are studies of audit fee level changes (e.g. Iyer and Iyer 1996 and Maher et al.,1992). In accounting usage these studies are known as “changes” studies; they are correctly analyzed by comparing pairwise differences to pair-wise differences, an approach that explicitly takes into account their pair-wise matching. 2 A second type of matched sample study is a between-subjects study. For example, Mack et al. (1976) studied women in a residential retirement community population to measure potential risk factors for a type of cancer. For each woman diagnosed with the cancer a matched 2 In accounting research terminology, “changes” studies are contrasted with “levels” studies. “Changes” studies can be less prone to omitted variables problems. An example “levels” study would be to analyze the 4,000 or so firm-year observations available in Compustat for one year, to explain audit fee levels as a function of various firm characteristics. A study of the somewhat fewer firms for which data is available in each of two years can be analyzed as a “changes” study: one regresses firm-specific difference in audit fee upon the year-to-year differences in the various characteristics. In the latter case, one has pair-matched the firm at year 1 with the same-named firm at year 2. Assuming that firms do not change from year to year in the unmeasured firmspecific characteristics, the firm-specific characteristics are successfully “differenced out” and should have no bearing on the “changes” analysis. 3 set of four women of same age, marital status, and similar entry date into the retirement community was chosen, then detailed personal and family histories were painstakingly collected, coded, and analyzed. Differences in discrete outcome (e.g. cancer detected or not) were to be explained by differences in measured factors of interest (e.g. prior exposure to various drugs, use of hormonal treatments), with additional factors, say for presence of cancer in family history, also measured and controlled for by inclusion in the model. In accounting, matched pairs might be firm-year observations of two different firms, chosen so that the firms match in some characteristics such as industry code and asset size. A matched sample may or may not also be a choice-based sample; to be both, the match selection must focus on drawing comparison sets of subjects that have opposite outcomes of the variable that is to be explained in the analysis but that are similar on matching variables. A sample of litigated firms paired to non-litigated firms, with matching by industry and closest size, and to be analyzed by a logit model explaining litigation, is both (e.g. Lys and Watts 1994). A matched sample may reflect opposite “choices” taken, but its analysis is not in the form of a choice-based statistical analysis if the choice is not the dependent variable in the analysis. For example, Wallace (1997) collected pairs of firms matched by industry and size, but which made opposite decisions on whether or not to adopt residual income-based compensation plans. In what we term a non-choice-based analysis, he ran OLS regressions of financial performance upon a decision indicator and other variables. We use the term “fully-matched” samples to distinguish situations in which each stratum or case-control comparison subset is unique, and “semi-matched” samples which have strata or pairings of case and controls that are nominally but not meaningful unique. For an example of semi-matching, Heninger (2001) obtained 67 cases of firms whose auditors were sued, and identified firm-year control observations by matching on year and industry and then randomly selecting one from the available candidates. There were multiple occurrences of auditor litigation in some industries, so the nominally 1-1 matched pairs in those industries can be combined into fewer than 67 matched sets in analysis, and we classify the sample as semi-matched. If instead, at the last step in matching, he had chosen the unique firm closest in size, or if all 67 cases were in different industries, then each pairing would be distinctly defined and we would deem his sample to be fully-matched. Different approaches to analysis are available for semi-matched samples: for example, in a regression analysis one might include an intercept dummy variable for 4 each nominally matched pair, or include only one for each meaningfully distinct matched set. Choice-based and matched samples are not random samples. Therefore, it is necessary to perform statistical analysis upon them differently than would be appropriate for random samples, in order to generate results that should generalize to the larger populations from which they are selected. Campbell and Stanley (1966) categorize internal and external threats to validity of research designs. They define internal validity as “the basic minimum without which any experiment is uninterpretable”. External validity asks the question of generalizability: To what populations, settings, treatment variables, and measurement variables can an observed effect be generalized? We identify three ways in which accounting researchers have sometimes failed to account for non-randomness of choice-based and matched sample selection in their analysis of choice-based and matched samples. Three common errors in accounting research using matched and choice-based samples In Figure 1 we identify six categories of choice-based and matched sample designs that we discern in accounting research, for which we determine that varying guidance is in fact needed. We map these out in a Venn diagram in Figure 1, where large overlapping ovals indicate choice-based samples and matched samples. Some studies are choice-based but nonmatched (indicated as CB-NM). These, like Palepu (1986), have sample selection based on an observed outcome variable that is to be explained, but selection within each outcome category is random. Within matched samples, it is necessary to differentiate between semi-matched and fullymatched samples. So, choice-based papers using matching may be fully matched (CB-FM, having unique pairings), or may be semi-matched (CB-SM, having some groups larger than pairs). Within matched samples that are not choice-based, there is the same fully-matched vs. semi-matched distinction, defining NCB-FM and NCB-SM types.3 Finally, within NCB-FM we must distinguish between the Within-subjects vs. Between-subjects studies (denoted NCB-FMW and NCB-FM-B). An example study in each of these six design categories is listed below the Venn diagram, as well as the count of how many of each category we found in our review of 3 An NCB-SM example is Krishnan (2003), who created a nonrandom sample by identifying 15,342 firm-year observations in Big 6 audited firms, selecting only those for which corresponding non-Big 6 control observations in the same 2 digit SIC and cash flow deciles were available, and then added those 3316 corresponding non-Big 6 observations. 5 audit research studies. Accounting researchers argue convincingly the importance of controlling for the industry, size, and other variables that they use in match selection (often citing prior research), but then fail to include the variables in their analysis, as might be done by including an intercept dummy variable for each matched set. It is well accepted that the likelihood or levels of bankruptcy, litigation, audit fees, and other common dependent variables will vary by industry and firm size; in other words one has conditional information for inferring the outcome from knowing the industry and firm size. Also, the accounting ratios and other independent variables commonly of interest also vary systematically by industry and firm size. In an industry- and size-matched sample, therefore, the vector of intercept dummies that would control for industry and size is correlated with both the outcome variable and with the explanatory variables. In ordinary least squares regression analysis, omission of such a correlated variable causes an omitted correlated variables problem, rendering coefficient estimates biased and inconsistent. In logit and probit regressions, it is arguably worse: coefficients on included variables will be estimated inconsistently even if the omitted variables are uncorrelated with the included variables.4 Researchers have believed that selecting the control sample using matching will by itself ensure that results have been controlled for the effects of the matching variables, and that the matching variables therefore need not be included in the statistical analysis. However, as has been prominently noted in guidance provided in the biomedical field, “the matching process requires that the data …be analyzed with the matching taken into account” (Breslow and Day 1980, p. 32). To account for the matching within any regression model estimated in a matched sample, one must fully saturate the model, i.e., one is either to include a dummy variable for each stratum, or to perform analysis on differenced data within each stratum. For fully-matched paired designs analyzed by Ordinary Least Squares (OLS) regression, that means adding a dummy variable for each pairing, or, equivalently, performing analysis on differenced data, i.e., by regression of pair-wise difference in outcome upon pair-wise differences in independent variables. For matched designs to be analyzed by logit regression, when the outcome is a 0-1 categorical variable, accounting for the matching is similar although there is a complication, to 4 We thank an anonymous reviewer for clarifying this point regarding nonlinear analyses. 6 be explained, that motivates use of specialized software for correct implementation.5 In accounting research, however, estimated models commonly omit the use of the match selection information, effectively omitting the effect of matching variables on the dependent variable.6 We term this omission Error 1. Error 1: Use of Unconditional Analysis, when Analysis Conditional upon Effects of Matching Variables is Needed Introductory guidance for non-random sample research designs is found in introductory statistics textbooks such as Johnson and Bhattacharyya (1985) and Rice (1995). These introduce the idea that for univariate comparisons in experimental settings where there is a natural pairing in the data, a matched sample t-test (a one sample test) is more powerful than an unmatched (two sample) t-test in detecting a mean difference in a given measure. Researchers may have misperceived that either is appropriate; we must note that an Error 1 has occurred, however, when a matched pair t-test is required but a two sample, unmatched t-test is performed instead. The first technical problem (Error 1), that is the use of unmatched analysis for matched samples, we observe first in the seminal papers by Beaver (1966 and 1968)7, and Altman (1968), on bankruptcy prediction. The use of unmatched analyses has persisted in the literature ever since. Deakin (1972, p. 172), citing statistician Tatsuoka (1971), noted that Altman (1968)’s discriminant analysis of a pair- matched sample would have required "more complex procedures" to be correct, but only one subsequent paper in accounting research implemented those procedures.8 All other discriminant analyses of matched pair samples in accounting and 5 See Appendix I for development. 6 In our audit research review, we find 67 papers that employed matching, with 55 of those failing to control for it in their analyses. 7 Beaver (1966) applied both matched and unmatched analysis; he stated preference for the unmatched analysis as the output seemed to provide unconditional probabilities (that would be justified in a random sample) and he reported only unmatched analysis in his 1968 paper. 8 A well-executed study by Harrison (1977) is the only matched pair discriminant analysis which we can identify that addresses the concerns to which Deakin alluded. It is a CB-FM study employing pair-matching of firms having an accounting method change to controls that did not change accounting method, with matching by industry, year, and risk measured by beta. Harrison also references Tatsuoka (1971) in his application of a Hotelling T2 test (a multivariate generalization of a matched-pair t-test) to determine whether the single sample of pairwise differences in market returns differed from zero. This avoided Error 1 by use of a one sample (differences) test. Essentially he examined whether pair-wise differences were significantly different than zero. His work is subject to Error 2, however, in that his results could have been driven by residual differences in his matching variables, although he did perform one sensitivity analysis to attempt to address that. Further, his work 7 finance appear to have been performed incorrectly in that they failed to take into account the pairings. Maddala (1998), discussed below, entirely dismisses the bankruptcy prediction studies that used matched samples. We develop here that choice-based and matched sample studies, analyzed correctly in ways we describe fully, can usefully find conditional effects. Schlesselman (1982) comments on unmatched analysis of matched data that “If cases and controls have been matched on a variable that is associated with the study exposure, then an analysis that does not account for the matching will result in an estimate of the odds ratio that is biased toward unity” (p.272). Our simulations show that the impact of Error 1 is worse: bias can go in any direction. While awareness within accounting research of problems relating to unconditional analysis is unfocused, Error 1 has been commented upon in a literature of discrimination in mortgage lending. Giles and Courchane (2000) note that inconsistency of estimates appears to have been missed… Several authors recognize that stratifying [use of matching in sample selection] will affect the estimation of the constant term, but then fail to realize that the inclusion of a racial group dummy variable results in separate stratum constraints. In particular, reference is made to the discussion in Maddala (1983, pp 9091) and Maddala (1991, pp. 792-793), which relate to stratifying by outcome only; we need to extend the results when we also stratify by a dummy variable covariate. (pp 8-9). Dietrich (2001), examining the impact of that misanalysis in mortgage lending, has reported the striking conclusion that 6 of 23 studies, when replicated and reanalysed appropriately, have their results reversed. We observe That the textbook examples of matched sample studies differ in crucial ways from the non-random observational studies common in accounting research in which matched sample t-tests are often applied. First, in accounting research matching is often by closest size or other measure, and Error 2 is possible, that is difference in outcome may be driven by residual difference in the matching variable. To address this, as we explain in Section V, an OLS regression generalization of the t-test is required. Error 2: Failure to Control for Effect of Imperfectly Matched Variables A second technical error (Error 2) in the analysis of matched or choice-based samples is subject to Error 3 as discriminant analysis does not benefit from the logit exemption. In fact it may not be possible to avoid Error 3 in discriminant analysis of CB-FM studies: reweighting cannot be applied to each datum according to its stratum' s sampling rate, when data have been differenced. Tatsuoka (1971) provides guidance on applying prior probability information in some statistical settings, but not specifically in the matched pair example he gives (and Tatsuoka’s example itself suffers from Error 2 and Error 3). 8 occurs when the matchings are not exact. This error stems from imperfection in the matching process, e.g. from selecting pair-wise controls that are “closest” rather than exact matches on a continuous variable such as firm size. Where closest matching is used, the researcher should consider explicitly the possibility that the remaining pair-wise differences in the matching variable may itself be sufficient to explain observed patterns of outcomes, and attempt to control for this possibility. For example, the researcher can try including linear and/or quadratic factors of the size variable in the model. Frequently accounting researchers have failed to control for this problem in estimation when closest matching is used.9 The second technical problem (Error 2) has been noted at times in the accounting literature. Some authors have included in their models the continuous variables such as size which they used in matching, explicitly to control for the residual effect of pair-wise imperfect matching in those variables (e.g. Lys and Watts (1994) and Carcello and Neal (2003)), while others have not. Of 37 audit papers needing an evaluation of the impact of imperfect matching, 27 did not provide it, so discerning consumers of the research would be left with uncertainty as to the likely impact of this omitted correlated variable upon reported results. Error 3: Failure to Reweight Observations According to Differing Sampling Rates A third technical error (Error 3) in the analysis of matched or choice-based samples arises from the fact that these non-random samples’ numbers of observations in outcome groups or in matched sets are not proportional to the size of their categories in the general population. The choice-based and/or matched sample selection process creates stratified samples that are deliberately not proportionally representative: the rare outcome is represented in the sample as often as the common one; the large industry has no more representation than the small industry. 10 Second, with the exception of NCB-FM-W designs, accounting studies also differ in that they employ non-random selection, hence Error 3 or non-generalizability will apply, unless the logit 9 Of the audit research sample, 37 papers used closest matching and, of those, 27 failed to include at least a linear term for that matching variable in their analysis. 10 Five of six research design categories described later are non-random. The exception is certain within-subject studies. For example if cases for a before-and-after study of audit fee levels are randomly selected from all potential subjects having data availability at the before and after times, the results are fairly generalizable to the population of continuing firms. (A distinct survivorship bias is potentially present; audit fee level changes identified within continuing firms may not generalize to the whole population that includes entering and departing firms.) 9 exemption is enjoyed or unless reweighting of each observation in the analysis is employed. Besides in NCB-FM-W settings, application of a matched pair t-test towards ascertaining a group difference in an accounting ratio, does not yield any generalizable result. Hence we find the classical introductory statististics textbook discussion is adequate only for guiding univariate analyses in the NCB-FM-W research design, and does not address the other five research design situations. In accounting research, non-random samples are commonly analyzed as if they were random. In marketing research, by contrast, where stratified sampling (one kind of non-random sampling) is commonly employed, it is well understood that to preserve generalizability it is necessary to reweight the data. A stratified sample taken to assess a univariate measure (e.g. proportion of likely buyers of a product, or likely voters for a political candidate) is analyzed by reweighting: down-weighting observations from the over-represented strata, upweighting the under-represented. The usual method to make use of nonrandom samples in regression and other analyses, also, would be for the researcher to reweight observations analogously, weighting each observation according to the sampling rate applied in selecting from its strata of the larger 11 population . In accounting research, an exception to the general need for reweighting has been noted and applied for some logit regression studies such as Palepu (1986). However most applications in accounting research fail to reweight and fail also to conform to the limited requirements for the logit exemption to apply, and hence the generalizability of their results is in 12 question. On Error 3, there is recognition that reweighting or the use of logit regression is necessary in the analysis of choice-based sampled data. The logit exemption to reweighting has 11 Examples of reweighted analyses in accounting research are Zmijewski (1984), Dopuch, Holthausen and Leftwich (1987), and Koh (1991). 12 A limited version of the logit exemption was described and endorsed prominently by Maddala (1991). The limited version, which involves applying a logit model as if the sample were randomly selected, applies only to settings with choice-based such as Palepu (1986)’s sample of reorganized firm cases compared to a control sample selected randomly from non-reorganized firms, hence there is stratification by outcome alone. When pair-matching or other further stratification within the control sample selection is utilized, the limited version does not apply and adjustments that fully saturate the model are necessary for the logit exemption to apply (as will be developed in Section III). Of the 73 audit research papers, 42 suffer Error 3 and are not logit regression. There are also 22 logit regression papers that would need to utilize a fully-saturated model for the logit exemption to apply, but do not, so Error 3 applies for these as well. Using abbreviations defined below, we identify only two WESML studies, two CB-NM logit studies and five NCB-FM-W studies not suffering Error 3. 10 been discussed in accounting research since Palepu (1986). Palepu noted that only the intercept term in his logit analysis of a choice-based sample was biased, and even that could be corrected by an adjustment using exogenous population frequency information, citing statisticians Manski and Lerman (1977) and Manski and McFadden (1981). Zmijewski (1984), citing Palepu’s working paper, reviewed 17 papers on bankruptcy and advocated the use of reweighting, in particular the use of weighted exogenous sample maximum likelihood (WESML) to make adjusted estimations.13 Greene (2004), summarizes the issue for choice-based sampling succinctly: In what we infer were CB-NM studies of loan default, “the dependent variable measured the occurence of loan default, which is a relatively uncommon occurence. To enrich the sample, observations with y=1 (default) were oversampled. Intuition should suggest (correctly) that the bias in the sample should be transmitted to the parameter estimates, which will be estimated so as to mimic the sample, not the population, which is known to be different.” For CB-SM or CB-FM studies, we note the oversampling would vary by strata within outcome group. Greene then explains the WESML estimator that would address the CB-NM case correctly, and does not consider the logit exemption. Maddala (1991) reviewed Zmijewski’s and other tabulations of logit and probit applications, and concluded that WESML is not needed in logit settings, and endorsed the use of logit in Palepu (1986) and two other papers.14 In an earlier paper, Maddala (1983) had described the problem of Choice-based sampling as “a case of stratification by an endogenous variable” and had gone on to state that “Manski and Lerman (1977) showed that treating choice-based samples as if they were random and calculating estimators appropriate to random samples will generally yield inconsistent estimates.” He notes that logit coefficients besides the intercept would not be biased in analysis of 0-1 outcome choice-based samples. 13 Zmijewski’s preference for reweighting may have been influenced by his incorrect assertion that it is discriminant analysis, not logit analysis, which provides estimates unbiased but for the intercept. Discriminant analysis has strong distributional assumptions that are usually not justified in accounting research. 14 The two papers are Dopuch, Holthausen, Leftwich (1987) and McNichols and Dravid (1990). We note these authors had selected their control samples with semi-matching on year which was not accounted for in the analysis. To obtain technically correct coefficients on the research variables of interest in these datasets, the researcher applying a logit analysis would need to control for the levels of the matching variables by including a dummy variable for each year. If WESML is used, different weightings would have to be applied to each year’s strata. While the technical error may well not have had a significant impact in these studies (having only two years of data, and those years perceived to be similar), the error may be very significant in studies involving data over different time periods. Maddala endorsed these papers without calling attention to this problem. 11 Maddala’s suggestion has been widely cited by accounting researchers who thought that as long as logit regression was used, coefficients other than the intercept would not be biased. This is not always true. The logit exemption allows the use of unweighted logit regressions to analyse choice-based matched sample data, delivering asymptotically unbiased coefficient estimates and standard errors on non-intercept variables, providing that the model is fully saturated.15 The typical logit application in accounting research, however, has estimated an unweighted and unsaturated model, which does not control for the matching variables’ effects and does not enjoy the logit exemption from need to weight data to reflect population proportions. We also observe that researchers have considered unweighted estimation of choicebased samples as being acceptable outside of logit settings, too. Not adequately appreciated is the logical corollary to Maddala’s statement: if logit regression is not employed, then analysis of a non-random sample as if it were a random sample is not acceptable. By not accounting for the matching in the analysis, the analysis suffers from the omission of correlated variables, leading to bias in all coefficient estimates, including unpredictable biases on those of research interest. Almost all of the published discriminant, logit, and probit analyses of matched samples in accounting research have been misanalyzed along these lines.16 We believe that the persistence of the incorrect practice is due to the guidance perceived to have been provided by Zmijewski (1984), Palepu (1986), and Maddala (1991); they are cited frequently in our sample of audit research studies. A typical quote is as follows: “Maddala (1991: 793) argues that if this choicebased sample is used to estimate a logit model, no weighting procedure is needed. The coefficients of the explanatory variables are not affected by the unequal sampling rates. It is only the constant term that is affected.” This seems to be an accurate statement of Maddala’s 1991 position as Maddala did not preface it with a disclaimer that its applicability was restricted only to non-matched samples. In fact, when matching is also present, constant terms for each matched set must also be included, and each of those will be affected, but will permit accurate estimation of the research variables of interest. While Maddala did not note these complications in his 1991 15 A model can be fully saturated by including a dummy variable for each stratum, e.g. including a pairidentification dummy for each pairing in a fully matched sample. Alternatively, the model can be posed on pairwise differenced data. The latter is preferred for logit analysis. 16 The two exceptions are the aforementioned Harrison (1977) and Burgstahler et al. (1989), who uniquely applied a no-intercept probit model to pairwise differences in bankruptcy prediction, avoiding error 1. We have not found any other exceptions in accounting research in print prior to 2004. 12 work, in a later review for a Handbook of Statistics article (Maddala, 1996), he broadly dismissed the use of matched samples, stating that “a logit analysis based on ‘matched samples’ cannot tell us anything about the effects of measured characteristics on failure rates” (p. 560). It is unclear to us whether Maddala (1996)’s strong dismissal referred only to the incorrect unconditional analysis commonly applied, or whether he would also have had reservations about conditional analysis, but at any rate only his endorsement of logit analysis has been cited widely while his dismissal has gone unnoticed in accounting research.17 We rely upon a correct statistical theory for the analysis of choice-based and matched samples that has developed largely in the biostatistics literature, where it has been shown that logit models in fact can very usefully discern conditional (within cluster) effects from matched samples. This literature supports our society’s massive investment in cancer and other medical research. Breslow (1996) provides a historical review. Briefly, Anderson (1972) and Prentice and Pyke (1979) made key contributions, establishing that appropriate logit analysis can yield coefficient estimates (besides the intercept) and corresponding standard errors that are valid when performed on fully-matched sample data . Breslow and Day (1980, 1987) developed and popularized the matched sample methodology, leading to wide application in medical research. Monographs by Schlesselman (1982) and Hosmer and Lemeshow (1988) serve practitioners. There are few citations in accounting of this literature, besides several citations of Hosmer and Lemeshow’s discussion of the logit model in general terms; their chapter describing the appropriate analysis of case-control paired samples seems not to have been noticed or understood to be applicable. A complication in the appropriate logit analysis of choice-based pair-matched samples has been noted, and has been developed by Abrevaya (1996). It turns out that appropriate logit analysis of pair-wise differences yields the same relative estimates of coefficients as does logit analysis of pooled, non-differenced data with pairings accounted for by inclusion of a dummy variable for each pairing (less one, or without an overall intercept). However, the latter approach yields coefficients that are exactly twice the magnitude, but standard errors are not scaled proportionately, so different p-values are reported. As the scaling of logit coefficients is arbitrary, either method is correct for obtaining coefficient estimates, but it is the differences 17 No paper in our extensive review of accounting research has cited the late Maddala' s 1996 paper, and the Social Science Citation Index shows no such citations. 13 analysis that has the correctly corresponding standard errors and p-values and that is correct for use in inferences. For logit analysis, the approach of including pair-wise dummies overstates the significance of its coefficients. The differences approach is easily implemented in SAS software by application of its PROC LOGISTIC with use of STRATA command to identify pairings, or in Stata software by application of its CLOGIT with similar use of its corresponding STRATA command. Modern econometrics textbooks have little mention of matched samples. Heckman, Ichimura, and Todd (1998), however, show that the correct treatment is well understood among current econometricians. They dislike matched sample studies, instead preferring to model selection processes explicitly. This criticism seems particularly apt for non-choice-based studies of management decisions such as Wallace (1997). Biostatisticians, on the other hand, would use matched samples to quickly investigate possible relationships in a new area, and then apply randomized experiments, longitudinal studies, and other approaches to deepen knowledge. Econometricians’ preference to model the selection processes explicitly would require more data, and would preclude, for example, the choice-based matched sample studies that biostatisticians and accounting researchers sometimes employ involving analysis of costly hand-collected variables for strategically chosen observations. Bergstrahl, Kosanke, Jacobsen (1991) provide an efficient means to identify matches for matched sample selection in SAS software, with both optimal and greedy algorithms.18 Parsons (2002) provides another SAS software implementation of a greedy algorithm. Barber and Lyon (1986, 1987) and Kothari, Leone, and Wasley (2005) discuss potential strategies in selecting matched samples. They advocate application of what are termed propensity scoring methods; Heckman, Ichimura, and Todd (1998) however provides a critical dismissal of those approaches. The focus of the present paper is upon the correct analysis of choice-based and matched samples, once selected, rather than upon the match selection process. In the next section we go on to demonstrate the potential problems of Errors 1 and 2 in an extended example. 18 BKJ(1991)’s SAS macro offers options for weighting on multiple matching variables and for greedy matching (for the first treatment observation, selecting the closest match available) versus more strategic algorithms. We employed the macro in our simulations and recommend its use. 14 3. Simulation-Based Example and Statistical Theory In this section we provide an extended example of Errors 1 and 2 using logit regression. We use a simulation to generate settings where "true" parameter values are known, allowing us to compare the performance of alternative methods of analysis used in practice. This enables us to explore which methods of analysis lead to incorrect conclusions, and which methods are more efficient in converging to correct conclusions. This exercise will demonstrate ways that past research may be incorrect: a) true effects may be suppressed, b) non-existent effects may erroneously be found to be significant, and c) effects can be miss-measured, even to the extent that an apparent effect in one direction may be found when its true effect is in the opposite direction. Suppose that a discrete 0-1 outcome Y is hypothesized to be driven by two variables of interest, X1 and X2, as well as by a ' nuisance'categorical variable Z. This would be an appropriate model, for example, in research attempting to measure the effect of market value (X1) and of discretionary accruals (X2) upon whether or not an audit-related litigation (Y) occurs, where it is already known that industry membership (Z) has a significant influence upon the outcome. We simulate this situation by generating a population of X1, X2, and Z data and then generating outcomes Y that are a function of those plus a random error term. Arbitrarily, we let there be five "industry" groups (five categories of Z), that will have (X1, X2) values in clusters centered at (1,1), (3,1), (5,1), (7,1) and (9,1). In order to approximate variables in accounting research, we generate X1 and X2 to be normally distributed and correlated. Specifically, within each group, we generate 5000 observations of (X1, X2), distributed bi-variate normal with correlation of 0.4. These data will be the independent variables in each of the three simulations that follow. Simulation 1 Demonstration of an Error 1 Impact: (a) Failure to Find a True Effect Consistent with the distributional assumptions of logit regression, we generate values Y according to the following formula: y i = 1 if α j + β 1 x1i + β 2 x 2i + ε i > 0 ; yi = 0 Otherwise; 15 where j is an intercept specific to the jth of J=5 groups in the population, β1 and β2 are the coefficients reflecting X1 and X2’s influence upon the outcome, and i is a logistic distributed random error term having mean 0 and variance 2=1. In this first of three cases, we set β1 =1, and β2 = 1. In order to generate a mix of 1 and 0 outcomes within each group, given the distribution of X1 and X2 and the chosen values of β1 and β2, we set j values at -2, -4, -6, -8, and -10. This yields more outcome 1’s above and to the right of each group’s center, and more outcome 0’s below and to the left. This simulation yielded 12,477 outcome 1’s and 12,523 outcome 0’s. Assuming there was a significant cost to data collection, a researcher might reasonably choose to study the relationship of X1 and X2 to Y by gathering a limited choice-based matched sample. We simulate this by selecting, at random, 50 out of the 12,477 observations having outcome 1. And, then we create a comparison sample by randomly selecting, for each one of those, a matching observation from those having outcome 0 and appearing in the same “industry” group. This yields 100 observations in 50 nominal pairs that may be analysed in various ways. A scatter plot of the data generated for 50 pairs is included in Table 1. We create increasingly larger samples of 100, 200, 400, 800, and 1600 pairs by adding observations selected in the same way. Results of analysis for each sample size are tabulated in Table 1. First consider analysis as has been most commonly done in accounting research, i.e. running the logit regression: y i = 1 if β 0 + β 1 x1i + β 2 x 2i + ε i > 0 ; yi = 0 otherwise, where β0 is a single overall intercept that is estimated and i is an error distributed according to the logistic distribution. This is an unconditional, pooled analysis that we term “unmatched”. Note, it uses neither group identifier information nor pairing information. Logistic regression software finds the maximum likelihood estimates for βˆ0 , βˆ1 , and βˆ 2 by numerical search that maximizes the product of probabilities for occurrence of the observed data, where for outcomes y i = 1 the probability is: 16 [1] Pr( yi = 1 | x1i , x 2i , ) = F ( β 0 + x1i β 1 + x 2i β 2 ) = exp( β 0 + x1i β 1 + x 2i β 2 ) 1 + exp(β 0 + x1i β 1 + x 2i β 2 ) [2] and for outcomes yi = 0 the probability is: Pr( y i = 0 | x1i , x 2i , ) = 1 − F ( β 0 + x1i β 1 + x 2i β 2 ) = 1 1 + exp( β 0 + x1i β 1 + x 2i β 2 ) [3] where F is the cumulative density function of the logistic distribution. In the first two columns of Table 1 we report selected results of this unmatched analysis applied for 100 observations in 50 pairs: estimated coefficients βˆ1 =-.083 and βˆ 2 =1.56, with corresponding p-values of .32 and <.0001. We do not report βˆ0 . A researcher could infer that X2 affects Y, while X1 does not, when in fact true β1 = β 2 so X1 and X2 affect Y equally. The next question we consider is whether increasing the sample size would allow estimation to identify what we know, by construction, to be true, that the two variables have an equal and positive effect on Y? Continuing down the column within Simulation 1, we observe that increasing the sample size up to 1600 pairs does not accomplish that. Unmatched analysis applied to all 25,000 observations in the simulation eventually yields a statistically significant coefficient βˆ1 =.086, but its estimated magnitude is a small fraction, about seven percent, of the estimated coefficient βˆ 2 . Unmatched analysis in this case is inconsistent: it will not converge to the true values even asymptotically. Now consider an analysis that takes into account the pair-matching. This is implemented by running a conditional logit regression. This is essentially a no-intercept logit regression of pair-wise differences in Y upon pair-wise differences of the independent variables.19 19 As developed in Appendix 1, the estimation in effect maximizes the product of observed probabilities that the pair-wise difference is 1, as a function of pairwise differences in X1, X2, and the vector of industry dummies Z. To implement this, a numerical search is run to maximize the product of probabilities: Pr(( y i − y j ) = 1 | xi1 , xi 2 , x j1 , x j 2 , z i , z j ) = exp(( β 0 + β1 xi1 + β 2 xi 2 + γz i ) − ( β 0 + β 1 x j1 + β 2 x j 2 + γz j )) 1 + exp((β 0 + β1 xi1 + β 2 xi 2 + γz i ) − ( β 0 + β1 x j1 + β 2 x j 2 + γz j )) 17 Disconcertingly for some, the pairwise difference in Y that is the dependent variable is uniformly one for each observation, giving rise to a seeming paradox in estimation. It may seem impossible to estimate such an expression, i.e. regressing a vector of 1’s on a vector of data times coefficients to be estimated.20 If this were a regular logit regression, having all outcomes 1’s would mean that it could not be estimated: it would be a situation of “complete separation” and the maximum likelihood estimation procedure would find that increasing coefficient estimates continually towards infinity would indefinitely continue to increase the likelihood. Resolution of the paradox is found by noting that this is not a regular logit regression, but instead this is a nointercept regression. Here, one is finding the unique coefficients maximizing a likelihood expression subject to a very strong constraint that the intercept value is fixed. In an analogous no-intercept OLS regression, the intercept would be zero, but here, when pairwise difference in each independent variable is zero, the conditional probability value that it is Yi = 1 and Yj = 0 rather than the reverse, is in fact ½. Maximizing the appropriate likelihood expression enforces that; this is easily implemented in standard statistical software. The next two columns give conditional logit coefficient estimates based on the same 50 pairs of data: estimated coefficients βˆ1 =.89 and βˆ 2 =1.89, both different from zero at conventional significance levels. Estimation on increasing amounts of data up to 1600 pairs yields estimated coefficients βˆ1 =.989 and βˆ 2 =.966, which are close to their true values. The simulation suggests that conditional logit estimation is consistent, i.e. that it converges upon the true coefficient values. We provide a proof that the conditional logit estimation is in fact consistent in Appendix A. Note, within this simulation, from 50 pairs on, a true effect that X1 contributes positively to Y is revealed in the conditional logit analysis, but is concealed in the unmatched analysis until = exp(( xi1 − x j1 ) β 1 + ( xi 2 − x j 2 ) β 2 ) 1 + exp(( xi1 − x j1 ) β 1 + ( xi 2 − x j 2 ) β 2 ) Note that in taking the pairwise difference, the overall intercept drops out, as does each pairwise difference in industry, Z. Thus the vector of industry effects is not estimated. The expression can be interpreted as the conditional probability that it is Yi = 1 and Yj = 0, rather than the other way around, given that one is 1 and the other is 0. For convenience in estimation, however, we reorder as necessary so Yi – Yj = 0 always. Note when X1i – X1j =0 and X2i – X2j =0, the expression simplifies to ½, the intercept probability value. 20 The paradox has puzzled accounting researchers and, in general, led some to avoid matched sample designs and led others to perform pooled, unmatched analyses of matched sample data that are not justified. 18 400 pairs are collected and analyzed. This demonstrates a Type II error stemming from Error 1: a significant effect is not identified when in fact it is true. The industry groupings from columns 5 and 6 will be discussed below. Simulation 2 Demonstration of an Error 1 Impact: (b) Finding a False Effect If only Type II errors were caused by misanalysis, perhaps one could still trust unmatched analysis that achieved significant results, as if the results were shown despite a bias against finding them. But what if there was no real contribution by a variable, can a Type I error be found? To examine this possibility, we keep the same distribution of X1, X2, and Z, and regenerate Y as a slightly different function, now setting β1 =1, and β2 = 0. In this second simulation, X2 has no contribution to Y, and one would hope that analysis will not erroneously identify a significant coefficient βˆ2 . In the Simulation 2 panel of Table 1, we present results of analyzing successively larger samples. In the first columns of this panel, see that coefficient estimates go to .087 and to .303 in the unmatched analysis. In the conditional logit, estimates of .988 and .079 are obtained that are closer to the 1 and 0 true values. In fact these conditional logit results are not significantly different than the corresponding true parameter values of 1 and 0, before 1600 pairs the conditional logit analysis correctly identifies no significant effect of X2. In unmatched analysis, however, a Type I error occurs: a highly significant positive influence of X2 is erroneously assessed. Simulation 3 Demonstration of an Error 1 Impact: (c) A Sign Reversal With unmatched analysis there will be some degree of misestimation of coefficients due to the omitted correlated variables issue in any setting where there are correlations among the independent, outcome, and matching variables, as demonstrated in simulation 1 and 2. Simulation situations where misanalysis finds even more disturbing results that are the opposite of true effects are not difficult to find. We obtain such a situation by continuing to rotate the relative effects of X1 and X2, within each industry, on the outcome variable. As before, we keep the same distribution of X1, X2, and Z, but now regenerate Y setting βˆ1 = -4, and βˆ2 = 1. We find a sign reversal effect of misanalysis: unmatched analysis yields a mistaken result that β2 is significantly negative when in fact it is positive. 19 Results are reported in the Simulation 3 panel of Table 1. In the 200 pair case, the conditional logit analysis performs well, identifying estimates of -4.49 and .810, which are statistically different than zero and not far different from the true values. With larger sample sizes, the estimates improve, as before. But, for the unmatched analysis, the initial estimate is βˆ1 =-.164 and βˆ2 = -.207, not close to the true values of -4 and 1. Going to larger sample sizes the unmatched analysis continues to lead to the erroneous conclusion: at 1600 pairs the unmatched analysis estimates βˆ2 = -.18. The unmatched analysis has identified a negative effect for X2, when in truth X2 has a positive effect that is shown in the conditional logit analysis. Simulation 4 Demonstration of Error 2: Failure to Control for Imperfect Matching Up to this point we have considered simulations where the matching was exact: each outcome 1 case was matched to an outcome 0 control having exactly the same industry. To illustrate the potential for Error 2, failure to control for residual effects of imperfectly matched variables, we revise the simulation process slightly. Suppose now that the researcher is only interested in the effect of X2 on outcome, and chooses to match by industry group and by closest X1. One could interpret X1 as a firm-size measure, perhaps the log of total assets. Using the same distributions of variables and the same true equal and positive effects relating X1 and X2 to outcome as in simulation 1, we perform the matching by industry and now also by closest X1. We must consider the results of analysis with and without including control for residual differences in X1. In a modified simulation run with 1600 pairs, with X2 alone in the model, the conditional logit estimate βˆ 2 is 1.4229 (standard error .0913, p-value <.0001).21 With both in the model, however, we obtain estimate for βˆ 2 of 1.1041 (standard error .1018, p-value <.0001). Estimation is by conditional logit with stratification on pair identifiers. Including X1 as well as X2 in the model “soaks up” the effect of residual differences in X1 upon pairwise differences in outcome, and obtains an estimate close to the true value of β2 = 1. In this simulation setting, we know that the form of X1’s effect on outcome is that it enters linearly, and hence including it in the model controls for residual effects correctly. In other settings one might include a quadratic 21 For brevity, we do not tabulate results for different numbers of pairs. 20 term as well. The point of this example is that the residual difference from imperfect matching can drive results in analysis omitting control for that residual difference. Interestingly, in this situation the estimate for β1 at 22.3933 (standard error 3.4813, pvalue <.0001) is not very close to 1, the true effect of X1. That perhaps occurs because the simulation, as implemented, obtains a domain of pairwise differences in X1 that is vanishingly small. A precept to observe is that one cannot accurately measure the effect of a matching variable upon outcome. When the effect of a variable is of interest, it should not itself be used as a matching variable. To reiterate, the first analysis, omitting the residual effect of X1 suffers error 2 (omission of control for residual differences) yielding an estimate 4 standard errors away from the true value β1=1; the second analysis, controlling for it, is “spot on” the true value. Potential for Simulation To Explore Error 3 The simulations presented above do not illustrate Error 3, non-generalizability of results. In these simulations the error distributions conform to logistic distribution, and the logit exemption to a need for reweighting applies. Error 3 can occur in univariate analysis or OLS or probit regression analysis, where the data must be reweighted according to sampling rates in each stratum or else the analysis is not generalizable to any universal population. The impact of Error 3 could be explored by simulations in such non-logit settings, by comparing unweighted vs. appropriately reweighted analyses. Semi-Matched Analysis Versus Fully Matched Analysis A further potential improvement in the efficiency of analysis in Simulations 1, 2 and 3 can sometimes be obtained by recognizing the similarity of pairings within each Industry group. In the samples selected, we know that there are sets of multiple pairings within the same industry, hence sharing the same true value of industry intercept j. We have not yet exploited this additional information. To do so, we run conditional logit analysis stratified upon the 5 industry groups, instead of stratifying on the 50 or more pairings. The results are reported in columns 5 and 6 of Table 1. The tabulation shows that this yields very similar results. We have noted in other simulations, however, that use of the industry groupings can be more efficient. For the simulation 1 setting but with correlation =.9 rather than .4, for example, 21 we find that this yields statistically significant results sooner. With just 50 pairs, the estimate for β2 is significant at the .05 level, while equivalently significant results for β2 are not found until 400 pairs, for the conditional logit using pairings. Both approaches appear to yield the correct estimates, as do results using industry groupings in Simulations 2 and 3. The Simulation 1, 2, and 3 settings were in effect CB-SM not CB-FM settings. We provide a proof, in Appendix B, that conditional logit provides asymptotically correct estimates for the effects of interest, as long as a fully saturated model (including one intercept for each group) is used. Including multiple intercepts for each separate pairing within each industry is duplicative and reduces the degrees of freedom in the analysis. The simulations have demonstrated that Error 1 and Error 2 problems can be very severe. In the next section we provide further evidence from replications. 4. Replication and Reanalysis of Two Choice-Based Matched Sample Studies Errors in analysis can also be illustrated by replications. We replicate two studies here. Replication of a matched sample study tangibly demonstrates that when data is analyzed taking into account the matching, estimates are different than when unmatched analysis is used. Our first replication, of a medical example, illustrates that the difference uncovered may not be very great in magnitude, but the context can be such that even a small difference is very important. Second, our replication of Ghicas (1990), suggests that both Type 1 and Type 2 errors were made in an accounting research analysis, and resolves an anomaly in the results that Ghicas had noted. The Mack et al. (1976) retrospective study of cancer in a retirement community is a prominent study published in The New England Journal of Medicine. It was employed as a running example in Breslow and Day' s (1980) monograph on the use of matched samples. It is employed as an example of conditional logit analysis within SAS software documentation. We chose it for convenience and also because the scientific community has weighed in on which is the correct analysis. The goal of this case-control analysis is to determine the relative risk of cancer that having a gall bladder disease condition contributes, while controlling for the effect of hypertension. The researchers chose matching within one year of age, same marital status (evermarried or single), and living in the community at the time of diagnosis of the patient’s disease. It is a retrospective study and the cost of data collection was high: they interviewed subjects, 22 collected medical histories, clinical records, and prescription history. Using the subsample of the study’s data that is included in the SAS documentation of PROC LOGISTIC, Example 42.10, we replicate and then vary the analysis. Table 2 presents the SAS documentation reported results, and our results applying unmatched and matched analysis. We obtained conditional logit analysis results identical to those in SAS documentation. In this example the coefficient estimates vary only slightly across specifications, and there is no Type 1 or Type 2 crossing of significance levels, as to the importance of any variable. But this medical example is one in which it especially easy to understand that even small differences do matter. Here, the statistical significances of the gall bladder condition factor as a variable does not change very much, going from p-value of .0675 for unmatched to .0770 for conditional logit. The statistical significance of the Hypertension variable remains insignificant (although including the variable does provide value in the interpretation). The gall bladder variable' s coefficient estimate changes from .8417 to .9704, an increase of only 10%. The odds risk ratio, a nonlinear function of the coefficient estimates, increases by about 20%, from 2.258 to 2.639. The received interpretation of this study is that the odds risk ratio is 2.639 for the gall bladder factor, meaning that cancer occurs 2.639 times as often among persons having the gall bladder condition than among those who do not, among the population from which the study sample is drawn. The unmatched estimate, of 2.258 could be viewed as similar, perhaps, but the difference in its estimate could affect serious decisions. In medical situations like this one, there are available courses of treatment that would mitigate the risk of cancer, including options that are costly, invasive, and having painful and/or uncertain side-effects. Doctors use odds ratio numbers to consider offering treatment or costly additional testing options, or not, and patients, informed by the doctors of the costs and the risks, must make serious decisions. If one is informed of a cancer risk that is .4 times higher, one might make different choices about whether or not to pursue options that could mitigate the risk. Second, we replicate "Determinants of Actuarial Cost Method Changes for Pension Accounting and Funding", the Accounting Review paper based on Dimitrios Ghicas’ University of Florida Ph.D. dissertation. Ghicas (1990) is one of few papers in accounting research that both employs matching in its sample selection and provides an explicit listing of its firm-year 23 22 observations with identification of its pairings, facilitating a replication. Ghicas (1990), applied logistic regression to explain firms’ choices to switch actuarial methods for pension funding. The choice to switch actuarial methods was explained by Ghicas as a function of factors that allowed firms to subsequently report higher earnings. He was surprised to find no significance of size, measured by assets, a likely proxy for the political exposure that might prevent more prominent firms from raiding their pension funds. Because he pooled the case and control data from the 45 matched pairs, and omitted dummy variables for the pairings, his analysis suffers from errors 1 and 2. Table 3 Panel A presents Ghicas’ reported results. Table 3 Panel B reports our unmatched logit regression of the salient model that substantially matched Ghicas’ results. We then reanalyzed the data with a pair-matched logit regression on the 86 pooled treatment and control observations, by using conditional logit analysis. We compare the results to our conditional logit results in Table 3, Panel C. We found important differences. While there were no changes in sign between our unmatched versus pair-matched analyses on the ten variables, the estimated statistical significances of six of the ten variables shifted across traditional thresholds for strong and for 23 marginal significance (p-values of .01 and .10). The significance of two variables, IR and LogTA, increased dramatically (p-values changing from .02 to .001, and from .08 to .02). Three variables that were marginally significant in the unmatched analysis, WC, RUNI, and INT, were now found not to be significant. One variable, CI, that was insignificant, became marginally significant. The other four variables did not have significance level changes across the .01 or .10 significance level thresholds. While Ghicas finds significant support for six of nine hypotheses (two with statistically 22 After contacting Ghicas and finding that his original data is no longer available, we substantially reconstructed Ghicas’ dataset from Compustat, 10-K statements, and annual reports, without certitude of exactly replicating his data because of ambiguities about fiscal year-ends and unavailable reports. Data for two pairs of observations was especially problematic to reconstruct and thus were omitted from our analyses. 23 We would prefer to perform statistical tests for whether each coefficient is the same before and after, individually, so as to say whether the change for each was statistically significant. This could be performed easily in an ordinary regression context. But in the logit regression context, where the relative sizes are determined but overall scaling is arbitrary, we do not see how to construct those tests. (Note, in a stacked regression model, an included observation will have no influence upon a given model’s estimation if it has value zero for each variable in that model. This is not true in a logit regression context.) 24 significant support and four with marginal support) we find significant support for only two of Ghicas’ six hypotheses, and no support for the other hypotheses. Interestingly, the size measure, logTA, now comes in strongly with .01 significance, while Ghicas’ discussion in hypothesis development and footnotes suggested puzzlement on his part that this measure, a likely proxy for political visibility, entered only marginally (at .11 in his and .08 in our unmatched analyses). Also, we find marginal support in the “wrong” direction for one hypothesis for which he expected to find support, but did not. The differences in our analysis can largely be explained by the fact that the matching controlled appropriately for factors related to industry and stock exchange listing. At least four of the included variables are related to industry effects: leverage, working capital, and rate-ofundertaking-new-investments, and size. Hence, these variables are correlated with the industry dummy variables omitted from the unmatched analyses. And, the omitted dummy variables would be expected to affect the probability of pension method switching. Therefore the unmatched analysis performed originally reflects an omitted correlated variables problem and its estimated coefficients are biased and unreliable. Ghicas believed that his analysis did control for his matching variables, and, citing Palepu (1986) and Zmijewski (1984), he stated that “The primary advantage of logit models is the presence of consistent coefficient estimates whenever choice-based sampling is involved” (Ghicas 1990, p. 385.) This statement, although consistent with the general state of knowledge within accounting research, failed to recognize the complication due to his further stratification (of matched pairs) within 0-1 outcome sub-samples. These replications show that reanalysis provides new insights in old data, and suggest that previous choice-based and matched sample studies might best be reanalysed before being relied upon further. In the next section we review other accounting studies and provide guidance for future research. 5. Guidance for Statistical Analysis of Choice-Based and Matched Samples In this section, we seek to provide the guidance on choice-based and matched samples needed by researchers going forward, informed by past usage and the potential for errors that we have demonstrated. We discuss, first, how univariate analysis can suffer each of the three errors 25 that we identify and what are the corresponding remedies. We discuss the general solution that reweighting analysis provides, to compensate for non-random sample selection, when exogenous sampling rate data is available. Then we provide more specific guidance in each of the six distinct research designs that we have identified as important. Finally we comment on approaches for performing matching and on considerations in choosing between possible research designs, in advance of data collection. Guidance for univariate case Univariate analysis has routinely been applied in choice-based and matched samples in accounting research. Sometimes this has been the main analysis, but more often recently this is preliminary to multiple variable analysis that will follow. Often this is done to compare two samples on each variable, e.g. in a choice-based sample, to compare the treatment sample of bankrupt firms and the control sample of non-bankrupt firms. As noted in Section II, many researchers studied the use of matched sample t-tests in introductory statistics textbooks such as Johnson and Bhattacharyya (1985) or Rice (1995). These textbooks establish that a matched sample t-test (a one sample test) is more powerful than an unmatched (two sample) t-test in detecting a mean difference in a given measure, provided there is a natural pairing in the data. What is a natural pairing? The presented examples include pairings that are within-subject, as in Rice’s example comparing a blood platelets aggregation index measured in blood samples taken from each subject before and after smoking a cigarette. Or the pairing is prospective and treatment is randomized. “In a medical experiment, for example, subjects might be matched by age or weight or severity of condition, and then one member of each pair randomly assigned to the treatment group and the other to the control group” (Rice, 1995, p. 410.) Johnson and Bhattacharyya enjoin: “After pairing, the assignment of treatments should be randomized for each pair” (p. 347). In these settings, the pairing is shown to be useful if there is any positive pair-wise covariance in the univariate measure to be compared, as that covariance will be subtracted from the sum of the variances of the measure calculated within each of the two samples, yielding a lower unexplained variance, and hence a higher ratio of mean difference divided by estimated standard error (t-value) can be discerned. In all six research design cases, we deem an Error 1 to have occurred when a univariate analysis fails to account for the matching, as when an unmatched t-test is applied rather than the matched t-test that is justified. In univariate t-test analysis, the inappropriate test choice cannot 26 cause an actual reversal of results: the numerator in the t-test for a matched sample is the average of pair-wise differences, which is mathematically the same as the numerator for an unmatched sample, the difference of the two samples’ averages. (In multivariate settings, however, as demonstrated in Section III, entirely opposite results may obtain when matched vs. unmatched analyses are employed.) What differs is the estimated standard error of the difference, and corresponding p-values of statistical significance for the null hypothesis of no difference. As the researcher is usually correct in his judgment that there is some benefit in matching, the matched sample t-test will be more powerful, statistically. And to be clear, use of the unmatched t-test is invalid. It is especially inappropriate to report an unmatched t-test when the researcher wishes, for some reason, to show no statistical difference between two samples. We observe, further, that the situations described in statistics textbooks differ from many of the non-random observational study settings common in accounting research in which matched sample t-tests are often applied. The NCB-FM-W research design can be an exception. Accounting studies may differ from those described by Johnson and Bhattacharyya (1985) and Rice (1995) in that matching is sometimes by closest size or other measure, and hence Error 2 is possible. When matching is by industry and closest log-size, for example, we argue that a univariate t-test, matched or not, is not even appropriate. The residual difference in log-size might explain pair-wise difference in the measure of interest as well or better than the group membership does, in truth, but that would not be revealed in a matched t-test. The corresponding analysis that would avoid Error 2 is an OLS regression of differences, in particular with the pairwise difference of the measure of interest regressed upon an intercept and pair-wise difference in log-size (and perhaps more terms, such as pair-wise difference in log-size-squared). Significance of the intercept would indicate a group membership effect on the measure of interest. (Or, equivalently, the OLS regression to avoid Error 2 may be run on all non-differenced observations but with the dependent variable being the measure of interest and independent variables being the size variable or a size-difference variable and pair-identifier dummy variables, with omission of one such dummy variable or of an overall intercept.) A quadratic term for the size or the sizedifference may also be accomodated in this formulation. Accounting studies also differ from those described in the statistics textbooks in that there is possibility of non-generalizability, of Error 3. Besides in the NCB-FM-W settings where the samples arguably are randomly drawn, the application of a matched pair t-test towards 27 ascertaining a group difference, or of the OLS regression just described, does not yield any generalizable result: there exists no larger population to which the analysis generalizes. In the NCB-FM-W setting of before-and-after, assuming the cases are drawn randomly from all continuing firms, the result generalizes to the population of all continuing firms. In the other five settings, the match selection draws a non-random sample. The OLS implementation, but not the univariate t-test, may be corrected by reweighting to avoid Error 3. Guidance for reweighting Reweighting provides for generalizability, and we explain how to apply it here. However, this requires exogenous sampling information that will only be available if the researcher focuses data collection effort, early in the research design process, on the wider population to which analysis in a nonrandom sample is to generalize. To reweight a choicebased non-matched sample, the researcher must know the count of each outcome category in the wider population. For example, to study bankruptcy across Compustat-listed firms by a choicebased sample of bankrupt and non-bankrupt firms, the researcher would effectively need to determine whether or not the definition of bankruptcy is met, for all Compustat firms in any strata that will be sampled. It does not suffice to examine the bankruptcy status for just the sample firms. The researcher' s choice of definition for bankruptcy then, is limited to those for which information is universally available in the wider population to which the sample will generalize. Likewise, to evaluate a non choice-based matched sample, the researcher must know the count in the wider population of members of each possible matched set. For example, in an industry matched sample, the number of Compustat firms in each industry category must be known or collected, again limiting the researcher' s choice of definition of industries to one that can be determined universally. For a sample that is both choice-based and matched, the researcher must know the count in all strata formed by intersection of outcome partitioning and matched set partitioning. For example, in each separate industry category, the number of bankruptcies and non-bankruptcies must be collected. In the analysis, each observation is to be weighted by the inverse of the sampling rate for its stratum. Consider a CB-FM setting such as bankrupt firms matched to non-bankrupt firms by industry, with random selection out of the multiple possible matches (rather than by closest size, eliminating, for simplicity, any possibility of Error 2). To examine whether there is a difference between bankrupt firms vs. nonbankrupt firms in a given accounting ratio, it would be incorrect 28 to apply a univariate t-test. It is possible, however, to perform a correct analysis by running a weighted OLS regression of the accounting ratio upon a group membership dummy variable plus pair-identifier dummy variables (omitting an overall intercept or one pair-identifier). The weight applied to each observation should be inversely proportional to the sampling rate for its stratum. For example, let the weight be 1 for all the bankruptcies in one industry where all bankruptcies available are selected in the sample. Then, the weight should be 2 for bankruptcies in a second industry where only one-half of all available bankruptcies are randomly selected. And, within each of these industries, the randomly selected non-bankruptcies should be weighted according to the prevalence of non-bankruptcies in the industry. In the second industry, supposing there are 99 non-bankruptcies for each bankruptcy, and just one non-bankruptcy is chosen randomly for each bankruptcy in the sample, the weight for each non-bankruptcy would be 198. If performed appropriately, the weighted regression results provide a generally valid result of the association of a single accounting ratio with bankruptcy.24 To be sure, in many situations the need for exogenous sampling information is onerous, and, compared to collecting a random sample, the cost of collecting exogenous rate information would often outweigh the advantage of being selectively strategic in taking a non-random sample rather than a random one.25 Choice Based Non-Matched More specifically, let us summarize what is our guidance for researchers, by research design category. See Table 4 for a summary. If research employs a choice-based sampling, without use of matching (CB-NM design), as 6 of the 73 audit research papers do, then of our three errors only Error 3 can apply. Because the CB sample is non-random, either logit analysis 24 Our characterization may be over-simplified. Greene (2000), summarizing the Weighted Exogenous Sampling Maximum Likelihood (WESML) estimator applied to a CB-NM case, states that weights are to be applied within a weighted log-likelihood expression, i.e. as constants multiplying the logs of each observation’s individual likelihood. Our CB-SM example might better be implemented in maximum likelihood form, which may differ in implementation from how we describe it above. And, Greene notes complications in the estimation of the covariance matrix requiring use of a special estimator. 25 Also, Greene (2004) notes use of a non-random sample and then reweighting is not a “free lunch”: “What the biased sampling does, the weighting undoes. It is common for the end result to be very large standard errors, which might be viewed as unfortunate, insofar as the purpose of the biased sampling was to balance the data precisely to avoid this problem” (p.823). We note that the unbalanced sampling in choice-based studies does at least ensure that observations having each outcome are represented in the sample, which random sampling would not guarantee. Choice-based and matched samples do permit investigation of areas where data collection is costly. 29 can be used, or reweighting is needed. Some might believe, incorrectly, that having the same number of observations in each of two subsamples is needed, and discard valuable data from one in order to even them up, but that is not necessary. If pair-wise or other matching is not employed, it is not necessary to have equal sizes of case versus comparison samples, so all available data should be used to enhance power in the analysis. (It is true that if additional observations could be collected, it would generally be more beneficial to add data to the smaller sized outcome group; in a narrow sense, equal sizing is most efficient. However data collection decisions should be based on cost as well as benefit considerations, and it is not appropriate to throw away available data for which no additional collection costs need be incurred, unnecessarily.) Choice-Based Fully Matched It is more common now, however, for researchers using a choice-based design also to choose to use matching (as Abdel-khalik and Ajinka (1979) recommend). If research employs choice-based fully matched design (CB-FM) that yield equal sized pair-matched comparisons, as 27 of 73 do, all three errors can apply. To utilize the matching and avoid Error 1, pair-wise differences can be taken or, equivalently, an intercept for each pair can be included in analysis (with omission for one pairing or omission of an overall intercept, to permit estimation). Of the 27, 23 do suffer from this error; the four that avoided it do so by employing only univariate pairwise tests (and then they cannot avoid suffering Error 3). For multivariate OLS regressions, including pair-identifier intercepts would be an easy adjustment to avoid the Error 1 and to soak up the variance in outcome that relates to the matching variables. Discriminant analysis of matched sample data could be performed correctly by applying analysis to assess the location in the independent variable space of pair-wise differences, and examining that for significant deviation from the origin, as described earlier. Other statistical methods may or may not lend themselves to controlling for matching that found locations of outcome groups in the space of the independent variables. To avoid Error 2, which 21 of the 27 fail to do, researchers must include the imperfectly matched variable in the analysis (e.g. include size or size2 as a control variable in a regression analysis). To avoid Error 3, either logit regression or reweighting needs to be applied. Again, for discriminant analysis, we are not aware of any approach that could implement the necessary reweighting. Choice-Based Semi-Matched 30 If research employs a choice-based semi-matched (CB-SM) design, as 23 of 73 do, then Error 1 and Error 3 may easily apply. Error 2 is conceivable, but not often seen. This is the research design illustrated in simulations 1, 2, and 3 above. Unlike for the CB-FM design, here it is not most appropriate to perform the analysis upon pair-wise differences, even when there is a nominal pairing that could be used for that purpose, as in the simulation. (Note, for semimatched analysis there does not need to be an equal number of each outcome collected in each stratum. But if an equal number of outcome 1’s and outcome 0’s are collected in each group, as was done in the collection of 50 pairs in the simulation, a researcher might be inclined to take pairwise differences.) Instead, the most efficient way to control for matching is by including just one stratum identifier for each defined stratum (e.g. industry), and hence to pool together the nominal pairings that appear within each stratum. In the simulation, then, effectively, just five industry intercepts are to be estimated, rather than 50 or 100 or 1600 pairwise intercepts, and greater efficiency is achieved. In our review of audit papers, we identify as Semi-matched a number of studies where the researchers themselves present the work as if it were Fully-matched, because it was apparent to us that the researchers’ pairings included groups of pairs that could best be pooled together, whether or not the researchers perceived it that way. Again, to achieve generalizability of results and to avoid Error 3, logit regression must be employed in the analysis or explicit reweighting is needed. NonChoiceBased Fully Matched Between Subjects If a researcher employs nonchoice-based fully matched (NCB-FM), as 11 of the 73 audit research papers do, then it is possible that the matching is between-subjects (NCB-FM-B, 6) or within-subjects (NCB-FM-W, 5). If the former, all three errors could possibly apply. To avoid Error 1, pairing must be accounted for by analysis based on differences or otherwise fully saturated. To avoid Error 2, imperfection in matching must be controlled for by including terms for the imperfectly matched variables in the analysis. To avoid Error 3, if logit regression is not employed, reweighting must be applied. NonChoiceBased Fully Matched Within Subjects An example within-subjects experimental design would be a study of firms’ audit fees compared before and after some event. Here, the sampling is not choice-based but it is matched (e.g. a firm-year observation before an event, to the same firm’s firm-year observation after the event). In within-subjects designs, if the subjects are themselves chosen randomly, then there is 31 no issue of non-random selection that would require reweighting to strata proportions. So withinsubjects studies are not subject to Error 3. Since there may be sample selection issues that come up in an audit fees within-subjects study (e.g. firms which undergo mergers may drop from the audit fee study), a selection bias concern can rule out the widest generalizations, but the withinsubjects sample results can fairly be generalized at least to the larger population of continuing firms. Error 2 also is not possible; there is a perfect matching of each subject (before) with itself (after). To avoid Error 1, pairing must be accounted for in the analysis. NonChoice-Based Semi-Matched If research employs nonchoice-based semi-matched (NCB-SM), as 6 of the 73 papers do, analysis should proceed as with the NCB-FM-B, but for the use of fewer intercept dummies (condensing pairs within the same stratum into pools, to have just one intercept estimated more accurately). Some Considerations in Match Selection Better matching variables are those which are not of research interest but which are believed to explain variation in outcome and which are cheap to gather, before data of final samples is to be collected. We observe many examples of matching by closest size, but where it is log-size that is deemed the appropriate form of the variable to include in analysis (e.g. Lys & Watts 1994). It would be more appropriate to use closest log-size in the match selection initially. To perform the matching, on whatever matching variables have been chosen, use software that provides an audit trail and implements a consistent approach. We recommend Kosanke and Bergstrahl' s SAS macro for selecting matched sets, described in Bergstrahl, Kosanke, Jacobsen (1991). Parsons (2002) provides an alternative SAS macro that we have not evaluated. There are other perspectives, but we find Heckman, Ichimura and Todd (1998)’s arguments against the use of propensity matching to be compelling. Again, if analysis is to incorporate reweighting, match selection must be applied using only variables for which the strata-level and universal rates are known. Some Considerations in Choosing Research Designs In advance of data collection, we recommend use of matched and choice-based sampling plans only when there exists relatively inexpensive-to-gather candidate matching variables and 32 when other control variables and/or the variables of research interest are expensive to collect. Then, a strategically selected matched sample can be more powerful than a similar sized random sample. If data collection costs are not high, a random sample is preferred. If data collection costs are intermediate, it is preferred to collect more than one match for each case observation; the analysis as semi-matched rather than fully matched can easily accomodate the additional information. 6. Summary and Conclusions Technical errors in the analysis of non-random samples runs through accounting research. Controls for matching are not included, although needed (Error 1). As we showed in simulation, incorrect conclusions can then be reached. A lesser error is failure to evaluate the potential effect of imperfect matching (Error 2). Where logit exemption to a need for reweighting does not apply, then WESML or other reweighting is needed but typically is not applied (Error 3), so presented results are not in fact generalizable. Our main finding from our review of accounting research is that the vast majority of choice-based and matched papers suffer one or more of the three technical deficiencies. Of the 73 audit research papers we reviewed, 55 need to but do not explicitly control for matching in their analysis, and thus suffer from Error 1. Of the remainder, 6 are choice-based samples but not matched samples (so control for matching is not needed). Only 12 of the 73 are matched samples where researchers correctly controlled for the matching. Our most urgent guidance to researchers then is to either avoid use of matching, or to take the matching into account when analyzing the data. If matching is not taken into account, by either evaluating pairwise differences, or by including dummy variables for each matched set, then the research should not be accepted by the field. On the second technical criticism we make, which can only apply to the 38 fully matched research designs within the 73 audit research papers, we note 30 of 38 papers suffer from lack of explicit control for “closest” imperfect matching. We advise researchers either to avoid imperfect matching, or to perform and report sensitivity analyses on how imperfection in the matching might have influenced outcomes. A closest-matched variable such as size can still have influence. It might be controlled for by including a linear term. But, as size or another variable’s contribution might be non-linear, in general, there is no fully satisfactory resolution. The 33 researcher, we argue, must make some effort to examine the possibility that all results are driven by the omitted effect. Sensitivity analyses including linear and quadratic terms, for example, might be performed and discussed. Otherwise, the researcher has not established that other reported effects are not merely the result of an omitted variable problem. Our third technical criticism, the need for reweighting, can apply to only 42 of the 73 papers. We note that 40 of these 42 are in error for performing statistical analysis without necessary reweighting. A large fraction of the other papers, 25 of the 73, can be regarded as largely exempt from the need to perform reweighting because they used logit regression. When logit is used, inferences based on non-intercept coefficients can be correct, assuming there are not other methodological problems present. (However, of these 25 papers, only the 4 that avoided use of matching do not themselves suffer from Error 1.). Only two of the papers applied an explicit reweighting. To avoid this criticism, we suggest that choice-based and matched sampling should be avoided unless explicit sampling rate information can be obtained (allowing for explicit reweighting) or unless logit regression will suffice to analyze the research questions (taking advantage of the logit exemption to the need for reweighting). How important are the errors we identify for the course of accounting research? It is possible that research streams have been misdirected due to mistaken identification of effects that are not true, or due to the mistaken findings that are reversed from true effects. We suspect that the most common effect may be the failure to find effects that in fact appear to be true, as illustrated in our replication of Ghicas (1990). Burgstahler (1987) argues that if tests reported in the accounting literature are characterized by low power (as would be the case for misanalyzed choice-based and matched samples) and high effective levels (i.e. if there is a bias to publish significant findings), the results of published tests properly should have little or no impact on the beliefs of a Bayesian. It is common knowledge that papers with no result (finding insufficient evidence to reject null hypotheses) are much less likely to be published; Greenwald (1975) discusses such publication bias and its consequences. This supports a severe view that over a very long period, accounting research involving misanalysed non-random samples should be disregarded. We hope that researchers recognize new opportunities for research projects from this work. First, there are many opportunities to reconsider published results in audit research and other areas where choice-based and matched samples have been used. Many researchers might 34 now salvage studies unpublished previously for reason of unexplainable anomalies or for lack of statistically significant results. And future work may now exploit greater-than-previouslyunderstood power of choice-based and matched sampling methods, when correctly analysed, in appropriate settings. 35 References Abdel-khalik, A. R. and B. B. Ajinkya. 1979. Empirical Research in Accounting: A Methodological Viewpoint. Sarasota, FL: American Accounting Association. Abrevaya, J. 1996. The Equivalence of Two Estimators of the Fixed-Effects Logit Model. Economics Letters 55:41-43. Agresti, A. 2002. Categorical Data Analysis. 2nd edition. Hoboken, NJ: John Wiley & Sons, Inc.. Altman, E. I. 1968. Financial Ratios as Predictors of Failure. Journal of Finance 23(4): 589-609. Anderson, J. A. 1972. Separate Sample Logistic Discrimination. Biometrika, 59:19-35. Barber, B. M. and J. D. Lyon. 1986. Detecting Abnormal Operating Performance: The Empirical Power and Specification of Test Statistics. Journal of Financial Economics 41(3): 359-399. Barber, B. M. and J. D. Lyon. 1987. Detecting Long-run Abnormal Stock Returns: The Empirical Power and Specification of Test Statistics. Journal of Financial Economics 43(3): 341-372. Bartov, E., F. A. Gul, and J. S. L. Tsui. 2001. Discretionary-Accruals Models and Audit Qualifications. Journal of Accounting and Economics 30: 421-452. Beaver, W. H. 1966. Financial Ratios as Predictors of Failure. Empirical Research in Accounting: Selected Studies, 1966, supplement to Journal of Accounting Research: 71-111. Beaver, W. H. 1968. Market Prices, Financial Ratios, and the Prediction of Failure. Journal of Accounting Research 6(2): 179-192. Bhojraj, S. and C. M. C. Lee. 2002. Who is my peer? A valuation-based approach to the selection of comparable firms. Journal of Accounting Research 40: 407-439. Breslow, N. E. 1996. Statistics in Epidemiology: The Case-Control Study. Journal of the American Statistical Association, 91(433):14-28. Breslow, N. E. and N. E. Day. 1980. Statistical Methods in Cancer Research: Volume 1--The Analysis of Case-Control Studies, and Volume II—The Design and Analysis of Cohort Studies. The International Agency for Research on Cancer, Lyon, France. Breslow, N. E. and N. E. Day. 1987. Statistical Methods in Cancer Research: Volume II—The Design and Analysis of Cohort Studies. The International Agency for Research on Cancer, Lyon, France. Burgstahler, D. 1987. Inference from Empirical Research. The Accounting Review 62 (1): 203-214. Burgstahler, D., J. Jiambalvo, & E. Noreen. 1989. Changes in the Probability of Bankruptcy and Equity Value. Journal of Accounting & Economics 11 (2,3): 207-224. Campbell, D. T. and J. C. Stanley. 1966. Experimental and Quasi-Experimental Designs for Research. Chicago: Rand McNally & Company. Deakin, E. B. (1972). A Discriminant Analysis of Predictors of Business Failure, Journal of Accounting Research, Spring: 167-179. Dietrich, J. 2001. The Effects of Choice-Based Sampling and Small-sample Bias on Past Fair Lending Exams. Working paper, Office of The Comptroller of The Cuurency, Department of The Treasury, Washington, DC. Dopuch, N., R. W. Holthausen, and R. W. Leftwich. 1987. Predicting Audit Qualifications with Financial and Market Variables. The Accounting Review LXII (3): 431-454. Ghicas, D. 1990. Determinants of Actuarial Cost Method Changes for Pension Accounting and Funding. The Accounting Review. April 384-405. Giles, J. A., and M. J. Courchane. 2000. Stratified Sampling Desing for Fair Lending Binary Logit Models. Working paper. Greenwald, A. G. 1975. Consequences of Prejudice Against the Null Hypothesis. Psychological Bulletin 82(1): 1-20. Harrison, T. 1977. Different Market Reactions to Discretionary and Nondiscretionary Accounting Changes. Journal of Accounting Research 15(1): 84-107. Heckman, J. J. 2002. Unpublished notes. Heckman, J. J., H. Ichimura, and P. Todd, (1998). Matching as an Econometric Evaluation Estimator. The Review of Economic Studies, 65(2): 261294. 36 Hillegeist, S., E. K. Keating, D. P. Cram, and K. G. Lundstedt. 2004. Assessing the Probability of Bankruptcy. Review of Accounting Studies 9: 5-34. Hosmer, D. W. and S. Lemeshow. 1988. Applied Logistic Regression. NY: John Wiley & Sons. Johnson and Bhattacharyya. 1985. Statistics: Principles and Methods. NY: John Wiley & Sons. Kerlinger. 1973. [cited in Abdel-Khalik and Ajinkya, details to be added] Kinney, W. R. 1986. Empirical Accounting Research Design for Ph.D Students. The Accounting Review 61(2): 338-350. Kothari, S. P., A. J. Leone and C. E. Wasley. 2005. Performance Matched Discretionary Accrual Measures. Journal of Accounting and Economics, 39:1. Lys, T. and R. Watts. 1994. Lawsuits Against Auditors. Journal of Accounting Research 32 (Supplement): 65-93. Mack, T. M., M. C. Pike, B. E. Henderson, R. I. Pfeffer, V. R. Gerkins, M. Arthur, and S. E. Brown, 1976. Estrogens and Endometrial Cancer in a Retirement Community. The New England Journal of Medicine. 294: 1262-1267. Maddala, G. S. 1991. A Perspective on the Use of Limited-Dependent and Qualitative Variables Models in Accounting Research. The Accounting Review 66 (4): 788-807. Maddala, G. S. 1996. Applications of Limited Dependent Variable Models in Finance. Handbook of Statistics, 14: 553-566. Maddala, G. S. 1983. Limited-Dependent and Qualitative Variables in Econometrics. Cambridge University Press. Manski, C. F. and S. R. Lerman. 1977. The Estimation of Choice Probabilities from Choice Based Samples. Econometrica 45 (November): 1977-88. Manski, C. F. and D. McFadden. 1981. Alternative Estimators and Sample Designs for Discrete Choice Analysis, in: C. F. Manski and D. McFadden, eds., Structural Analysis of Discrete Data with Econometric Applications. MIT Press, Cambridge, MA. McNichols, M. and A. Dravid. 1990. Stock dividends, Stock Splits, and Signalling. Journal of Finance 45 (July): 857-79. Palepu, K. G. 1986. Predicting Takeover Targets: a Methodological and Empirical Analysis. Journal of Accounting and Economics 8: 3-35. Parsons, L. S. 2002. Reducing Bias in a Propensity Score Matched-Pair Sample Using Greedy Matching Techniques. SAS Users’ Group Conference Proceedings. Paper 214-26. Prentice, R. L. and R. Pyke. 1979. Logistic Disease Incidence Models and Case-Control Studies. Biometrika 66(3): 403-11. Rice, J. A. 1995. Mathematical Statistics and Data Analysis. 2nd edition. Belmont, Ca.: Duxbury Press, International Thomson Publishing. SAS Version 9.1, 2003. Software Documentation, SAS/STAT PROC LOGISTIC, Example 42.10. Schlesselman, J. J. 1982. Case-Control Studies: Design, Conduct, Analysis. Oxford University Press, Oxford. Scott and Wild. 1991. Fitting Logistic Models in Stratified Case-Control Studies. Biometrics 47: 497-510. Smith, M. 2003. Research Methods in Accounting. Sage Publications. Tatsuoka, M.M. 1971. Multivariate Analysis: Techniques for Educational and Psychological Research. New York: John Wiley & Sons, Inc. Wallace, J. S. 1997. Adopting Residual Income-Based Compensation Plans: Do You Get What You Pay For? Journal of Accounting and Economics. 24: 275-300. Zmijewski, M. 1984. Methodological Issues Related to the Estimation of Financial Distress Prediction Models. Journal of Accounting Research 22: 59-82. 37 Figure 1: Research Design Categories for Choice Based and Matched Samples Choice Based Matched NCB-FM-W CB-NM CB-SM CB-FM NCB-SM NCB-FM-B Fully Matched Count in Category Example Audit Research CB-NM Choice Based Non-Matched Palepu (1986) CB-SM Choice Based Semi-Matched Henninger (2001) 23 CB-FM Choice Based Fully Matched Lys and Watts (1994) 27 NCB-FM-W Non Choice Based Fully Matched 6 Teoh and Wong (1993) 5 Iyer and Iyer (1996) 7 Krishnan (2003) 5 Within-Subject NCB-FM-B Non Choice Based Fully Matched Between Subject NCB-SM Non Choice Based Semi-Matched Total 38 73 Simulation 1: True 50 pr 100 pr 200 pr 400 pr 800 pr 1600 pr All Data βˆ1 βˆ 2 1 =1, 2=1, correlation=.4 -0.083 1.562*** -0.004 1.441*** 0.045 1.208*** 0.062** 1.324*** 0.083*** 1.250*** 0.077*** 1.137*** 0.086*** 1.204*** Conditional Logit Using Pairings βˆ1 0.886** 0.833*** 0.956*** 1.159*** 1.032*** 0.989*** N.A. βˆ 2 1.898*** 1.480*** 1.131*** 1.348*** 1.191*** .966*** N.A. Conditional Logit Using “Industry” Groupings βˆ1 0.668** 0.802*** 0.906*** 0.967*** 1.004*** 0.967*** 0.977*** βˆ 2 Graphical Presentation of Final Coefficient Estimates Simulated Data in Five Industry Groupings 1.622*** 1.413*** 1.106*** 1.209*** 1.096*** 0.949*** 1.019*** Simulation 1 Est Beta2 Unmatched Analysis Unmatched (.09, 1.2) With Industry Fixed Effects True (1,1) Strata on Pairs (0,0) 50 pr 100 pr 200 pr 400 pr 800 pr 1600 pr All Data 1 =1, 2=0, correlation=.4 0.075 0.131 0.081 0.424*** 0.089** 0.423*** 0.093*** 0.337*** 0.090*** 0.318*** 0.082*** 0.328*** 0.087*** 0.303*** 0.772*** 0.865*** 0.959*** 0.926*** 0.972*** 0.988*** N.A. -0.117 0.150 0.106 0.051 0.059 0.079* N.A. 0.854*** 0.857*** 0.944*** 0.910*** 0.935*** 0.947*** 0.986*** -0.109 0.193 0.127 0.065 0.050 0.062 -0.002 Simulation 2 Est Beta2 Simulation 2: True Est Beta1 Unmatched (.09, .3) Strata on Pairs True (1,0) With Industry Fixed Effects Est Beta1 50 pr 100 pr 200 pr 400 pr 800 pr 1600 pr All Data 1 = -4, 2=1, -0.164** -0.151*** -0.162*** -0.164*** -0.160*** -0.157*** -0.149*** correlation=.4 -0.207 -0.328** -0.198** -0.175** -0.217*** -0.254*** -0.180*** -4.749*** -4.877*** -4.485*** -3.829*** -4.021*** -4.192*** N.A. 0.932 0.801* 0.810*** 0.917*** 0.869*** 0.865*** N.A. -5.945*** -4.070*** -4.161*** -4.129*** -3.926*** -4.013*** -3.946*** 0.641 0.529* 0.880*** 1.025*** 0.902*** 0.905*** 0.996*** Simulation 3 True (-4,1) With Industry Fixed Effects Strata on Pairs Est Beta1 Unmatched (-.15, -.2) *,**,***=significant at .10, .05, .01 level Table 1: Comparison of Methods Applied to Simulated Data Est Beta2 Simulation 3: True Table 2 Logit Regressions Analyzed in Replication of Mack et al (1976) Replication Panel B Panel A SAS Reported Results Coefficient P-value Panel C Replication: Conditional Logit Coefficient P-value Intercept Replication: Unmatched Analysis Coefficient P-value -0.3013 0.2242 Gall .9704 0.0675 .9704 0.0675 0.8417 0.0770 Hyper .3481 0.3558 .3481 0.3558 0.3682 0.3264 Odds Ratio Odds Ratio Odds Ratio Gall 2.639 2.639 2.258 Hyper 1.416 1.416 1.445 low high low high low high Gall .933 7.468 .933 7.468 .915 5.571 Hyper .677 2.965 .677 2.965 .693 3.015 N 63 pairs 126 Panel A: SAS Version 9.1 (2003) Example 42.10: Conditional Logistic Regression for Matched Pairs Data. Panel B: Data analyzed using conditional logit matches SAS Example 42.10 of Panel A, exactly. Panel C: Reanalysis using unmatched analysis. Apparent finding is that Gall variable is significant, Hyper is not, as before. However, significance is slightly lower. And, Odds Ratio for Gall is estimated at 2.258, 40% lower. Gall = Gall bladder condition Hyper = Hypertension condition Bold = significant at .10 level 40 Table 3: Logit Regressions Analyzed in Ghicas (1990) Replication Panel A Panel B Ghicas’ Reported Results Coefficient Intercept P-value Panel C Replication: Unmatched Analysis Coefficient P-value Replication: Conditional Logit Coefficient P-value -2.880 0.31 0.713 0.76 PR 1.036 0.06 0.210 0.39 0.250 0.35 IR 0.785 0.01 0.484 0.02 0.735 0.00 LEV 3.632 0.12 1.698 0.40 1.716 0.53 WC -1.788 0.03 -1.188 0.10 -0.847 0.28 -31.547 0.01 -21.983 0.06 -4.528 0.61 14.43 0.01 9.376 0.09 1.090 0.74 -0.845 0.11 -0.378 0.08 -0.726 0.01 1.625 0.20 0.098 0.69 0.160 0.77 -0.338 0.46 0.195 0.20 0.189 0.09 3.109 0.34 0.752 0.70 0.679 0.75 RUNI INT ln (TA) ETR CI FFO N 90 86 86 Panel A: Ghicas (1990), Table 4, pp 397. Panel B: We reconstructed Ghicas’ data to the extent possible, yielding 43 usable pairs. Results of the unmatched logit regression show the results to be substantially similar. Panel C: Data analyzed using conditional logit shows changes in results. Results for variable IR were now found to be highly significant at a .001 level. Results for variables WC, RUNI, and INT now show no significant effects. Variable LOGTA, a proxy for political visibility, and central to Ghicas’ hypotheses, was not found by Ghicas to be significant. However, conditional logit regression shows this variable to be significant at a .02 level. PR = Pension Assets / Pension Liabilities IR = Interest Rate, used for the computation of pension liabilities per SFAS No. 36 Bold = significant at .10 level Variable definitions, with number indicating Compustat item numbers: LEV = Long-term Debt, #9 / (Total assets, # 6 - Intangible assets, # 33); WC = Current Assets, #4 / Current Liabilities, $5; RUNI = (Capital Expenditures, #128 + Acquisitions, #129 + Advertising, #45 + R&D, #46) / Total Assets, #6; INT = WC * RUNI; TA = Total Assets, #6 (in millions); ETR = (Tax Expense, #16 - Change in Deferred Taxes, #35) / Funds Flow from Operations, #35; CI = (It - It-1)/ It-1, It = Income for year t before extraordinary items and discounted operations, #18; FFO = Funds Flow from Operations, #110 / Sales, #12 WC = Current Assets, #4 / Current Liabilities, $5; RUNI = (Capital Expenditures, #128 + Acquisitions, #129 + Advertising, #45 + R&D, #46) / Total Assets, #6; INT = WC * RUNI; TA = Total Assets, #6 (in millions); ETR = (Tax Expense, #16 - Change in Deferred Taxes, #35) / Funds Flow from Operations, #35; CI = (It - It-1)/ It-1, It = Income for year t before extraordinary items and discounted operations, #18; FFO = Funds Flow from Operations, #110 / Sales, #12. 41 Table 4: Description of Research Design and Potential Errors CB-FM 27 papers Treatment Control Group Group Selected on basis of outcome One firm selected as match for each firm in treatment group from set of firms having similar characteristics by matching on “closest” values in audit research sample CB-SM Selected on basis of outcome Randomly selected from firms not having the same outcome, but matching by industry, year, size or group level 23 papers CB-NM Selected on basis of outcome Randomly selected from firms not having the same outcome Error 1 Error 2 √ √ 23 21 18 26* √ √ √ 21 0 8 22* √ N/A N/A 4 Randomly selected sample of firms Same subject, usually before and after √ N/A 2 Randomly selected sample of firms One firm selected as match for each firm in treatment group from set of firms having similar characteristics by matching on √ If logit, run conditional logit with pairs identified as strata. If OLS, include dummy variables for pairs, and reweight each observation for its sampling rate (i.e., apply WESML). If logit, run conditional logit with groups identified as strata. If OLS, include dummy variables for groups and apply WESML. If logit, run regular logit, and only the intercept is biased. If not logit, apply WESML. If OLS, include pair-identifier dummies or analyse as differences-on-differences. N/A 5 papers NCB-FM-B Selected Guidance √ 6 papers NCB-FMW Error 3 If MANOVA, block on subject. Univariate comparisons okay. WESML not required. √ √ If OLS, include pair-identifier dummies and linear (and perhaps more) terms for imperfectly matched variables, or analyse as differenceson-differences including differences of 42 “closest” values 7 papers 4 NCB-SM 5 papers Total: 73 Randomly selected sample of firms Randomly selected from firms not having the same outcome, but matching by industry, year, size or group level 6 7 imperfectly matched variables. WESML required. If OLS, include group dummies. √ √ √ If MANOVA, block on groups. 5 0 5 WESML required. 55 27 42 64* Error 1: Count of audit papers that use unconditional analysis, when analysis conditional upon effects of matching variables is needed Error 2: Count of audit papers that fail to control for effect of imperfectly matched variables Error 3: Count of audit paper that fail to reweight observations according to appropriate sampling rates *Count of audit papers suffering Error 3, not including logit papers with unsaturated models Count including logit papers with unsaturated models. 43 Appendix A This appendix provides proof that coefficient estimates in a logit regression on one-toone matched pairs data are correctly analysed either by a no-intercept logit regression on pairwise differences, or, equivalently, by a pooled no-intercept logit regression having dummy variables indicating pair memberships. It follows that an unmatched pooled logit regression (as has been routinely employed in practice) is misspecified.26 Specifically, the proof establishes that the relative magnitudes of coefficients are estimated correctly each way. The corresponding standard errors and p-values, however, are estimated correctly only by the former approach, but proof is not herein provided. (See Abrevaya (1996) for explication of this complication.) Suppose that a population of data exists where the following logistic model relationship is true: y = 1 if α j + xβ + ε > 0 ; y = 0 otherwise where j is an intercept specific to the jth of J strata in the population, xβ is the vector product of coefficients and independent variables and is a logistic distributed error term having mean 0 and variance 2 . The jth stratum is a subpopulation having uniform measurements on industry and size or other combination of factors that influence the outcome through j. From the population, suppose that n matched pairs of data are randomly selected without replacement, i.e., where each pair of observations are matched only in that they are selected from within the same stratum. Note, the selection is not outcome-based; both outcomes in a pair might be the same. Denote the sample data as follows: {( y11 , x11 , y12 , x12 ), ( y 21 , x 21 , y 22 , x 22 ),..., ( y n1 , x n1 , y n 2 , x n 2 )} where yi1 and yi2 denote the paired 0-1 outcomes for the ith matched pair and vectors xi1 and xi2 26 The proof follows Agresti (2002)’s notation and suggestions for extension from a simpler setting that he presents. The basic result for coefficients is attributed to Anderson (1972) and extended by Prentice and Pyck (1979) and others; see Breslow (1996) for a review. 44 denote the corresponding sets of values of the explanatory variable(s). The likelihood function for the sample is as follows, from which the proof will follow directly: n L* = ∏ i =1 exp(α i + xi1 β ) 1 + exp(α i + xi1 β ) yi 1 1 1 + exp(α i + xi1 β ) 1− yi 1 exp(α i + xi 2 β ) 1 + exp(α i + xi 2 β ) yi 2 1− yi 2 1 1 + exp(α i + xi 2 β ) [A1] where L* denotes the likelihood of the sample, and the ith expression in square brackets expresses the probability that the ith pair would have the outcomes that are observed for it. The likelihood expression provides for intercepts i, i=1 to n, and coefficient vector and may be maximized to yield maximum likelihood estimates for these parameters directly using iterative search methods. Note this is in the form of the estimation of a pooled logit regression with an intercept/dummy variable for each pair, and hence we have shown that logit regression with dummy variables is appropriate for the assumed matched sample setting. Maximizing the the expression A1 will yield coefficient estimates that are correct in their relative magnitudes; the scaling in logit software implementations is arbitrarily chosen to fix the estimated standard error of unobservable ε to equal one. As the purpose of estimation is to determine β , it is useful to note that there is no information about available in pairs where both outcomes are 1 or both outcomes are 0, because the distribution of ( y i1 , xi1 , y i 2 , xi 2 ) depends on β only when the pairwise success total S ≡ y i1 + yi 2 equals one. If S=0 or S=2, the value of αi may be set arbitrarily large or small, depending on the sign of β, so that the ith pair’s contribution to the expression above is arbitrarily close to one, hence varying βˆ in maximum likelihood estimation searching will not affect that pair’s contribution to the likelihood. Restricting ourselves then to pairs having opposing outcomes, i.e. where Si=1, we can write: P(Yi1 = 0, Yi 2 = 1 | S i = 1) + P(Yi1 = 0, Yi 2 = 1 | S i = 1) = 1 and 45 P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) = P(Yi1 = y i1 , Yi 2 = y12 ) P(Yi1 = 0, Yi 2 = 1) + P(Yi1 = 0, Yi 2 = 1) Expanding the last expression out to reflect the influence of independent variables, using the usual logit formulae, we obtain: P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) = exp(α i + xi1 β ) 1 + exp(α i + xi1 β ) yi1 1− yi1 1 exp(α i + xi 2 β ) 1 + exp(α i + βxi 2 β ) 1 + exp(α i + xi1 β ) exp(α i + xi1 β ) 1 + exp(α i + xi1 β ) 1 1 + exp(α i + xi1 β ) + 1 1 + exp(α i + xi 2 β ) yi 2 1− yi 2 1 1 + exp(α i + xi 2 β ) exp(α i + xi 2 β ) 1 + exp(α i + βxi 2 β ) [A2] The expression is the probability that the sample outcomes would be observed, given that opposing outcomes are observed. Without loss of generality, we can reorder within any pairs where necessary so that it is the first outcome in the pair that is zero, i.e. so that yi1 = 0 and y i 2 = 1 . Then [A2] above simplifies to: P(Yi1 = y i1 , Yi 2 = y12 | S i = 1) = 1 1 + exp(α i + xi1 β ) exp(α i + xi1 β ) 1 + exp(α i + xi1 β ) 1 1 + exp(α i + xi1 β ) exp(α i + xi 2 β ) 1 + exp(α i + β xi 2 β ) + 1 1 + exp(α i + xi 2 β ) . exp(α i + xi 2 β ) 1 + exp(α i + xi 2 β ) And the above simplifies to: = exp(α i + xi 2 β ) exp(α i + xi1 β ) exp(α i + xi 2 β ) 1+ exp(α i + xi1 β ) = exp((α i + xi 2 β ) − (α i + xi1 β )) . 1 + exp((α i + xi 2 β ) − (α i + xi1 β )) [A3] Now, in one more algebraic step we see that the pair-identifier intercepts can be dropped out, as this further simplifies to: = exp(( xi 2 − xi1 ) β 1 + exp(( xi 2 − xi1 ) β ) [A4] 46 Note, this is in the form of a logistic regression across pairs i, with no intercept and with predictor values xi* = xi 2 − xi1 , and artificial response y i* = 1 for every observation. Thus, we have proven the equivalence between the no-intercept logit regression of pair-wise differences in outcome (all 1’s) on differences in explanatory variables, and the pooled logit regression including an intercept/dummy variable for each pair, because as noted above the estimation may be performed directly on [A1]. Maximizing A4 yields coefficient estimates that are the same in relative magnitudes as maximization of A1. In practice, however, software implementation with dummy variables as in A4 will yield coefficient estimates that are twice as large as in implementing A1, and will report corresponding standard errors and p-values that are incorrect. (Again, Abrevaya (1996) provides explanation.) The correct software implementation is applied by SAS software’s PROC LOGISTIC with use of its STRATA statement, or by STATA software’s CLOGIT command. Appendix B This apprendix provides proof that choice-based matched sampling requires modification from usual estimation methods generally, but that for the binary, ordered, or multinomial logit regression setting the estimation may be analysed by the usual logit regression provided a fully saturated model is employed.27 First, let us consider random sampling and thereafter how non-random sampling and analysis differs. The likelihood function in a random sampling scheme is: I L = ∏ f (Y i , X i , Z i ) i =1 I = ∏ ( f (Y i | X i , Z i , β )h ( X i , Z i ) i =1 27 This understanding is due to Scott and Wild (1991). This proof is informed by unpublished class lecture notes by Heckman (2002) that addressed a simpler case. 47 where Y is the dependent variable, X is research variables, and Z is nuisance variables, and f and h are joint density functions. Or in logarithm form ln L = I ln f (Yi | X i , Z i , β ) + i =1 I i =1 [B1] ln h( X i , Z i ) First order conditions for estimation are found by differentiating with respect to and setting the result equal to zero: ∂ ln L = ∂β I i =1 ∂ ln f (Yi | X i , Z i , β ) =0 ∂β [B2] Note that in (2) the second summation term in (1) has dropped out, due to exogeneity, given random sampling. In the case of binary logistic regression, where Y_is constrained to 1’s and 0’s and f is the logistic density function, this simplifies to: ∂ ln L = ∂β I i =1 ( yi − exp( x i β ) ) xi = 0 1 + exp( x i β ) [B3] It may be instructive to observe that for x i including a constant, that (B3) implies the average of predicted probabilities must equal the proportion of 1’s in the sample. For an endogenous sampling scheme, instead, such as for choice-based sampling, the likelihood function is: I L = ∏ f (Yi | X i , Z i , β )h( X i , Z i ) i =1 C ( yi ) g ( yi , Z i ) and g(y_i) is is the sampling rate for the given outcome y_i ln L = I i =1 ln f (Yi | X i , Z i , β ) + I i =1 ln h( X i , Z i ) + I i =1 ln C (Yi ) − I i =1 ln g (Yi ) [B4] and the first order conditions are 48 ∂ ln L = ∂β I i =1 ∂ ln f (Yi | X i , Z i , β ) − ∂β I i =1 ∂ ln g (Yi ) =0 ∂β [B5] Note, estimators using just the first term in (B5), as is done for random sampling, in general will be biased under choice-based sampling. It would be possible under any maximum likelihood estimation approach to use explicit sampling rates as here. We will deduce that in the case of binary logistic regression on a matched sample, however, the second term is zero for each coefficient other than intercepts in a fully saturated model, i.e. for clusters in semimatched samples or for pairs in fully-matched samples. If logistic regression is not used, then the estimation must incorporate the weighing as here. In a choice-based matched sample, data is not randomly sampled but rather follows the following scheme: 1. Draw choice D=d and industry Z=z by ϕ (d , z ) . 2. Draw X by f ( X | d , z ) . The joint density of the sampled data is then: f * ( X | d , z ) = ϕ (d , z ) f ( X | d , z ) (B6) Suppose that outcomes d range from 0 to I. Suppose industries z range from 1 to J. The observed sample distribution of X is: g * ( X ) = f ( X | d = 1, z = 1)ϕ (d = 1, z = 1) + ... + f ( X | d = I , z = J )ϕ (d = I , z = J ) and the probability in the sample of observing d given X is: Pr * ( D = d , Z = z | X ) = f ( X | D = d , Z = z ) Pr( D = d , Z = z ) f (X ) (B7) Assume f(X)>0. Using Bayes’ theorem, write, for a fixed z, Pr * ( D = 1, Z = z | X ) = = Pr f ( D = 1, Z = z ) f ( X )ϕ ( D = 1, Z = z ) Pr( D = 1, Z = z | X ) f ( X )ϕ ( D = 1, Z = z ) Pr( D = 0, Z = z | X ) f ( X )ϕ ( D = 0, Z = z ) + Pr( D = 1, Z = z ) Pr( D = 0, Z = z ) 49 Cancelling f(X)’s, and multiplying through by Pr( D = 1, Z = z ) , this reduces to: ϕ ( D = 1, Z = z ) Pr( D = 1, Z = z | X ) Pr * ( D = 1, Z = z | X ) = Pr( D = 1, Z = z | X ) + Pr( D = 0, Z = z | X ) Pr( D = 1, Z = z | X )ϕ ( D = 0, Z = z ) Pr( D = 0, Z = z | X )ϕ ( D = 1, Z = z ) Now divide above and below by Pr( D = 1, Z = z | X ) to yield: Pr * ( D = 1, Z = z | X ) = 1 (B8) Pr( D = 0, Z = z | X ) Pr( D = 1, Z = z )ϕ ( D = 0, Z = z ) 1+ Pr( D = 1, Z = z | X ) Pr( D = 0, Z = z )ϕ ( D = 1, Z = z ) Recall, the log-odds form of the logit model resembles part of the above. In logit regression, ln Pr(d = 1, Z = z | X = α z + xβ Pr(d = 0, Z = z | X which implies Pr(d = 1, Z = z | X ) = e −(α z + xβ ) Pr( d = 0, Z = z | X ) So, if the logit model is true, then B5 becomes: 1 Pr * ( D = 1, Z = z | X ) = 1 + e −(α z + xβ ) e 1 = 1+ e Pr( D =1, Z = z )ϕ ( D = 0 , Z = z ) ln{ ] Pr( D = 0 , Z = z )ϕ ( D =1, Z = z ) [B6] Pr( D =1, Z = z )ϕ ( D = 0 , Z = z ) − (α z + xβ − ln{ ] Pr( D = 0 , Z = z )ϕ ( D =1, Z = z ) * = e α z + xβ [1 + e α *z + xβ ] where α *z = α z − ln{Pr( D = 1, Z = z )ϕ ( D = 0, Z = z ) ] . Observe this is in the form of the usual logit Pr( D = 0, Z = z )ϕ ( D = 1, Z = z ) estimator. Applying the usual logit estimator within this cluster z, then, we get an unbiased estimate of β and only the intercept α z estimated for this cluster is not correct. Pooling across all clusters, provided we include an intercept for each cluster, β will be estimated correctly. 50 Any cluster where all outcomes d are the same will contribute nothing to the estimation of β ; note, the intercept for the cluster can be set arbitrarily high or low so that the cluster likelihood approaches 1 and is unaffected by β . So clusters where all outcomes are the same may be deleted from the sample estimation. Therefore we have derived that estimation using usual logit estimators works correctly, except for the intercepts, so we have deduced the claim following (B5) above. We have not proven that the standard errors in estimation will be estimated correctly using the usual logit estimators, but that can be shown as well. The proof extends naturally to the ordered logit setting and to the multinomial logit settings. Note, also, at an extreme where each cluster consists merely of a pair of observations having different outcomes, we have matched pairs. As Breslow (1996) observes, the matched pair result (B6) can be reached as the extreme of the stratification process so that each stratum consists of just one pair. 51
© Copyright 2025