8/26/2009 Advanced Adverse Impact Analysis Why the (uncorrected) Fisher Exact Test Should not be used for Most Adverse Impact Analyses (8-26-09) BCGi Institute for Workforce Development © Copyright 2009 Biddle Consulting Group, Inc. All Rights Reserved Copyright 2009, BCGinstitute, inc. All Rights Reserved Visit BCGi Online While you are waiting for the webinar to begin, begin Don’t forget to check out our other training opportunities through the BCGi website. Join our online learning community by signing up (its free) and we will notify you of our upcoming i ffree training t i i events t as wellll as other th information of value to the HR community. www.BCGinstitute.org 1 8/26/2009 HRCI Credit BCG is an HRCI Preferred Provider CE Credits are available for attending this webinar Only those who remain with us for at least 80% of the webinar will be eligible to receive the HRCI training completion form for CE submission Copyright 2009, BCGinstitute, inc. All Rights Reserved About Our Sponsor: BCG • Assisted hundreds of clients with cases involving Equal Employment Opportunity (EEO) / Affirmative Action (AA) (both plaintiff and defense) p Analyses y / Test Development p and Validation • Compensation • Published: Adverse Impact and Test Validation, 2nd Ed., as a practical guide for HR professionals • Editor & Publisher: EEO Insight an industry e-Journal • Creator and publisher of a variety of productivity Software/Web Tools: – – – – – – – – ® OPAC (Administrative Skills Testing) ® CritiCall (9-1-1 Dispatcher Testing) AutoAAP™ (Affirmative Action Software and Services) C4™ (Contact Center Employee Testing) Encounter™ (Video Situational Judgment Test) Adverse Impact Toolkit™ (free online at www.disparateimpact.com) ® ® AutoGOJA (Automated Guidelines Oriented Job Analysis ) COMPare: Compensation Analysis in Excel www.Biddle.com Industry Leader 4 2 8/26/2009 Contact Information Daniel Biddle,, Ph.D. [email protected] Biddle Consulting Group, Inc. 193 Blue Ravine, Ste. 270 Folsom, CA 95630 1-800-999-0438 www.biddle.com Copyright 2009, BCGinstitute, inc. All Rights Reserved Questions? Should you have any questions during the webinar you have two options: Ask a question through the GoToMeeting screen console and we will try to address it at the end of the webinar. Should y you have any yq questions regarding g g OFCCP Audits, Testing and Selection, or Statistical Analysis, visit: www.BCGInstitute.org 3 8/26/2009 Presentation Overview Disclaimer: These are complicated topics! Adverse Impact Analyses Background: Is this really important? Issue #1: Marginal Totals Issue #2: Conservativeness Data simulation results Implications and recommendations Copyright 2009, Biddle Consulting Group All Rights Reserved Copyright 2009, BCGinstitute, inc. All Rights Reserved What’s the Big Deal? The issues we’ll be discussing are a “big deal” and not a big deal at the same time. The big deal? Adverse impact is serious and no one wants to calculate liability statistics inaccurately. But it’s not a big deal because most of the time under most circumstances, time, circumstances when SS adverse impact is there, it’s there! Most court cases and audits are typically only enforced when statistical evidence is strong. 4 8/26/2009 How Did this Come About? For decades, EEO professionals have relied on “Chi-Square” type analyses for the 2x2 table question: Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 Sometimes various corrections have been used (Yates, Cochran). Sometimes the Fisher Exact Test (FET) has been used Copyright 2009, BCGinstitute, inc. All Rights Reserved How Did this Come About? But is that what Fisher intended? All 2x2 tables to be run through his “exact” test? Since about the 1950s, various challenges have been brought to the FET: What are the assumptions required for the FET results to be accurately interpreted? Is the FET too conservative? Are there other more accurate techniques when the strict FET conditions are not met? 5 8/26/2009 How Did this Come About? Most recently, from the mid-90s to this year, a barrage of articles have been published in the biomed field, field theoretical statistics field and journals journals, and other fields that have criticized the FET, abandoned the FET, and recommended other “less conservative” replacements that are more applicable and accurate across a greater diversity of 2x2 situations situations… The adverse impact has not been neglected in these discussions… We’ve reviewed over 80 such articles and chapters… Copyright 2009, BCGinstitute, inc. All Rights Reserved How Did this Come About? 2X2 analyses can be conducted in three situations: fixed, mixed, and free margins. While there is a consensus in the current literature that the FET is inappropriate in 2 of these 3 “2X2” situations… There is not a consensus regarding whether the uncorrected FET should be used in 1 of the 3 “2X2” situations. When evaluating these 2x2 situations situations, it becomes clear that the FET should not be used in many AI circumstances, but may be used in some situations… Let’s take a look at the “2x2 situations” 6 8/26/2009 FET Issue #1: Marginal Totals The FET Requires Meeting Conditional Assumptions Not Always Met in Practice Are the margins FIXED before or after the event? Are the margins FIXED, CONSTRAINED, OR CORRELATED to the employer’s previous decisions? Pass Fail Totals Men 8 2 10 www.biddle.com Women Totals 2 10 6 8 8 18 13 Copyright 2009, BCGinstitute, inc. All Rights Reserved FET Issue #1: Marginal Totals The Three Major 2x2 Models (see Collins & Morris, 2008) Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 FIXED Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 MIXED Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 FREE Model 1: Independence Trial: The marginal proportions are assumed to be fixed in advance (i.e., proportion of each group and selection totals are fixed). Data are not viewed as a random sample from a larger population. Model 2: Comparative Trial: Apps are viewed as random samples from two distinct populations (e.g., minority and majority). Proportion from each population is fixed (i.e., the marginal proportion on one variable is assumed to be constant across replications). The second marginal proportion (e.g., the marginal proportion of applicants who pass the selection test) is estimated from the sample data. Model 3: Double Dichotomy: Neither row/column are assumed to be fixed. Apps are viewed as a random sample from a population that is characterized by two dichotomous characteristics. No purposive sampling or assignment to groups is used, and the proportion in each group can vary across samples. 14 7 8/26/2009 FET Issue #1: Marginal Totals The Three Major 2x2 Models: Applied to EEO Analysis FIXED Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 Pass Fail Totals Men 8 2 10 Women Totals 2 10 6 8 8 18 www.biddle.com •T Terminations i i / RIF RIFs • However…shared/correlated with past practices; oftentimes not predetermined MIXED • Some promotions • However… H shared h d ““odds dd ratios” ti ” FREE • Apps are widely recruited; show up; unknown passing rates / hiring rates 15 Copyright 2009, BCGinstitute, inc. All Rights Reserved FET Issue #1: Marginal Totals Which of the Three 2X2 Models Apply to HR Decisions? Reviewing Three 2X2 Models to HR Decisions (Adapted from Collins & Morris, 2008) HR Practice 2X2 Model Hiring with a fixed cutoff score Double Dichotomy Top-down selection www.biddle.com Comments Selection decisions use a fixed cutoff score. The passing score typically set in advance or using normative data. MQs might be used. Candidates are selected top-down based on hiring criteria until a fixed number of positions are filled. Selection rate is fixed based on staffing needs. If a different sample had been used, the number passing would have been the same. However,, each group's g p N None off th the th three proportion is likely to vary across samples and is best models fit treated as an estimate of an unknown population appropriately. parameter. Further, because selection decisions depend on applicant rank position in a particular sample, the selected and nonselected groups are sample-specific and do not reflect two distinct populations as in the comparative trial model. 16 8 8/26/2009 FET Issue #1: Marginal Totals Which of the Three 2X2 Models Apply to HR Decisions? Reviewing Three 2X2 Models to HR Decisions (Adapted from Collins & Morris, 2008) HR Practice Banding Promotion www.biddle.com 2X2 Model Comments Banding is a combination of "ranking" and typically None of the three also involves a minimum cutoff score, so it is a hybrid models fit method for which none of the sampling models are a appropriately. perfect fit. Candidate pool is relatively fixed... If decisions were repeated, candidate set would be similar. In such cases probabilities based on randomly sampling from cases, None of the three a population, as in the comparative trial and double dichotomy models, would not apply. Similarly, models fit appropriately. probabilities based on random reassignment of participants (i.e., independence trial model), would not be appropriate. Without theoretical process for producing different data patterns (e.g., random 17 Copyright 2009, BCGinstitute, inc. All Rights Reserved FET Issue #1: Marginal Totals Which of the Three 2X2 Models Apply to HR Decisions? p trial model (“Fixed”) ( ) • “Because the independence does not represent typical personnel selection data, there is reason to question the appropriateness of the Fisher Exact Test for adverse impact analysis.” • “The tendency of these tests to be conservative under d the h other h sampling li models d l iindicates di that h the h Fisher Exact Test and Yates’s test will be less likely than other tests to identify true cases of adverse impact” (Morris & Collins, 2008). www.biddle.com 18 9 8/26/2009 FET Issue #1: Marginal Totals Which of the Three 2X2 Models Apply to HR Decisions? • • • • In the EEO analysis field, “The justification of conditional tests (those for “fixed” margins) g ) depends p on the assumption p that the process p determining g the fixed marginal counts is not dependent on the process under study…” For example, When considering whether to use a conditional test (the FET) when conducting a promotional analysis, “The number of minority members hired out of a labor pool should not provide information about the odds ratio of the promotion rates, the parameter of interest.” Gastwirth advises checking this assumption before calculating conditional tests in situations where the available sample results from a previous selection process that may be affected by the same factors involved in the process being examined (because the odds ratio of the hiring rates and promotion rates would be related). For this reason, the unconditional tests may be a “more accurate” test across a greater number of AI cases (Gastwirth, J. (1997). Statistical Evidence in Discrimination Cases, Journal of the Royal Statistical Society, 160, Part 2, 289-303). www.biddle.com 19 Copyright 2009, BCGinstitute, inc. All Rights Reserved FET Issue #1: Marginal Totals • Men • Women PROMS • Men • Women HIRES • Men • Women TERMS I the In th EEO area, when h using i statistical t ti ti l tests, t t it is i important i t t to t consider id a crucial assumption underlying conditional tests. This assumption requires that one can condition on fixed marginal numbers that are not dependent on any factor related to the process being investigated. For example, if one examines the promotion data of a firm, the marginal sample sizes of minority and majority employees eligible for advancement clearly result from the hiring practice of that firm (Ibid). www.biddle.com 20 10 8/26/2009 FET Issue #1: Marginal Totals FET Requires “Calling Out” Marginal Totals Before the Analysis is Conducted… • Wh When, if ever, iis this thi really ll the th case in i AI analyses? l ? • “The FET assumes that both of the margins in a 2X2 table are fixed by construction—i.e., both the treatment and outcome margins are fixed a priori” (Sekhon, 2005). • “Over decades there has been a lively debate among statisticians on the applicability of the conditional FET. The argumentation against the test mainly is that it conditions inference on both margins where only one margin is fixed by most experimental designs and the test is inherently conservative…the row and column marginal totals are fixed by the researcher prior to data collection” (p. 171, Gimpel, 2007). • “Fisher’s 2 x 2 exact test requires that the marginal frequencies inwww.biddle.com both margins are fixed a priori” (Romualdi, et al, 2001). 21 Copyright 2009, BCGinstitute, inc. All Rights Reserved FET Issue #1: Marginal Totals Which of the Three 2X2 Models Apply to HR Decisions? • For fun… it isn’t “truth” today y if it’s not “so” on Wiki! • “FET assumes that the row and column totals are known in advance. In cases where this assumption is not met, FET is very conservative, resulting in Type I error which is below the nominal significance level. In practice, this assumption is not met in many experimental p designs g and almost all non-experimental p ones. An alternative exact test, Barnard's exact test, has been developed and Proponents of it suggest that this method is more powerful, particularly in 2 by 2 tables.” www.biddle.com 22 11 8/26/2009 FET Issue #2: Conservativeness The FET is “too Conservative” Compared to other Methods • “The tendency of these tests to be conservative under the other sampling li models d l iindicates di t that th t the th FET and d Yates’s Y t ’ test t t will ill be b less l likely than other tests to identify true cases of adverse impact” (Morris & Collins, 2008). • “The exact test of Fisher…gives tests which are both extremely conservative and is appropriate” (Upton, 1982) (later endorsed mid-P). • “The traditional FET should practically never be used” … “the FET is unnecessarily conservative with lower power than conditional mid-p tests and unconditional tests” … “We do not recommend the use of FET. FET is conservative, that is, other tests generally have higher power yet still preserve test size” Lydersen, et. al, 2009) • “FET can be conservative in the sense of its actual significance level (or size) being much less than the nominal level” (Lin & Yang, 2009). www.biddle.com 23 Copyright 2009, BCGinstitute, inc. All Rights Reserved Probability Theory Applied to 2X2 Tables 0.2 DEMONSTRATION OF "DISCRETENESS" IN THE FET PROBABILITY DISTRIBUTION 0.15 Asymptotic "Best Estimate" Line Used by the Chi-Square 0.1 FET: p Mid‐p: Uncond.: 0.0536 0.0392 0.0338 0.05 The FET has 4 "stopping places" below .05 Chi‐Square Theory has more 0 1 2 3 4 5 6 7 24 12 8/26/2009 Probability Theory Applied to 2X2 Tables Actual Significance Level v. Desired (.05) Significance Level Actual Significance Level v. Desired (.05) Significance Level Mid--P Mid FET (uncorrected) 25 Copyright 2009, BCGinstitute, inc. All Rights Reserved Important Questions for HR Professionals… What is the significance level used for testing whether a test is valid? What is the significance level used for testing Adverse Impact? Answers: Validity: .05 Adverse Impact: .05 What statistical tests are useful for answering these statistical questions? Validity: Pearson correlation is common Adverse Impact: Fisher Exact Test (under a variety of methods), Chi-Square, Z-test, etc. 13 8/26/2009 Example • Let’s take two employers that use a physical test that has a standardized mean group difference of 1.0 (d) between men/women – This difference is commonly observed on written tests (minority/non-minority) and physical tests (men/women). • Each employer tests for 1,000 applicants per year • One employer hires only the top 10%; the other only the top 40% • Such a test will exhibit adverse impact, it’s just depends on 2 factors: • 1: the number of applicants tested and hired, and • 2: the power of the statistical test used to detect the AI Copyright 2009, BCGinstitute, inc. All Rights Reserved Example This example constitutes one where a “substantial passing rate difference” (required by the Guidelines) has b been observed b db between t th the ttwo groups iin th the population l ti Using a 40% hiring rate, a standardized mean group difference of 1.0 (d) (between men and women) equates to: • 58% male passing rate • 22% f female l passing i rate t Using a 10% hiring rate, a of 1.0 (d) equates to: • 18% male passing rate • 3% female passing rate 14 8/26/2009 Practical Implications of a Test with a 1.0 (d) • How much overlap is there between two groups based on various d values? .25 d = 82% overlap .50 d = 67% overlap 67% overlap .75 d = 55% overlap 1.0 d = 45% overlap 45% overlap Copyright 2009, BCGinstitute, inc. All Rights Reserved Summarizing the AI Evidence on a 1.0 (d) Test Evaluating the company that uses a 40% hiring rate: • 58% of the men will pass and 22% of the women will pass • Hiring ratio = 2.6 male hires for every 1 female hire • The impact ratio is 38% (2X less than half the 80% test) • Less than one-half (45%) of the male distribution overlaps with the female distribution Evaluating the company that uses a 10% hiring rate: • 18% of the men will pass and 3% of the women will pass • Hiring ratio = 6 male hires for every 1 female hire • The impact ratio is 17% (4X less than half the 80% test) • Less than one-half (45%) of the male distribution overlaps with the female distribution 15 8/26/2009 Finding Adverse Impact Next let’s investigate the usefulness in answering the AI question using three statistical tools: Fisher Exact Test (FET) Fisher Exact Test (mid-P) Chi-Square (or “Z” test) The sample sizes for both the “40% hiring rate” and 10% hiring rate” rate employers will be scaled and “10% evaluated Sample sizes will be “matched” for both men and women 31 Copyright 2009, BCGinstitute, inc. All Rights Reserved First, Some Definitions Type I Error (α): reject the null (“no difference”) hypothesis when the null hypothesis is true In other words, finding AI when it does not exist Type II Error (β): fail to reject the null hypothesis when the null hypothesis is false In other words, missing AI when it exists Type I Error Rate: The percentage of Type I errors made by a statistical test ((i.e., the rate at which it falselyy concludes AI). ) Type II Error Rate: The percentage of Type II errors made by a statistical test (i.e., the rate at which it misses AI that exists). Nominal Level: the p-value of significance, declared in advance (e.g., .05) (in AI cases, the major concern is with answering the “big .05 question”) 32 16 8/26/2009 More Definitions • Statistical p power analysis y evaluates the likelihood that a statistical test will find a meaningful difference at the specified level (e.g., .05, or 2SDs). • Adverse Impact tests that are “more powerful” are more likely lik l tto find fi d adverse d iimpactt when h it exists. www.biddle.com 33 Copyright 2009, BCGinstitute, inc. All Rights Reserved Power Analysis for 40% Hiring Rate Employer Power Curve for Detecting Adverse Impact on a 1.0(d) Test Used with a 40% Overall Passing Rate / Chart Answers the Question: What percent of the time will the test find adverse impact when it exists? FET FET (mid P) FET (mid‐P) Chi Square Chi‐Square 100% 90% 80% 70% 60% 50% Gap Showing Increased Likelihood of p g f Missing AI When it Exists 40% 30% 20% 10% 0% 5 10 15 20 25 30 35 Sample Size (Equal for Each Group) 40 45 50 34 17 8/26/2009 Power Analysis for 10% Hiring Rate Employer Power Curve for Detecting Adverse Impact on a 1.0(d) Test Used with a 10% Overall Passing Rate / Chart Answers the Question: What percent of the time will the test find adverse impact when it exists? FET FET (mid‐P) Chi‐Square 100% 90% 80% 70% 60% 50% 40% Gap Showing Increased Likelihood of Missing AI When it Exists 30% 20% 10% 0% 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 Sample Size (Equal for Each Group) 80 35 Copyright 2009, BCGinstitute, inc. All Rights Reserved Power Comparison in Small Samples Average Statistical Power in Samples Between 10 and 50 FET FET (mid‐P) Chi‐Square 80% 75% 70% 64% POWER 60% 56% 50% 45% 39% 40% 30% 66% 76% 68% 29% 20% 10% 0% 10% SR 20% SR 40% SR Selection Ratio 18 8/26/2009 Accuracy of Tests for Answering the “just .05 or less” Question Comparison Between FET and Mid-P Based on Sample Size (Based on Monte Carlo Simulations from Cited Articles) FET SD Required f or Signif icance Mid-p SD Required f or Signif icance Poly. (FET SD Required f or Signif icance) Poly. (Mid (Mid-p p SD Required f or Signif icance) 2.60 AVERAGE "OVERAGE" OF FISHER EXACT TEST (AMOUNT HIGHER THAN 1.96 TO FIND AN ACTUAL 1.96 FINDING) 2.50 2.40 2.30 AVERAGE "OVERAGE" OF MID‐P SD Value S 2.20 2.10 2.00 1.90 1.80 1.70 1.60 1.50 0-20 21-50 50-75 76-100 101-125 126-200 Sample Size 37 Copyright 2009, BCGinstitute, inc. All Rights Reserved How Accurately do the Tests Answer the .05 Question? Actual FET/Mid-P Significance Levels (Compared to Desired .05 Level) Sample S l Size 0-20 21-50 50-75 76-100 101 125 101-125 126-200 Typical Estimate for n<50 (FET) Typical Estimate for n<50 (mid-P) Typical yp Alpha Range % Below Actual SD Desired .05 Required for Level Significance 0.015 0.025 0.026 0.032 0 035 0.035 0.043 70% 50% 47% 36% 30% 13% 2.43 2.24 2.22 2.15 2 11 2.11 2.02 0.029 41% 2.19 0.046 8% 1.99 38 19 8/26/2009 Type I Error Rates Between Tests Type I Error Rate Comparison Between Three 2X2 Tests Across Sample Size/Selection Ratio Scenarios FET Mid‐P 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 Selection Ratio (SRT), Minority Representation (PMIN), and Sample Size (N) SRT‐70%‐PMIN30%‐N‐50 SRT‐70%‐PMIN30%‐N‐100 SRT‐70%‐PMIN10%‐N‐100 SRT‐70%‐PMIN30%‐N‐20 SRT‐70%‐PMIN10%‐N‐20 SRT‐70%‐PMIN10%‐N‐50 SRT‐50%‐PMIN50%‐N‐50 SRT‐50%‐PMIN50%‐N‐100 SRT‐50%‐PMIN30%‐N‐100 SRT‐50%‐PMIN50%‐N‐20 SRT‐50%‐PMIN30%‐N‐20 SRT‐50%‐PMIN30%‐N‐50 SRT‐50%‐PMIN10%‐N‐50 SRT‐50%‐PMIN10%‐N‐100 SRT‐30%‐PMIN50%‐N‐100 SRT‐50%‐PMIN10%‐N‐20 SRT‐30%‐PMIN50%‐N‐20 SRT‐30%‐PMIN50%‐N‐50 SRT‐30%‐PMIN30%‐N‐50 SRT‐30%‐PMIN30%‐N‐100 SRT‐30%‐PMIN10%‐N‐100 SRT‐30%‐PMIN30%‐N‐20 SRT‐30%‐PMIN10%‐N‐20 SRT‐30%‐PMIN10%‐N‐50 SRT‐10%‐PMIN50%‐N‐50 SRT‐10%‐PMIN50%‐N‐100 SRT‐10%‐PMIN30%‐N‐100 SRT‐10%‐PMIN50%‐N‐20 SRT‐10%‐PMIN30%‐N‐20 SRT‐10%‐PMIN30%‐N‐50 SRT‐10%‐PMIN10%‐N‐20 0 SRT‐10%‐PMIN10%‐N‐50 SRT‐10%‐PMIN10%‐N‐100 P‐V Value Generated by Test Z‐Test 39 Copyright 2009, BCGinstitute, inc. All Rights Reserved Summary The best AI test is one that balances the 3 concerns between: Being able to answer the .05 question Missing adverse impact when it exists, and Falsely concluding AI exists when it does not. The FET consistently “undershoots” the .05 level of significance: Drastically in smaller samples (n<50); substantially in samples 50125. The mid-P provides a “correctly” sized adjustment across various samples Type II error rates (“missing” AI when it exists) differ substantially by test, especially in smaller samples where the FET is much less powerful All 3 common tests share similarly low Type I error rates, leaving the employer with very low odds of “incorrectly” concluding AI 40 20 8/26/2009 Summary • Using the FET unilaterally in all 3 conditions is unacceptable and should be discontinued in light of the recent findings just reviewed. • The conditional FET may be appropriate in limited conditional settings, but there will always be an argument against such use: • The FET is conservative regardless • Does the situation analyzed truly meet conditional requirements? • However, the mid-P has power advantages, adheres more closely to the .05 level, and is very closely aligned with the FET (where appropriate) www.biddle.com 41 Copyright 2009, BCGinstitute, inc. All Rights Reserved Summary • Is either position—the FET or mid-P—aligned with a “plaintiff” or “defense” position? It depends on the question being asked… • If an employer is interested in knowing the exact p-value p value given in a clearly conditional situation where the margins were indeed fixed beforehand… the FET will provide this answer. • If an employer is interested in not missing adverse impact that may exist (i.e., wants strong power to detect AI), the mid-P will better answer the question, in both conditional and unconditional situations (i.e., all 3 models). • For F a test t t to t b be useful, f l it should h ld be b reasonably bl accurate, t reasonably powerful, versatile across wide situations • The p-value from an AI test using a discrete distribution should be reasonably ‘aligned’ with a P-value from a comparative continuous distribution www.biddle.com 42 21 8/26/2009 Summary • The FET gives the actual conditional p-value, but will always go below the .05 05 nominal level level, thus not answering the “exact” =<.05 or 2 SD question asked in Title VII situations. • The Mid-p may be thought as “assessing the strength of evidence against the null hypothesis” (Barnard, 1989, p. 1474). This is not true regarding the exact p-value from the FET FET. • The question being asked in Title VII situations is not necessarily “what is the p-value”; but rather: “Is the pvalue less than .05”? The mid-p answers this question more accurately. www.biddle.com 43 Copyright 2009, BCGinstitute, inc. All Rights Reserved Summary Advantages of Using the Mid-P (adapted from Hirji, 2006) • Hirjij p provides the basis for endorsing g the mid-P as the p preferred exact method (for either conditional or unconditional situations): • Statisticians who hold very divergent views on statistical inference have either recommended or given justification for the mid-p method. • A mid-p version has been or can be devised for most of the statistics used in exact conditional and unconditional analysis of discrete data. • The Confidence Intervals associated with Mid-ps are often preferred by statistical program because they are more narrow / accurate (e.g., StatXact). • The shape and power function of the mid-p tests are generally close to the shape of the ideal power function—an important distinction because it demonstrates that the power of the test is uniform, and able to detect AI when it exists across a variety of data sets—both balanced and unbalanced). www.biddle.com 44 22 8/26/2009 Summary Advantages of Using the Mid-P (adapted from Hirji, 2006) • In a wide variety of designs and models, the mid-p rectifies the extreme conservativeness of the traditional exact conditional method without substantially compromising the type I error. • Empirical studies show that the performance of the mid-p method resembles that of the exact unconditional methods and the conditional randomized methods. • With the exception of a few studies, most studies indicate that in comparison with a wide variety of exact and asymptotic methods, the midp methods are among the preferred, preferred if not the preferred ones ones. • The mid-p as good comparative small and large sample properties. • Hirji concludes by stating: The mid-p method is thus a widely-accepted, conceptually sound, practical and among the better of the tools of data analysis. Especially for sparse and not that large a sample size discrete data, we thereby echo the words of Cohen and Yang (1994) that it is among the “sensible tools for the applied statistician.” www.biddle.com 45 Copyright 2009, BCGinstitute, inc. All Rights Reserved Summary EVALUATION FACTOR Appropriate in Independent Trial? (Model 1, Fixed) Appropriate in Comparative Trial? (Model 2, Mixed) Appropriate in Double Dichotomy? (Model 3, Free) Average Distance from .05 L Level l iin Small S ll Samples S l Actual Significance Level Required in Small Samples Preserves .05 Nominal Sig. Level Average Power in Small www.biddle.com Samples (n<50) FET-Boschloo FET FET (mid-P) (unconditional) (conditional) MAYBE YES NO NO YES YES NO YES YES 41% 8% 5-10% 2.19 1.99 1.95-2.05 YES NO NO 54% 62% 62% 46 23 8/26/2009 How Do You Compute the Mid-P? It’s rather simple…Many Stat Packages will provide the mid-p If you already have an AI tool or stat program, just Compute the 2-tail FET Subtract ½ of the p-value from the first table from that value The “hypergeomdist” function can be used EXAMPLE If you want to avoid the hassle, just calculate mid-p values for FETs that are “on the cusp” of significance, such as 1.80 SDs (corresponding to p-values of about .07) 07) Can easily be done for Mantel-Haenszel style analyses If the exact unconditional test is preferred: http://www.stat.ncsu.edu/exact/ 47 Copyright 2009, BCGinstitute, inc. All Rights Reserved Questions? www.BCGinstitute.org 48 24
© Copyright 2024