UK FHS Historical sociology (2014) Quantitative Data Analysis II. Standard error and confidence intervals (2.) – for proportion (and other parameters) Jiří Šafr jiri.safr(AT)seznam.cz updated 25/11/2014 ® Jiří Šafr, 2014 Content • Principles of inferential statistics and interval estimates (reminder, see http://metodykv.wz.cz/QDA2_CfI_1.ppt) • Standard error (SE) and confidence interval (CI) for categorical variables (p, %) for numeric variables (mean) see http://metodykv.wz.cz/QDA2_CfI_1.ppt • Assumptions for inferential statistics (SE, CI) • How to compute CI for % in SPSS? some alternatives • Comparing for two population proportions • Simultaneous confidence intervals • Standard error and confidence intervals for other parameters (proportion difference, correlation coefficient, median,…) 2 For introduction into the logic of inferential statistics, computation of standard error and confidence interval for numerical variables, first see the presentation: Standard error and confidence intervals (1.) – introduction to inferential statistics, SE and CI for numerical variables (means) http://metodykv.wz.cz/QDA2_CfI_1.ppt Sampling Error Population → sample → population Random sampling error is encountered in survey research because the sample selected is not a perfect representation of the test population. [Assael, Keon 1982] Vybírá se náhodně (bez vracení) pouze jeden výběrový soubor a údaje z něho reprezentují základní soubor (populaci). Chybu způsobenou volbou výběrového souboru lze s určitou předem zvolenou pravděpodobností vymezit na základě teorie výběrových šetření 4 Results from survey samples are always only estimation of the true parameter (in population). • Their accuracy is dependent mainly on sample size and distribution of values (variance). • Orientational tool: for large samples from large (national) population, ca N=1000, the true (population) relative frequencies (percent) range in these intervals: Source: [Special Eurobarometer 337] However we will learn, how to compute it exactly and for whichever value and/or parameter (level of measurement) (e.g. %, mean, % point difference, correlation, …) 5 Principle of inferential statistics – categorical variables distribution of probability (i.e. %) in random sample(s) from population [Figure 13.15 in De Vaus (1986) 2002: 304 or 232] • • „If we have a random sample then probability theory again provides the answer. If we took a large number of random samples most will come up with percentage estimates close to that which actually exists in the population. In only a few samples will the sample estimates be way off the mark. In fact the sample estimates would approximate a ‘normal’ distribution (Figure 13.15).“ [De Vaus (1986) 2002: 232] 6 S Na ose X je podíl (relativní počet výskytu) odpovědí pro volbu konzervativní strany v mnoha náhodných výběrech. rostoucím počtem opakovaných náhodných výběrů se odhadovaná hodnota % blíží skutečné hodnotě v populaci. Binomial distribution Návštěva kostela NSR, červenec–srpen 1956 Pravidelná Nepravidelná Málokdy Nikdy Celkem % 30,3 24,6 28,6 16,5 100,0 Náhodný výběr 4000 osob, se rozdělí na skupiny po 40 osobách, vznikne tak 100 dílčích náhodných výběrů. Toto rozdělení odpovídá jako při dotazování u 100 reprezentativních průřezů. Tyto dílčí náhodné výběry však nemají stejné procento osob, které chodí do kostela jen „málokdy“. Podle zákona velkých čísel musí přitom menší odchylky vystupovat častěji než velké. [Noelleová 1968: 115] Podíl 27,5 % osob, které „málokdy“ navštěvuji kostel, tj. 11 ze 40 dotazovaných, vystupuje např. u 18 ze 100 dílčích náhodných výběrů, naproti tomu jen v jednom výběru je podíl 10 % = 4 ze 40 dotazovaných. Z křivky zvonovitého tvaru lze vyčíst, jaké rozdělení by se dalo očekávat v mezním případě, kdyby se neprošetřovalo pouze 100, ale libovolné množství dílčích náhodných výběrů. 7 What precedes computation of confidence interval: 1. Standard error And its calculation precedes computation of 0. variance/standard deviation (2. level of confidence → z-values) (general principle and how to obtain) → see http://metodykv.wz.cz/QDA2_CfI_1.ppt Standard error and estimation of the parameter (e.g. mean or %) • or generally standard error of a sample • It quantifies uncertainty of our measuring for mean: StD Error (of mean) SE = for percent (%): StD Error (of proportion) SE = • Note: Probability, i.e. proportion (%) is in fact a mean of number of observations, so we calculate SE for proportion essentially in the same way as SE of mean (standard deviation of proportion divided by square root of sample size). 9 Standard error • Is smaller when sample size increases (accuracy of parameter estimate increases) • Increasing sample size twice, the confidence interval decreases only 1,41 times (√k-multiplicatively), that's why for twofold accuracy we need quadruple sample size • Usually there is sufficient accuracy of our results if probability that ca 2/3 of measured values are within a margin of mean or +/- 1 of their standard error (SE). 10 What standard error (SE) is for? • It specifies, how (in)accurate are our results • for omputation of confidence interval • for testing, whether two (ore more) parameters are different (in population) • for testing, whether a sample parameter is significantly (statistically) different from zero in population (dělíme-li např. korelační koeficient r jeho SE a dostaneme-li číslo větší než 2, pak je s 95% pravděpodobností korelace nenulová, tj. existuje i v celé populaci) 11 Confidence interval (assumptions) • Further we will consider only Two-sided confidence interval (there is also one-sided CI, when we determine only either Upper or Lower bound) • for simple random sample • and for large samples (n > 30) • We assume at least normal distribution of values of the phenomenon (which is in social reality mostly on principle unrealistic) 12 Confidence Intervals for qualitative data - nominal variable → frequency (probability / percent) % is probability multiplied by 100, i.e. p 0,1 = 10 % (thus p = 0,8 → 1 - p = 0,2) Confidence interval for proportion (i.e. % / 100), dichotomised variable Point estimate ± choosen level of confidence (statistical error) x Standard error of estimate • Probability (point estimation) p = x/n • Standard error of probability/proportion SE = √ p(1 − p)/n • Confidence Interval p ± zα/2(SE) • C for 95 % error α = 0,05; zα/2 = 1,96 → Existuje 95 % spolehlivost, že naměřená hodnota ve výběru bude (v populaci) mezi hodnotami horní a dolní hranice. Máme-li proměnnou s více kategoriemi, pak počítáme p vždy jako dichotomii té které kategorie oproti součtu ostatních (např. vzdělání: VŠ / ostatní stupně (ZŠ+VY+SŠ). 14 Example: Voter turnout in 2006, CR FREQ q34. • To compute CI (SE) we he have to fill in the formula and calculate it by hand (or use some other tool) Zdroj: data ISSP 2007 15 Example: Voter turnout in 2006, CR • We have sample estimation (from ISSP 2007 survey) for variable Voted in 2006 (catg. Voted / Didn‘t vote) • Standard error of proportion (SE) for Voted: – Probability of Voted = 750/1196 = 0,628 (= 62.8%) – (Probability of Didn‘t vote = 446/1196 = 0.373) • SE = √ 0.628(1 − 0.628)/1196 = 0.014 • True value of Voted estimated on our sample will be in the interval (when p < 5 % → C=1.96): 0.628 ± 1.96 √ (0,628)(0,373)/1196 = 0.628 ± 0.0274 or (0,6006; 0.6554) or 62.8 (± 2.7)% Source: Data ISSP 2007, CR 16 Example: Voter turnout in 2006, CR • Here we know exceptionally the true population parameter from the official statistics: In the Elections to the Chamber of Deputies held from 2 to 3 of June 2006, 64.47% Czech citizens participated. (official data from CZSO/ČSÚ) • Our sample estimation (data ISSP 2007) – for 95 % CI: 60.06 ← 62,8 → 65.54 – for 99 % CI (where zα/2 = 2.326) 59.60 ← 62,8 → 66,05 – for 90 % CI (where zα/2 = 1.645) 60.05 ← 62,8 → 65.01 Indeed, all interval estimates contain the true population proportion. 17 How to compute it in SPSS? Not so easily as in case of numerical variables (mean) in SPSS CI for proportion (%) standardly only in graph → BARCHART GRAPH /BAR(SIMPLE)=PCT BY q34 /INTERVAL CI(95.0). 19 Zdroj: data ISSP 2007 BARCHART for % with CI, clicking in SPSS 20 Bivariate BARCHART (with CI for %) • GRAPH /BAR(SIMPLE)=PCT BY q34 BY q38 /INTERVAL CI(95.0). • For comparison of % „Voted in 2006 parliamentary election“ within subgroups (e.g. along Union membership) 21 Zdroj: data ISSP 2007 1. In the output (on FREQ table) you can use (post)script Script can be downloaded from: http://www.acrea.cz/sc_intervaly_spolehlivosti_cetnosti.htm This is most convenient way. However it needs to be stored in a computer and you need the appropriate version of the script fitting to your SPSS version, sometimes even some programming environment needs to be installed (Python), and also it is probably only in Czech. It doesn‘t exist in PSPP. 22 Source: data ISSP 2007, CR 2. Syntax routine CI for proportion [Pryce 2002] http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt Here we have to fill in results, e.g. from FREQ (univariate) or possibly CROSSTAB (bivariate). In fact there are four tests in this syntax. For univariate description it is the second test Large-Sample Confidence Interval for a Single Population Proportion. Fill in only values of n a p, you can also choose CI (originaly set to 99% CI) and decimals shown. *-------------------------------------------------------------------------------. *-------------------------------------------------------------------------------. * Large-Sample Confidence Interval for a Single Population Proportion. * (see Moore and McCabe (2001) Intro to the Practice of Statistics, p. 586 -588). *-------------------------------------------------------------------------------. *For the inverse normal computation, I use the approximation used by http://www.hpmuseum.org/software/67pacs/67ndist.htm adap ted from Abramowitz and Stegun, Handbook of Mathematical Functions, National Bureau of Standards 1970. MATRIX. COMPUTE n = {4040}. /* Enter the sample size here (change the number in curly brackets)*/ COMPUTE x = {2048}. /* Enter the number of "successes" (change the number in curly brackets)*/ COMPUTE CONFID = {0.99}. /* Enter the desired confidence level here */ *The remainder of the syntax calculates the Confidence Interval given the values for n and x which you have entered above. *NB you don't need to alter anything from here on. COMPUTE Q = 0.5 * (1-CONFID). COMPUTE A = ln(1/(Q**2)). COMPUTE T_ = SQRT(A). COMPUTE zstar = T_ - ((2.515517 + (0.802853*T_) + (0.010328*T_**2))/ (1 + (1.432788*T_) + (0.189269*T_**2) + (0.001308*T_**3))). COMPUTE phat = x/n. COMPUTE SE_phat = SQRT((phat*(1-phat))/n). COMPUTE m = zstar * SE_phat. COMPUTE LOWER = phat - m. COMPUTE UPPER = phat + m. COMPUTE ANSWER = {n, phat, zstar, SE_phat, Lower, Upper}. PRINT ANSWER / FORMAT "F10.5" /Title = "Confidence Interval for a Single Population Proportion" / CLABELS = n, phat, zstar, SE, Lower, Upper. END MATRIX. *NB if you want to obtain values to a greater (lesser) number of decimal places, change the format specified in the last but one line of the syntax. *e.g. if you want only 3 decimal places, change the format to "F10.3". *------------------------------------------------------------------------------. *------------------------------------------------------------------------------. The output: Run MATRIX procedure: Confidence Interval for a Single Population Proportion n phat zstar SE Lower 1196,000 ,627 1,960 ,014 ,600 ------ END MATRIX ----And don't forget, if you use this script (e.g. in diploma thesis) you should credit it, cite: Gwilym Pryce 2002. Large-Sample Confidence Interval for a Single Population Proportion. Inference for Proportions. Available at: http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt. Upper ,655 23 Source: data ISSP 2007, CR For contingency table in SPSS (only) graphically or computing via Syntax routine CI for proportion [Pryce 2002] Example: Type of housing [s31] by description of place of living (size of community) [s21] CROSS s31 BY s21. And fill results in into the formula or the syntax routine by Pryce [2002]. Source: data ISSP 2007, CR • for category „small town“: Rodinný domek Menší bytový dům Větší bytový dům p dolní mez horní mez 0,3266 0,2805 0,3727 0,1482 0,1133 0,1832 0,5251 0,4761 0,5742 CROSS s31 BY s21 /CEL COL. GRAPH /BAR(SIMPLE)=PCT BY s31 BY s21/INTERVAL CI(95.0). 24 Web calculators of confidence interval for nominal variables (%) • http://ncalculators.com/statistics/confidence-interval-calculator.htm http://www.surveysystem.com/sscalc.htm • http://vassarstats.net/prop1.html • Confidence Interval for the Difference Between Two Independent Proportions. http://vassarstats.net/prop2_ind.html 25 Orientational tool (if there is no computer nor calculator): Statistical margins pro binomial distribution Value of 2σ — two standard deviations — in % → Level of statistical significance 95,45 % n = sample size (random sample) p = frequency in population (%) Source: [Noelle 1968: 118] 26 Task • Compute confidence interval for proportion of people with university diploma in CR using sample data ISSP 2007. • Compare it with the true population value (statistics of CZSO/ČSÚ for 2007). • What‘s wrong? → show solution in AKD2_1_CfI_RESENI 27 Comparing for two population proportions (dichotomised variables in crosstabulation) • We can compute confidence interval for proportion of specific value/category within subgroups or for already existing results. For example, dichotomised variables: Voted (dependent var) along categories of Religion (Christian/otherwise) (independent var) and to compare, whether interval estimates within categories of Religion overlap or not. • More exact and easier it is via computing CF of % difference between the proportions/categories • If the confidence interval of the proportion difference is not including 0 (i.e. it is not „zero“ within the whole population), we can assert, that % difference between the (sub)categories is statistically significant (at given p), i.e. it holds true with given statistical error for whole population. → You can compute it by hand (for formula see later) or using SPSS syntax routine by G. Pryce [2002] http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt use the last (4.) test Large-sample Confidence Intervals for Comparing for two population proportions. • This method can be applied to a crosstabulation with more categories 28 → step by step focusing on one by one value/category comparison. Comparing for two population proportions SPSS syntax routine by G. Pryce [2002] http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt • Here we have to fill in results, e.g. from FREQ (univariate) or possibly CROSSTAB (bivariate). In fact there are four tests in this syntax. • For comparing for two population proportions it is the fourth test Largesample Confidence Intervals for Comparing for two population proportions. Fill in only values of n1, n2 and p1, p2, you can also choose CI (originally set to 90% CI) and decimals shown. *-------------------------------------------------------------------------------. *-------------------------------------------------------------------------------. * Large-sample Confidence Intervals for Comparing for two population proportions. * (see Moore and McCabe (2001) Intro to the Practice of Statistics, p. 602-604). *-------------------------------------------------------------------------------. *For the inverse normal computation, I use the approximation used by http://www.hpmuseum.org/software/67pacs/67ndist.htm adapted from Abramowitz and Stegun, Handbook of Mathematical Functions, National Bureau of Standards 1970. Example: Non-participation in MATRIX. COMPUTE n1 = {1222}. /* Enter the first sample size here (change the number in curly brackets)*/ COMPUTE n2 = {1222}. /* Enter the second sample size here (change the number in curly brackets)*/ COMPUTE x1 = {958}. /* Enter the number of "successes" for sample 1 here (change the nb in curly brackets)*/ COMPUTE x2 = {1016}. /* Enter the number of "successes" for sample 2 here (change the nb in curly brackets)*/ COMPUTE CONFID = {0.95}. /* Enter the desired confidence level here */ *The remainder of the syntax calculates the Confidence Interval given the values for n and x which you have entered above. *NB you don't need to alter anything from here on. COMPUTE Q = 0.5 * (1-CONFID). COMPUTE A = ln(1/(Q**2)). COMPUTE T_ = SQRT(A). COMPUTE zstar = T_ - ((2.515517 + (0.802853*T_) + (0.010328*T_**2))/ (1 + (1.432788*T_) + (0.189269*T_**2) + (0.001308*T_**3))). COMPUTE p1hat = x1/n1. COMPUTE p2hat = x2/n2. COMPUTE SE_phat = SQRT(((p1hat*(1-p1hat))/n1) + (p2hat*(1-p2hat))/n2)). COMPUTE m = zstar * SE_phat. COMPUTE LOWER = (p1hat - p2hat) - m. COMPUTE UPPER = (p1hat - p2hat) + m. COMPUTE diffp1p2 = p1hat - p2hat. COMPUTE ANSWER = {n1, n2, diffp1p2, zstar, SE_phat, Lower, Upper}. PRINT ANSWER / FORMAT "F10.5" /Title = "Confidence Interval for Comparing 2 Proportions" / CLABELS = n1, n2, diffp1p2, zstar, SE, Lower, Upper. END MATRIX. Sport (q13_a) = 958 Culture: (q13_b) = 1016 TOTAL = 1222. The output: Sport clubs and Culture association [ISSP 2007, CR] The result: the CI is not crossing 0 → the difference 4,7 % points is statistically significant (at p < 5%). Run MATRIX procedure: Confidence Interval for Comparing 2 Proportions n1 n2 diffp1p2 zstar 1222,00000 1222,00000 -,04746 1,96039 ------ END MATRIX ----- SE ,01592 Lower -,07866 Upper -,01626 29 And don't forget, if you use this script (e.g. in diploma thesis) you should credit it, cite: Gwilym Pryce 2002. Large-Sample Confidence Interval for a Single Population Proportion. Inference for Proportions. Available at: http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt. Or you can use Web Calculator for Confidence Interval for the Difference Between Two Independent Proportions http://vassarstats.net/prop2_ind.html 30 Simultaneous confidence intervals (for proportion) → multiple comparison problem • So far we have made independent conclusions, but if we want to assess several proportions together, we need to assure that all parameters were covered by predetermined desired level of confidence. • For Simultaneous conclusion about several proportions we make the overall confidence C stricter to z α / S where S = number of proportions for which we need simultaneous confidence intervals • For example: for 4 proportions, with desired α = 0,05 z α / 4 = z α / 0,0125 = 0,02497 i.e. rounded to C = 2,5 (we use it instead of common C = 1,96 for a single population proportion) This is similar method to multiple comparison of means in more than two subgroups, i.e. Post-hoc test in Analysis of variance. See statistical tables with critical values of standard normal test for simultaneous testing. Source: [Řehák, Řeháková 1986: 64-65] 31 Confidence interval v SPSS ? • SPSS computes CI only for numerical variables, i.e. mean (e.g. EXPLORE) • (In OLS regression we can get CI for regression coefficient B, in logistic regression for exp(B).) • However, it is easy to compute standard error (e.g. for proportion % or correlation coefficient; sometimes SPSS provides it) and filling it into the formulas, we can calculate CI easily by hand (see later) • Alternatively we can use syntax routines, e.g. for proportion by G. Pryce [2002] http://www.spsstools.net/Syntax/Distributions/ProportionTestsAndCI.txt or scripts with post-hoc adaptation in outpout (for • univariate % http://www.acrea.cz/skripty-interval-spolehlivosti-cetnosti.htm) Or we can calculate CI outside SPSS, e.g. using the internet calculators… (see later) 32 Standard error and Confidence intervals for various parameters (correlation coefficient, median, difference of proportion (%), …) Standard error and CI of correlation coefficient (in SPSS) SE is not included within CORRELATION but it is in CROSSTABS CROSSTABS OC2011 BY PrijmD /FORMAT=NOTABLES /STATISTICS=CORR . CI (95%) for R = 0,072 ± 1,96*0,023 = 0,072 ± 0,045 or 0,027 ← 0,072 → 0,117 CI correlation coefficient can be computed at http://vassarstats.net/rho.html 34 Computation of standard error • for mean • for standard deviation • for median • pro correlation coefficient or 35 Computation of standard error • for proportion (%) SE = √ p(1 − p) / n • for difference of proportion (%) p1- p2 Web Calculator for Confidence Interval for the Difference Between Two Independent Proportions http://vassarstats.net/prop2_ind.html • for Odds Ratio More on http://davidmlane.com/hyperstat/A111955.html http://www.miislita.com/information-retrieval-tutorial/a-tutorial-on-standard-errors.pdf 36 Routines for Confidence intervals in SPSS syntax • for proportion (%) http://www.spsstools.net/Syntax/Distributio ns/ProportionTestsAndCI.txt • for median http://www.spsstools.net/Syntax/Distributio ns/Calculate95PercCIforTheMedian.txt 37
© Copyright 2024