Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples Author(s): William H. DuMouchel and Greg J. Duncan Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 78, No. 383 (Sep., 1983), pp. 535543 Published by: American Statistical Association Stable URL: http://www.jstor.org/stable/2288115 . Accessed: 24/09/2012 15:42 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal of the American Statistical Association. http://www.jstor.org UsingSample SurveyWeightsin Multiple Samples RegressionAnalysesof Stratified WILLIAMH. DuMOUCHELand GREGJ. DUNCAN* thesame to giveeach stratum The rationalefortheuse of samplesurveyweightsin a each case, whichattempts in thesamplethatit has in thepopanalysisis examinedwithrespect relativeimportance leastsquaresregression of the popu- ulation.This articleassumesthatan observablestratifigeneralspecifications to fourincreasingly of the cation variable J takes on k levels and that{fw}, the prolation regressionmodel. The appropriateness estimatedependson whichmodelis portionsof the populationforwhichJ = j, j = 1, ... . weightedregression between kareknown.Let njbe thesize ofa simplerandomsample chosen.A proposalis madetouse thedifference as an aidinchoos- drawnfromthejth stratum,j = 1, . . , k, so thatn1 + estimates andunweighted theweighted is underrepresented es- * + nk = n. Since thejth stratum modeland hencetheappropriate ingtheappropriate tonjlj, theweight timator.Whenappliedto an analysisof thefamilialand inthesamplebya factorproportional is of the educationallevel at- assignedto theithobservation determinants environmental tainedby a sampleofyoungadults,themethodslead to r (1.2) Wi uj1Inj1, a revisionoftheinitialadditivemodelinwhichinteraction i = 1, and race, as well whereji is thevalue ofJ fortheithobservation, termsbetweencountyunemployment , ... n. Let W the matrix whose ithdidenote diagonal as betweensex and mother'seducation,are included. agonalelementis wi. In some textbooks,and in many KEY WORDS: Weightedregression;Finitepopulation; analyses of surveydata (see Klein 1953, Bachman, model. Green,andWirtanen Superpopulationmodel;Educationalattainment etal. (1972),Dun1974,Blumenthal can andMorgan(1976),Hu and Stromsdorfer (1970),and 1. INTRODUCTION Juster (1976)),a weighted leastsquaresestimator is used, Suppose thata samplesurveymeasures(p + 1) var- namely, so thatthedata consist iables on each of n individuals, (1.3) P w = (X' WX)' X' WY. X. Thenthe ofthen x 1 matrixY and then x p matrix Which estimatorshould be used? Controversy of has coefficients of theregression least squaresestimator ragedat least since Klein and Morgan(1951). The adY on Xis vocates of I8 can point out that the justification for (1. 1) p = (X'X) - X' Y. weightedregressionin termsof adjustingforunequal However,the rows of Y and X oftenare not a simple errorvariances(see, e.g., Draperand Smith1966)is not sam- at issue here. In the usual homoscedasticregression randomsamplefromthe population.Differential varianceunbiasedwhetheror not responseratesamongvarious model,,3is minimum plingratesand differential to size. Nevertheless, of selectionforeach thestrataare sampledproportional probabilities stratalead to different thesupindividual.Kish (1965) discusses the computationof theadvocatesofPwareconcernedwithreducing scheme,reasoning by schemes,butthis posedbias causedbythesampling forvarioussampling theseprobabilities of an overallpopulationtotal andnot analogyto theestimation sampling articleis concernedonlywithstratified is clearlynecessary of clustersampling.Fur- ormean.In thatcase, suchweighting complication withthe further in thestratum differences means. to be based on X but ifthereare systematic is permitted ther,thestratification In addition,theyarguethattheassumptions thatlead to noton Y ofParelikelytobe violatedinpopulations sampling theoptimality As describedin Kish (1965),thedifferential ofweightsfor of interest.Brewerand Mellor(1973) discusshow the andresponserateslead to thecomputation choicebetweenPand Pw is influenced by thechoiceof a model-based versusan approach approachto inference * WilliamH. DuMouchelis AssociateProfessor withina finitepopulationin of AppliedMathe- based on randomization Cam- whichno particular Institute ofTechnology, Center,Massachusetts matics,Statistics modelis assumed. bridge,MA 02139.GregJ. Duncanis SeniorStudyDirector,Survey The pointofthisarticleis to clarify thisissuebyshowofMichigan, forSocialResearch,University ResearchCenter,Institute estimator dependson whichof Herman inghow theappropriate AnnArbor,MI 48109.The authorswantto thankProfessor A onanearlier comments andconstructive reading Chernoff forhiscareful threeassociate ofthisarticle.Theyalso wanttothanktwoeditors, draft on earlier fortheirmanyhelpfulcomments editors,and fourreferees for andMartinFrankel,Jr.,GrahamKalton,andRogerWright drafts, valuablediscussion. 535 ? Journal Statistical Association oftheAmerican 383 September 1983,Volume78,Number Section Applications 536 Joural ofthe American StatisticalAssociation,September 1983 severalpossibleregression models(ifany)is appropriateof (X, J). By analogyto the univariatesamplesurvey ofina mean,one targetquantity problemofestimating as a deviceto helpdecidewhichmodel,andhencewhich terestis the weightedaveragecoefficient, namely,the is appropriate. estimator, In Section2 we definefourdif- vector ferentregressionmodels of increasinggenerality that k mightbe used tojustifytheuse of f w. In section3 we = E 1TjI(j), discuss the relationshipbetweenthe models and the j=1 choice of estimator, and in Section4 we showhow an n n easilycomputedtestbasedon pw - Pmayhelpinchoos(2.3) wi, -=~ w{3ii i=l i=l inga model.Section5 containsfurther discussion,and in the last two sectionswe illustrate the issues by the wherePi = f3(ji)and the secondequalityfollowsfrom construction of an educationalattainment modelbased (1.2). on a nationalsurvey. Model 2.4 The Omitted-Predictor 2. FOURREGRESSIONMODELS This model assumes thatthe simplehomoscedastic 2.1 Notation modelofSection2.2 wouldholdifonlyXwereaugmented omitted(I x q) variableZ. Thatis, The decisionofwhether to use theweightsdependson bytheunfortunately whatone assumesaboutthepopulationfromwhichthe given(X, Z, J), data have been drawn.In thissectionwe describefour Y = ka + Zy + i, modelsthatexemplify the mostcommonassumptions. = X13+ &Uy+ i, (2.4) Associatedwitheach modelis a certaintargetquantity, or parameterof interest.The questionis whethera or wheree has mean0 and variance&2, independent of(X, Pw is themoreappropriateestimateofthattargetquan- Z, J). The coefficients ofX and Z are a and -y,respectity. to X, namely tively,whileU is thepartofZ orthogonal The readermayfinditeasierto thinkintermsofsam(2.5) E(X'Z). U = Z - XE(X'X)1 plingfroman infinite population,since populationsize per se is nota majorissue here.We alwaysassumethat Since Z has not been identified, the targetquantityor the stratumsamplesizes {n1} are smallfractions of the parameterof interestin thismodelis ,B,but ifZ were corresponding populationstatumsizes, and the mathe- identified, we assumetheanalystwouldpreferto know maticsofsampling withreplacement orfrominfinite pop- (a, y) ratherthanto knowmerely,B. thisarticle.Let Y and X ulationsare used throughout thismodeland pointsrelating Thereare twoimportant denotethescalarand (1 x p) randomvariablesdefined themixture model.First,even ifZ is takento be theX var- x Jinteraction by a singledrawofthedependentand independent variable,so thatthetwomodelsareidenLet yand tical,the two parametersf3and ,3 usuallywill not be fromtheentirepopulation. iables,respectively, x denotevalues of Y and X, namely,singlerowsof the equal. Second,evenwhenassumingthatomitted predicdata matricesY and X respectively. Unconditional ex- torsexist,we have in mindthattheyare nottoo numerpectations E(?) referto a simplerandomsamplefromthe ous, so thatalthough modelis thetheomitted-predictor population,whileconditional expectations E(- IJ) refer oreticallya generalizationof the mixturemodel, in to stratified wherea simplerandomsampleof practiceitwouldhavefewerparameters sampling, sinceone would size nj is taken fromthejth stratum,j = 1, . . . , k. if thenumberof + < k, especially that kp, q) (p hope is strata, large. 2.2 The Simple Linear Homoscedastic Model and to show how a test based on ,Bw - , may be used 2.5 The General Nonlinear Model (No Model) This is theusual regression modelin which Y= Xk + , (2.1) that assumption This modelmakestheminimal Y = X* + E*, (2.6) wherep is a p x 1 vectorofcoefficients, ande is random errorwithmean0 and varianceor2. The keyassumption whereE(i*) = 0 and cov(X, e*) = 0. However, no other is thatthemeanand varianceofi, conditional on (X, J), assumptions are madeaboutE(i* IX, J) or V(i* | X, J). are independentof (X, J). The parameter* is thusdefinedas 2.3 The MixtureModel = E(X'X)-' E(X'Y). (2.7) Thismodelsupposesno uniqueP, butthatp variesby The parameter* willbe called thecensuscoefficient, stratum in thepopulation.Thatis, thereare k parameter sinceit wouldbe theleast squaresestimateifthepopuweresampled, vectors p(1), . . . , P(k), and, conditionalon J = j, andtheentirepopulation lationwerefinite of ,3*is thatX,3* as in a census.Anotherinterpretation Y = X,B(j) + >, (2.2) of Y in the sense of thebestlinearpredictor represents This where,again,e has mean0 andvarianceiJ2, independent minimizing theexpectedsquarederrorofprediction. DuMouchel and Duncan: Using Sample Survey Weights in Regression "model" is notreallya modelsinceit assumesno pop- that ulationstructure exceptthatnecessaryto definethetarget quantityf3*. 537 E(3 J) = 3 + (X'X) -X'(v - 4), A Ifeverynj is smallcomparedto thesize ofthejth pop= 3 + (X'WX)_,X'W(v E(rW|J) A). ulationstratum, thismodelseemsequivalentto thefinite in whichthevaluesof Y andX in Noticethat,in general,neitherj3nor 3w is an unbiased population formulation J3, andthereis no simthe populationare treatedas fixedwithno underlyingestimateoftheaveragecoefficient tell ple way to from the two preceding expressions which This modelincludesthe threeearliermodels structure. = has the smaller bias. , For example, if p 1, so that as special cases. Note, however,that if the mixture the and the are all scalars, then xi P3i, truethatf3*= J3,while, modelis true,itis not generally if the simplehomoscedasticmodelor the omittedpreBias () = - 3)I/ xi2, Xi(i dictormodel is true,then f3*= P3.In fact, settinge* = Bias (13w) = E w1x12(j,3 - j3)/ wjxi2. modelis forUy + e showsthattheomitted-predictor model.Butthe mallyequivalentto thegeneralnonlinear Then, ifxi 1, Bias (13w) = 0 by the definition(2.3) of formermodelassumesthatU (actuallyZ) is an easily J3,butforotherchoicesofX thiswillnotbe true.In fact, interpreted andnot-too-hard tomeasurevariablethatwas it could happenthatx,2is proportional to wi, in which omittedby oversightor by some practicalnecessity, case 13wouldbe unbiasedbut, w wouldnotbe. In genwhilethelattermodelallows U to be anyvariablewith eral,neitherP nor 13wappearto be suitableestimators variable. cov((U,X) = 0, perhapsan unobservable for 13in the mixturemodel. Konijn (1962) and Porter One reasonforintroducing bothmodelsis,as discussed (1973)use themixture modeland recommend estimating in Section3.4, to contrasttwoapproachesa statistician 13 separatelywithineach stratumand then takinga mighttake afterrejectingthe simple homoscedastic weighted averageoftheestimatesas thefinalestimateof modelisjustaround model.The optimist ("a goodfitting 13.Thatis, use The pessimist thecorner")searchesforextrapredictors. = E We3P Wi. ("models are nevervalidin thereal world") refusesto P = E Ij(j) relyon anypopulationstructure. Unfortunately, this recommendation is inadvisablefor sampling schemes with strata and few many relatively 3. WHENTO USEWEIGHTED REGRESSION observations Pfefferman and Nathan(1981) per stratum. 3.1 Not Ifthe Simple Linear Homoscedastic suggestusingweights fortheP3i thattakeintoaccountthe Model is Acceptable precisionof each P3i. Sometimesseparateestimation Underthelinearhomoscedastic model,P3is unbiased withinmanystratais impossiblebecause thereare too Ifonewereespeciallysuspicious and has minimum varianceamongall linearunbiasedes- fewdegreesoffreedom. that one of the theconstantterm coefficients (typically timators;it wouldnaturally be preferred to 3w. Haberof the usual regression) varies one couldallow by strata, man(1975)provesvariousrelationsbetweenP3and 3w. the estimate of only that coefficient to varyby strata. For example,he showsthatforany linearcombination Such an the entire of covariance on datasetcosts analysis c' ofthecoefficients onlyone degreeoffreedomper stratum. 4RI(l + R)2 6 V(c' P J)IV(c'fw IJ) 6 1, 3.3 Use pw Ifthe Linear Homoscedastic Model is whereR is theratioof thelargestto thesmallestof the not Acceptable but an Estimateof 13*is modelto be {wi}. In orderforthelinearhomoscedastic Desired acceptable,itmustbe a prioriplausiblesubstantively and in additionpass the usual data analytictestsinvolving The advantageof13winthemodelsofSections2.4 and estimator of 13= examination of residuals,checkingforinteractions, and 2.5 is thatP3wis at least a consistent 13*, leteach nj whileP maynotbe. Proofofconsistency: so on. -o and wi o mjlnj,i = 1, ... , n. Then withprobability and X'WYI/ wi E(X'X) 3.2 Not Ifthe Parameter f3in the MixtureModel is one X'WXI/ wi approaches ~~A ~~~ ~~~~~ so that w ,1*. On the Y) E(X' (2.7) approaches by P to be Estimated otherhand,ifthe samplesizes of thestrata,nj, are not The mixture modelcannotprovidea generalrationale proportional to thepopulation proportions mj(i.e., thew of P3.To see this, do notapproachequality),then13neednotapproach13*. forpreferring P3wto P3as an estimator considerthemodelof Section2.3 and let v, x and j xI~ be definedby 3.4 A StrategyforChoosing Between P and Pw modelofSection2.3 First,ifone believesthemixture 13of Equation(2.3), thenneither and desires to estimate Vi = Xfi, ,8nor,w is appropriate. Therefore, fortherestof this wherexiis theithrowofX. Thenelementary calculations articlewe ignorethe mixturemodeland theestimation (let Y = v + e = jL + v - jL + e in(1.1) and (1.3)) show of 13. v= x43,i i= 1,. .. ,n 538 Journalofthe American StatisticalAssociation,September 1983 Thereremainsthe problemof choosingbetweenthe dictormodelof Section2.4, Y = Xa + Zy + E, where linearhomoscedastic modelofSection2.2 (thuschoosing thecolumnsofZ are further (perhapsunobserved)pre,3)and themoregeneralmodelsof Sections2.4 and 2.5. dictorsthatshouldhave been includedin theregression Ifone prefers to estimatef3* withinthegeneralnonlinear but were not. Althoughthis particularalternativehymodel,f3wis appropriate.If one believesthe omitted- pothesisformulation is notessentialhere,itsuse willbe predictor model,thenone shouldtryto identify theextra convenient forexpressingand interpreting theexpected predictor Z andestimate(a, -y)inEquation(2.4) or,failing meansquaresofan ANOVA testofA = 0 usingstandard that,settleforusingf3was an estimateof J3. linearmodelstheory.We willsee thatthehypothesis -y Thecontroversy arisesindecidinghowmuchevidence, = 0 impliesA = 0 butnotvice versa.The hypothesis A if any, to requirebeforegivingup the linearhomosce- = 0 canalso be interpreted as E( IJ) = ,3*inthecontext dasticmodel.Closelyrelatedis thequestionofhowhard ofthegeneralmodelofSection2.5, butourdevelopment to look foradditionalpredictors. On one side are those concentrates on the use of A in a test(perhapsone of (see KishandFrankel1974,BrewerandMellor1973,and many)of thesimplemodelversustheomitted-predictor references therein)whotendto be extremely dubiousof model. Furthermore, when our test rejectsthe simple theassumptions of thelinearhomoscedastic modeland model,examination of A usuallysuggestscandidatesfor whoalso maynotbe veryinterested insearching forextra the neededpredictors Z. In thissectionwe do notdispredictors.They are satisfiedwithmakinginferencestinguish betweenE(A) and E(- IJ), sinceall expectations aboutthecensusparameter hereare conditionalon (X, Z) and, forthetwo models P *. On the otherside are those who tendto accept the beingcompared,theadditionalconditioning on J makes simplemodelof Section2.2 so longas it can withstand no difference. thescrutiny ofa carefulregression analysisas described, Since A = - ,, it may be representedas A = DY, forexample,inthebooksbyMostellerandTukey(1977) where and Belsley,Kuh,and Welsch(1980).The processofreD = (X'WX)-1X'W - (X'X)-1X'. withtransformed fitting variables,checkingforinteractions,plotting residuals,and so on, maylead to theuse UnderthemodelofSection2.2, elementary calculations ofextrapredictors, butthebasic strategy is to acceptthe showthatthecovariancematricesof P43w, and A are simplemodel(and use P3)unlessevidenceagainstit develops. The advantagesof thisapproachare, one, the V(A) = (X' simpleinferential procedures(standardF testsand con- V(3w) = (X'WX)-1(X'W2X)(X'WX)'1U2, fidenceintervals) and,two,themorestraightforward interpretation of 13,whichthemodelofSection2.2 allows. V(A) = DD'o2 Without theassumptions ofthatmodel,theinterpretation _ = [(X'WX)-1(X'W2X)(X'WX)-1 X)- ]U2 (Xof 13*is difficult. For example,yearsof schoolingmay = V(P3w) - V(f3). (4.3) have a positive,8*forpredicting income,buttheincome of some subgroupsmaydropwithincreasing education. Notice two things about these expressions.First, Publishedregressionanalysesare oftenappliedto sub- V(3w) is not(X'WX)'&, as wouldbe trueif V(Ei) populationsor to completelydifferent populationsby Iw9 , i = 1, ., n. Thus, the standarderrorsand t .2. laterresearchers.In thatcase, 13*may be misleading, statisticsoutputby mostweightedregression computer whilethe extraeffortspentto identify interactions or programs are invalidforour situation, evenifthelinear otheromittedpredictors maylead to greatertheoretical modelholdsand A = 0. Second,since V(Qw) = V(Q + understanding. Smith(1976)makesa similarpoint. A) = V(P) + V(A),we see that,3andA areuncorrelated, Another wayto contrast theopposingsidesofthecon- as can be showndirectlyby noticingthat,as a linear troversy is bythepriority theyassignto defining a target transformation of Y, A is orthogonal to thecolumnsofX 13*as (i.e., DX = 0). Therefore, quantity.One sidefirstdefinesthetargetquantity thesumofthesquaredresidtheparameter ofinterest, no matterwhatthepopulation uals fromtheunweighted can be partitioned regression structure maybe. The otherside tendsto spendmore intoa partdue to A and an error,or unexplained, comeffort forstructure in thepopulationand then ponent.(Assumen > 2p and thatboth(X'X) and VA searching chooses,fromamongtargetquantitieslike 13,13,(a, y), V(A)/I are nonsingular.) Thisleadsto an ANOVA table and 13 * thatparameter mostsuitedto whateverstructurewiththreeindependent components (see Table 1). seemsto be present. If themodelof Section2.2 is true,and in additionE is In the spiritof thislatterapproach,we nextdescribe normallydistributed, thenthe ratioMSwI62 has an F yetanothertestthatthedata shouldpass beforeone ac- distribution withp and (n - 2p) degreesof freedom. ceptsthesimplemodeland uses theestimator13. Undertheextendedmodelof Section2.4, theexpected value of MSw is &2 + A'VA-'Alp, whiletheexpected 4. USINGTHEWEIGHTSTO TESTTHESIMPLEMODEL value of 62 iS & + T21(n - 2p), where The testis based on thedifference A = 13w - 13.The - A'VAA. T2= y'Z'(I - X(X'X)'lX')Zy assumptions ofSection2.2 implythatA-E(A) = 0. As an alternative hypothesis, we considertheomitted-pre-The formulaforT2 can be interpreted as thedifference A A A A A A - DuMouchel and Duncan: UsingSample SurveyWeightsin Regression Table 1. Formulas forANOVA Table Comparing Weighted and Unweighted Regressions Source df Sum of Squares Mean Square Regressiona p SSR= f3'(X'X)f MSR= Weights p SSw= A'Vaj1 MSw= SSE remainder Error Total n 2p - n &2 = SSRIP SSW/p SS/El(n - 2p) Y'Y a The source labeledregression hereincludestheconstanttermifitis presentin the model.In mostapplications,and in our examplesof Section7, theeffectof thegrand meanis omittedand thedfforregressionis p-1,whileSSR and Y'Y are reducedbythe square ofthegrandmean. 539 The two"due to regression"sumsofsquareswillbe SSR + SSw, and SSR, respectively. MethodB. Perform theregressions of Y and Z on X. Thenperform theregression of Y on theresidualsofthe Z-on-Xregressions. The last "due to regression"sumof squareswillbe SSw. 5. SOME FURTHER REMARKS The following remarksare somewhatindependent of each otherbutare offeredas discussionand forclarification. Remark1. Althoughthe testsinvolvingA and -yare equivalent,interpretation oftheirindividual components is somewhatdifferent andless straightforward forA'than forA. If the hypothesisA = 0 is rejected,we suggest checkingforinteractions amongthevariablesforwhich thecorresponding components ofA or are significantly different fromzero. This procedurehas, of course,no powertodetectinteractions notrelatedtotheweightvariable,thatis, thoseforwhichtheT2 ofSection4 is large. betweentheexcess in theresidualsumofsquaresdue to the termZ^yin the model(4.1) and thatacneglecting countedforbyestimating A = DZy, whereD is givenby (4.2). IfT2 iS small(thenexttheoremimpliesthatT2 = 0 is equivalentto Z = WXCforsomematrixC), thenthe F testbasedon MS w/&2willbe a usefultestofthesimple linearmodelof Section2.2. If thismodelis rejected,we Remark2. Bishop (1977) suggestsusinga weighted concludethat,Band , w have different The expectations. regression in certainsituationswhenthe samplingratio rationaleforpreferring unweighted to weightedregresa is of thedependentvariable(notmerelythe function sion is also rejectedunlesssome othervariablesZ can predictor variables) even whenthe simplemodelholds. be foundthatlead one to acceptan extendedmodelof The presentarticledoes notdiscussthatsituation, which theform(4.1). is akin to retrospective or case-control sampling. Manski A weighted leastsquarescomputer program is required and to compute,Bwand A, and anotherspecialprogramis and McFadden(1980)providea generalformulation analysis of the problem. Holt, and Smith, Winter (1980) requiredto computeVa. However,as thefollowing theoformulation andsolutionthatassumes remshows,SS wcan be computedandthetestperformed providea different that the of probability selection dependson a normally withordinary, unweighted, regression programs. distributed designvariablewithknownvariance. Theorem.The F testforA = 0 is thesameas theusual Remark3. Thomson(1978)presentsanotherrationale F testfor-y= 0, ifthe regressionmodel Y = Xao + WX-y coefficients thatdepend + E is fitted by ordinary least squares.(Thatis, create forusingestimatesofregression thenew variablesZ = WX and testforthe effectof Z on the sample design,even when the additivemodel holds. He showsthatalthough,Bis best conditional on partialledon X.) withsmaller Proof. Since A = DY = D(Xot + WXy + E), and DX X, theremaybe otherunbiasedestimators = 0, varianceforcertainrangesofthetrue,B,ifbias and variance are computedby averagingoverall valuesofX in A = E(A) = DWXy = VA(XWX)y a givensamplingdesign.However,underThompson's than,Bw. using(4.2) and (4.3), whereVA = V(A)/1u2.Therefore, model,,Bis alwaysmoreefficient theF testsof A = 0 and -y= 0 willbe equivalentifVA 6. AN EXAMPLE and (XWX)are bothnonsingular. One can showthatthis conditionis equivalentto theassumption thatthematrix To elucidatetheissuesin termsoftheapplicability of (X: WX) is fullrank(= 2p), whichis trueifthereare at themodelsdiscussedin Section2, we considertheanalrowsofX ysisof a subsetofdata fromthePanel Studyof Income least(p + 1) distinctwiwhosecorresponding have rankp. (We conjecturethatthetheoremis truefor Dynamics,a continuing longitudinal studybegunin 1968 theF testswouldhavefewer bytheUniversity X and W,although arbitrary ofMichigan'sSurveyResearchCenter. degreesoffreedomin thenumerator.) The originalsampleof 4,802familieswas composedof methods twosubsamples.The largerportion(2,930)oftheoriginal In practice,one might use oneoftwodifferent wereconductedwithhouseholdheadsfroma to computeSSR and SSw, dependingon the detailsof interviews cross-sectionsample of familiesin the representative after one's least squaresregressioncomputerprogram, wereconUnited States. An additional1,872interviews having formedthe variables Y, X, and Z = WX (Zij = householdsdrawnfrom ductedwithheadsoflow-income a sampleidentified andinterviewed bytheCensusBureau AnMethodA. Perform the regression of Y on X and Z, forthe1966-1967SurveyofEconomicOpportunity. and thenrefitthe regression, droppingtheZ variables. nual interviewshave been conductedsince 1968 with A 540 Journalof the American StatisticalAssociation,September 1983 thesehouseholdheads and also withtheheads of new Table 2. VariableDefinitionsin Model Equation (6.1) whohaveleft formedbyoriginalpanelmembers families ofthe Completededucationalattainment yearof thestudy,a set of Ed home.At theend of thefifth in years,self-reported individual, in FaEd weightswas calculatedto accountforinitialvariations levelof the father,in Educationalattainment years,reportedbythe father rates.The samplingratesand variationsin nonresponse levelof the mother,in Educationalattainment means MothEd to helpestimatepopulation weightswereintended years,reportedbythe father anal- Sibs and totalsand forpossibleuse in otherstatistical Numberof siblings 1967-1971 averagetotalparentalfamily whichare inversely FamilyIncome yses.The calculationoftheweights, income,in thousandsof 1967 dollars, of selectionforeach into the probability proportional reportedbythe father dividual,is describedin Morgan(1972,pp. 33-34). For Age in years The age of the individual, Per studentpublicschool expenditurein thepurposeof thisexample,we ignorethefactthatthe ExpiPupil 1968,forcountyof residencein 1968 samplingschemewas clusteredas wellas stratified. Unemployment The percentof countylaborforce In thisanalysiswe attemptto predicteducationalatunemployedin 1970,forcountyof residencein 1968 ourselvesto panel individuals tainment and we restrict A dichotomousvariableequal to one ifthe (a) who were childrenin 1968households,aged 14-18; Rural parentalfamilyresidesmorethan50 miles (b) who had becomeheadsor wivesoffamiliesby 1975; froma cityof 50,000or morein 1968,and zero otherwise (c) who had completedtheirschoolingby 1975;and (d) The percentof persons25 or moreyearsold whohad completedat leasteightyearsofschooling.Re- % college in the 1968 countyof residencewho have 46 and 9 observations, strictions (c) and (d) eliminated completedfouror moreyearsof college was necessarybecause CountyIncome The median1969 incomein 1968 countyof A finalrestriction respectively. residence,in thousandsof 1969 dollars restrictions (d) (a) through the867 individuals satisfying families.Since theeducacame fromonly658 different tionalattainments of siblingsare not likelyto be indeblack pendent,we randomlyselectedone observationfrom on blackmales,221 on whitefemales,and 149on re- females. each set of siblingsthatcame fromthesamefamily, to 658. The weights ducingthe numberof observations 7. RESULTS OF THEDATAANALYSES forthese658 cases rangefrom1 to 83, witha meanof 29.0 and standarddeviationequal to 21.4. The dataused estiunweighted We beganour analysisby obtaining cardimages, in thisanalysis,consisting of867computer of(6.1),withtherace-sexdummy matesoftheparameters is availablefromtheauthors. Thecoefficients variablesincludedas additivepredictors. Theoreticaland empiricalstudiesoftheeconomicsof andassociatedstandarderrorsaretheboldfaceentriesin educationalattainment (Becker 1975,Ben-Porath1967, of Table 3. the secondand thirdcolumns,respectively, Duncan 1974,Edwards1975,Hill 1979,Liebowitz1974, Consideredas a whole,thesevariablesaccountformore numerParsons1975,and Wachtel1975)have identified thana quarterofthevariancein educationalattainment. ous characteristics of the familyand the economicensigns,alVirtuallyall of themhave the hypothesized decision.This vironment thatmayaffecttheattainment withtheexceptionofthecountyincomevariable, though, ofthefolpastresearchleads to ourinitialspecification levelarestatistically onlytheonesmeasuredatthefamily lowingform: at conventionallevels. The resultsfor the significant a significant countyincomevariableis puzzling,although Ed = Ot+ P1FaEd + P2MothEd + I3Sibs has also been foundby Wachtel negativecoefficient (+) (-) (+) oftheredata. A histogrami (1975,p. 515) withdifferent + P6ExplPupil + P4FamilyIncome + P35Age oftheresidualsversusthefitted sidualsand a scatterplot (+) (+) (+) fromtheassumptions valuesshowedno grossdeviations ofthesimplemodel. + P7Unemployment+ P8Rural + P9%College UsingMethodA of Section4 to compare,Bwith13w, (+) (-) (+) each indepenwe formed14 Z variablesby multiplying + P1oCounty Income. theconstant)bytheweightvardentvariable(including (+) (6.1) iable. Whenthedependentvariableis regressedon both variables,we foundthatthenullhyTable 2 definesthevariablesin the model(6.1) and the setsofindependent are givenin the pothesis,thatA = 0 (i.e., thesimplemodelis correct), signsof the coefficients hypothesized var- could be rejectedat aboutthe6 percentlevel,as Table of(6.1). In addition,we includedummy parentheses iablesforblackmales,whitefemales,and blackfemales 4 indicates.We proceededto calculateestimatesof A, with VA,and weightedestimatesof ,B. in orderto comparetheireducationalattainment whitemales. Because low-incomefamilieswere over- The weightedestimatesof P and thet ratiosof A are are givenin theboldfaceentriesofcolumns4 and 5 ofTable sampled,thenumberofblackandwhiteobservations in MethodA providesan inthesamplethaninthepop- 3. The regressionperformed farmoreevenlydistributed (y) and standarderrorsforeach on whitemales,112 estimateof coefficients ulation.Thereare 176observations 541 DuMouchel and Duncan: UsingSample SurveyWeightsin Regression Table 3. Coefficients, StandardErrors,and t Rtiosof VariousTests forTwo Versionsof the Educational Attainment Model Independent Variable EaEd MothEd : Unweighted Estimateof, w: WeightedEstimate of,8 .0 1.6 (.021) (.026) (.020) (.126) .082 .158 .082 .055 -.069 .033 (.031) (.012) (.031) (.012) -.069 .039 -.074 .030 .251 - 1.267 (.045) (.215) (.045) (.448) .314 -.639 -.068 1.385 (.163) (.796) -.218 -.070 .491 (.196) (.900) .142 .268 1.0 .022 .037 -.081 .016 -.104 .006 .183 -.097 .019 -.084 (.078) (.042) (.175) (.017) (.049) (.077) (.076) (.172) (.017) (.048) .027 -.015 -.113 .030 -.129 .027 .210 -.119 .024 -.092 6.705 7.679 -.208 (.907) (1.080) (.085) 6.073 6.794 -.252 .082 .125 Sibs Family Income -.073 -.044 Age Whether Black Male WhetherWhite Female Whether Black Female Expenditure/Pupil Unemployment Rural % College CountyIncome .271 -.280 Constant Whether White SE(g): StandardError tA: t Ratioof Difference Between Weightedand Unweighted Estimatesof, and Unemployment WhetherFemale .076 .002 -.340 and MothEd (.169) - .2 -.9 ty: t Ratioof Estimated Coefficient of Variablex Weight .3 .6 .0 2.0 -.2 .5 .0 -.2 -.2 -.5 - .1 .5 .297 1.753 1.3 -1.5 1.4 - 1.0 1.6 -2.1 1.6 -2.0 1.048 -2.8 - .5 -2.8 - 1.1 -.3 -.3 1.0 .1 -2.3 -.3 1.0 -.7 .4 .3 -.2 .4 -.2 .5 -2.3 .2 .6 -.7 .6 .3 .2 .4 -.3 -1.0 - 1.1 -.7 -.320 -.5 .2 -.6 - 1.2 .2 MothEd2 .008 (.007) .005 -.6 -.4 Whether Female .017 (.009) .017 -.0 -.2 x MothEd2 R2 StandardErrorof Estimate Sample Size .289 1.60 658 .315 1.57 658 Source: Morgan (1972). intheboldface We choseinsteadtouse theinformation of theZ variables;the t ratiosof each are givenin the entriesof Table 3 to exploreextensionsof the simple sixthcolumnofthetable. subdiffered coefficients we could modelof(6.1). The unweighted of thesedifferences, Giventhe significance forfourvariafromthe weightedcoefficients have chosen to use the f w of column4 as descriptive stantially rate, ofthe bles: mother'seducation,countyunemployment estimatesof,3*inthecensusmodel.Thisrejection simplemodelwouldputmoreemphasison raceand sex, and two of the race-sexdummyvariables.Notice that (and betweentheranking of educational thereis a roughcorrespondence as predictors and less on unemployment, betweenundirection)of the t ratios on differences attainment. andthet ratiosofthe coefficients and weighted weighted Z variables.Thus, the moreeasilycomcorresponding Table 4. ANOVATable ComparingWeightedand on theZ variablescan apparputedtestsof significance UnweightedRegressions Usingthe InitialModel entlyserveas a guideto variableswithlargeA's. 's suggestthatthe Priorresearchand the significant Mean Sum of interF Significancemostprobablecause ofmisspecification is omitted df Squares Square Source indepenthe actionsbetweenrace and sex and someof <.0001 20.1 51.5 670.0 13 Regression dentvariableslistedin (6.1), especiallymother'sedu.056 4.2 1.7 59.2 14 Weights 2.5 1586.7 630 Error cation and county unemployment.Although the hypothesisof equal slopes for the fourrace and sex 2315.9 657 Total subgroupscouldnotbe rejected(F = .94; df = 30,614), 542 Journal of the American Statisticol Association, September 1983 thesubsetregressions did suggesta possibleinteraction Table 6. Estimated Increase in Mean Ed per Year of betweenrace and countyunemployment ratein which Increase in MothEd, Based on Final Model increasesinunemployment hada stronger positiveeffect Education Level of Mother on theeducationalattainment of blacksthanof whites. This interaction is quite plausiblesince unemployment Sex of 12 Grades 16 Grades 8 (College Grad.) (H. S. Grad.) Grades ratesforblacks are considerably higherthanthose of Child whitesand a unitchangein the overallcountyunem.26 .19 .13 Male ployment ratehas moreeffecton blacksthanon whites. Female .46 .26 .06 Whenthisinteraction termwas addedto (6.1), thecoefficientwas -.21 witha standarderrorof .09. Furthermore,whenMethodA was repeatedwithit and its as- these209 individuals fromfamilieswithat leasttwochilsociated Z variable (i.e., whetherwhite x county drenin thisfive-year age cohort.Whenthefinalmodel unemployment rate x weight)includedas predictors, the was fittedto thecompletesampleof867 individuals, the F ratiooftheentiresetofweightinteractions dropsfrom coefficient ofincomerosefrom.033to .044. 1.7 to 1.3, and the coefficients of the two Z variables Our use of the weightsto testformodelmisspecififormedfromthe countyunemployment variableare in- cationhas led us to thesubstantive conclusionthata simsignificant. modelis not ple linearadditiveeducationalattainment Since thethreeotherZ variablessignificant at the .05 appropriatefor severalreasons. First,a worseningof level in the originalspecification remainedsignificantlocal economicconditions(as measuredby the county whenthe race-unemployment interaction was included, unemployment rate) appearsto providemoreof an inwe continuedour searchforadditionalinteractions. We centiveforblacks to stayin schoolthanforwhites.A discoveredthatmother'seducationinteracted withitself one percentage rate pointincreaseintheunemployment (i.e., itseffect was nonlinear) and,furthermore, thatthese was associatedwithan additionalone-fifth of a yearof nonlineareffectsof mother'seducationon the educa- educationalattainment forblacks,whilethe effectfor tionalattainment of thechildrendependeduponthesex whiteswas essentially ofmothzero. Second,theeffects of thechild. children of attainment er's educationon theeducational Resultsfromthe estimation of our finalspecificationincreasewiththelevelofhereducationandfurthermore of theeducationattainment modelare theitalicizeden- dependon the sex of the child.Table 6 evaluatesaEd/ triesof Table 3. In contrastto theinitialspecifications,aMothEdformotherswith8, 12, and 16 yearsof eduthe highestt ratioforthe difference betweenweighted cation. and unweighted is 1.4. The analysisof vari- There is a modestincreasein the marginaleffectof coefficients ance tableforthefinalmodel,Table 5, showsthattheF mother'seducationalattainment for sons and a much ratioassociatedwiththeweightsis below 1.0. An moredramaticincreasein thiseffectfordaughters. The coefficients and standarderrorspresentedas the extrayearofmother'seducationlevelis associatedwith italicentriesof columns2 and 3 of Table 3 shouldbe virtually whose ofdaughters no increaseintheattainment regardedwithsomecautionbecause thedatawereused mothers havean eighthgradeeducationbutis associated to suggesttheappropriate functional form. withan additionalone-halfyearofeducationfordaughAs a finalanalysisstep,we reestimated themodelon terswithcollege-educated do Theseconclusions mothers. the209 individuals whowereexcludedwhenthesample notchangewhenthemodelis estimatedfromthecomwas restricted to onlyone siblingperfamily.The results pletesampleof867individuals. None ofthesixnumbers were quite similar,particularly forthe race-unemploy-in thistablechangesby morethan.03. mentinteraction andthenonlinear effectofmother'sedFinally,an analysisoftheresidualsfromthisextended ucation.Although thesignoftheinteraction betweensex modelrevealedno anomalies,andwe thusprefer theunof childand mother'seducationchangeddirection, the weightedestimatesof its coefficients. thatchangedbya statistically onlycoefficient significant amountwas familyincome.The incomecoefficient in- [ReceivedNovember1976.RevisedNovember1982.] a moreimportant roleforincomefor creased,suggesting REFERENCES BACHMAN,J.G.,GREEN, S., andWIRTANEN,I.D. (1974),Youth Ann Vol. III, DroppingOut-Problemor Symptom?, in Transition, forSocial Research. Arbor,Mich.:Institute and EmBECKER, GARY S. (1975),HumanCapital:A Theoretical toEducation(2nded.), New piricalAnalysis,withSpecialReference Press. York:ColumbiaUniversity Significance BELSLEY, D.A., KUH, E., and WELSCH, R.E. (1980),Regression Data andSourcesofCollinearity, Influential Diagnostics:Identifying <.0001 New York:JohnWiley. .494 of HumanCapital BEN-PORATH,YORAM (1967),"The Production and theLifeCycleof Earnings,"JournalofPoliticalEconomy,75, 352-365. BISHOP, JOHN (1977), "EstimationWhenthe SamplingRatiois a Table 5. ANOVATable ComparingWeightedand UnweightedRegressionsUsing the Final Model Source Regression Weights df Sum of Squares Mean Square F 17 18 730.6 43.3 43.0 2.4 17.35 . .97 Error 622 1542.0 Total 657 2315.9 2.5 DuMouchel and Duncan: UsingSample SurveyWeightsin Regression 543 LinearFunctionoftheDependentVariable,'Proceedings ComplexSamples,"JournalofRoyalStatistical Society,Ser. B, 36, ofSocial 1-37. StatisticsSectionoftheAmericanStatistical Association,848-853. BLUMENTHAL, M.D., KAHN, R.L., ANDREWS, F.M., and KLEIN, L.R. (1953),A Textbook Ill.: Row, ofEconometrics, Evanston, Peterson,and Co. HEAD, K.B. (1972),Justifying Violence:Attitudes ofAmerican Men, AnnArbor,Mich.:Institute forSocial Research. KLEIN, L.R., and MORGAN,JAMESN. (1951),"ResultsofAlternativeStatisticalTreatments of SampleSurveyData," Journalof BREWER,K.R.W., andMELLOR, R.W. (1973),"The Effect ofSamAmericanStatistical ple Structure Association,46, 442-460. on Analytical Journalof StatisSurveys,"Australian KONIJN,H.S. (1962),"Regression AnalysisinSampleSurveys,"Jourtics,15, 145-152. Association,57, 590-606. DRAPER, N.R., andSMITH, H. (1966),AppliedRegression Analysis, nal ofAmericanStatistical LEIBOWITZ, ARLEEN (1974), "Home Investments in Children," New York:JohnWiley. DUNCAN, GREG (1974),"EducationalAttainment," Five Thousand JournalofPoliticalEconomy,82, 111-131. Estimators AmericanFamilies-Patternsof EconomicProgress,Vol. I. eds. MANSKI, C., and McFADDEN, D. (1980),"Alternative and SampleDesignsforDiscreteChoice Analysis,"in Structural Morganetal., AnnArbor,Mich.:Institute forSocialResearch,305AnalysisofDiscreteData, eds. C. ManskiandD. McFadden,Cam332. bridge,Mass.: M.I.T. Press. DUNCAN, GREG J.,and MORGAN,JAMESN. (eds.) (1976),Five ThousandAmerican Families-Patterns ofEconomicProgress,Vol. MORGAN,JAMES N. (1972),A Panel Studyof IncomeDynamics: AvailableData, Vol.1, AnnArbor,Mich.: StudyDesign,Procedures, forSocial Research. IV, AnnArbor,Mich.:Institute Institute forSocial Research. EDWARDS, LINDA N. (1975),"The Economicsof SchoolingDecisions:TeenageEnrollment Rates,"Journal ofHumanResources,10, MOSTELLER, F., and TUKEY, J.W. (1977), Data Analysisand Regression,Reading,Mass.: Addison-Wesley. 155-73. Transfers andthe HABERMAN,SHELBY J.(1975),"How MuchDo Gauss-Markov and PARSONS, DONALD 0. (1975),"Intergenerational EducationalDecisionsof Male Youth," Quarterly Journalof EcoLeast-SquareEstimatesDiffer? A Coordinate-Free Approach,"The nomics,89, 603-617. AnnalsofStatistics, 3, 982-990. Analysis HILL, C. RUSSELL (1979),"Capacities,Opportunities, and Educa- PFEFFERMANN,D., andNATHAN,G. (1981),"Regression ofData Froma ClusterSample,"Journal Statistical tionalInvestments: oftheAmerican The Case oftheHighSchoolDropout,"TheReAssociation,76, 681-689. viewofEconomicsand Statistics, 61, 9-20. HOLT, D., SMITH, T.M.F., andWINTER,P.D. (1980),"Regression PORTER, RICHARD D. (1973), "On the Use of SurveySample intheLinearModel,"AnnalsofEconomicandSocialMeasWeights AnalysisofData fromComplexSurveys,"JournaloftheRoyalStaurement, 212, 141-158. tisticalSociety,Ser. A, 143,474-487. Stratified HU, T.W., and STROMSDORFER, E.W. (1970), "A Problemof SMITH, KENT W. (1976),"AnalysingDisproportionately Samples With ComputerizedStatisticalPackages," Sociological Bias inEstimating Weighting theWeighted Regression Model,"Proceedingsof theBusinessand EconomicsSectionof theAmerican Methodsand Research,5, 207-230. THOMSEN, IB (1978),"Design and Estimation ProblemsWhenEsStatisticalAssociation,513-516. a Regression timating Coefficient FromSurveyData," Metrika,.25, JUSTER,F. THOMAS (ed.) (1976),TheEconomicandPoliticalImpact of GeneralRevenueSharing,Washington, D.C.: U.S. Government 27-36. WACHTEL, PAUL (1975),"The Effect ofSchoolQualityon AchievePrinting Office. ment,Attainment Levels, and LifetimeEarnings,"Explorations in KISH, LESLIE (1965),SurveySampling,New York:JohnWiley. KISH, LESLIE, andFRANKEL, MARTINR. (1974),"Inference from EconomicResearch,2, 502-536.
© Copyright 2024