Using Sample Survey Weights in Multiple Regression Analyses of Stratified... Author(s): William H. DuMouchel and Greg J. Duncan

Using Sample Survey Weights in Multiple Regression Analyses of Stratified Samples
Author(s): William H. DuMouchel and Greg J. Duncan
Reviewed work(s):
Source: Journal of the American Statistical Association, Vol. 78, No. 383 (Sep., 1983), pp. 535543
Published by: American Statistical Association
Stable URL: http://www.jstor.org/stable/2288115 .
Accessed: 24/09/2012 15:42
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp
.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
American Statistical Association is collaborating with JSTOR to digitize, preserve and extend access to Journal
of the American Statistical Association.
http://www.jstor.org
UsingSample SurveyWeightsin Multiple
Samples
RegressionAnalysesof Stratified
WILLIAMH. DuMOUCHELand GREGJ. DUNCAN*
thesame
to giveeach stratum
The rationalefortheuse of samplesurveyweightsin a each case, whichattempts
in thesamplethatit has in thepopanalysisis examinedwithrespect relativeimportance
leastsquaresregression
of the popu- ulation.This articleassumesthatan observablestratifigeneralspecifications
to fourincreasingly
of the cation variable J takes on k levels and that{fw}, the prolation regressionmodel. The appropriateness
estimatedependson whichmodelis portionsof the populationforwhichJ = j, j = 1, ... .
weightedregression
between kareknown.Let njbe thesize ofa simplerandomsample
chosen.A proposalis madetouse thedifference
as an aidinchoos- drawnfromthejth stratum,j = 1, . . , k, so thatn1 +
estimates
andunweighted
theweighted
is underrepresented
es- * + nk = n. Since thejth stratum
modeland hencetheappropriate
ingtheappropriate
tonjlj, theweight
timator.Whenappliedto an analysisof thefamilialand inthesamplebya factorproportional
is
of the educationallevel at- assignedto theithobservation
determinants
environmental
tainedby a sampleofyoungadults,themethodslead to
r
(1.2)
Wi uj1Inj1,
a revisionoftheinitialadditivemodelinwhichinteraction
i = 1,
and race, as well whereji is thevalue ofJ fortheithobservation,
termsbetweencountyunemployment
,
...
n.
Let
W
the
matrix
whose
ithdidenote
diagonal
as betweensex and mother'seducation,are included.
agonalelementis wi. In some textbooks,and in many
KEY WORDS: Weightedregression;Finitepopulation; analyses of surveydata (see Klein 1953, Bachman,
model. Green,andWirtanen
Superpopulationmodel;Educationalattainment
etal. (1972),Dun1974,Blumenthal
can andMorgan(1976),Hu and Stromsdorfer
(1970),and
1. INTRODUCTION
Juster
(1976)),a weighted
leastsquaresestimator
is used,
Suppose thata samplesurveymeasures(p + 1) var- namely,
so thatthedata consist
iables on each of n individuals,
(1.3)
P w = (X' WX)' X' WY.
X. Thenthe
ofthen x 1 matrixY and then x p matrix
Which estimatorshould be used? Controversy
of
has
coefficients
of theregression
least squaresestimator
ragedat least since Klein and Morgan(1951). The adY on Xis
vocates of I8 can point out that the justification
for
(1. 1)
p = (X'X) - X' Y.
weightedregressionin termsof adjustingforunequal
However,the rows of Y and X oftenare not a simple errorvariances(see, e.g., Draperand Smith1966)is not
sam- at issue here. In the usual homoscedasticregression
randomsamplefromthe population.Differential
varianceunbiasedwhetheror not
responseratesamongvarious model,,3is minimum
plingratesand differential
to size. Nevertheless,
of selectionforeach thestrataare sampledproportional
probabilities
stratalead to different
thesupindividual.Kish (1965) discusses the computationof theadvocatesofPwareconcernedwithreducing
scheme,reasoning
by
schemes,butthis posedbias causedbythesampling
forvarioussampling
theseprobabilities
of an overallpopulationtotal
andnot analogyto theestimation
sampling
articleis concernedonlywithstratified
is clearlynecessary
of clustersampling.Fur- ormean.In thatcase, suchweighting
complication
withthe further
in thestratum
differences
means.
to be based on X but ifthereare systematic
is permitted
ther,thestratification
In addition,theyarguethattheassumptions
thatlead to
noton Y
ofParelikelytobe violatedinpopulations
sampling theoptimality
As describedin Kish (1965),thedifferential
ofweightsfor of interest.Brewerand Mellor(1973) discusshow the
andresponserateslead to thecomputation
choicebetweenPand Pw is influenced
by thechoiceof
a model-based
versusan approach
approachto inference
* WilliamH. DuMouchelis AssociateProfessor
withina finitepopulationin
of AppliedMathe- based on randomization
Cam- whichno particular
Institute
ofTechnology,
Center,Massachusetts
matics,Statistics
modelis assumed.
bridge,MA 02139.GregJ. Duncanis SeniorStudyDirector,Survey
The pointofthisarticleis to clarify
thisissuebyshowofMichigan,
forSocialResearch,University
ResearchCenter,Institute
estimator
dependson whichof
Herman inghow theappropriate
AnnArbor,MI 48109.The authorswantto thankProfessor
A
onanearlier
comments
andconstructive
reading
Chernoff
forhiscareful
threeassociate
ofthisarticle.Theyalso wanttothanktwoeditors,
draft
on earlier
fortheirmanyhelpfulcomments
editors,and fourreferees
for
andMartinFrankel,Jr.,GrahamKalton,andRogerWright
drafts,
valuablediscussion.
535
? Journal
Statistical
Association
oftheAmerican
383
September
1983,Volume78,Number
Section
Applications
536
Joural ofthe American StatisticalAssociation,September 1983
severalpossibleregression
models(ifany)is appropriateof (X, J). By analogyto the univariatesamplesurvey
ofina mean,one targetquantity
problemofestimating
as a deviceto helpdecidewhichmodel,andhencewhich terestis the weightedaveragecoefficient,
namely,the
is appropriate.
estimator,
In Section2 we definefourdif- vector
ferentregressionmodels of increasinggenerality
that
k
mightbe used tojustifytheuse of f w. In section3 we
= E
1TjI(j),
discuss the relationshipbetweenthe models and the
j=1
choice of estimator,
and in Section4 we showhow an
n
n
easilycomputedtestbasedon pw - Pmayhelpinchoos(2.3)
wi,
-=~ w{3ii
i=l
i=l
inga model.Section5 containsfurther
discussion,and
in the last two sectionswe illustrate
the issues by the wherePi = f3(ji)and the secondequalityfollowsfrom
construction
of an educationalattainment
modelbased (1.2).
on a nationalsurvey.
Model
2.4 The Omitted-Predictor
2. FOURREGRESSIONMODELS
This model assumes thatthe simplehomoscedastic
2.1 Notation
modelofSection2.2 wouldholdifonlyXwereaugmented
omitted(I x q) variableZ. Thatis,
The decisionofwhether
to use theweightsdependson bytheunfortunately
whatone assumesaboutthepopulationfromwhichthe given(X, Z, J),
data have been drawn.In thissectionwe describefour
Y = ka + Zy + i,
modelsthatexemplify
the mostcommonassumptions.
= X13+ &Uy+ i,
(2.4)
Associatedwitheach modelis a certaintargetquantity,
or parameterof interest.The questionis whethera or wheree has mean0 and variance&2, independent
of(X,
Pw is themoreappropriateestimateofthattargetquan- Z, J). The coefficients
ofX and Z are a and -y,respectity.
to X, namely
tively,whileU is thepartofZ orthogonal
The readermayfinditeasierto thinkintermsofsam(2.5)
E(X'Z).
U = Z - XE(X'X)1
plingfroman infinite
population,since populationsize
per se is nota majorissue here.We alwaysassumethat Since Z has not been identified,
the targetquantityor
the stratumsamplesizes {n1} are smallfractions
of the parameterof interestin thismodelis ,B,but ifZ were
corresponding
populationstatumsizes, and the mathe- identified,
we assumetheanalystwouldpreferto know
maticsofsampling
withreplacement
orfrominfinite
pop- (a, y) ratherthanto knowmerely,B.
thisarticle.Let Y and X
ulationsare used throughout
thismodeland
pointsrelating
Thereare twoimportant
denotethescalarand (1 x p) randomvariablesdefined themixture
model.First,even ifZ is takento be theX
var- x Jinteraction
by a singledrawofthedependentand independent
variable,so thatthetwomodelsareidenLet yand tical,the two parametersf3and ,3 usuallywill not be
fromtheentirepopulation.
iables,respectively,
x denotevalues of Y and X, namely,singlerowsof the equal. Second,evenwhenassumingthatomitted
predicdata matricesY and X respectively.
Unconditional
ex- torsexist,we have in mindthattheyare nottoo numerpectations
E(?) referto a simplerandomsamplefromthe ous, so thatalthough
modelis thetheomitted-predictor
population,whileconditional
expectations
E(- IJ) refer oreticallya generalizationof the mixturemodel, in
to stratified
wherea simplerandomsampleof practiceitwouldhavefewerparameters
sampling,
sinceone would
size nj is taken fromthejth stratum,j = 1, . . . , k.
if
thenumberof
+
<
k,
especially
that
kp,
q)
(p
hope
is
strata, large.
2.2 The Simple Linear Homoscedastic Model
and to show how a test based on ,Bw - , may be used
2.5 The General Nonlinear Model (No Model)
This is theusual regression
modelin which
Y= Xk + ,
(2.1)
that
assumption
This modelmakestheminimal
Y = X* + E*,
(2.6)
wherep is a p x 1 vectorofcoefficients,
ande is random
errorwithmean0 and varianceor2. The keyassumption whereE(i*) = 0 and cov(X, e*) = 0. However, no other
is thatthemeanand varianceofi, conditional
on (X, J), assumptions
are madeaboutE(i* IX, J) or V(i* | X, J).
are independentof (X, J).
The parameter* is thusdefinedas
2.3 The MixtureModel
= E(X'X)-'
E(X'Y).
(2.7)
Thismodelsupposesno uniqueP, butthatp variesby The parameter* willbe called thecensuscoefficient,
stratum
in thepopulation.Thatis, thereare k parameter sinceit wouldbe theleast squaresestimateifthepopuweresampled,
vectors p(1), . . . , P(k), and, conditionalon J = j,
andtheentirepopulation
lationwerefinite
of ,3*is thatX,3*
as in a census.Anotherinterpretation
Y = X,B(j) + >,
(2.2)
of Y in the sense of
thebestlinearpredictor
represents
This
where,again,e has mean0 andvarianceiJ2, independent minimizing
theexpectedsquarederrorofprediction.
DuMouchel and Duncan: Using Sample Survey Weights in Regression
"model" is notreallya modelsinceit assumesno pop- that
ulationstructure
exceptthatnecessaryto definethetarget quantityf3*.
537
E(3 J) = 3 + (X'X) -X'(v - 4),
A
Ifeverynj is smallcomparedto thesize ofthejth pop= 3 + (X'WX)_,X'W(v
E(rW|J)
A).
ulationstratum,
thismodelseemsequivalentto thefinite
in whichthevaluesof Y andX in Noticethat,in general,neitherj3nor 3w is an unbiased
population
formulation
J3,
andthereis no simthe populationare treatedas fixedwithno underlyingestimateoftheaveragecoefficient
tell
ple
way
to
from
the
two
preceding
expressions
which
This modelincludesthe threeearliermodels
structure.
=
has
the
smaller
bias.
,
For
example,
if
p
1,
so
that
as special cases. Note, however,that if the mixture
the
and
the
are
all
scalars,
then
xi
P3i,
truethatf3*= J3,while,
modelis true,itis not generally
if the simplehomoscedasticmodelor the omittedpreBias () =
- 3)I/ xi2,
Xi(i
dictormodel is true,then f3*= P3.In fact, settinge* =
Bias (13w) = E w1x12(j,3
- j3)/ wjxi2.
modelis forUy + e showsthattheomitted-predictor
model.Butthe
mallyequivalentto thegeneralnonlinear
Then, ifxi 1, Bias (13w) = 0 by the definition(2.3) of
formermodelassumesthatU (actuallyZ) is an easily
J3,butforotherchoicesofX thiswillnotbe true.In fact,
interpreted
andnot-too-hard
tomeasurevariablethatwas
it could happenthatx,2is proportional
to wi, in which
omittedby oversightor by some practicalnecessity,
case 13wouldbe unbiasedbut, w wouldnotbe. In genwhilethelattermodelallows U to be anyvariablewith
eral,neitherP nor 13wappearto be suitableestimators
variable.
cov((U,X) = 0, perhapsan unobservable
for 13in the mixturemodel. Konijn (1962) and Porter
One reasonforintroducing
bothmodelsis,as discussed
(1973)use themixture
modeland recommend
estimating
in Section3.4, to contrasttwoapproachesa statistician
13 separatelywithineach stratumand then takinga
mighttake afterrejectingthe simple homoscedastic
weighted
averageoftheestimatesas thefinalestimateof
modelisjustaround
model.The optimist
("a goodfitting
13.Thatis, use
The pessimist
thecorner")searchesforextrapredictors.
= E We3P Wi.
("models are nevervalidin thereal world") refusesto
P =
E Ij(j)
relyon anypopulationstructure.
Unfortunately,
this recommendation
is inadvisablefor
sampling
schemes
with
strata
and
few
many
relatively
3. WHENTO USEWEIGHTED
REGRESSION
observations
Pfefferman
and Nathan(1981)
per stratum.
3.1 Not Ifthe Simple Linear Homoscedastic
suggestusingweights
fortheP3i
thattakeintoaccountthe
Model is Acceptable
precisionof each P3i. Sometimesseparateestimation
Underthelinearhomoscedastic
model,P3is unbiased withinmanystratais impossiblebecause thereare too
Ifonewereespeciallysuspicious
and has minimum
varianceamongall linearunbiasedes- fewdegreesoffreedom.
that
one
of
the
theconstantterm
coefficients
(typically
timators;it wouldnaturally
be preferred
to 3w. Haberof
the
usual
regression)
varies
one couldallow
by
strata,
man(1975)provesvariousrelationsbetweenP3and 3w.
the
estimate
of
only
that
coefficient
to
varyby strata.
For example,he showsthatforany linearcombination
Such
an
the
entire
of
covariance
on
datasetcosts
analysis
c' ofthecoefficients
onlyone degreeoffreedomper stratum.
4RI(l + R)2 6 V(c' P J)IV(c'fw IJ) 6 1,
3.3 Use pw Ifthe Linear Homoscedastic Model is
whereR is theratioof thelargestto thesmallestof the
not Acceptable but an Estimateof 13*is
modelto be
{wi}. In orderforthelinearhomoscedastic
Desired
acceptable,itmustbe a prioriplausiblesubstantively
and
in additionpass the usual data analytictestsinvolving The advantageof13winthemodelsofSections2.4 and
estimator
of 13=
examination
of residuals,checkingforinteractions,
and 2.5 is thatP3wis at least a consistent
13*,
leteach nj
whileP maynotbe. Proofofconsistency:
so on.
-o and wi o
mjlnj,i = 1, ... , n. Then withprobability
and X'WYI/ wi
E(X'X)
3.2 Not Ifthe Parameter f3in the MixtureModel is one X'WXI/ wi approaches
~~A
~~~ ~~~~~
so
that
w ,1*. On the
Y)
E(X'
(2.7)
approaches
by
P
to be Estimated
otherhand,ifthe samplesizes of thestrata,nj, are not
The mixture
modelcannotprovidea generalrationale proportional
to thepopulation
proportions
mj(i.e., thew
of P3.To see this, do notapproachequality),then13neednotapproach13*.
forpreferring
P3wto P3as an estimator
considerthemodelof Section2.3 and let v, x and j xI~
be definedby
3.4 A StrategyforChoosing Between P and Pw
modelofSection2.3
First,ifone believesthemixture
13of Equation(2.3), thenneither
and
desires
to
estimate
Vi = Xfi,
,8nor,w is appropriate.
Therefore,
fortherestof this
wherexiis theithrowofX. Thenelementary
calculations articlewe ignorethe mixturemodeland theestimation
(let Y = v + e = jL + v - jL + e in(1.1) and (1.3)) show of 13.
v=
x43,i
i=
1,.
..
,n
538
Journalofthe American StatisticalAssociation,September 1983
Thereremainsthe problemof choosingbetweenthe dictormodelof Section2.4, Y = Xa + Zy + E, where
linearhomoscedastic
modelofSection2.2 (thuschoosing thecolumnsofZ are further
(perhapsunobserved)pre,3)and themoregeneralmodelsof Sections2.4 and 2.5. dictorsthatshouldhave been includedin theregression
Ifone prefers
to estimatef3* withinthegeneralnonlinear but were not. Althoughthis particularalternativehymodel,f3wis appropriate.If one believesthe omitted- pothesisformulation
is notessentialhere,itsuse willbe
predictor
model,thenone shouldtryto identify
theextra convenient
forexpressingand interpreting
theexpected
predictor
Z andestimate(a, -y)inEquation(2.4) or,failing meansquaresofan ANOVA testofA = 0 usingstandard
that,settleforusingf3was an estimateof J3.
linearmodelstheory.We willsee thatthehypothesis
-y
Thecontroversy
arisesindecidinghowmuchevidence, = 0 impliesA = 0 butnotvice versa.The hypothesis
A
if any, to requirebeforegivingup the linearhomosce- = 0 canalso be interpreted
as E( IJ) = ,3*inthecontext
dasticmodel.Closelyrelatedis thequestionofhowhard ofthegeneralmodelofSection2.5, butourdevelopment
to look foradditionalpredictors.
On one side are those concentrates
on the use of A in a test(perhapsone of
(see KishandFrankel1974,BrewerandMellor1973,and many)of thesimplemodelversustheomitted-predictor
references
therein)whotendto be extremely
dubiousof model. Furthermore,
when our test rejectsthe simple
theassumptions
of thelinearhomoscedastic
modeland model,examination
of A usuallysuggestscandidatesfor
whoalso maynotbe veryinterested
insearching
forextra the neededpredictors
Z. In thissectionwe do notdispredictors.They are satisfiedwithmakinginferencestinguish
betweenE(A) and E(- IJ), sinceall expectations
aboutthecensusparameter
hereare conditionalon (X, Z) and, forthetwo models
P *.
On the otherside are those who tendto accept the beingcompared,theadditionalconditioning
on J makes
simplemodelof Section2.2 so longas it can withstand no difference.
thescrutiny
ofa carefulregression
analysisas described, Since A =
- ,, it may be representedas A = DY,
forexample,inthebooksbyMostellerandTukey(1977) where
and Belsley,Kuh,and Welsch(1980).The processofreD = (X'WX)-1X'W - (X'X)-1X'.
withtransformed
fitting
variables,checkingforinteractions,plotting
residuals,and so on, maylead to theuse UnderthemodelofSection2.2, elementary
calculations
ofextrapredictors,
butthebasic strategy
is to acceptthe showthatthecovariancematricesof P43w, and A are
simplemodel(and use P3)unlessevidenceagainstit develops. The advantagesof thisapproachare, one, the V(A) = (X'
simpleinferential
procedures(standardF testsand con- V(3w) = (X'WX)-1(X'W2X)(X'WX)'1U2,
fidenceintervals)
and,two,themorestraightforward
interpretation
of 13,whichthemodelofSection2.2 allows. V(A) = DD'o2
Without
theassumptions
ofthatmodel,theinterpretation
_
= [(X'WX)-1(X'W2X)(X'WX)-1
X)- ]U2
(Xof 13*is difficult.
For example,yearsof schoolingmay
= V(P3w) - V(f3).
(4.3)
have a positive,8*forpredicting
income,buttheincome
of some subgroupsmaydropwithincreasing
education. Notice two things about these expressions.First,
Publishedregressionanalysesare oftenappliedto sub- V(3w) is not(X'WX)'&, as wouldbe trueif V(Ei)
populationsor to completelydifferent
populationsby Iw9
, i = 1, ., n. Thus, the standarderrorsand t
.2.
laterresearchers.In thatcase, 13*may be misleading, statisticsoutputby mostweightedregression
computer
whilethe extraeffortspentto identify
interactions
or programs
are invalidforour situation,
evenifthelinear
otheromittedpredictors
maylead to greatertheoretical modelholdsand A = 0. Second,since
V(Qw) = V(Q +
understanding.
Smith(1976)makesa similarpoint.
A) = V(P) + V(A),we see that,3andA areuncorrelated,
Another
wayto contrast
theopposingsidesofthecon- as can be showndirectlyby noticingthat,as a linear
troversy
is bythepriority
theyassignto defining
a target transformation
of Y, A is orthogonal
to thecolumnsofX
13*as (i.e., DX = 0). Therefore,
quantity.One sidefirstdefinesthetargetquantity
thesumofthesquaredresidtheparameter
ofinterest,
no matterwhatthepopulation uals fromtheunweighted
can be partitioned
regression
structure
maybe. The otherside tendsto spendmore intoa partdue to A and an error,or unexplained,
comeffort
forstructure
in thepopulationand then ponent.(Assumen > 2p and thatboth(X'X) and VA
searching
chooses,fromamongtargetquantitieslike 13,13,(a, y), V(A)/I are nonsingular.)
Thisleadsto an ANOVA table
and 13
* thatparameter
mostsuitedto whateverstructurewiththreeindependent
components
(see Table 1).
seemsto be present.
If themodelof Section2.2 is true,and in additionE is
In the spiritof thislatterapproach,we nextdescribe normallydistributed,
thenthe ratioMSwI62 has an F
yetanothertestthatthedata shouldpass beforeone ac- distribution
withp and (n - 2p) degreesof freedom.
ceptsthesimplemodeland uses theestimator13.
Undertheextendedmodelof Section2.4, theexpected
value
of MSw is &2 + A'VA-'Alp, whiletheexpected
4. USINGTHEWEIGHTSTO TESTTHESIMPLEMODEL
value of 62 iS & + T21(n - 2p), where
The testis based on thedifference
A = 13w - 13.The
- A'VAA.
T2= y'Z'(I - X(X'X)'lX')Zy
assumptions
ofSection2.2 implythatA-E(A) = 0. As
an alternative
hypothesis,
we considertheomitted-pre-The formulaforT2 can be interpreted
as thedifference
A
A
A
A
A
A
-
DuMouchel and Duncan: UsingSample SurveyWeightsin Regression
Table 1. Formulas forANOVA Table Comparing
Weighted and Unweighted Regressions
Source
df
Sum of Squares
Mean Square
Regressiona
p
SSR=
f3'(X'X)f
MSR=
Weights
p
SSw=
A'Vaj1
MSw=
SSE
remainder
Error
Total
n
2p
-
n
&2 =
SSRIP
SSW/p
SS/El(n
-
2p)
Y'Y
a The source labeledregression
hereincludestheconstanttermifitis presentin the
model.In mostapplications,and in our examplesof Section7, theeffectof thegrand
meanis omittedand thedfforregressionis p-1,whileSSR and Y'Y are reducedbythe
square ofthegrandmean.
539
The two"due to regression"sumsofsquareswillbe SSR
+ SSw, and SSR, respectively.
MethodB. Perform
theregressions
of Y and Z on X.
Thenperform
theregression
of Y on theresidualsofthe
Z-on-Xregressions.
The last "due to regression"sumof
squareswillbe SSw.
5. SOME FURTHER
REMARKS
The following
remarksare somewhatindependent
of
each otherbutare offeredas discussionand forclarification.
Remark1. Althoughthe testsinvolvingA and -yare
equivalent,interpretation
oftheirindividual
components
is somewhatdifferent
andless straightforward
forA'than
forA. If the hypothesisA = 0 is rejected,we suggest
checkingforinteractions
amongthevariablesforwhich
thecorresponding
components
ofA or are significantly
different
fromzero. This procedurehas, of course,no
powertodetectinteractions
notrelatedtotheweightvariable,thatis, thoseforwhichtheT2 ofSection4 is large.
betweentheexcess in theresidualsumofsquaresdue to
the termZ^yin the model(4.1) and thatacneglecting
countedforbyestimating
A = DZy, whereD is givenby
(4.2). IfT2 iS small(thenexttheoremimpliesthatT2 = 0
is equivalentto Z = WXCforsomematrixC), thenthe
F testbasedon MS w/&2willbe a usefultestofthesimple
linearmodelof Section2.2. If thismodelis rejected,we
Remark2. Bishop (1977) suggestsusinga weighted
concludethat,Band , w have different
The
expectations.
regression
in certainsituationswhenthe samplingratio
rationaleforpreferring
unweighted
to weightedregresa
is
of thedependentvariable(notmerelythe
function
sion is also rejectedunlesssome othervariablesZ can
predictor
variables)
even whenthe simplemodelholds.
be foundthatlead one to acceptan extendedmodelof
The presentarticledoes notdiscussthatsituation,
which
theform(4.1).
is
akin
to
retrospective
or
case-control
sampling.
Manski
A weighted
leastsquarescomputer
program
is required
and
to compute,Bwand A, and anotherspecialprogramis and McFadden(1980)providea generalformulation
analysis
of
the
problem.
Holt,
and
Smith,
Winter
(1980)
requiredto computeVa. However,as thefollowing
theoformulation
andsolutionthatassumes
remshows,SS wcan be computedandthetestperformed providea different
that
the
of
probability
selection
dependson a normally
withordinary,
unweighted,
regression
programs.
distributed
designvariablewithknownvariance.
Theorem.The F testforA = 0 is thesameas theusual
Remark3. Thomson(1978)presentsanotherrationale
F testfor-y= 0, ifthe regressionmodel Y = Xao + WX-y
coefficients
thatdepend
+ E is fitted
by ordinary
least squares.(Thatis, create forusingestimatesofregression
thenew variablesZ = WX and testforthe effectof Z on the sample design,even when the additivemodel
holds. He showsthatalthough,Bis best conditional
on
partialledon X.)
withsmaller
Proof. Since A = DY = D(Xot + WXy + E), and DX X, theremaybe otherunbiasedestimators
= 0,
varianceforcertainrangesofthetrue,B,ifbias and variance are computedby averagingoverall valuesofX in
A = E(A) = DWXy = VA(XWX)y
a givensamplingdesign.However,underThompson's
than,Bw.
using(4.2) and (4.3), whereVA = V(A)/1u2.Therefore, model,,Bis alwaysmoreefficient
theF testsof A = 0 and -y= 0 willbe equivalentifVA
6. AN EXAMPLE
and (XWX)are bothnonsingular.
One can showthatthis
conditionis equivalentto theassumption
thatthematrix To elucidatetheissuesin termsoftheapplicability
of
(X: WX) is fullrank(= 2p), whichis trueifthereare at themodelsdiscussedin Section2, we considertheanalrowsofX ysisof a subsetofdata fromthePanel Studyof Income
least(p + 1) distinctwiwhosecorresponding
have rankp. (We conjecturethatthetheoremis truefor Dynamics,a continuing
longitudinal
studybegunin 1968
theF testswouldhavefewer bytheUniversity
X and W,although
arbitrary
ofMichigan'sSurveyResearchCenter.
degreesoffreedomin thenumerator.)
The originalsampleof 4,802familieswas composedof
methods twosubsamples.The largerportion(2,930)oftheoriginal
In practice,one might
use oneoftwodifferent
wereconductedwithhouseholdheadsfroma
to computeSSR and SSw, dependingon the detailsof interviews
cross-sectionsample of familiesin the
representative
after
one's least squaresregressioncomputerprogram,
wereconUnited
States.
An
additional1,872interviews
having formedthe variables Y, X, and Z = WX (Zij =
householdsdrawnfrom
ductedwithheadsoflow-income
a sampleidentified
andinterviewed
bytheCensusBureau
AnMethodA. Perform
the regression
of Y on X and Z, forthe1966-1967SurveyofEconomicOpportunity.
and thenrefitthe regression,
droppingtheZ variables. nual interviewshave been conductedsince 1968 with
A
540
Journalof the American StatisticalAssociation,September 1983
thesehouseholdheads and also withtheheads of new Table 2. VariableDefinitionsin Model Equation (6.1)
whohaveleft
formedbyoriginalpanelmembers
families
ofthe
Completededucationalattainment
yearof thestudy,a set of Ed
home.At theend of thefifth
in years,self-reported
individual,
in FaEd
weightswas calculatedto accountforinitialvariations
levelof the father,in
Educationalattainment
years,reportedbythe father
rates.The
samplingratesand variationsin nonresponse
levelof the mother,in
Educationalattainment
means MothEd
to helpestimatepopulation
weightswereintended
years,reportedbythe father
anal- Sibs
and totalsand forpossibleuse in otherstatistical
Numberof siblings
1967-1971 averagetotalparentalfamily
whichare inversely FamilyIncome
yses.The calculationoftheweights,
income,in thousandsof 1967 dollars,
of selectionforeach into the probability
proportional
reportedbythe father
dividual,is describedin Morgan(1972,pp. 33-34). For Age
in years
The age of the individual,
Per studentpublicschool expenditurein
thepurposeof thisexample,we ignorethefactthatthe ExpiPupil
1968,forcountyof residencein 1968
samplingschemewas clusteredas wellas stratified.
Unemployment The percentof countylaborforce
In thisanalysiswe attemptto predicteducationalatunemployedin 1970,forcountyof
residencein 1968
ourselvesto panel individuals
tainment
and we restrict
A dichotomousvariableequal to one ifthe
(a) who were childrenin 1968households,aged 14-18; Rural
parentalfamilyresidesmorethan50 miles
(b) who had becomeheadsor wivesoffamiliesby 1975;
froma cityof 50,000or morein 1968,and
zero otherwise
(c) who had completedtheirschoolingby 1975;and (d)
The percentof persons25 or moreyearsold
whohad completedat leasteightyearsofschooling.Re- % college
in the 1968 countyof residencewho have
46 and 9 observations,
strictions
(c) and (d) eliminated
completedfouror moreyearsof college
was necessarybecause CountyIncome The median1969 incomein 1968 countyof
A finalrestriction
respectively.
residence,in thousandsof 1969 dollars
restrictions
(d)
(a) through
the867 individuals
satisfying
families.Since theeducacame fromonly658 different
tionalattainments
of siblingsare not likelyto be indeblack
pendent,we randomlyselectedone observationfrom on blackmales,221 on whitefemales,and 149on
re- females.
each set of siblingsthatcame fromthesamefamily,
to 658. The weights
ducingthe numberof observations
7. RESULTS
OF THEDATAANALYSES
forthese658 cases rangefrom1 to 83, witha meanof
29.0 and standarddeviationequal to 21.4. The dataused
estiunweighted
We beganour analysisby obtaining
cardimages,
in thisanalysis,consisting
of867computer
of(6.1),withtherace-sexdummy
matesoftheparameters
is availablefromtheauthors.
Thecoefficients
variablesincludedas additivepredictors.
Theoreticaland empiricalstudiesoftheeconomicsof
andassociatedstandarderrorsaretheboldfaceentriesin
educationalattainment
(Becker 1975,Ben-Porath1967,
of Table 3.
the secondand thirdcolumns,respectively,
Duncan 1974,Edwards1975,Hill 1979,Liebowitz1974,
Consideredas a whole,thesevariablesaccountformore
numerParsons1975,and Wachtel1975)have identified
thana quarterofthevariancein educationalattainment.
ous characteristics
of the familyand the economicensigns,alVirtuallyall of themhave the hypothesized
decision.This
vironment
thatmayaffecttheattainment
withtheexceptionofthecountyincomevariable,
though,
ofthefolpastresearchleads to ourinitialspecification
levelarestatistically
onlytheonesmeasuredatthefamily
lowingform:
at conventionallevels. The resultsfor the
significant
a significant
countyincomevariableis puzzling,although
Ed = Ot+ P1FaEd + P2MothEd + I3Sibs
has also been foundby Wachtel
negativecoefficient
(+)
(-)
(+)
oftheredata. A histogrami
(1975,p. 515) withdifferent
+ P6ExplPupil
+ P4FamilyIncome + P35Age
oftheresidualsversusthefitted
sidualsand a scatterplot
(+)
(+)
(+)
fromtheassumptions
valuesshowedno grossdeviations
ofthesimplemodel.
+ P7Unemployment+ P8Rural + P9%College
UsingMethodA of Section4 to compare,Bwith13w,
(+)
(-)
(+)
each indepenwe formed14 Z variablesby multiplying
+ P1oCounty
Income.
theconstant)bytheweightvardentvariable(including
(+)
(6.1) iable. Whenthedependentvariableis regressedon both
variables,we foundthatthenullhyTable 2 definesthevariablesin the model(6.1) and the setsofindependent
are givenin the pothesis,thatA = 0 (i.e., thesimplemodelis correct),
signsof the coefficients
hypothesized
var- could be rejectedat aboutthe6 percentlevel,as Table
of(6.1). In addition,we includedummy
parentheses
iablesforblackmales,whitefemales,and blackfemales 4 indicates.We proceededto calculateestimatesof A,
with VA,and weightedestimatesof ,B.
in orderto comparetheireducationalattainment
whitemales. Because low-incomefamilieswere over- The weightedestimatesof P and thet ratiosof A are
are givenin theboldfaceentriesofcolumns4 and 5 ofTable
sampled,thenumberofblackandwhiteobservations
in MethodA providesan
inthesamplethaninthepop- 3. The regressionperformed
farmoreevenlydistributed
(y) and standarderrorsforeach
on whitemales,112 estimateof coefficients
ulation.Thereare 176observations
541
DuMouchel and Duncan: UsingSample SurveyWeightsin Regression
Table 3. Coefficients,
StandardErrors,and t Rtiosof VariousTests forTwo Versionsof the Educational
Attainment
Model
Independent
Variable
EaEd
MothEd
:
Unweighted
Estimateof,
w:
WeightedEstimate
of,8
.0
1.6
(.021)
(.026)
(.020)
(.126)
.082
.158
.082
.055
-.069
.033
(.031)
(.012)
(.031)
(.012)
-.069
.039
-.074
.030
.251
- 1.267
(.045)
(.215)
(.045)
(.448)
.314
-.639
-.068
1.385
(.163)
(.796)
-.218
-.070
.491
(.196)
(.900)
.142
.268
1.0
.022
.037
-.081
.016
-.104
.006
.183
-.097
.019
-.084
(.078)
(.042)
(.175)
(.017)
(.049)
(.077)
(.076)
(.172)
(.017)
(.048)
.027
-.015
-.113
.030
-.129
.027
.210
-.119
.024
-.092
6.705
7.679
-.208
(.907)
(1.080)
(.085)
6.073
6.794
-.252
.082
.125
Sibs
Family Income
-.073
-.044
Age
Whether
Black
Male
WhetherWhite
Female
Whether
Black
Female
Expenditure/Pupil
Unemployment
Rural
% College
CountyIncome
.271
-.280
Constant
Whether White
SE(g):
StandardError
tA:
t Ratioof
Difference
Between
Weightedand
Unweighted
Estimatesof,
and
Unemployment
WhetherFemale
.076
.002
-.340
and MothEd
(.169)
-
.2
-.9
ty:
t Ratioof
Estimated
Coefficient
of
Variablex
Weight
.3
.6
.0
2.0
-.2
.5
.0
-.2
-.2
-.5
-
.1
.5
.297
1.753
1.3
-1.5
1.4
- 1.0
1.6
-2.1
1.6
-2.0
1.048
-2.8
- .5
-2.8
- 1.1
-.3
-.3
1.0
.1
-2.3
-.3
1.0
-.7
.4
.3
-.2
.4
-.2
.5
-2.3
.2
.6
-.7
.6
.3
.2
.4
-.3
-1.0
- 1.1
-.7
-.320
-.5
.2
-.6
- 1.2
.2
MothEd2
.008
(.007)
.005
-.6
-.4
Whether
Female
.017
(.009)
.017
-.0
-.2
x MothEd2
R2
StandardErrorof
Estimate
Sample Size
.289
1.60
658
.315
1.57
658
Source: Morgan (1972).
intheboldface
We choseinsteadtouse theinformation
of theZ variables;the t ratiosof each are givenin the
entriesof Table 3 to exploreextensionsof the simple
sixthcolumnofthetable.
subdiffered
coefficients
we could modelof(6.1). The unweighted
of thesedifferences,
Giventhe significance
forfourvariafromthe weightedcoefficients
have chosen to use the f w of column4 as descriptive stantially
rate,
ofthe bles: mother'seducation,countyunemployment
estimatesof,3*inthecensusmodel.Thisrejection
simplemodelwouldputmoreemphasison raceand sex, and two of the race-sexdummyvariables.Notice that
(and
betweentheranking
of educational thereis a roughcorrespondence
as predictors
and less on unemployment,
betweenundirection)of the t ratios on differences
attainment.
andthet ratiosofthe
coefficients
and weighted
weighted
Z variables.Thus, the moreeasilycomcorresponding
Table 4. ANOVATable ComparingWeightedand
on theZ variablescan apparputedtestsof significance
UnweightedRegressions Usingthe InitialModel
entlyserveas a guideto variableswithlargeA's.
's suggestthatthe
Priorresearchand the significant
Mean
Sum of
interF
Significancemostprobablecause ofmisspecification
is omitted
df Squares Square
Source
indepenthe
actionsbetweenrace and sex and someof
<.0001
20.1
51.5
670.0
13
Regression
dentvariableslistedin (6.1), especiallymother'sedu.056
4.2
1.7
59.2
14
Weights
2.5
1586.7
630
Error
cation and county unemployment.Although the
hypothesisof equal slopes for the fourrace and sex
2315.9
657
Total
subgroupscouldnotbe rejected(F = .94; df = 30,614),
542
Journal of the American Statisticol Association, September 1983
thesubsetregressions
did suggesta possibleinteraction Table 6. Estimated Increase in Mean Ed per Year of
betweenrace and countyunemployment
ratein which
Increase in MothEd, Based on Final Model
increasesinunemployment
hada stronger
positiveeffect
Education Level of Mother
on theeducationalattainment
of blacksthanof whites.
This interaction
is quite plausiblesince unemployment Sex of
12 Grades
16 Grades
8
(College Grad.)
(H. S. Grad.)
Grades
ratesforblacks are considerably
higherthanthose of Child
whitesand a unitchangein the overallcountyunem.26
.19
.13
Male
ployment
ratehas moreeffecton blacksthanon whites. Female
.46
.26
.06
Whenthisinteraction
termwas addedto (6.1), thecoefficientwas -.21 witha standarderrorof .09. Furthermore,whenMethodA was repeatedwithit and its as- these209 individuals
fromfamilieswithat leasttwochilsociated Z variable (i.e., whetherwhite x county drenin thisfive-year
age cohort.Whenthefinalmodel
unemployment
rate x weight)includedas predictors,
the was fittedto thecompletesampleof867 individuals,
the
F ratiooftheentiresetofweightinteractions
dropsfrom coefficient
ofincomerosefrom.033to .044.
1.7 to 1.3, and the coefficients
of the two Z variables Our use of the weightsto testformodelmisspecififormedfromthe countyunemployment
variableare in- cationhas led us to thesubstantive
conclusionthata simsignificant.
modelis not
ple linearadditiveeducationalattainment
Since thethreeotherZ variablessignificant
at the .05 appropriatefor severalreasons. First,a worseningof
level in the originalspecification
remainedsignificantlocal economicconditions(as measuredby the county
whenthe race-unemployment
interaction
was included, unemployment
rate) appearsto providemoreof an inwe continuedour searchforadditionalinteractions.
We centiveforblacks to stayin schoolthanforwhites.A
discoveredthatmother'seducationinteracted
withitself one percentage
rate
pointincreaseintheunemployment
(i.e., itseffect
was nonlinear)
and,furthermore,
thatthese was associatedwithan additionalone-fifth
of a yearof
nonlineareffectsof mother'seducationon the educa- educationalattainment
forblacks,whilethe effectfor
tionalattainment
of thechildrendependeduponthesex whiteswas essentially
ofmothzero. Second,theeffects
of thechild.
children
of
attainment
er's educationon theeducational
Resultsfromthe estimation
of our finalspecificationincreasewiththelevelofhereducationandfurthermore
of theeducationattainment
modelare theitalicizeden- dependon the sex of the child.Table 6 evaluatesaEd/
triesof Table 3. In contrastto theinitialspecifications,aMothEdformotherswith8, 12, and 16 yearsof eduthe highestt ratioforthe difference
betweenweighted cation.
and unweighted
is 1.4. The analysisof vari- There is a modestincreasein the marginaleffectof
coefficients
ance tableforthefinalmodel,Table 5, showsthattheF mother'seducationalattainment
for sons and a much
ratioassociatedwiththeweightsis below 1.0.
An
moredramaticincreasein thiseffectfordaughters.
The coefficients
and standarderrorspresentedas the extrayearofmother'seducationlevelis associatedwith
italicentriesof columns2 and 3 of Table 3 shouldbe virtually
whose
ofdaughters
no increaseintheattainment
regardedwithsomecautionbecause thedatawereused mothers
havean eighthgradeeducationbutis associated
to suggesttheappropriate
functional
form.
withan additionalone-halfyearofeducationfordaughAs a finalanalysisstep,we reestimated
themodelon terswithcollege-educated
do
Theseconclusions
mothers.
the209 individuals
whowereexcludedwhenthesample notchangewhenthemodelis estimatedfromthecomwas restricted
to onlyone siblingperfamily.The results pletesampleof867individuals.
None ofthesixnumbers
were quite similar,particularly
forthe race-unemploy-in thistablechangesby morethan.03.
mentinteraction
andthenonlinear
effectofmother'sedFinally,an analysisoftheresidualsfromthisextended
ucation.Although
thesignoftheinteraction
betweensex modelrevealedno anomalies,andwe thusprefer
theunof childand mother'seducationchangeddirection,
the weightedestimatesof its coefficients.
thatchangedbya statistically
onlycoefficient
significant
amountwas familyincome.The incomecoefficient
in- [ReceivedNovember1976.RevisedNovember1982.]
a moreimportant
roleforincomefor
creased,suggesting
REFERENCES
BACHMAN,J.G.,GREEN, S., andWIRTANEN,I.D. (1974),Youth
Ann
Vol. III, DroppingOut-Problemor Symptom?,
in Transition,
forSocial Research.
Arbor,Mich.:Institute
and EmBECKER, GARY S. (1975),HumanCapital:A Theoretical
toEducation(2nded.), New
piricalAnalysis,withSpecialReference
Press.
York:ColumbiaUniversity
Significance BELSLEY, D.A., KUH, E., and WELSCH, R.E. (1980),Regression
Data andSourcesofCollinearity,
Influential
Diagnostics:Identifying
<.0001
New York:JohnWiley.
.494
of HumanCapital
BEN-PORATH,YORAM (1967),"The Production
and theLifeCycleof Earnings,"JournalofPoliticalEconomy,75,
352-365.
BISHOP, JOHN (1977), "EstimationWhenthe SamplingRatiois a
Table 5. ANOVATable ComparingWeightedand
UnweightedRegressionsUsing the Final Model
Source
Regression
Weights
df
Sum of
Squares
Mean
Square
F
17
18
730.6
43.3
43.0
2.4
17.35
. .97
Error
622
1542.0
Total
657
2315.9
2.5
DuMouchel and Duncan: UsingSample SurveyWeightsin Regression
543
LinearFunctionoftheDependentVariable,'Proceedings
ComplexSamples,"JournalofRoyalStatistical
Society,Ser. B, 36,
ofSocial
1-37.
StatisticsSectionoftheAmericanStatistical
Association,848-853.
BLUMENTHAL, M.D., KAHN, R.L., ANDREWS, F.M., and KLEIN, L.R. (1953),A Textbook
Ill.: Row,
ofEconometrics,
Evanston,
Peterson,and Co.
HEAD, K.B. (1972),Justifying
Violence:Attitudes
ofAmerican
Men,
AnnArbor,Mich.:Institute
forSocial Research.
KLEIN, L.R., and MORGAN,JAMESN. (1951),"ResultsofAlternativeStatisticalTreatments
of SampleSurveyData," Journalof
BREWER,K.R.W., andMELLOR, R.W. (1973),"The Effect
ofSamAmericanStatistical
ple Structure
Association,46, 442-460.
on Analytical
Journalof StatisSurveys,"Australian
KONIJN,H.S. (1962),"Regression
AnalysisinSampleSurveys,"Jourtics,15, 145-152.
Association,57, 590-606.
DRAPER, N.R., andSMITH, H. (1966),AppliedRegression
Analysis, nal ofAmericanStatistical
LEIBOWITZ, ARLEEN (1974), "Home Investments
in Children,"
New York:JohnWiley.
DUNCAN, GREG (1974),"EducationalAttainment,"
Five Thousand JournalofPoliticalEconomy,82, 111-131.
Estimators
AmericanFamilies-Patternsof EconomicProgress,Vol. I. eds. MANSKI, C., and McFADDEN, D. (1980),"Alternative
and SampleDesignsforDiscreteChoice Analysis,"in Structural
Morganetal., AnnArbor,Mich.:Institute
forSocialResearch,305AnalysisofDiscreteData, eds. C. ManskiandD. McFadden,Cam332.
bridge,Mass.: M.I.T. Press.
DUNCAN, GREG J.,and MORGAN,JAMESN. (eds.) (1976),Five
ThousandAmerican
Families-Patterns
ofEconomicProgress,Vol. MORGAN,JAMES N. (1972),A Panel Studyof IncomeDynamics:
AvailableData, Vol.1, AnnArbor,Mich.:
StudyDesign,Procedures,
forSocial Research.
IV, AnnArbor,Mich.:Institute
Institute
forSocial Research.
EDWARDS, LINDA N. (1975),"The Economicsof SchoolingDecisions:TeenageEnrollment
Rates,"Journal
ofHumanResources,10, MOSTELLER, F., and TUKEY, J.W. (1977), Data Analysisand
Regression,Reading,Mass.: Addison-Wesley.
155-73.
Transfers
andthe
HABERMAN,SHELBY J.(1975),"How MuchDo Gauss-Markov
and PARSONS, DONALD 0. (1975),"Intergenerational
EducationalDecisionsof Male Youth," Quarterly
Journalof EcoLeast-SquareEstimatesDiffer?
A Coordinate-Free
Approach,"The
nomics,89, 603-617.
AnnalsofStatistics,
3, 982-990.
Analysis
HILL, C. RUSSELL (1979),"Capacities,Opportunities,
and Educa- PFEFFERMANN,D., andNATHAN,G. (1981),"Regression
ofData Froma ClusterSample,"Journal
Statistical
tionalInvestments:
oftheAmerican
The Case oftheHighSchoolDropout,"TheReAssociation,76, 681-689.
viewofEconomicsand Statistics,
61, 9-20.
HOLT, D., SMITH, T.M.F., andWINTER,P.D. (1980),"Regression PORTER, RICHARD D. (1973), "On the Use of SurveySample
intheLinearModel,"AnnalsofEconomicandSocialMeasWeights
AnalysisofData fromComplexSurveys,"JournaloftheRoyalStaurement,
212, 141-158.
tisticalSociety,Ser. A, 143,474-487.
Stratified
HU, T.W., and STROMSDORFER, E.W. (1970), "A Problemof SMITH, KENT W. (1976),"AnalysingDisproportionately
Samples With ComputerizedStatisticalPackages," Sociological
Bias inEstimating
Weighting
theWeighted
Regression
Model,"Proceedingsof theBusinessand EconomicsSectionof theAmerican Methodsand Research,5, 207-230.
THOMSEN, IB (1978),"Design and Estimation
ProblemsWhenEsStatisticalAssociation,513-516.
a Regression
timating
Coefficient
FromSurveyData," Metrika,.25,
JUSTER,F. THOMAS (ed.) (1976),TheEconomicandPoliticalImpact
of GeneralRevenueSharing,Washington,
D.C.: U.S. Government 27-36.
WACHTEL, PAUL (1975),"The Effect
ofSchoolQualityon AchievePrinting
Office.
ment,Attainment
Levels, and LifetimeEarnings,"Explorations
in
KISH, LESLIE (1965),SurveySampling,New York:JohnWiley.
KISH, LESLIE, andFRANKEL, MARTINR. (1974),"Inference
from EconomicResearch,2, 502-536.