Document 278812

Effective
Geographic
SampleSizein thePresenceof
SpatialAutocorrelation
DanielA. Griffith
AshbelSmith
SchoolofSocialSciences,
Professor,
University
ofTexasat Dallas
contained
As spatialautocorrelation
latentin georeferenced
theamountofduplicateinformation
data increases,
in thesedataalsoincreases.
Thisproperty
theresearchquestionaskingwhatthenumberofindependent
suggests
observations,
sample
sayn*,is thatis equivalentto thesamplesize,n,ofa dataset.Thisis thenotionofeffective
size. Intuitively
whenzerospatialautocorrelation
speaking,
prevails,n*= n; whenperfectpositivespatialautocorrelation
prevailsin a univariateregionalmean problem,
n*= 1. Equationsare presentedforestimating
withthe goal of obof a samplemean or samplecorrelation
coefficient
n*based on the samplingdistribution
level of precision,
modelspecifications:
(1)
tainingsomepredetermined
usingthe following
spatialstatistical
simultaneous
These equationsare evaland (3) spatialfilter.
(2) geostatistical
autoregressive,
semivariogram,
uatedwithsimulation
and are illustrated
withselectedempirical
experiments
examplesfoundin theliterature.
redundant
KeyWords:
geographic
sample,
geostatistics,
spatialautoregression.
information,
spatialautocorrelation,
fordata-gathering
purposesmustaddress
Sampling
questions askinghow and what to sample (Levy
and Lemeshow 1991), and it is the foundationof
much empiricalsocial science research,whetherquantitativeor qualitativemethodologiesare employed.One
distinction between these two methodological approachesis thatquantitativeresearchfrequently
requires
relativelylarge sample sizes to collect somewhatsuperficial, albeit important,attributeinformationthat is
generalizableto a population,whilequalitativeresearch
to relativelysmallsamplesizesin order
oftenis restricted
to collect large quantitiesof in-depth,detailed information from subjects or case studies. Quantitative
analysis generalizationis achieved through a sound
random-sampling
design (i.e., how to sample); qualitative analysisgeneralization,if desired,may be achieved
Considerable
throughsuch techniquesas triangulation.
efforthas been devoted to geographicsamplingdesigns
for quantitative investigations (e.g., Stehman and
Overton 1996)-translating the what into where to
sample-designs that exploit random sampling error.
Impacts of spatial autocorrelationin this context are
partiallyunderstoodand are the topicofthisarticle.One
of an arrayof purposivesamplingstrategies(i.e., how to
sample) can be employedin qualitative research (see
Marshall and Rossman 1999, 78). The goal often is
and adherenceto selectedtheoretical
representativeness
as well as convenience.Impactsofspatial
considerations,
autocorrelationin this latter context are almost completelyunknown,althougha spatial researchershould
realize that it still will come into play. For example, a
snowball samplingstrategywill be impacted by spatial
ifsubjectsare fromnearbylocationsand
autocorrelation
because ofthe waythe
autocorrelation
social
network
by
extreme-cases
an
is
And
strategy
generated.
sample
could be impacted by the existence of geographically
nonrandom"hot spots" or "cold spots,"whicharisebecause ofspatialautocorrelation.
Findingsreportedin this
articleforquantitativemethodologiesofferat least some
speculativeinsightsinto qualitativesample sizes,too.
ImportantSampleProperties
oftenin termsofstatistical
Sample sizedetermination,
power calculations, frequentlyis a valuable step in
planninga sample-based,quantitativestudy.Most instatisticstextbooksdiscusshypothesistesting
troductory
in the contextof appropriatesamplesize determination,
with or without statisticalpower specification.The
popularityand cumbersomenessof these calculations
have resulted in web-based interactivecalculators to
executethe necessarycomputationsforresearchers(e.g.,
For the case
http://calculators.stat.ucla.edu/powercalc/).
of independentobservations,Flores,Martinez,and Ferrer(2003) furnishsome insightsinto sample-sizedeforarithmeticmeans of georeferenced
termination
data,
but for systematicsampling designs rather than the
tessellated random samplingdesign promotedin this
article.As this literatureillustrates,calculatingan appropriatesamplesize unavoidablyinvolvesmathematical
notation, which accordinglyappears in the ensuing
discussion.
C
AnnalsoftheAssociation
ofAmerican
95(4), 2005, pp. 740-760 2005 byAssociationofAmericanGeographers
Geographers,
2005
Initialsubmission,
December2004; finalacceptance,February
April2004; revisedsubmission,
PublishedbyBlackwellPublishing,
350 Main Street,Malden,MA 02148, and 9600 Garsington
Road,OxfordOX4 2DQ, U.K.
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
Statisticalpower (Tietjen 1986, 38) is the probability-frequentlydenoted by 1- P3,where 3 is the probabilityof failingto reject the null hypothesiswhen the
alternativehypothesisis true(i.e., a TypeII error)-that
a test will reject a false null hypothesis(i.e., the complementof a Type II error).The higherthe power,the
greaterthe chance of obtaininga statistically
significant
resultwhen a null hypothesisis false.The powerof all
statisticaltests is dependent on the followingdesign
parameters:significancelevel selected for a statistical
test; sample size; the tolerablemagnitudeof difference
between a sample statisticand its corresponding
population parameter;and natural variabilityfor the phenomenonunderstudy.
Spatial autocorrelation,which may arise fromcommon variablesassociated with locations or fromdirect
interactionbetween locations (see Griffith1992), has
an impact on significancelevels, detectabledifferences
in attributemeasuresfora population,and measuresof
attributevariability(see, e.g., Arbia, Griffith,
and Haining 1998, 1999). These impacts motivatedClifford,
Richardson,and H~mon (1989) to apply the phrase
"effectivedegreesof freedom"l-the equivalentnumber
ofdegreesoffreedomforspatiallyunautocorrelated(i.e.,
independent) observations,exploiting redundant or
containedin georeferenced
data
duplicatedinformation
due to the relativelocationsof observations(i.e., spatial
-to analyses in which these spatial
autocorrelation)
autocorrelationeffectsare adjusted for in the case of
correlationcoefficients.The duplicate informationin
question may arise fromgeographictrendsinduced by
commonvariablesor frominformation
sharingresulting
fromspatialinteraction(e.g., geographicdiffusion).
This
articlehighlightsthe nearlyequivalentnotion of effective sample size2: the numberof independentobservations, say n*, that is equivalent to a spatially
autocorrelateddata set's sample size, n. Intuitively
speaking,when zero spatialautocorrelation
prevailsand
a regionalmean is beingestimated,n* = n; whenperfect
positive spatial autocorrelationprevails,n*= 1. The
importanceof correctingn to n*may be illustratedby
analysisof remotelysensed data forthe High Peak districtof England, for which n = 900 pixels containing
markedlyhigh positivespatial autocorrelationis equivalent to n* 5 independent pixels (see the ensuing
discussionfordetails).As an aside,Getisand Ord (2000)
furnisha similartypeof analysisforthe multipletesting
of local indices of spatial autocorrelation,
which themselvesare highlyspatiallyautocorrelated
byconstruction.
Of note is that establishing
effective
samplesize unamathematical
derivations;basic ones
voidablyrequires
are outlinedin thebodyofthisarticlein orderto establish
741
the soundnessofresults.The validityofreportedfindings
is further
bolsteredwithsimulationexperimentresults.
ImportantConsiderations When Designing a
Sampling Network
Random samplingin a geographiclandscape requires
considerationsmuch like those used when designing
a conventional stratifiedrandom sample. Geographic
needs to be cast in termsof spatial
representativeness
to
coverage ratherthan in termsof, say, stratification
achieve good socioeconomic/demographic
coverage.
Designinga geographicsamplingnetworkalso needs to
protectagainst sample locations being correlatedwith
the geographicdistributionto be studied; this specific
concern is whypurelysystematicsamplingoftenis not
used. Geographic sampling networksenable regional
means to be estimated,eitherforpredefinedsets of aggregateareal units (choroplethmaps) or as interpolationsofcontinuoussurfaces(contourmaps). Geographic
samplingnetworks,designedforefficientestimationof
of inparametersdescribingthe geographicdistribution
terest,need to guard against grosslyinefficient
spatial
prediction (Martin 2001; Miller 2001; Diggle and
resultsin
Lophaven 2004) and vice versa.This trade-off
a compromisebetweena systematicsample,comprising
regularlyspaced samplinglocationsin orderto achieve
good geographiccoverageand hence good interpolation
spaced samplinglocations in
accuracy,and irregularly
orderto achieve betterestimationof parametersforthe
of interest.
geographicdistribution
A samplingnetworkcan be devised in variousways
to satisfythe condition of containingboth regularly
and irregularly
spaced samplinglocations.One wayis to
of
the locations systematically
n/2
(e.g., on a
position
and
the
remainingn/2locaregularsquare tessellation)
tions in a random fashion (i.e., randomlyselect eastnorth-southcoordinates).This
westand, independently,
is the typeof designassociatedwiththe GEOEAS data
example (Englund and Sparks 1991; see http://
A secwww.websl1.uidaho.edu/geoe428/data_files.htm).
ond method proposedby Diggle and Lophaven (2004)
involvespositioningsome samplinglocations on a regular square tessellationgridand the remaininglocations
on more finelyspaced regularsquare tessellationgrids
withina randomlyselected subset of cells demarcating
the coarsergrid-the lattice plus in-filldesign. Diggle
and Lophaven also propose a thirddesign,which they
thatinvolvespositioningsome samplinglocations
prefer,
on a regularsquare tessellationgrid,withthe remaining
locationsbeing randomlyselected fromconstantradius
a randomsubsetof
circularbufferzones circumscribing
742
Griffith
the systematically
positionedlocations-the latticeplus
close pairsdesign.Unfortunately,
the softwarecurrently
available to supporttheirdesigns"would encounterse... with numbersof locations larger
rious difficulties
thana fewhundred"(Diggleand Lophaven2004, 8). Yet
a fourthdesign is the one employedfor this article,
which is based on hexagonal-tessellation,stratified,
random sampling (Stehman and Overton 1996). A
regularhexagonal tessellationcontainingn cells is superimposedon a region-the systematiccomponent.
Then a singlelocationis randomlyselectedfromwithin
each hexagon-the random component. This design
sharesmanysimilarities
withthe latticeplus close pairs
design. Of note is that these networklayoutissues are
centralto debates about geostatisticalsamplingdesigns.
Cressie (1991, ?5.6) furnishesa usefuloverviewof numerousspatialsamplingdesigns.
Mixing regularlyand irregularly
spaced samplinglocations highlightsanotherimportantfeatureof spatial
analysis,namelydesigned-basedand model-basedinference. The precedingsamplingdesigns supportdesignbased inference,
whichassumesthata givenlocationhas
a unique fixedbut unknownvalue for the geographic
distributionof interest.The referencesamplingdistribution is constructed,conceptually,by repeatedlysampling froma geographiclandscape and using the same
design and calculatingparameterestimateswith each
sample. Initially,spatial scientistsbelieved that this
used when data constrategycould not be legitimately
tain non-zero spatial autocorrelation(Brus and de
is to let thevalue
Gruijter1993). An alternativestrategy
forsome geographicdistribution
at a givenlocationvary.
In other words, the joint distributionof data values
forminga map is one of an infinitenumberof possible
realizationsof some stochasticprocess; the total set of
Hence, the espossiblemaps is called a superpopulation.
sentialtool fordescribinga map is a model,resultingin
thisinferential
basis beinglabeled model-based.
A severe
in
shortcomingof this latterapproach is the difficulty
whether
or
not
model
are
knowing
valid,
assumptions
necessitatingdiagnostic analyses. But it furnishesan
nonranindispensableanalyticaltool forunderstanding
domlysampled data such as remotelysensed data and
forenablingspatial autocorrelationto be accounted for
when devisinga samplingdesign: the model-informed,
design-basedperspectiveoutlinedin this article.
A ConceptualFramework:The Effective
Size of a
GeographicSample
A basis forestablishingeffectivesample size fornormally distributedgeoreferenceddata is presented in
termsof the samplingdistributionof a single sample
mean; extensionsexploitingmultiplesample means or
the sample correlationcoefficientare presentedin AppendicesA, B, and C. This approach,forwhich the assumptionof a bell-shapedcurve is critical,is directly
analogousto thatreportedfortimeseries(e.g.,see the R
Documentation) and is indirectlyanalogous to what is
models
reportedforsurveyweightswithsuperpopulation
(Pottchoff,
Woodbury,and Manton 1992), wherebyapplyingweightsto sampleresultsaltersthe value of n.
Measuringnaturalvariabilityforsome georeferenced
phenomenon resultsin an inflatedvariance estimate
when spatialautocorrelationis overlooked(see Haining
2003, ?8.1). Suppose the n x n matrixV contains the
covariationstructureamong n georeferencedobservations (more precisely,matrixG2V- 1 is the covariance
matrix),such that Y = t + e
t + V-1/2e*,where Y
x
attributevaldenotes an n 1 vectorof georeferenced
ues, p denotes the population mean of variable Y, 1
denotes an n x 1 vectorof ones, and e and e*, respectively,denote n x 1 vectorsof spatiallyautocorrelated
and unautocorrelatederrors.Suppose e* is independent
and identicallydistributedN(O,
oY),e*) whereN denotes
and a.2 denotesthe population
the normaldistribution,
variance forvariatee*. If V = I, the n x n identitymatrix,then the n observationsare uncorrelated.Using
matrixnotation,the populationvarianceestimatebased
is
upon a sample, and ignoringspatial autocorrelation,
given by
= E[(Y - l)/n] TR(V1)
2, (lA)
l~)T(Y
n
where 62 denotes the estimateof cy, the variance of
attributevariableY, E denotes the calculus of expectations operator,and T and TR respectivelydenote the
matrix transpose and trace operators. The quantity
factor(VIF), similarto
TR(V-' )/nis a varianceinflation
in conventional
the VIF generatedby multicollinearity
analysis;it expressesthedegree
multiplelinearregression
observations
to whichcollinearityamonggeoreferenced
dispersed
degradesthe precisionofY relativeto similarly
spatiallyuncorrelatedvalues. Popular versionsof matrix V include, for spatial autoregressiveparameterp
and binarygeographicweightsmatrixC: (I- pC) for
the conditional autoregressive(CAR) model; and,
[(I- pW)T(I- pW)] for the autoregressiveresponse
(AR) and simultaneousautoregressive(SAR) models,
versionof mawherematrixW is the row-standardized
trixC.3 Cliffand Ord (1981, Ch. 7), Anselin (1988),
Griffith
(1988), Haining (1990), and Cressie (1991, Ch.
1), amongothers,furnishadditionaldetailsabout these
models.
E(&)
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
Again using matrix notation, the variance of the
sample mean of variableY, y, ignoringspatial autocorrelation,is givenby
(A)
1
o
/-
08
08
a
E(^&)
(1B)
1TV-11/n2
Rearrangingthe termson the right-handside of this
equation and makingthe necessaryalgebraicmanipulationsyields
TR(V-1)
E(^Y2)E(')
2
..
02
TR(V-1)
n(1C)
1TV-'1
00
side ofthisequation
The denominatoron the right-hand
furnishesthe formulaforeffectivesample size,namely,
0.2
00
0.4
0.6
se-formula
1TV-11
(2)
(2)
If the n observationsare independent,and hence V = I,
thenn* = n, and the VIF becomesTR(V-) = 1. Ifperfect
positivespatial autocorrelationprevails,then, conceptually,V - 1 = kl 1T, withk --+oc as positivespatial autocorrelationincreases,and n* = 1.
In additionto the mathematicalstatisticaltheoryderivationofEquation(2), itsvaliditycan be assessedthrough
A simpleexploratory
simulationexperimentation.
experiment (100 replications)was conductedforselectedcases
in whichvariableY was distributed
N(O, 1), spatialautowas embeddedwithan SAR model,and n, the
correlation
level of positivespatialautocorrelation
p, and geographic
were
varied.
a scatterplot
connectivity
Figure1A portrays
of the simulatedstandarderrorversusa standarderror
computedwiththe VIF and Equation (2). The goodnessof the regression
of-fit
line appearingin thisgraphhighlightsthe soundnessof Equation (2), with a noticeable
deviationbeingattributable
onlyto simulationvariability.
Mean-BasedResultsfora Spatial
Model Specification:
The
Autoregression
SAR Model
here
Findingsbased on a SAR model are illuminating
as a CAR model
because a SAR model can be rewritten
models
(see Cliffand Ord 1981), whereassemivariogram
and Layne 1997).
can be directly
relatedto it (see Griffith
TheCase ofa SingleGeographic
Mean
Griffith
(2003) reportsfindingsforEquation (2) and its
extensionto expression(Al) in AppendixA, including
the followingconjecture,whichis a slightimprovement
and Zhang (1999):
on the resultreportedby Griffith
1.0
0.8
(B)
1800
TR(V-')
n*- -n.
743
I
Variable
-f0-i--0-
16800,
Griffith
Crsie
1400
1200
~
1000
E
so 800
400
400
.."
6
200 400 600 800 1000 1200 1400 1600 1800
n*fromsimulation
of the simulatedstandarderror(100
Figure1. (A): a scatterplot
versusa standarderrorcomputedwiththe variance
replications)
inflation
factor(VIF) and Equation(2), denotedby solid circles
solid straight
grayline denotespredicted
( ); the superimposed
valuesproducedbytheestimatedregression
equation.(B): a scatterplotofn*computedwithEquation(2) versusfi*computedwith
Equation (3), denotedby asterisks(*), and with
approximation
Cressie's(1991, 15) equation,denotedby open circles(o); the
solid straightgrayline denotes predictedvalues
superimposed
estimatedregression
the
equationbased upon Equaproducedby
and thebrokenstraight
tion(3) results,
graylinedenotespredicted
valuesproducedby the estimatedregression
equationbased upon
Cressie'sequation'sresults.
If georeferencedattributevariableY is normallydisso, and p is the spatial autotributed,or approximately
correlation parameter estimate for a SAR model
then the effectivesamplesize is givenby
specification,
1
x1
n-l(1i1237
- 1 - e1.92369 n
204
(3)
Of note,again,is thatthenormality
assumptionis critical
here. In addition to a nonlinearregressionanalysisof
empiricalcases used to calibrateEquation (3), itsvalidity
Griffith
744
can be assessedthroughsimulationexperimentation.
The
experimentused to validate Equation (2) also was used
to validateEquation (3). Figure1B portrays
a scatterplot
of n* computedwith Equation (2) versusfi*computed
withEquation (3). The goodness-of-fit
of the regression
line appearingin thisgraphhighlights
the soundnessof
Equation (3). Of note is that alteringthe geographic
connectivitydefinitionresultsin slightbut perceptible
variationabout values calculatedwithEquation (3).
Cressie (1991, 15) reportsa comparable effective
whichalso was used to predictn* for
samplesizeformula,
the simulationexperiment(see Figure1A). Equation (3)
outperforms
Cressie'sequation,yieldinga mean squared
error(MSE) of55 thatis substantially
less thanthe MSE
of 334 forhis equation.
TheCase ofTwoGeographic
Means
Followingthe same logic that establishesEquation (2),
also reports,for the bivariateweightedmean
Griffith
case (see additionaldetailsin AppendixA) specifiedin
termsof a SAR model, the conventionalvariance term
in expression(Al), namely
w2
?(
22w(1
n
-w)pxyvxGy
,
(4A)
where 01 and cro respectivelydenote the variance of
variablesX and Y, Pxydenotes the correlationbetween
attributevariablesX and Y,w (0 < w < 1) is the weight
applied to the mean of variable X, and the term
2w(1 - w)
adjusts for the presence of redunpxvyxoy
in a bivariategeoreferenced
dant attributeinformation
dataset. This expressionis multipliedby the VIF appearingin expression(Al), namely,
w22TR{ [(I -
-
1xW)T(I
A closer inspectionof this conventionalvariance expressionreveals that it containsthe individual,weighted, standarderrortermsw2o2/n and (1-w)2o1/n. A
closerinspectionofthisVIF termrevealsthatit contains
the weighted, individual VIF terms w2&2TR{[(Iand
(1-w)2&2TR{[(IxW)]- }/n
5xW)W(I-can be seen in Equawhich
I5yW)T(I-yW)]-I}/n,
tion (2). One
surprisehere is that this VIF expression
does not include the cross-productterm involving
And a closer inspection of
[(I- pxW)T(I1yW)1'.
termreveals that it containsthe
this means variability
factorforeach individualvariablen*,as well as
prorating
term.
a cross-products
A simulation experiment based on n = 625,
=
= 1 and 100 replicationswas conducted to esc- o,the validityofexpression(Al) acrossthe rangeof
tablish
w and PxY values. Figure2 portraysa scatterplotof the
simulatedstandarderrorversus a standarderrorcomof the
puted withexpression(Al). The goodness-of-fit
regressionline appearingin this graph highlightsthe
soundnessofexpression(Al), withnoticeabledeviations
being attributableonlyto simulationvariability.
empiricalcase studiesthat
Graphsfortwo illustrative
portraythe curve describedby expression(Al) when
1 #1#l appear in Figure 3. Relevant statisticsfor
these two examplesused to constructthese graphsappear in Table 1A. The curvesportrayedin Figure3 may
be approximated
withthe following
equations,whichare
equivalentto but simplerin formthan the one reported
in Griffith
(2003, 85) and demonstratethat the joint n*
value essentiallyis a weightedfunctionof the two effectivesamplesizes that can be computedseparatelyfor
the individualmeans:
0.5
xW)]-1}
+ (1 - w)2&2TR{ [(I - iW)T(I-
W)-/n,
0.4
(4B)
where px and py, respectively,
are the spatial autocorrelationparameterestimatesforvariablesX and Y. The
resultingproductthen is divided by the termdenoting
ofmeansin the presenceofnon-zero
samplingvariability
spatial autocorrelation,also appearing in expression
(Al), namely,
.?
1XW)T(I- 15XW)]-11
T _
+ (1 - w)2 2
1W)11
lT[(I- _5W)T(I-
+ 2w(1- w)px
1/n2.
1T[(I- 5xW)T(I15W)]y~x?y
(4C)
.i-1.
a02
S0.1
0.0
W221T[(I
/
.*0
0.3
,,
0.0
0.1
0.2
se-formula
0.3
0.4
Figure 2. A scatterplotof the simulated standard errorversus a
standarderrorcomputedwithexpression(Al), denotedby solid
circles ( * ); the superimposedsolid straightgrayline denotes pre-
dictedvaluesproducedbytheestimatedregression
equation.
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
(A)(A)i
13
Murray
mean elevation
2.0171
12
+
S AsW2.0171 (1
11
10
(1
9
+
W2.0171
(1-
-
8
=
7
elevationstandarddeviation
6
0.0
0.2
0.4
0.8
0.6
1.0
w
(B)
745
7
As
76
75
74
+ nPb
W)1.3475
-
--)W1.3475
W)1.3475'pseudo-R2
0.9994,
(5B)
where pseudo-R2 is the squared multiple correlation
coefficient(R2) betweenn* and fi*.Because of the role
played by variance terms,which are specificto cases,
onlythe generalformof the equation can be established
at thistime.This empiricalanalysisfurther
corroborates
the validityof expression(Al).
As an aside, the extensionof this two-meansto a
resultis summarizedin AppendixB.
multiple-means
73
TheCase ofa Pearson
Product-Moment
Correlation
Coefficient
72
71
70
69
Pb
68
0.0
0.2
0.4
0.6
0.8
1.0
w
3. Plotsofthebivariate
curvefortheillustrative
Figure
examples;
theapproximation
curveisdenoted
line(...), andthe
bythedotted
oftheexactn*valuesis denoted
scatterplot
bysolidcircles( . ).
the
PuertoRico digitalelevation
model(Dem). (B): the
(A):
Smelter
site.
Murray
PuertoRico
n-i
nxy 1+-1 n
S
(1
W.9
q-
1.589968
-- wu)1.5968
(1 - w)1.5968
=
Spatial scientistsare often interestedin measuring
relationshipsbetween,ratherthan means of,geographicallydistributedvariables.Details forcomputingeffectivesamplesize in thiscontext,again assumingnormally
distributedvariables,appear in Appendix C. The followingapproximaterelationshipholds between the individual mean-based univariateeffectivesample sizes
and theircorrespondingbivariatecorrelation-basedeffectivesample size:
(1
w1.5899+
0.9998,
-
'
w)1.5968
x
n
pseudo-R2
(5A)
and
+
i(Px
+
PY)1.161
0.04vnxn
(6)
This approximationresultsin: fiy = n when spatial auis absent;iy asymptotically
tocorrelation
convergingon
2 frombelow when Px = py~ 1 and Pxy=1; and, fi~y
asymptotically converging on roughly 5 when
Px = PY 1 and Pxy = 0, highlightingthat it is an
Table 1A. Selecteddescriptive
statistics
forthePuertoRico digitalelevationmodel(DEM) and Murraysmeltersitesoil
contaminants
Landscape
Puerto
Rico(n= 73)
smelter
site(n= 253)
Murray
Variable
DEM meanelevation
(e)
DEM elevation
standard
deviation
(se)
Arsenic
(As)
Lead(Pb)
Standard
deviation
?
0.80270
3.54528
3.46316
7.69417
0.83139
0.67638
0.53180
0.49363
Bivariate
correlation n*
0.68102
0.74775
6.24
12.69
68.24
76.95
NOTE:Griffith
ofthesetwodatasets.
TheMurray,
is a superfund
elevation
model.
(2003)provides
descriptions
Utah,landscape
digital
site.DEM denotes
Griffith
746
desiredvalues here are 1
approximation;theoretically,
and 2. ResultsforfiveThiessen polygon-basedempirical
examplesappear in Table iB; these resultsdemonstrate
that Equation (6) furnishesa good approximationfor
Equation (C1).
These findingshighlightthe implicationthatimpacts
of spatial autocorrelationcan be mitigatedto some extentby incorporating
redundantgeoreferenced
attribute
informationinto an analysis,a natural formof which
arises in space-time data series. Lahiri (1996, 2003)
notes that this is one way of regainingestimatorconsistencywhen employinginfill asymptotics(i.e., the
sample size increases by keeping the study area size
constantand increasingthe samplingintensity).
j#
3 1
2r
1
1
+2 r3
n
d
i= 1 j=1
n
dij r,
+'~1+5.12~;1
nexponential
1+ Z
i=l j=l
<_
(8a)
e-dij/r n
j#i
n-i
= 1
(1 + 51.4879)
1+
Model
Geostatistical/Semivariogram
Z
i=1j=l
K-Bessel
Mean-Based Results for Selected
1.776,and
)1.7576'and
(8B)
(8B)
n
'
n
c~
Kd
K1)
i=1 j= 1
Specifications
j#i
Geostatisticalmodels involve definingmatrixVratherthan matrixV. The scalar formof expression(2)
forsemivariogram
modelsis givenby
n
spherical
n(Co + Ci)
n(Co + C1) - C1
Zf(di, r)
i=l j=l
j#i
n,
n
n
(C--C1)
i=lj
n"
(7)
r)
1f(dij,
j~i
wheref(dij,r) denotesa particularsemivariogram
model
withrangeparameterr,nuggetCo, and slope C1, withdij
denotingdistanceseparatinglocationsi and j. The sillin
a semivariogrammodel "representsa value that the
tends to when distancegets verylarge,"
semivariogram
and hence at such extremelylargedistancesequals the
variance of the variableunder study(K. Johnsonet al.
2001, 283), Co+C1. If no spatial autocorrelationis
termin the denominatorof
present,thenthe right-hand
Equation (7) equals (n - 1), and n* = n; as spatial autocorrelationincreases,the denominatorincreases [its
right-handterm goes to n(Co + C1)/Co + nC1, since
f(dij,r T) -+ 1], resultingin n* decreasingfromn to
1 when Co = 0. The followingare three model-specific special instances of Equation (7), when Co is zero, where the approximationformula is given by
1 + (n - 1/(1 +
)c), in which dmax denotes the
bd distance,whichis the counterpart
maximuminterpoint
to Equation (3) forsemivariogram
models:
n
n-i
S1+ (1 + 69.6698
1(8C)
)1.8601
(8C)
whereK1 is a modifiedBessel functionof the firstorder
and second kindand the respectiverelativeerrorsumsof
are 5.4 x 10-7,
squares (RESSs) forthe approximations
8.9 x 10-6, and 7.5 x 10-7. The spherical model
findinghelps highlightwhy average nearest-neighbor
distance could serve as an informativeindex about
spatial autocorrelationhere. For the sphericalmodel,
= 1. If n = 2, n* = 2/2if n= 1 then n* =i
As d12 0, n* -+ 1; in the limit,
(1 dl2 - d2).
-when an observationis replicated,
only a singleobservationeffectively
exists.As dl2 ---+r,n* 2; in thelimit,
when two observationsare farenough--apart,theycontain no duplicated information.For the exponential
model, again, if n =1 then n*= =1 1=. If n = 2,
0
n* = 2/(1 + e-d12/').As d12- 0, n* - 1; as dl2
oo,
n* - 2. For the K-Bessel functionmodel, once more,
if n = 1 then n*= 1 = 1. If n = 2, n*= 2/(1+
1K1 (?i2)). As d12 -+ 0, n* - 1; as d12 -+ 00, n* --+ 2these particularlimitscan be confirmedwith Maple or
Mathematica.Graphs of these three functions-based
on nonlinearregressiongeneralizationsof a simulation
experimentusing 250 randomlyselected pointsfroma
unit square geographicregion and with 500 replications-appear in Figure4. These graphssuggestthatthe
range parameterof a semivariogrammodel behaves
similarto p for an autoregressivemodel; the (practical) range increasesas the degree of spatial autocorrelation increases.These graphsalso suggestthat spatial
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
(A)
models.This correspondenceis closestforgeoreferenced
data distributed
acrossa regularsquare tessellation(e.g.,
sensed
remotely
data). Whereas autoregressivemodels
are specifiedin termsof an inversecovariance matrix,
models are specifiedin termsof the assemivariogram
sociated covariance matrixitself.Therefore,Equation
(2), expression(Al), and Equation (C3) also can be
employedwithsemivariogram
modelingresults.
225
1:
200
150
o1
100
50
0\
0.0
TheCase ofNegative
SpatialAutocorrelation
0.2
0.4
0.2
0.4
0.2
0.4
r/dmax
0.6
0.8
1.0
0.6
0.8
1.0
0.6
0.8
1.0
(B) 250
250O *
200
150
100
50
0.0
r/dmax
(C) 250
200
150
c
100
'soi
50
0.0
747
r/dmax
Figure4. Effectivesample size (n*) for selectedsemivariogram
modelsacrosstheirrespective
based uponsimurangeparameters
lation experiments
(250 replications);solid circles( * ) denote
simulation
function
reresults,and dots (...) denotegeneralized
sults. (A): sphericalsemivariogram.
(B): exponentialsemivariogram.(C): Besselfunction
semivariogram.
autocorrelationincreases as a model descriptiongoes
from the spherical specification,to the exponential
to the K-Besselfunctionspecification.
specification,
This
implication is consistent with Griffithand Csillag's
and Layne's (1997, 1999) findings
(1993) and Griffith
that conceptuallyand numericallylink the exponential
and CAR models and the K-Bessel functionand SAR
A large majorityof georeferenced
data displaymoderate positivespatial autocorrelation.One exceptionis
remotelysensed data, whichtend to displayverystrong
positive spatial autocorrelation.Negative spatial autocorrelationis not encounteredin practicenearlyas often
as is positivespatialautocorrelation.
Rare examplesof it
are furnished
Anselin
(1988), and by Griffith,
by
Wong,
and Whitfield(2003). Richardson (1990) notes that
when negativespatialautocorrelation
is present,n*> n,
whichat firstglanceseemscounterintuitive.
But negative
autocorrelation
is
more
than
an
antithetic
spatial
nothing
variate (see Hall 1989) in spatial guise, which allows
moreto be gleanedfromless,ratherthan less frommore
when spatialautocorrelation
is positive.This typeof result can be obtained with Equation (2) by lettingthe
autoregressiveparameterbe negative (i.e., p<0). For
example, forthe Murraysite, n* nearlyreaches 1,200
when p
- 0.9. The wave-holespecification
is themost
popular semivariogrammodel for describingnegative
spatial autocorrelation(other possibilitiesinclude the
cubic model). But forit, as the rangeincreases,n* decreasesfromn to 1. This fundamental
difference
between
thesetwoclassesofmodelsis attributable
to factorsother
thancontinuity(e.g.,autoregressive
modelsdependupon
a rigid neighborhoodstructurefor discretizeddata
that is defineda priori,whereasgeostatistiaggregations
cal models tend to over-smoothwhen spatialintensities
are scatteredin smallclusters)and matrixinversionimpacts (e.g.,edge effectsdue to the shape and size ofboth
a regionand the areal unitsintowhichit is partitioned).
The wave-hole semivariogram
model actuallybehaves
morelike the AR(2) timeseriesmodel
Yt=- +
4-1gt-1 4 q02Yt-2 St,
for which pi1>0 and 4P2<0 (e.g., l=0.66 and
(P2= - 0.22). As such, positivespatial autocorrelation
stillexistslocallyat relativelyshortdistances.This feature is necessarilyso because in a continuoussituation,
changingnearbydistancesby a smallamountcannot be
accompaniedby continualdissimilarity.
Griffith
748
A Remotely
SensedImageExample
clude:
Baileyand Gatrell(1995,254) analyzeLANDSAT 5
"ThematicMapper"(TM) sensordata froma 1-km2,
oftheHighPeakDistrict
30 x 30 pixels(n= 900) portion
a
Thisimageincludes
a mixoflanduses,from
inEngland.
mixed
locatedin the southwest
reservoir
part,through
thesouthandconiferous
woodland
deciduous
inhabiting
the
andmoorland
centralarea,to roughgrazing
covering
In
thesedataare
northern
partofthisregion. otherwords,
hereis in
The illustration
discussed
veryheterogeneous.
Band #4 (B4) to spectral
termsof the ratioof spectral
thenearBand#3 (B3),bandsthat,respectively,
represent
of the electromagnetic
and red wavelengths
infrared
ofthisratioprovisualization
The geographic
spectrum.
in biomass,
with
ofspatialvariation
videsa goodpicture
in B4,which
reflecting
strongly
healthy
greenvegetation
whereasitsenergy
abmeasures
thevigorofvegetation,
sensedin B3,whichaidsin theidenis strongly
sorption
ofplantspecies.
tification
Box-CoxpowertransThe following
heterogeneous
was appliedto the biomassindexto better
formation
stabilizethe varianceacrossthe HighPeak geographic
indexwitha
thetransformed
landscape,betteraligning
normaldistribution:
LN
B4 + 29
L[B317
rank/B, 0.44
+1
-0.18]
rankB4,/B3
n-
0.98834 and n* = 5.12. In other words,
on average,the 900 spatiallyautocorrelated
pixels
the
Peak
attain
the
forming High
image
approximately
statistical
as
five
precision only
independsamelevelof
entpixels.
Eleven semivariogram
modelswere fittedto these
HighPeakdata. In all butone case (cubic),theCo inestimate
wasvery
(i.e.,nuggeteffect)
tercept
parameter
close to zero,and thus,forsimplicity,
Co was set to zero.
Estimation
and (practical)rangeresultsappearin Table
2. Mostofthesemivariogram
modelsfurnish
a verygood
of spatialautocorrelation
latentin these
description
data. MostRESSs are roughly
2-3 percent.Semivariogramplots(Figure5) also revealthatthe modelpredictionsclosely track the data; the two poorest
are providedby the cubic model,which
descriptions
tendstoyieldvaluesthataretoohighforshortdistances,
andtheGaussianmodel,whichtendstoyieldvaluesthat
are too lowforshortdistances.
Effectivesample sizes here rangefrom6.75 to 17.42.
Althoughthesevaluesare ofthe sameorderofmagnitude as the one producedwiththe SAR model,they
tendto be noticeably
theK-Bessel
Furthermore,
higher.
whichis thetheoretical
model
function,
semivariogram
companionforthe SAR model,does not producethe
closestof the set ofn* valuesto 5.12. Of noteis that
the SAR value is consistent
withwhatwouldbe obtainedwitha timeseriesforwhichj = 0.98834:
n* = 900(1 - 0.98834)/(1 + 0.98834) = 5.28.
(9)
whereLN denotesthe naturallogarithm.
SpatialSAR
data inresultsforthesetransformed
modelestimation
==
And it is consistent with the value of 5.83 ren-
deredby Cressie's(1991, 15) formulae,
forwhichthis
Variable
* gamma
0.09-
* spherical
0.08)
*
0.07 -
&
*
4
0.06E 0 E 0.05
'
eassel
circular
0.03
0.01
penta-spherical
quadratic
rational
+ Gaussian
X cubic
0.04
0.02
exponential
A stable
-
0.00-" 0
0.0
0.1
0.2
0.3
standardized
distance
0.4
5. Semivariogram
Figure
plotsandpredicted
valuesforelevengeostatistical
modelsdeintheHigh
latent
scribing
spatial
dependency
Peakbiomass
index.
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
16
terval (56.3, 74.5) and 68.2 fallingabove the interval
(46.0, 67.6).
15
and a SingleMean: The MurraySmelter
InfillSampling
SiteExample
14
b
12
10
80
6
749
0.25
0.30
0.35
0.40
0.45
range
0.50
0.55
0.60
0.65
6. Modelestimated
versus
effective
Figure
(practical)
range
sample
size(n*)for11geostatistical
theHighPeakbiomass
index.
models,
precedingtime series resultis the asymptoticlimitin
termsof n.
the negative relationshipbetween the
Interestingly,
(practical) range and n* can be detected fromthese
modelingresults (see Figure 6). In other words, as a
spatial dependencyfield increases in size, more information becomes redundant,and effectivesample size
decreases.
TheMurray
Smelter
SiteRevisited
To illustrateresultsobtainableusingirregularspaced
pointdata and semivariogram
depictionsof spatialautocorrelation,the Murraysmeltersite data were analyzed
in detail. SAR model descriptionsof arsenic (As) and
lead (Pb) are reportedin Table 1B. As previouslyseen
with the High Peak data, these SAR-based, effective
fromthe rangeof similarsemisample size values differ
variogrammodel resultsreportedin Table 3, although
theyall are of the same orderof magnitude.Here the
values are lower than theirautoregressive
counterparts,
withthe SAR-based values of 77.0 fallingabove the in-
Semivariogrammodels, because of theircontinuous
nature,are especiallyusefulforassessingthe situationof
infillsampling(i.e., the size of a regionis held constant
while samplingincreasingly
is moreintensive).In other
words,fora given region,as samplingbecomes increasinglymoreintense,whathappensto n*?In thiscontext,
n* shouldbecome a functionofthe averagefirstnearestbetween sample locations.
neighbordistance,say
dmN1, locations decreases, the
As distance between nearby
overall amount of redundantinformationin a sample
containingspatialautocorrelationwill tend to increase.
A second exploratory
simulationexperiment
was conductedwiththe Murraysmeltersitedata, whichis based
stratifiedrandom sampling
upon hexagonal-tessellation
(Stehman and Overton 1996). Sample size was
sequentiallyincreasedfromroughly100 to roughly2,000
100. The designforthis
by incrementsof approximately
typeof samplingis outlinedin AppendixD. The hexagonal tessellations
forsamplesofsizen = 104 and n = 2,008
appear in Figure7. Because these samplingschemesare
beingused forillustrative
purposes,the noncoveredsectionsofthelandscapeare ignored;in reality,
thesepartsof
thelandscapewouldbe coveredwithpartialhexagonsthat
then could be groupedinto a set of artificial
piecemeal
hexagonswhoseindividualareaswouldequal thatofeach
completehexagonforsamplingpurposes.
Relationshipsbetween n* and n, by semivariogram
model specification(see Table 3) forAs and Pb separately,are portrayedin Figure8A. These graphssuggest
asymptoticdiminishingreturnsfor n* in each case.
Meanwhile, as n increases, the average firstnearestneighbordistance,dNN1,fora regularhexagonal tessellation will tend to decrease; this value, which can be
TablelB. Selecteddescriptive
statistics
forfiveparticular
geographic
landscapeexamples
Geographic
landscape
variables
x
Puerto
RicoDEM
Kansasoilwells
Texas
Texas
Texas
Texas
Texas
site
Murray
Minnesota
forest
stand
Elevation
mean& variance
& % shale
Thickness
pH & Se
u & SO4
Mo & B
V & As
Bu & SO4
As & Pb
basalarea& suitability
index
0.8314
0.3743
0.1398
0.3452
0.3218
0.3881
0.4530
0.5318
0.3648
0.6764
0.1093
0.3605
0.6834
0.3022
0.6878
0.6834
0.4936
0.3569
xv
n
0.6810
- 0.5463
-0.0031
- 0.2521
- 0.4942
0.7122
- 0.7686
0.7478
0.4200
73
124
127
127
127
127
127
253
513
n
n
12.69
6.24
52.95 98.97
95.03 56.82
59.07 20.93
62.59 65.66
52.92 20.56
44.44 20.93
68.24 76.95
219.49 224.27
nxy
30.97
119.38
120.57
90.12
113.98
84.55
77.31
179.08
436.52
xy
29.12
117.25
120.59
91.23
114.26
85.65
80.52
175.41
437.48
750
Griffith
Table2. Selectedsemivariogram
sensedimagebiomassindex
modelingresultsfortheHighPeakremotely
C1
r
100x RESS
range
(practical)
n*
0.0858
0.1066
0.0986
0.0868
0.0962
0.0923
0.0837
0.0538
0.0852
0.4772
0.1605
0.3527
0.2133
0.1857
0.4298
0.1301
0.1001
0.1505
0.5002
0.3107
0.0431
0.5837
2.7
2.3
2.2
2.4
4.1
2.8
8.8
45.8
3.5
2.5
4.5
0.3527
0.6399
0.4956
0.4298
0.5671
0.4004
0.2607
0.5002
0.3107
no practical
range
essentially
no range
15.55
6.75
9.15
14.86
9.03
11.95
17.42
fit
unacceptable
15.89
NA
NA
Model
Spherical
Exponential
Stable
Penta-spherical
Rational
quadratic
Bessel
Gaussian
Cubic
Circular
Cauchy
Power
and Layne(1999, 468).
Griffith
Practicalrangesare calculatedfollowing
C1 denotesthevarianceestimate(i.e.,Co = 0); r denotesthe rangeparameter.
standardizedby the observedmaximuminterpointdistance, yielding , will be a functionof the regular
hexagonal tessellationcentroids.
Infillsamplingresultsare fora fixedrangeparameter,
- following
and
functionof'""[
maybe expressedas the
r,
dmax
and of the productn dNN1,whichis equivalentto the sum
of the individualpoint nearest-neighbor
distances divided by the maximuminterpointdistance:
r
ninfill
n
1 + be-O.l3867c-3.46687d
x (1
+
(10)
be-CxdaNN/dmax-dxndNNl/dmax)
,4
whichequals n, as it should,whenthe uniform
spacingof
all neighboringpoints is at a distance exceeding the
is the
(practical) range. The quantity 1+be-.sl67c-3.46687d
infillasymptoticeffectivesample size discountfactorby
whichn needs to be multipliedto calculaten*. Estimates
of b, c, d, and the discountfactorforthe Murraydata
appear in Table 4. A scatterplotforthe observed and
predictedvalues associatedwithEquation (10) appears
in Figure8B, corroborating
the high pseudo-R2values
reportedin Table 4 that implya close correspondence
betweenEquation (10) and the infillsamplingresults.
Whereas Equation (7) describeshow n* changes as
in an attributevariablechanges,
spatial autocorrelation
Equation (10) describesparticularmodel instances of
how n* changes as spatial autocorrelationin a sample
ofthe attribute
changes,whilethe spatialautocorrelation
variableremainsconstant.Of note is that the practical
model is less
rangeofAs forthe Gaussiansemivariogram
than dNN1forthatcase, resultingin n* = n forboth the
univariateand bivariatecases, as it shouldbe.
anda SingleMean:A Skeetshooting
Sampling
Infill
SiteExample
Superfund
A thirdexploratorysimulationexperimentwas conductedwithdata collectedfora skeetshooting
superfund
site. For evaluation purposes,236 surfacesoil samples
werecollectedin the superfundsite. A generalizationof
of these measuresreveals a
the geographicdistribution
was intensivelysampled.
site
that
within
the
singlespot
used
The samplingnetwork
throughoutmostof the site
Table3. Selectedsemivariogram
standardized
resultsforsoilsamplesfromtheMurray
siteusinga maximum
smelter
modeling
distanceof0.20; themaximum
standardized
distanceis 1.05010
Lead(Pb)
Arsenic
(As)
Model
Moec0?cc
C
c0?cc
100 RESS
c
Range/practical
range100x RESS n
ny
0.08511
67.0
33.4
73.5 0.091890.90811
52.0
31.0
0.10630
64.6 0.024870.97513
0.12260
30.9
56.3 0.000001.00000
46.0
33.2
0.10371
64.4
73.4 0.088190.91181
31.7
59.6 0.076420.92358
47.2
0.12476
58.2
0.09130
32.0
70.2 0.085610.91439
66.1
35.0
0.07175
74.5 0.185880.81412
33.6
67.6
0.07607
71.9 0.107310.89269
Range/practical
range100xRESS n
c
agepatcalC
rCng/pracricalSrange
0.132350.86765
Spherical
0.036610.96339
Exponential
Stable
0.002220.99778
Penta-spherical0.113450.88655
Rational
0.075610.92439
quadratic
Besselfunction 0.094680.90532
Gaussian
0.215660.78434
Circular
0.162360.83764
0.08139
0.09179
0.10922
0.09566
0.10600
0.08021
0.06706
0.07522
44.9
43.4
43.2
44.9
44.3
44.4
47.2
45.1
751
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
(A)
(A) 140
120
9*x **
60
10600
100
2000
150
20
**
(B)
n
**
**
1.01.0*
*
0.0
0.8
0.0
0.8~
c
*
00
4E
0.6 0
0
2
(B)n-hat/n
0.4
0.4 -
0.45
0
1000
0 ?
.
1500
2000
0
4
0.2
(B)
0.0
0.0
0.2
0.4
0.6
n*-hat/n
0.8
1.0
sitedata.As
smelter
resultsfortheMurray
Figure8. Infillsampling
is denotedbya solidcircle( * ). Pb is denotedbyan asterisk(*).
valuesfrom
Equa(A): n* versusn. (B): n*/nversuspredicted
tion(10).
inthe
tessellations
7. Thetwoextreme
employed
Figure
hexagonal
smelter
sitesampling
(A): n = 104.(B): n = 2,008.
Murray
design.
reflectsa square gridpattern.The frequencydistribution
of Pb concentrationin this geographiclandscape is apThe 236 soil samdistributed.
log-normally
proximately
a
to
were
used
locations
generate Thiessen polygon
ple
surfacepartitioningin order to measure spatial autocorrelation.The Moran Coefficient(MC) based on the
measures is 0.48873 (test statistic:
log-transformed
z = 12.7), which is significantand indicates that a
moderate tendencyexists for similarvalues of log-Pb
concentrationmeasures to be in nearbysample locamodel was foundto
tions.The K-Bessel semivariogram
furnishthe best descriptionof spatial autocorrelation
latentin these data. (See Thayeret al. 2003.)
Again forinfillsamplingassessment,sample size was
sequentiallyincreasedfrom20 to roughly2,000 by incrementsof approximately100. The hexagonal tessellationsforsamplesofsizen = 20 and n = 1,997 appearin
Figure9.
Estimates of the K-Bessel functionsemivariogram
model parametersand the discount factor prorating
752
Griffith
Resultsfora SpatialFilterModel
(A)
Specification
(B)
Figure9. Two extremehexagonaltessellations
employedin the
site samplingdesign.(A): n = 101. (B):
skeetshooting
superfund
n= 1,997.
coefficients
forthe skeetshootdata appear in Table 5.
autocorrelation
latent in these Pb data is much
Spatial
strongerthan that latent in the Murraysite data (see
Table 4), a featurereflectedin the smaller discount
factor.Furthermore,
the goodness-of-fit
forthe discount
factorEquation (10) impliesjust as close a trackingof
the data as is foundforthe Murraysite.
Spatialfiltering
techniques(Getis 1990, 1995; Griffith
2000, 2003; Borcard and Legendre 2002; Getis and
Griffith
2002) allowspatialanalyststo employtraditional
regressiontechniqueswhile insuringthat regressionresiduals behave accordingto the traditionalmodel asin theseresiduals.
sumptionofno spatialautocorrelation
One spatial filtering
method exploitsan eigenfunction
decompositionassociatedwiththe MC. A spatialfilteris
constructedfromthe eigenfunctions
of a modifiedgeoof
graphicweightsmatrixthat depictsthe configuration
areal units in the MC and is used to capture the covariationamong attributevalues of one or more georeferencedrandom variables. The simplestversion of
thisweightsmatrixis denoted by the binary0-1 matrix
C. Spatial filtering
uses such geographicconfiguration
to partitiongeoreferenced
information
data into a syntheticspatialvariatecontainingthe spatialautocorrelation and a syntheticaspatial attributethat is free of
spatial autocorrelation.
The precedingspatial autoregressiveand geostatistical modelsare nonlinearin form;a spatialfiltermodel is
linear in form.In addition, the eigenvectorsused to
constructthe aforementionedspatial filtercome from
the followingmodifiedversionof matrixC foundin the
numeratorof a MC:
(I - 11T/n)C(I - 11T/n),
where (I - 11iT/n)is the projectionmatrixcommonly
found in conventional multivariate and regression
analysisthatcentersthen x 1 vectorofattributevalues.
The eigenvectorsof thismodifiedformof matrixC are
both orthogonaland uncorrelated.Consequently,the
of the mean of some attributevarisamplingvariability
able Y is givenby the standardresult,c2/n. But when
Table4. Coefficient
estimates
forEquation(10), and the resulting
discountfactor,
spatialautocorrelation
bysemivariogram
modelspecification,
fortheMurraysmeltersite
Arsenic
(As)
Model
Spherical
Exponential
Stable
Penta-spherical
Rational
quadratic
Besselfunction
Gaussian
Circular
Pseudo-R2
b
68.908
40.567
39.583
59.645
43.906
46.542
120.588
81.921
c
6.233
-0.799
- 1.467
4.916
0.010
1.440
10.568
7.365
d
0.223
0.163
0.190
0.207
0.191
0.171
0.268
0.243
0.9969
Lead(Pb)
discount
factor
b
c
0.06942
0.03742
0.03830
0.06355
0.04242
0.04536
0.08332
0.07296
67.902
40.722
32.623
56.928
38.716
44.739
66.168
87.816
5.004
- 1.941
- 6.090
3.030
-4.464
- 0.593
4.581
7.172
d
0.223
0.194
0.161
0.207
0.169
0.180
0.218
0.250
0.9980
discount
factor
0.06010
0.03549
0.02251
0.05195
0.02442
0.03696
0.05733
0.06822
753
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
discountfactor,
Table5. Coefficient
estimates
forEquation(10), and theresulting
bysemivariogram
spatialautocorrelation
site
modelspecification
forlead(Pb),fortheskeetshooting
superfund
coefficients
Infill
prorating
sampling
coefficients
Semivariogram
model
Co
Besselfunction
Goodness-of-fit
O0
C1
131.122
0.1916
1.0706
100x RESS= 17.5
factor
discount
d
c
b
Practical
range
- 0.101
0.375
= 0.9989
Pseudo-R2
0.02686
0. Forsimplicity,
wasnotsignificantly
from
tTheestimated
different
valueofCo is0.1284,which
then,Co wassetto0.
is overlooked,
then Y2is inflated,
spatialautocorrelation
as has been shownforautoregressive
and geostatistical
ThisVIF is givenbythestandard
modelspecifications.
corresultof 1, whereR is themultiple
regression
relationcoefficient
forattribute
variableY regressed
H eigenvectors
containedin a spatialfilter,
on
yielding
^2
e*2 In otherwords,the effective
samplesizefor
(1-R2)-..
^. linear
this
modelis givenby
n* - (1 - R2)n,
(11)
R2. Of note
whichproducesa linearplotforn*versus
is thatthe degreesoffreedom
adjustment,
n-H-l, occursin thenumerator
of ^&2onlyin thecommoncase
oftheregression
beingunknown,
highlightparameters
a
that
is
not
function
of
this
truly
adjustment.
ing
n*
If zerospatialautocorrelation
is containedin variable
Y,thenR = 0, and n* = n; as R -+ 1,n* goes to one, with
an upperlimitofR
onlyn- 1 eigenvectorsare
since one eia
available forconstructing
n-; spatial filter,
isproportional
tovector1,whichis thevector
genvector
term.
forthelinearregression
intercept
inTable6.
valuesofn*arereported
Spatialfilter-based
then*valuesforAs andPb inTables1 and6
Comparing
indicatesthatspatialfiltering
producesa moremodest
when converting
adjustment
samplesize to effective
size
in
the
of
non-zero
presence
spatialautocorsample
relation.
Resultstendtobe verysimilar
wheneitheran R2
is
criterion
or a residualMC minimization
maximization
used.In caseswheretheresidualspatialautocorrelation
failsto become trivial(e.g., High Peak biomassand
in Table6), the threshold
PuertoRico meanelevation,
value of MC could be reduced,prominent
negative
couldbecomecaneigenvectors
spatialautocorrelation
inclusionprobability
didates,the stepwiseregression
couldbe
couldbe altered,and/or
contiguity
geographic
could
redefined
(e.g.,a "queen's"connectioncriterion
Of
fora "rook's"connectioncriterion).
be substituted
values
noteis thattheaverageunivariate
semivariogram
in Table1A.
arecloserto thosevaluesreported
Model-Informed
Implications:
Sampling
Geographic
Results reviewedand new findingsreportedin this
interested
to geographers
articleare ofgreatimportance
dataandprovide
in collecting
necessary
inputformodelinformed
designs.Duringtheplansampling
geographic
a
a
of
may
spatialresearcher
ningstage study,quantitative
on a sample
whendeciding
thefollowing
naively
compute
sizefora givenpowerin orderto estimatetheregional
variableY (Ott 1988,147):
mean,pL,ofattribute
_
(Z1-r/2
A2Y
+1-Z-)
2
used to computen* forthreeparticular
Table6. Selectedspatialfilter(cx= 0.10 forinclusion)features
geographic
landscapeexamples
Geographic
landscape
Rico
Puerto
variable
DEM meanelevation
DEM elevation
deviation
standard
smelter
site arsenic
(As)
Murray
lead(Pb)
HighPeak
biomass
Eigenvector
(MC>0.25)
criterionK
selection
MaxR2
MinMC
MaxR2
MinMC
MaxR2
MinMC
MaxR2
MinMC
MaxR2
MinMC
MC n
R2 Residual
ZMcSpatialfilter
2.7356
11 0.7487
2.2617
15 0.7849
0.7685
9 0.5876
12 0.6230
0.4924
19 0.4520 - 0.5589
0.0001
19 0.4317
13 0.3684 -0.1687
0.0021
15 0.3642
8.4710
191 0.9698
8.0830
214 0.9717
0.6155
0.6753
0.6316
0.6856
0.7601
0.8306
0.7331
0.7989
0.8882
0.9062
73
73
73
73
253
253
253
253
900
900
n*
18.34
15.70
30.11
27.52
138.64
143.78
159.79
160.86
27.18
25.47
Griffith
754
representsthe Type I error(i.e., rejecting
whereZ1-
0/2
thenullhypothesis
whenitis true)probability
fora twotailedtest,Zi-_p represents
theTypeII errorprobability,
and A = I9 - jol, with t and
respectively,
denoting
to,
thenullandthealternate
means.The valueof
hypothesis
n rendered
seeksto allowa predebythiscomputation
desiredlevelof statistical
to be obtermined,
precision
tainedforan analysis.
n* resultsare helpful
All threeofthepreceding
with
domainsampling
(i.e.,thesizeofa geographic
increasing
isexpanded
inordertoincrease
thenumber
ofareal
region
from
units).Accordingly,
Equations(3), (7) and (11),
becomesrelevant.Considera current
project
sampling
soilPb pollutionacrossthe
whosegoal is to determine
2002).Then,basedon
CityofSyracuse(alsosee Griffith
a pilotstudyinvolving167 samplepoints(see Figure
valuesLN(Pb+3), a K-Beattribute
10A), transformed
model whose practicalrange is
ssel semivariogram
distanceunits,andan equation
0.09392,in standardized
oftheformgivenbyEquation(10),
2
n =(Co + C1)
x [1 + 47.9041e-(0.0958)(-7.4247)-70(0.0958)(0.1317)
(13)
spatialautoregression
n
2
n
("e*(Z1-/2-+Z-)2
A
A~2
=1
--
1 -e-212373p+020024V
1
---1 "92349(12A
(12A)
1-e-21-e1.92349
12373p+0.20024V/
n=
geostatistics
-
x
+
(Co + C1)
+Zl-f2
(Z1-/2
(1+b
dn
substitutions
whichcontainsthe following
[see Equa=
at
n
value
of
70
(effective)
[the
ranger
tion(10)]: n,
(A)
-
C,
(12B)
and
2
n=
spatialfiltering
(Zl-0/2
A2
.
x
*
*x
(Z1-a/2+Zl-~)2
1-R2
1
2
(12C)
FortheHighPeakbiomass
exampledata,82 = 0.267012,
fromthe SAR model analysis6.2 = 0.074242,from
the K-Besselfunctionsemivariogram
model analysis
6 . = 0.092342,and fromthe spatialfiltering
analysis
= 0.024322.Ifthemeanofthenormally
distributed
^
transformed
biomassindexis hypothesized
to be 2, the
maximum
to be detectedis 0.1 (A), a twodiscrepancy
tailedhypothesis
testwitha 5 percent
levelofsignificance
is to be employed(i.e., Zl - ,/2= 1.95996), and statistical
poweris setto 0.9 (i.e.,Zi _p= 1.28115),thenratherthan
n = 75 (i.e.,approximately
a 9 x 9 image)whenspatial
is overlooked,
autocorrelation
the SAR modelresults
thatn = 1,236,theK-Bessel
model
suggest
semivariogram
results
thatn = 2,382,andthespatialfilter
model
suggest
resultssuggestthat n = 963. In otherwords,rather
than the 30 x 30 pixelsimagebeingmorethan adetoatleasta 35 x 35
quateinsize,itneedstobe expanded
x
a 49 49 imagefora geostaimagefora SAR analysis,
tisticalanalysis,
and a 31 x 31 imagefora spatialfilter
analysis.
Ifa spatialresearcher
wantsto estimatean attribute
meanfora particular
region,witha specific
geographic
confidence
interval),
degreeofprecision(i.e.,a specific
thenpowerbecomesirrelevant
(Ott 1988,131) andinfill
x
x..
x.
x
(B)(
(B)
S n/nmax dNN1M
1.0
(dNN1)
/
(as
/
08
//
06
f
0.4/
0.4
//
02
00
00
0.0
oo
0.2
a 2
0.4
n/nmx;
06
AXd
d_NN1/
0.8
810
1.0
NNi)
denoted
locations,
(x),and
bycrosses
Figure10. (A): soilsample
dewithitscentroids,
tessellations
oneofthehexagonal
together
notedbysolidcircles( * ), fortheSyracuse,
NY,study.
(B): scatdenoted
ofrelative
bysolidcircles( . ),
terplots
n*versus
n/nmax,
versus
nearest
distance
standardized
andrelative
neighbor
n/nmax,
denoted
NY,study.
bysolidsquares( ), fortheSyracuse,
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
for which spatial autocorrelation is negligible],
=0.0958,
and d=
b=47.9041, c=-7.4247,
dmax
drNN
0.1317 (see Figure10). As was done fortheMurraydata,
estimatesofcoefficients
b,c, and d wereobtainedhereby
estimatingn* forincreasinglydenser hexagonal tessel- p
lations. Because power is not of interesthere, Z1
disappearsfromthe formula.Because spatial autocorrelation increases with increasing infill sampling,and
because diminishingreturnsforobtainingnew, nonredundant informationare encounteredas n increases,
withthe limitingcase (n goes to oo) approachingno new
information
witheach additionalsampleselection,more
and more sampleshave to be taken in orderto acquire
less and less new information.
A reasonablevalue of A
forthe Syracusestudywould be 0.25, indicatinga confidence interval of ji 0.25. For a = 0.05 and KBessel functionsemivariogramcoefficientestimatesof
C1 = 0.98432 and Co = 0, Equation (13) indicatesthat
achievingthislevel of precisionwould requirea sample
size of 2,501. Given that the mean of the pilot study
sampleis roughly4.5, thislevel of precisionwould allow
a researcherto estimate the log-populationmean to
withinroughly? 6% of its actual value. Of note is that
the RESS forthe semivariogram
model is 0.416, and the
descriptionof the experimentfurnishing
Equation (10)
coefficientestimatesis accompaniedby a pseudo-R2of
0.996.
approximately
Therefore,findingsreportedin this articlefurnisha
methodologyand formulaethatenable the computation
of appropriatesamplesizesforquantitativestudieswhen
non-zerospatial autocorrelationis presentin georeferenced data. The firststep (Step #1) of thismethodology
involvesa pilotstudyto obtaininitialestimatesofspatial
autocorrelationand variable variance. If a researcher
chooses to obtaina varianceestimatefromthe literature,
then assumingmoderatepositivespatialautocorrelation
formost variablesand extremelystrongpositivespatial
autocorrelationfor remotelysensed images would be
reasonable, too. The second step (Step #2) involves
selectionof a spatial model specificationto be used in
subsequentdata analyses.Althoughthismodel-informed
samplingdesign approach is somewhatsensitiveto the
is modeled,all three
wayin whichspatialautocorrelation
alternativemodel specifications
indicatethatgeographic
studiesrequiresubstantially
largersamplesizes than are
suggestedby conventionalstatisticaltheory.This particular resultis relevantto qualitativesampling,too. The
thirdstep (Step #3) is to computen, superimposethe
corresponding
hexagonaltessellationoverthe studyarea,
and thenrandomlyselecta singlepointfromwithineach
hexagon.This is the sampleto be drawn.A usefulpostdata-collectiondiagnosticexercisewould be to compare
755
parameterestimatesbased on the sampledata withthose
thesamplingdesign.Of note is thatthe
used to formulate
contribution
of this articleis the development
principal
of Equations (12A)-(12C) forcalculatingn.
Futureresearchneeds to addressextensionsof findings reportedhere to other than means of normally
distributedgeoreferenced
variables.The previouslylisted UCLA web page furnishescalculatorsformeans of
Poisson (which would require the Winsorized autoPoisson model) and exponentialvariables,and correlationcoefficients
(whichare touchedupon in thisarticle).
Meanwhile, the Web page http://www.stat.uiowa.
furnishes
Russ Lenth's
edu/%7Erlenth/Power/index.html
calculators for proportions(which would require the
autobinomialmodel), and analysisof variance (multiple
means,which are touched upon in this article;also see
Cliffand Ord 1975; Griffith
1978). Futureresearchalso
needs to outline impactsof spatial autocorrelationexplicitlyforpurposeful
samplesused in qualitativestudies.
Acknowledgements
This material is based on work supportedby the
National Science Foundation under Grant #BCS0400559. Executionof computationaland GIS workby
MatthewVincent and Marco Millones is gratefully
acwas
while
This
research
the
knowledged.
completed
author was in the Departmentof Geographyand Regional Studies,Universityof Miami.
AppendixA
The Case of a WeightedAverageof Two Correlated
SampleMeans
Sometimesa spatial scientistmay be simultaneously
interestedin two variables. One consequence of exofthe
tendingEquation (2) resultsto thejointtreatment
pairofmeans,x and y,fortwo attributevariables,X and
Y, is the presenceof two sourcesof redundantinformation:correlationbetweenthe twoattributevariablesand
spatial autocorrelationwithin each attributevariable.
Dutilleul (1993) updates the Clifford,
Richardson,and
H~mon (1989) discussionabout how spatial autocorrelation impacts upon the correlationcoefficient.Extendinghis discussionrevealsthatcovariationalso has a
VIF similarto that appearingin Equation (2), withthis
factorbeing compensatedforby the individualvariable
is computed;thatis,
VIFs when a correlationcoefficient
mustbe containedin the intera correlationcoefficient
val [-1, 1], regardlessof the nature and degree of
present.Constructinga weighted
spatialautocorrelation
Griffith
756
averageofX and Y,say [wX+(1- w)Y] for0 < w < 1,
resultsin the sampling
distribution
varianceofinterest
nentsor factorscores(see R. Johnsonand Wichern
oflargesampleinference),
yields
2002,fora discussion
1Ad d dAd
1
(I~cl
d 0 I) V/2
TR(Ad d
T)Vd/2"r?
(Ad
.dAd)TR[
((Adl @ I)Vd/2)T( ?
1T
I)AI)
beingforwex+ (1 - w)y. IfvariablesX and Y are independent,thenthisstandarderrorreducesto the theoreticalresultof a weightedsum of the variables'two
variances.
In thisbivariatemeanscase, effective
samplesize
becomesa weighted
variables'
averageoftheindividual
effective
samplesizesthatis adjustedforthe attribute
betweenX andY.The generalexpression
correlation
for
n*becomes
+ 2w(1 -w)pxyaxcy
w2au + (1 -
+2(2
1 R(-i
w2?T(V
(12
++ (A2
w2T (X
wu2 X TV-11
+
(11
[w22WTR(Vx1)+
n,
(B1)
I)V/2) 1
(Al)
-w)2yTR(Vy)]
l)
n,'
positivespatialautocorreapproachthe case ofperfect
one. If all butone oftheweights
lation,n* approaches
are zero,thenexpression(B1) reducesto the right-hand
valueproducedby
sideofEquation(2). The numerical
interval
definedby
in
the
is
contained
expression
(B1)
resultsobtainedwith
fortheP individual
theextremes
Equation(2).
0)
0)(ox
Sdd(w
_(wox)
Considerthe 2-meanscase (see AppendixA). Then
Ad d
0
1 - Wery 0
0
Ad d
Appendix B
od
The Case of a LinearCombinationofP > 2
CorrelatedSampleMeans
(Al) to a
Generalizing
Equation(2) and expression
multivariate
situation
P
variable
which
means,
involving
is particularly
relevantto the use of principalcompo-
I)Vd'/2
(
where0 denotesKronecker
Ad is a P x P diproduct,
coeffithelinearcombination
agonalmatrixcontaining
cient ap in diagonalcell Od is aP x P diagonal
P,
deviation
standard
matrix
containing
cYpin diagonalcell
matrixcontaining
Vd is an nP x nP block-diagonal
P,x n inversecovariancestructure
matrixVp'-1 in diagn
correlation
is a P x P attribute
onal block
matrix,
matrix.If Vd = I, thenexand I is anP,n x n identity
pression(B1) reducesto n. As all of the Vp matrices
-1/2
(1 ---2w)2o21TV--1 + 2wu(1- w)pxyJxoy1(V)-1/2V
whereVx and Vy respectively
are the nx n inverse
covariancestructure
matricescontainingthe spatial
autocorrelation
among n observationsfor attribute
variablesX and Y. If Vx = Vy = I, then thisexpression reducesto n. If w = 0, w = 1, or Vx = Vy, then
reducesto theright-hand
thisexpression
sideofEquation (2). In otherwords,the bivariatemeans effectivesamplesizeis a weighted
averageoftheindividual
effective
univariate
samplesizes (i.e., it mustbe containedin the intervaldefinedby them).And as Vx
and Vy approachthe case of perfectpositivespatial
for both attributevariables,n* apautocorrelation
of the twolimiting
effecproachesone. The weighting
tive samplesizes is determined
by both the relative
variancesof X and Y and the weightsused in cona linearcombination
of X and Y and is imstructing
little
the
attribute
correlation,
by
pacted
Pxy,computed
X and Y.
forvariables
d
I)
0
x(w
W(1
(1 - w)oy)
0
(wx
)(1
pxY)
)
(1 w- )oxy xpxy 1
(
(10- w)oy
xyx
Iao
-w2o
0
-- W)oxIypxy
(1
-- W)23.
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
=
1TAdiAdiid1
TR(Ad~d
correlationcoefficient,
r, to its correctsamplingdistribution forgeoreferenceddata (also see Haining 2003,
degreesof
?8.2). They develop the notion of effective
=
the
Based on the standardresultof or
freedom.
standarderrorof the correlationcoefficient
r,underthe
= 0, the numberof observations,
null hypothesisof
pxY
n, maybe proratedto n* usingthe formula
(1 +22
2w)2X2
+ 2w(1 - w)pxyxovy
dAd)
w
I)V1/2)[(T
=
2
x
(
+ (1
-
0
w20
)
(Add
n*= 1 + 0"-
0
( ((V1/2
(1)T- w)oyI
0
x
0
=
\ 0
)
T
(V
y1/2
0
(WYx(Vx1/2)T
(1
0
I
i
X=
1 }
\PYpxY)
((Ad`d d
?I-
( xl
II
I)V'/2) T(
I
x
-- W)(Y(V
pxyI)e
PxyI)I
/(d
wx(V
pxyI I
(
d
w(1
-w),x
0
1/2)T(Vyl/2
1
=
= w2
TR(V1)
+-
oy(Vy1/2)T
(1 - W)Oy
(Vy1/2
I)V2)
(AdId
1 + (1 - w)221 TVyil
w2oX1TVx
+ 2w(1 W)pxyCxoy1T(VT)-I/2VXl/2
TR [((Ad d 0 I)Vd 9/2)T( 0 II)((Ad
I)V/2)]
1T(A(d
0
(wox(Vxl/2T
(1-- W)
0
0
)
0
1))V-/2)
1/2)
w(1
- w)axGYpxy(V
wi2Vx
0 I) V
0
(/2) (i
I)
x1/2)T
I
I)((Ad
(C1)
denotes the variance of the samplingdistriwhere
of the correlationcoefficientr. Equation (C1)
butionor
is associatedwith
indicatesthatlow samplingvariability
largervalues of n and sizeable samplingvariabilityis
associatedwithsmallervalues ofn. Considertwo variables containingmaximalpositivespatialautocorrelation,
which manifestsitselfapproximatelyas a linear data
gradientacross a geographiclandscape. This situation
relates to three qualitativelydifferentvalues of rxy,
namely,rxyM 1 (both gradientsalign), rxy -1 (the
gradientsare in exactly opposite directions),and two
cases of rxy m0 (the gradientsare orthogonalin two
(wxI
0 0(1 - w)
lyI
(Vy1/2)T
x
0
(VxI/2)T
757
(1 - w)2G2TR(Vy1)
Therefore,expression(Bi) reduces to expression(Al)
when onlytwo means are beingconsidered.
AppendixC
An Illustration
ofOtherPossibleExtensions:
The
Case ofBivariate
Attribute
Correlation
Coupledwith
PositiveSpatialAutocorrelation
Clifford,Richardson,and H~mon (1989) and Richardson(1990) use semivariogram
modelingto linkthe
-
YPxy(Vxl/2)T(Vy1/2))
)2(V1/2)
different
ways). Hence, positivespatial autocorrelation
increases o,, resultingin n* decreasing.Meanwhile,if
zero spatial autocorrelationis present,then n* = n; if
perfectpositivespatial autocorrelationis present,then
n* approachesone. Dutilleul (1993) rewritesEquation
(C1) usingmatrixnotationand incorporatesimpactsof
estimatingmeans and variancesfora correlationcoefficient,an adjustmentthat is set aside here forsimplicity
(these typesof adjustmentsalso are outlinedby Griffith
and Zhang 1999). These resultsare forthe case of independentattributevariables(i.e., Pxy= 0).
The new developmentpresentedhere departsfrom
thisearlierworkin orderto incorporatethe entirerange
of correlationvalues by beginningwith the following
standarderrorof a correlationcoefficient:
1 2
2 x.
1v/n-1
(C2)
Followingthe developmentsforsamplemeans and forr
under the null hypothesisof Pxy= 0, the logic for
758
Griffith
derivingEquation (2) suggests
with the aforementioned
pseudo-R2values suggeststhe
need
for
further
refinement
of Equations (C2) and (C3).
- TR[(Vl)2]r2
1 - ry
x
If Pxy= 0, Equation (C3) differs
fromthe one report2 nTR(Vx1Vyl)
ed
and
Richardson,
Clifford,
by
Hfmon (1989) and
TR(V1')TR(Vy1)
r
10.47(1
1
]
-r~tu)+8/[n(rX~ry)6/"
Dutilleul
the
factor
by
(1993),
by
multiplicative
+ 16.40/n) 047(1-r)+8/[n(+rx+r)6/n]
)TR(V )-+
TR(V
STR(Vx1)TR(Vy1)(1
16.40TR[(Vx)2]
whereVx'1 and V7y1, respectively,
are the n x n spatial
autocorrelation
covariancestructure
matricesforattribute
variablesX and Y This equation was establishedwitha
simulationexperimentconducted using regularsquare
tessellationsformingrectangularregionsrangingin size
from3 x 3 to 40 x 40, combinationsof SAR autocorrelationparameter
values,px and py,in the set {0.00, 0.25,
0.50, 0.75, 0.99}, pxy valuesin the set {0.0, 0.2, 0.4, 0.6,
Pseudo-R2valuesfor
0.8, 0.999}, and 10,000replications.
the threespecialtheoretical
cases thatcan be checkedare
as follows:0.9984 forpx = py= 0 (and,as a check,0.9957
forthe analyticalresults);0.9874 forPxy= 0; and,0.9814
forPx = PY " 1. The overallpseudo-R2value is 0.9954,
withthe performance
of thisequationimproving
asymptotically.In addition,it has been validatedusingthe irforthe 127
regularThiessenpolygonsurfacepartitonings
Texas groundwater
locations
and
124 Kansas
the
sample
oil welllocations,the 73 municipios
ofPuertoRico,and the
513 Minnesota foreststands reportedby Griffith
and
Layne (1999). These fourirregularsurfacepartitioning
data sets were supplementedwith the Murraysmelter
Thiessen polygonsdata. The pseudo-R2for resultsobtainedbyreexecuting
thesimulation
experiment
usingthe
matricesforthesefiveempirical
geographicconfiguration
obtainedwith
geographiclandscapesand the coefficients
the precedingsimulationexperiment
is 0.9995.
Therefore,followingthe same logic used to establish
the commonstatistical
Equation (2), as wellas employing
practiceofusingr to estimateitsown standarddeviation,
y 1+(1-r)n-1 n
TR(Vx1)TR(Vy1)r~y
TR(Vx'Vy1)- TR[(Vx4)2]
TTR(V
TR(V-1)TR(VI)(1+
16.40/n)
)TR(Vy1) +16.40TR[(Vx4)2]
0.47(1-rxy)+8/[n(l+rx+rg)6/n]
(C3)
If the n observationscontain zero spatial autocorrelation, then V1' = Vy 1= I and Equation (C3) becomes
1+ (n- 1) = n; ifthe n observationscontainperfectpositive spatial autocorrelation, then, conceptually,
VX' = Vy 1 11T and hence Equation (C3) asymptotically convergeson 2 forPxy= 1-calculation of a correlationcoefficient
requiresat least 2 observations-and
5
for
equals roughly
pxY= 0. This latterresultcoupled
n
n-1
[
(V--)(+16.0/n
TR(Vxl-)T[TR(Vxl)TR(Vy1)+16.4OTR[(Vi4)2]]
047+8/[n(1+rx+ry)6/n]
J
This adjustmentfactormay relate to the use of sample
statistics,ratherthan populationparametervalues, in
calculations,a modificationDutilleulintroduces.
Equation (C3) links to the spatial autoregressive
model specificationsthrough the inverse covariance
matrix,ratherthanthe covariancematrixitself(see, e.g.,
Haining 1991). Because it is specifiedin termsof SAR
spatialautocorrelation
parameters,it willneed to have a
whichpresumasemivariogram
counterpartformulated,
blyneeds to be specifiedin termsofthe rangeparameter.
AppendixD
The HexagonalTessellationStratified
Sampling
Design
Because a regularhexagonaltessellationis employed,
the radius,Th,of a desiredhexagoncan be approximated
where "area" de,
notesthe area of the landscapeto be sampled.The value
producedbythiscomputation
actuallyis an upperlimitfor
the desiredradius.Next, starting
withan arbitrary
(0, 0)
pointpositionedoutsideof the studylandscape,a gridof
hexagoncentroidcoordinatescan be generatedwiththe
formula(3urh, /3vrh/2),
u = 0, 1, ... Umax,
and v = 0, 1,
griddefinedby (Umx,Vmx)
S.., Vmax,wheretherectangular
extendsbeyondthestudylandscapein all directions.
Next,
the hexagonscan be generatedwitha standardThiessen
polygonprogram.Both of theselast two stepshave been
in ArcView3.3 withAvenue scripts.
implemented
Sample locations (u, v) within a hexagon can be
generatedbydrawingpairsofindependentsamplesfrom
the uniformdistribution
U(0, 1); the coordinatepair is
beingdrawnfroma unitsquare,and itsjoint probability
distribution
will be a Poisson. First,the v-coordinateis
selected; then, the u-coordinateis selected such that
by calculatingthe quantity
2
u <_ Because the partial hexagon eliminates onequarter of the unit square, roughly25 percentof the
selected values of u will be rejected. Next, a random
selectionis made fromthe set of integers{1, 2, 3, 4}. If
the integer1 is selected,thenthe coordinatepair (u, v) is
retained. If 2 is selected, then the coordinate pair
becomes (- u, v), a reflectionof the originallyselected
EffectiveGeographicSample Size in the Presenceof Spatial Autocorrelation
(A)
)r'r' i ?*~~~L?Y?)~
f~jrrvri
Ir5'( C:
'nt
,~
i'+*
3 3
rtii:
Y
r?,I
;1VCrr
*j;L
1ir'82;
'hl
r
'"C
*u'lr
3)C?
rl-:i7,r
Ir~L~t131
i
hi.rtIlrTIY~-~))I
u
I~
~1~?!
*r
,I
Ic,,
;c,411,,
~cl;tfi
?I~~ ?7Y'
'Iil':CI
.rr'r
A i r?A-~S)!I~Y
I
i ~ )5?LI
",ti,~Y?+Iki~
IAS
c?:kr(t~?~t~s.
r
4
1~4
tir,?!li~i~'I
I'r
~7;YilCu%=5
r:nfr
ry.~rr Z~Y/r~*
. ( ?I*ll
Cr~
5.*CIhY,
s!,S iir 3~C~F~ 1~~i4r~~YLI~1~C?
ii
ii *r,
/CI?rr
.CIi??
ix~il
~j~l:P
12~3
4:u,.i~
rr
j~i!~jI
1
"~!'
,, ~t;~f~C(L,1.1
?hiLi;
r3?'~j~;r
titt' '~p~~ tit
rr
ur
rr,
'I
II ~
?;:~n: ,,
ru, ,31'~ cU t 1.'r" rJ, r, t4;?
(L
'I
r
4i7?L
,? ,S;
rc~d~Ili~~Ct~I
1~Yci?fii~r
r~sr~~,r
lII~61i
)
(IIrC11 r?
(hS I
~
?*/i
'r,,r ii,
ru*
b~A.uw4~~r rlJIrrri
;rr1:?5~~'
rlliCi
~2+
rr,
rrtI~rrllr31 rr
t
:,? i~li
c'Yrc~tI Xt
?rrrr
r?I.r~ui3L. I
.???iiN
r
uil~
-rrrr /r
rrr
rr i~~ 3;*r?
r(.?frr
;Ijrpi
ylirirce )I(
~,,~~)?
~JCIIi*??';;l
j
~C5,
u
i,
.u:
''
r
~4~
1~
lnlr
?y
n?
?ZitliF
,t;)~jb~~~~ci
;tCr
r?r
?u
=n,,
~'3r*~I
r? a/?
~i~"'i"r"'
'" '""
t
i~i~r
:A;its~'i?~c
I:: "Ic '
~II r
~t
r rS~~Cl~rt
Tlr?l; ;~;,.~:~12~5;~-'kA?1PI~.
~?
Iii
~~ ~?.'T*
i
i
rr?j~t~S~l~lrf
rr
rr.*it*.1i
,I
ICiGYL
1r~ir~IljSt
~jh~
?(yr?~Lyl
??'?''~rL~..~i:
~"
~?;?~?I4i
rr ;5ic,
:tF?i~
rI
~r~,~53 1 fr.r ,(? ~)fr'jrll?~r
r4 r
3'Y 'r
,n
,-r
3'( Ct
,r.i[C~j~
L~*: r :~4;?
c?1Y1 Z11 'Irr ~(?51
?'~Tj
i
Ivil.
'~C
r~iil.
r?
r~'~*J'
r
?r
'~c54~
~?na,.l, ji
*,~
~(
,,,,
" 1;
I, 1~?rI,
~.,..
T,'A:r
-i*?1;~3ulR,
1
rc~i~fi4
;f;r
??-?'~4'' ~d?1?r
E;'rS~:1
a ?-? ??
759
(u, v) coordinatealong the verticalaxis. If 3 is selected,
thenthecoordinatepairbecomes(- u, - v), a reflection
of the originally
selected(u, v) coordinateacrossthe origin.And if4 is selected,thenthecoordinatepairbecomes
selected (u, v) coof the originally
(u, - v), a reflection
ordinatealong the horizontalaxis. A simulationexperiment,involving10,000 replicationsand executed using
thissamplingdesign,producedFigureD-1A.
For a singlesampleofsizen, n samplecoordinatepairs
(u, v) are drawn accordingto the precedingprotocol.
Each coordinatevalue has to be rescaledbythe radiusof
the hexagon,Th,containedin the tessellation.Then, in
turn,one ofthe samplecoordinatepairsis added to each
hexagon centroid.The resultingsample furnishesgood
geographiccoverageand allowsall pointsin a landscape
that are covered by the hexagonal tessellationto have
of selection.One sampleof thistypeis
equal probability
in
portrayed FigureD-1B.
Notes
(B)
1. The notionof effective
datesto Satdegreesof freedom
forthe two-sample
terthwaite
(1946), whoseadjustment
variances
areundifference
ofmeanstestwhenpopulation
in manyintroductory
stais popularized
equal,forexample,
x
tisticsbooks.
3(
X?
?3u(
?
)--~
~K
X
)G
k
>5
r
X
~ ~iYu
?X
~c>-~ Y )34( r
>G
k
x
a<
x /X
2. Thisconceptalsoappearsin thetimeseriesliterature
(e.g.,
see Dawdyand Matalas1964). Additional
maybe
insight
foundin papersbyBox (1954a,b).
matrixwhoseentriesare
3. Frequently,
C is an n x n binary
andcij= 0 otherwise.
cij= 1 ifarealunitsi andj areneighbors,
is
4. The estimation
equation
1
)<
i
nifly
(1 +
/n 11+ be-0.13867c-3.46687d
max)
/dmax-dxdNN/d
be3867c346687 be-CXdNNM
>t
References
X ~j~ ~
andmodels.
DoMethods
Anselin,L. 1988.Spatialeconometrics:
theNetherlands:
Martinus
rdrecht,
Nijhoff.
andR. Haining.1998.Error
Arbia,G., D. Griffith,
propagation
in rasterGIS: Overlayoperations.
International
modelling
12:145-67.
ofGeographical
Systems
Journal
Information
in
raster
GIS: Ad1999.
Error
modelling
propagation
-.
& Geographic
ditionand ratioing
Cartography
operations.
26:297-315.
Information
Systems
spatialdataanalysis.
Bailey,
T., andA. Gatrell.1995.Interactive
London:Longman.
of
2002.All-scalespatialanalysis
Borcard,
D., andP Legendre.
ofneighcoordinates
databymeansofprincipal
ecological
153:51-68.
bourmatrices.
Modelling
Ecological
on quadraticforms
Box,G. 1954a.Sometheorems
appliedin
I. Effect
ofinthestudyofanalysis
ofvarianceproblems.
Annalsof
ofvariance
in theone-way
classification.
equality
Mathematical
25:290-302.
Statistics
- . 1954b.Sometheorems
forms
on quadratic
appliedin
II. Effects
ofinofvarianceproblems.
thestudyofanalysis
errors
inthe
between
ofvariance
andofcorrelation
equality
X* Lt
k
randomsampling
FigureD-1. Examplehexagonaltessellation
outcomes.
from
theunitsquarethathave
(A): 10,000selections
beenconverted
to thebasesampling
hexagon.(B): a singletessellation
stratified
random
denoted
for
sample,
bycrosses
(x),drawn
thecaseofn = 104hexagons,
whosecentroids
aredenoted
bysolid
circles( * ), covering
theMurray
smelter
site.
760
Griffith
Annalsof Mathematical
Statistics
two-wayclassification.
25:484-98.
1993.Design-based
versusmodelBrus,D., andJ.de Gruijter.
basedestimates
ofspatialmeans:Theory
andapplication
in
environmental
science.Environmetrics
4:123-52.
of meanswhen
A., and J.Ord. 1975.The comparison
Cliff,
observations.
samplesconsistof spatiallyautocorrelated
A 7:725-34.
Environment
andPlanning
.1981.Spatialprocesses.
London:Pion.
Clifford,
andD. H~mon.1989.Assessing
the
E, S. Richardson,
ofthecorrelation
betweentwospatialprocesssignificance
es. Biometrics
45:123-34.
Cressie,N. 1991.Statistics
forspatialdata.NewYork:Wiley.
and probability
Dawdy,D., and N. Matalas.1964. Statistical
ofhydrologic
ofvariance,
data,PartIII: Analysis
analysis
covarianceand timeseries.In Handbook
A
of hydrology,
ed.V.Chow,8.68ofwater-resources
compendium
technology,
8.90.NewYork:McGraw-Hill.
Diggle,E, and S. Lophaven.2004.Bayesian
geostatistical
design,
ofBiostatistics,
Working
Paper#42.Baltimore:
Department
JohnsHopkinsUniversity.
thet testforassessing
thecorreDutilleul,
P 1993.Modifying
lationbetweentwospatialprocesses.
Biometrics
49:305-14.
Environmental
E., and A. Sparks.1991.Geostatistical
Englund,
Assessment
User'sGuide.LasVegas,NV: EnvironSoftware:
mentalMonitoring
U.S. EPA.
Laboratory,
Systems
andC. Ferrer.
2003.Systematic
deFlores,
L.,L. Martinez,
sample
ofspatial
means.Environmetrics
14:4541.
signforestimation
forspatialdependence
in regression
Getis,A. 1990.Screening
Science
Association
69:69-81.
analysis.
Papers
oftheRegional
in a regression
framework:
ex. 1995.Spatialfiltering
onregional
periments
inequality,
expenditures,
government
and urbancrime.In Newdirections
inspatialeconometrics,
ed. L. AnselinandR. Florax,172-88.Berlin:Springer.
2002.Comparative
in
Getis,A., andD. Griffith.
spatialfiltering
34:130-40.
regression
analysis.
Analysis
Geographical
tests:
AddressGetis,
A.,andJ.Ord.2000.Seemingly
independent
ofmultiple
simultaneous
anddependent
tests.
ingtheproblem
at the39thAnnualMeeting
oftheWestern
Paperpresented
ScienceAssociation,
Kauai,HI, 28 February.
Regional
D. 1978.A spatially
Griffith,
adjustedANOVA model.Geo10:296-301.
graphical
Analysis
1988.
Advanced
statistics.
theNethDordrecht,
spatial
.
erlands:Martinus
Nijhoff.
Reflections
on
. 1992.Whatis spatialautocorrelation?
thepast25 yearsofspatialstatistics.
l'Espace
21:265-80.
G.ographique
. 2000.A linearregression
solution
to thespatialautocorrelation
2:141-56.
Journal
problem.
ofGeographical
Systems
distribution
ofsoil-lead
concen. 2002.The geographic
tration:
andconcerns.
URISAJournal
14:5-15.
Description
. 2003.Spatialautocorrelation
andspatial
Gaining
filtering:
andscientific
visualization.
Berunderstanding
theory
through
lin:Springer-Verlag.
beGriffith,
D., and E Csillag.1993. Exploring
relationships
tweensemi-variogram
and spatialautoregressive
models.
inRegional
Science72:283-96.
Papers
between
Griffith,
D.,andL. Layne.1997.Uncovering
relationships
andspatialautoregressive
In the1996
models.
geo-statistical
onStatistics
andtheEnvironment,
91oftheSection
Proceedings
Association.
Statistical
96. Washington,
DC: American
A
dataanalysis:
forspatialstatistical
-. 1999.A casebook
datasets.
New
thematic
of analyses
of different
compilation
Press.
York:Oxford
University
2003.Exploring
reGriffith,
D., D. Wong,andT. T. Whitfield.
measures
ofspabetween
theglobalandregional
lationships
Science
tialautocorrelation.
43:683-710.
Journal
ofRegional
Griffith,
D., andZ. Zhang.1999.Computational
simplifications
of spatialstatistical
neededforefficient
implementation
Sciin a GIS. Journal
Information
ofGeographic
techniques
ence5:97-105.
inthesocialandenvironmendataanalysis
R. 1990.Spatial
Haining,
Press.
U.K.:Cambridge
talsciences.
University
Cambridge,
andspatialdata.Geographcorrelation
S1991.Bivariate
icalAnalysis
23:210-27.
- . 2003. Spatialdataanalysis:
andpractice.
CamTheory
Press.
U.K.:Cambridge
University
bridge,
Biomeforthebootstrap.
Hall, P 1989.Antithetic
resampling
trika76:713-24.
andN. Lucas.2001.
K., J.verHoef,K. Krivoruchko,
Johnson,
CA: ESRI.
Redlands,
Analyst.
UsingArcGISGeostatistical
multivariate
statis2002.Applied
R., and D. Wichern.
Johnson,
Hall.
5thed.UpperSaddleRiver,
ticalanalysis,
NJ:Prentice
underinfill
of estimators
Lahiri,S. 1996. On inconsistency
SeriesA 58:403-17.
forspatialdata.Sankhya,
asymptotics
forweighted
sumsunLimitTheorems
Central
2003.
.
der some stochasticand fixedspatialsampling
designs.
SeriesA 65:356-88.
Sankhya,
Meth1991.Sampling
ofpopulations:
Levy,P, andS. Lemeshow.
NewYork:Wiley.
odsandapplications.
reMarshall,
C., and G. Rossman.1999. Designing
qualitative
3rded. ThousandOaks,CA: Sage.
search,
someenvironmental
andcontrasting
R. 2001.Comparing
Martin,
Environmetrics
12:303-17.
andexperimental
design
problems.
Springerspatialdata.Heidelberg:
Miller,W. 2001. Collecting
Verlag.
anddataanalmethods
tostatistical
Ott,L. 1988.Anintroduction
Press.
ysis,3rded. Boston:Duxbury
and K. Manton.1992."Equivalent
R., M. Woodbury,
Pottchoff,
refinements
offreedom"
degrees
samplesize"and"equivalent
modunder
forinference
survey
superpopulation
using
weights
Association
Statistical
87:383-96.
els.Journal
oftheAmerican
R Documentation.
http://www.maths.Ith.se/help/R/.R/library/
(lastaccessed10 October2003).
html/effectiveSize.html
ofassociation
on thetesting
Richardson,
S. 1990.Someremarks
In Spatialstatistics:
betweenspatialprocesses.
Past,present,
MI: Instied. D. Griffith,
andfuture,
277-309.AnnArbor,
tuteofMathematical
Geography.
F.1946.An approximate
ofestimates
distribution
Satterthwaite,
Bulletin
Biometric
ofvariancecomponents.
2:110-14.
In PracS., andW Overton.1996.Spatialsampling.
Stehman,
31-63.
ed. S. Arlinghaus,
ticalhandbook
ofspatialstatistics,
Boca Raton,FL: CRC Press.
G. Diamond,andJ.HasW, D. Griffith,
Thayer,
P Goodrum,
to riskassessment.
ofgeostatistics
sett.2003.Application
An International
RiskAnalysis:
23:945-60.
Journal
New York:
of statistics.
Tietjen,G. 1986. A topicaldictionary
ChapmanandHall.
TX 75083-0688,
SchoolofSocialSciences,
ofTexasat Dallas,PO. Box830688,GR31,Richardson,
Correspondence:
University
e-mail:
[email protected].