Why Least Squares and Maximum Entropy? An Axiomatic Approach to Inference for Linear Inverse Problems Author(s): Imre Csiszar Source: The Annals of Statistics, Vol. 19, No. 4 (Dec., 1991), pp. 2032-2066 Published by: Institute of Mathematical Statistics Stable URL: http://www.jstor.org/stable/2241918 . Accessed: 03/07/2013 17:53 Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at . http://www.jstor.org/page/info/about/policies/terms.jsp . JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. . Institute of Mathematical Statistics is collaborating with JSTOR to digitize, preserve and extend access to The Annals of Statistics. http://www.jstor.org This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions The Annals ofStatistics 1991,Vol. 19, No. 4, 2032-2066 WHY LEAST SQUARES AND MAXIMUM ENTROPY? AN AXIOMATIC APPROACH TO INFERENCE FOR LINEAR INVERSE PROBLEMS' BY IMRECSISZAR Institute Mathematical oftheHungarianAcademyofSciences An attemptis made to determinethe logicallyconsistentrules for when selectinga vectorfromanyfeasibleset definedbylinearconstraints, or thosewithpositivecomponents or the probability eitherall n-vectors ifand onlyifthe Somebasicpostulatesare satisfied vectorsare permissible. selectionruleis to minimizea certainfunction which,ifa "priorguess" is available,is a measureof distancefromthe priorguess. Two further naturalpostulatesrestrictthe permissibledistancesto the author's fAs corollaries,axand Bregman'sdivergences, divergences respectively. iomaticcharacterizations of the methodsof least squares and minimum thelatterare also information are arrivedat. Alternatively, discrimination As a specialcase,a characterized bya postulateofcomposition consistency. froma smallset ofnatural derivation ofthe methodofmaximumentropy axiomsis obtained. 1. Introduction. A frequently occurring problemin statisticsand applied frominsufficient is thata function has to be inferred information mathematics that specifiesonlya feasibleset of functions. Problemsofthiskindare often ofa signalor calledinverseproblems.Typicalexamplesare thereconstruction and the assignmentof a image fromthe resultsof certainmeasurements, that is, probability densityor mass functionsubjectto momentconstraints, thatspecify theexpectations ofcertainfunctions oftheunderlying constraints randomvariable.Oftenthe practicalsolutionto such problemsis to selectan elementofthefeasiblesetbya moreor less ad hocrule,usuallybyminimizing somefunctional such as the L2-normor negativeentropy.If somefunction is specifiedas a "priorguess," it is naturalto minimizea measureof distance fromthelatter,mostoftenthe L2-distance, densityor mass or,forprobability Kullback'sI-divergence fordiscrimination functions, (also calledinformation or cross-entropy). In thispaperwe addressthe questionofwhatselectionrulesare "good" in this framework and, in particular,whetherthe mentionedstandardones are indeedthe "best." Unfortunately, meaningto it is hardto givea mathematical thisquestion.It does not seempossibleto definea generalcriterion by which December Received 1990. May1989;revised ofTokyo,supported by 'This researchwas startedduringtheauthor'svisitat theUniversity forthePromotion ofScience. theJapanSociety Its completion wassupported bytheHungarian National ScienceFoundation, Scientific Research Grant1806. AMS 1980subject classifications. Primary 62A99;secondary 68T01,94A17,92C55. andphrases. linearconstraints, logically consistent inference, Keywords Imagereconstruction, minimum discrimination nonlinear distance, selection rule. information, projection, nonsymmetric 2032 This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2033 the goodness of selectionrules could be compared.Still, whereas people apparently feellittleneedforanyspecialjustification ofleastsquares(L2-norm variousreasonshave been put forwardto justifyI-divergence minimization), intostatisticsby Kullback(1959) as the methodof minimization [introduced minimumdiscrimination and entropymaximization. The recent information] widespreadapplicationsof "maximumentropy"have been pioneeredto a greatextentbyJaynes[forhis viewscf.Jaynes(1982)].The presentauthorhas arguedelsewhere[Csisza6r (1985)] that the conditionallimittheoremsofVan Campenhoutand Cover(1981) and Csiszar (1984) suggestthe interpretation thatthe minimumI-divergence distribution "updating"ofa priorprobability is a limitingformofBayesianupdating. to meetmomentconstraints Here we adoptan axiomaticapproachand considerthoseselectionrulesas in the sense of "good" thatlead to a logicallyconsistentmethodofinference, not natural The term inference is meant in a some satisfying postulates. even statisticalsense. Indeed, our considerationswill be nonprobabilistic, distributions. The relation thoughtheobjectsto be inferred maybe probability ofthisworkto previousaxiomaticapproacheswillbe discussedat the end of thissection. As a typicalexample,we briefly sketcha modelofthe imagereconstruction and in variousotherfields problemthat occursin computerized tomography [cf.,e.g.,Hermanand Lent (1976) or Censor(1983)].An imageis represented functionf definedon some domain.The availableinforby a positive-valued mation consistsin the measuredvalues of some linear functionalsR if, i = 1,.. ., k. In X-raytomography, f is the unknownX-rayattenuationfunctionand Ri f is itsintegralalongthepathofthe i's ray.Now,thedomainof f is partitioned intoa finitenumberofpictureelements,calledpixels,numbered in some way from1 to n. Assumingthat f is nearlyconstantwithineach pixel,we can write n (1.1) f=Ev j=1 fi, where fj is the indicatorfunctionof the jth pixel.Then,settingaij and bi = Ri f, we have (1.2) = Ri fj n , aijvj j=1 = bi, i = 1,...,k. Thus the unknownfunctionf is represented by the vectorv = (v1,.. ., vn)T and the feasibleset is identified withthe set of thosevectorsv withpositive thatsatisfythe linearconstraints components (1.2). The reconstruction problemis to selecta "suitable"elementofthisfeasibleset (possiblydependingon a "priorguess" of f represented by a vectoru). In thispaper,we concentrate on linearinverseproblemsofform(1.2). The functions extensionofour resultsto the continuouscase, thatis, to inferring definedon some domainand not representable by finite-dimensional vectors, should not be hard but will not be entered.The above example will be This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2034 repeatedlyused as an illustration,but it should be emphasizedthat our axiomaticapproachis by no means limitedto imagereconstruction. On the other hand, since a generalapproachinevitablyinvolvesidealizations,the solutionsit leads to are not necessarily"best" forspecificpracticalproblems, includingimagereconstruction. Our goal is to determinethe "logicallyconsistent"rules forselectingan elementfromanypossiblefeasibleset.We adopttheidealizedassumptionthat all conceivablelinearconstraints mayoccur;thusthepossiblefeasiblesetsare all thosesubsetsofa basic set S ofpermissible vectorsthatcan be definedby constraintsof form(1.2). Threecases willbe consideredin a parallelmanner namelywhen S consistsofall n-vectors, or ofthosewithpositivecomponents (as in imagereconstruction) or ofthe probability vectorswithpositivecomponents.The choiceofvectorswithpositiveratherthannonnegative components has been preferred in orderto reducetechnicaldifficulties; also, thisensures that a nonnegativequantityis never inferredto be 0 when the available information permitsit to be positive,whichis generallyconsidereddesirable. Our postulateswill be stated and intuitively justifiedin Section2. The resultswillbe statedin Section3 and provedin Section5, usingthelemmasin Section4. The keyresultis Theorem1, namelythat the basic postulatesof "regularity"and "locality"of a selectionrule are satisfiedif and onlyifthe selectionis by minimizing a certainfunction. If a priorguess is available,this functionis a measureof distance(nonsymmetric in general)fromthe prior guess. The subsequenttheoremsshow how certainadditionalpostulatesrestricttheclass offunctions thatmaybe used. Perhapsthemoststriking result in Theorem5. It providesa parallelcharacterization of the methodsof least squares and minimumI-divergence as the onlyones satisfying, in additionto and locality,a postulateof"composition regularity The intuitive consistency." meaningofthispostulateis thatifthe objectofinference is composedoftwo and the availableinformation components, says nothingabout theirinteraction,we shouldinferthatno interaction is present. It shouldbe mentionedthatin practice(1.2) is oftenrelaxedto (1.3) n E aij + ei =bi, j=1 i = ly,...,)k, wheree = (el, ..., ek)' is an errorvector.For the reconstruction problemof positronemissiontomography, Vardi,Sheppand Kaufman(1985) describeda modelequivalentto (1.3) withdata bi that were Poisson randomvariables (countsofdetectedemissionsin "tubes" determined by pairsofdetectors). A vectore as in (1.3) did notexplicitly entertheirmodel,but,definingei as the ofthe actualand expectedcountsforthe ith tube,(1.3) wouldhold. difference The suggestedreconstruction was the maximumlikelihoodestimateofv, and forits computationthe EM algorithmof Dempster,Laird and Rubin(1977) was used. This reconstruction techniqueis sometimesconsideredas relatedto "maximumentropy"[cf.,e.g.,Millerand Snyder(1987)],fortheformalrather thanconceptualreasonthatthe EM algorithm is equivalentto an alternating This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2035 I-divergence minimization; Vardi,Sheppand Kaufman(1985) haveshownthat the convergence of theiralgorithmis an instanceof a resultof Csiszar and minimization. Tusnady(1984) on alternatingI-divergence Maximumlikelihoodestimationis not a generallyapplicablemethodfor "solving"inverseproblemsof form(1.3) because (i) the "errors"ei maybe nonrandomor else theirjointdistribution maybe unknownand (ii) evenifthe the maximumlikelihoodesti"errors"are randomwithknowndistribution, mate is typicallynonuniqueif n > k. Vardi, Shepp and Kaufman(1985) theobjectintofewerpixelsthanthe avoidedthelatterdifficulty bypartitioning numberof "tubes" thatwas sufficiently large.If boundson the magnitudeof the errorsare known,it maybe convenientto interpret (1.3) as a systemof inequalitiesofform (1.4) bi n j=l1 aijvj < btzi=l,i , k, or as an inequalityofform k (1.5) k i=l ( n j=l 2 aiivi - b) < c. Our axiomaticapproachcould be extendedto the problemof selectingan elementfromanyset definedbyinequalityconstraints suchas (1.4) or (1.5) or, moregenerally, fromanyconvexsubsetofthebasic set S. Alternatively, (1.3) could be interpreted as determining a "feasible set" of pairs (v,e), and a selectionfromthis set could be made by any methodthat has been deemed "good" for the problem(1.2). Indeed, this is oftendone in practice,for somequadraticfunction ofthe pair(v,e) [cf.Herman example,by minimizing and Lent (1976) or Censor(1983)]. It remainsto be seen whether"solutions" of this kind to the problem(1.3) can be coveredby an extensionof our axiomaticapproach;one of the difficulties is that the possiblelinear constraintson pairs(v, e) are ofthe veryspecialform(1.3). The approachin thispaperwas strongly motivatedby Shore and Johnson (1980), where-forthe problemofassigninga probability distribution subject to momentconstraints-an intuitively appealingaxiomaticderivationof the methodsof maximumentropyand minimumI-divergencewas provided. forinferring Skilling(1988) gave a similarderivation arbitrary positive-valued because of the greaterliberty, this case turnedout to be considerfunctions; ablysimpler.BothShoreand Johnson(1980) and Skilling(1988) startedfrom the assumptionthat inferencewas based on minimizingsome functionor, equivalently,on some transitiverankingthat could be describedby real numbers.Afterhavingsubmitted thispaper,theauthorlearnedthatParis and ofmaximumentropy"from Vencovska(1990) had arrivedat "the inevitability axiomsthat-like ours-did not a prioriassume the minimization of some fromtheirs, function;in otherrespects,our approachappearsquite different in the axiomsdo exist.The authoris indebtedto Profesalthoughsimilarities sor Paris forsendinghimmanuscripts ofthisand relatedworks. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2036 I. CSISZAR Our resultsalso provideaxiomaticcharacterizations ofmeasuresofdistance whoseminimization leads to "good" methodsofinference. Whereasthereis an of entropy,I-divergence extensiveliteratureon axiomaticcharacterizations and theirgeneralizations [cf.,e.g.,Aczeland Dar6czy(1975)],ourcharacterizadiffer fromthe usual ones: Our postulatesinvolvenot the tionssubstantially but ratherthe inferencemethodit leads to. In measureto be characterized Theorem2(ii), a class of distancesintroduced by Csisza6r (1963) [and independentlybyAli and Silvey(1966)] is characterized; relatedresultsalso appearin Shoreand Johnson(1980) and Skilling(1988). Theorem3 characterizes a class ofdistancesintroduced by Bregman(1967); sincethispaperhad been submitted,a different axiomaticcharacterization ofthisclass (in thecontinuouscase) was givenbyJonesand Byrne(1990). Finally,we commenton theone-parameterfamilyofdistancescharacterized in Theorem4(ii). In theoriginalversionof this paper,the previouslynot consideredmembersof that familyhad been mentionedas candidatesforbecomingpractically useful.Recently, Jonesand Trutzer(1989) reportedapplicationsof (continuousversionsof) these distancesthatseemto confirm thatprediction. 2. Preliminaries, postulates. The reallineand thepositivehalf-line are denotedbyR and R+, respectively. We emphasizethat R+ does notcontain0. The vectorsin Rn whosecomponents are all 0 or all 1 are denotedby0 or 1. All vectorsare columnvectors. The set of n-dimensional vectorswith positivecomponentsof sum 1 is denotedby An nthat is, (2.1) A = (v: vE R, Tv= 1}. The familyofaffinesubspacesof Rn, thatis, the familyofnonvoidsubsets of Rn definedby linearconstraints, will be denotedby .Zn. In otherwords # LE iff L 0 and n L = {v: vE Rn,Av = b} forsomeA E Mk xn (k x n matrix)and b E Rk. We denoteby Yn' thefamily of all nonvoidsubsetsof R+ of form(2.2), withv E Rn replacedby v E R+. The familyof those L E Yn+ whichare subsets of An will be denotedby Y4+(1). For any fixeddimensiond < n, the set of d-dimensional elementsof 4n (or 4+k) is endowedwith the natural topology,that is, the quotient topologyderivedfromthe Euclideantopology oftheset ofthosepairs(A,b) E elementof .4n (or Yn+). M(n-d)xn x Rn-d thatdefinea d-dimensional Throughoutthis paper,our basic set S willbe eitherof Rn, Rn and An The set ofcomponents ofvectorsin S willbe denotedby V, thatis, V stands for R, R + or the open interval(0, 1), accordingto the choiceof S. Unless statedotherwise,u, v and w willalwaysdenoteelementsof V, whereasu, v and w denotevectorsin Vn. Further, we denote by ./ the familyof nonvoid subsetsof S determined by linearconstraints.Thus, accordingto the three cases, ? equals .n, .+ or .n'(1). For convenience,we will call the elementsof Y subspacesof S also if S $ Rn. (2.2) This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2037 Noticethat S itselfis an elementof ..f; amongthe propersubsetsof S the maximalsubspacesare thoseofdimensionn - 1 (if S = Rn or R+) or n - 2 (if S = An). The subfamily of Y consisting ofthesemaximalsubspaceswillbe denotedby X4/. Thus X/ consistsofthe (nonvoid)sets (2.3) _ ({v:a av a =b 0, if S = R~ or R+, {v: aTv = b, ITv = 1}, a A1, if S = Azn As mentionedpreviously, the conditionv E Vn is implicitin the notationin (2.3). Having in mindinferenceproblemsas in Section 1, we are interestedin rulesthatspecify foreach L E Y (and "priorguess" u) an elementof L, to be selectedif L is in the feasibleset (and the priorguess was u). The vector selectedfromL whena priorguessu was availableis regardedas a nonlinear projectionon L ofu, denotedby fl(Llu). (2 L DEFINITION 1. A selectionrule (withbasic set S) is a mapping[l: Yt-*S suchthat 11(L) E L foreveryL E .. A projectionrule is a familyofselection rules fIl( Iu), u e S, such thatu E L impliesfl(Liu) = u. A selectionrule is generatedbya functionF(v), v E S, if foreveryL E Y, fl(L) is the unique elementof L whereF(v) is minimizedsubjectto v E L. A projectionrule is generatedbya function F(vlu), u E S, v E S, if its componentselectionrules fl( Iu) are generatedbythe functionsF( Iu). REMARK.A necessaryconditionforF(viu) to generatea projectionrule is thatforanyfixedu E S theuniqueglobalminimumof F on S be attainedat v = u. Attention maybe restricted to functionsF forwhichthisminimum is 0 [because a projectionrule generatedby some F(vlu) is also generatedby F(vIu) = F(vIu) - F(uIu)I. A functionF(vIu), u E S, v E S, withtheproperty F(vlu) 2 0, withequalityiffv = u, willbe calleda measureofdistanceon S. EXAMPLE1. In the case S = Rn, the Euclidean distance F(vlu) = ilu - vil generatesa projectionrule that gives rise to the ordinaryprojectionin Euclideangeometry. It willbe calledthe least squaresprojectionrule and its componentselectionrules will be called least squares selectionrules. In particular,the selectionrule generatedby F(v) = livilis the standardleast squares selectionrule. A projectionrule generatedby a weightedL2-distance (E n 1ai(vi - ui)2)1/2 willbe calleda weightedleastsquaresprojectionrule.Of course,"least squares" is a standardmethodof a verylong history.Notice, however, thatit is notsuitablein thecases S = R or An; thenF(v) = liv- uil does not generatea selectionrule forany u E S (neitherdoes any weighted L2-distance)becauseit does notattaina minimumon somesubspacesL E .Y. EXAMPLE2. definedby (2.4) Let S = R or An. The I-divergence ofv E S fromu E S is I(viiu) = f (vilog - vi+ ui. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2038 I. CSISZAR This generalization ofKullback'sformula I(v||u) n v vilog,U u E- Anx V E- A\n retainsthe propertyI(vllu) ? 0, withequalityif and onlyif u = v. Clearly, F(vlu) = I(vllu) generatesa projectionrule (forany fixedu E S, it attainsa unique minimumon any L E /, and if u E L, this minimumis attainedat v = u); this will be called the I-divergence projectionrule,and eitherof its componentselectionrulesis an I-divergence selectionrule.The selectionrule generatedby the negativeentropyF(v) = E n 1vilogVi will be called the maximumentropy selectionrule.This is a specialcase of I-divergence selection rulescorresponding to u = (1/n)1 if S = A\ and u = (1/e)1 if S = Rn. As a comparisonof the naturalityof these choicesof u indicates,the maximum role mainlyin the case S = An, entropyselectionrule plays a distinguished thatis, forinferring probability distributions. We emphasizethat we do not a priorirestrictattentionto selectionrules 2 and 3 generatedby some function.However,the postulatesin Definitions belowwillsuffice to provesuch generatedness. Givena selectionrule II withbasic set S, we willdesignateHI(S) byv?(H). Then the componentselectionrules H( Iu) of a projectionrule satisfy,by Definition1, (2.5) vo(H(. Iu)) = u. DEFINITION2. A selectionrule H:Y-+ S is regular if it satisfiesthe axioms: following if L' c L and H(L) e L', then E(L')=l(L); (i) (consistency) (ii) (distinctness) if L # L' are bothin X#,then H(L) M H(L') unlessboth L and L' containv?(HI); (iii) (continuity) the restriction of Hl to the subspacesof any fixeddimension is continuous. A projectionrule is regularifits componentselectionrulesare such. The consistencyaxiom formalizesthe intuitiveidea that if v = H(L) selected on the basis of constraintsspecifyingL also satisfiesthe stronger constraintsspecifying L', thenthe additionalconstraints provideno reasonto changethe selectionof v. The case forthis axiomappears quite strong,for one wouldhardlywantto use a selectionrule example,in imagereconstruction not satisfying it. Nevertheless, forcertain this axiom may be inappropriate problems.Whenthe selectedelementoughtto resemblethe otherelementsof thefeasibleset L as muchas possible,it is reasonableto selectthev E L that minimizes-forsome given measure of distance d-either sup,, L d(v,w) of d(v,w) ["barycenter expectation method,"Perez (1984)] or the conditional on the conditionw E L (Bayesianrule,requiresa priordistribution on S). This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2039 These selectionrulesdo not satisfythe consistency axiomand are outsidethe scopeofthispaper. in a The distinctness axiomsaysthatdifferent bothconsisting information, singlelinear constraint,must lead to different conclusions,unless both are This consistent withtheselectionthatwouldbe madewithoutanyconstraints. is a technicalpostulateand it wouldbe desirableifit couldbe dispensedwith. Notice that the distinctnessaxiom is certainlysatisfiedfor selectionrules functionsF. generatedby differentiable is an obviousregularity Continuity hypothesis.Notice,however,that (for projectionrules) we did not postulatethe continuousdependenceof H(LIu) on u. For any set of indices J = Ul, ... , jk, 1 < il < < Jk < n, and any vectora E Rn, we denoteby aj the vectorin Rk definedby (2.6) aj = (ajlX ... , aik) For a selectionrule [I, we willdenote(II(L))j briefly by rlj(L). S is local ifforeveryJ c {1, ... , n} DEFINITION3. A selectionrule II: Y ofarbitrary size k, and any L' and L" in Y ofform (2.7) L'= {v: vj e L, vjcEL'}, L" = {v: vjE L, vjcEL"), L" E4k whereLo E..4, L' E Yk, (if S = Rn) or Lo E E4yn+-k (if S = Rn orAn), we have L" -k XL' E/A-k, (2.8) H( L') = 11JW) . A projectionruleis local ifits componentselectionrules IL( Iu) are local and, in addition,forL' and L' as abovewe have (2.9) HIj(L'u') = llJ(L"'Iu") ifu', = U'. = +,then anyL' and L" of REMARK.If S=Rn, )Y=n orS=R+, form(2.7) necessarily belongto ..?. On theotherhand,if S = An, ..= 4+ (1), the sets L' and L" in (2.7) belongto Y iffthe sum of componentsof each vectorin Lo is equal to the same 0 < c < 1, and the sum of componentsof each vectorin L' and L" is equal to 1 - c. Localitymeans,in otherwords,thatforL' E Y as in (2.7), [Ij(L') depends onlyon LO) and [IH(L'Iu') dependsonlyon Lo and u'j. the localityaxiomsays thatwheneverthe availableinformation Intuitively, consistsof two piecesthat involvecomplementary sets of componentsof the vectorto be inferred, each component ofthevectorselectedon thebasis ofthis whichinvolvesthe willbe determined information bythatpieceofinformation componentin question.In the reconstruction problemof X-raytomography, this means that if two sets of ray paths coveringdisjointsets of pixelswere wouldbe determined used,thenforeach pixelthereconstruction bythoserays whose paths are in the set coveringthat pixel. This axiom appears very This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2040 I. CSISZAIR natural,but strictadherenceto it mayperhapsbe criticized in thetomography exampleon thebasis ofsmoothnessproperties oftheunknownX-rayattenuationfunction.For the case ofinferring probability mass functions, Shoreand Johnson(1980) used a similarbut strongerpostulatecalled "subset independence." In this paper, regularityand localitywill be the basic postulates.The selectionand projection rulessatisfying themwillbe characterized in Theorem 1. In the restofthissectionwe formulate someotherdesirableproperties that suggestthemselvesas additionalpostulates,ifwe wantto arriveat methodsof practicalinterest. An important rolewillbe playedbythe specialsubspaces ({v: vj + Vj=t}, (2.10) Lij(t) | v: + vi = t, E if S =R v= 1-t}, orR, if S = 'A lo,j Given a selectionrule fl, we will writev ij v' to designatethat v and v' equal the ith and jth componentsof il(Lij(t)) forsome t; of course,then necessarilyt = v + v'. Similarly,givena local projectionrule,we will write (vlu) -ij (v'u') to designate that v and v' equal the ith and jth components of [l(Lij(t)Iu) ifthe ith and jth component ofu are u and u' (and t = v + v'). Clearly,v <ij v' means the same as v' *ji v. Further,we will show (corollaryof Lemma 3) that if v -ij V' and v' jk V" for a regular,local selectionrule II, then also v *>ik V", providedin the case S = An that v + v' + v" < 1. Ofcourse,similarstatements holdfortherelations(v u) i*- (v'Iu') associatedwitha regular,local projectionrule. DEFINITION4. (i) A local projectionrule is semisymmetric if for every Lij(t) as in (2.10), the ith and jth componentsof Hl(Lij(t)u) are equal wheneverui = uj, that is, (vyu) <-ij (v'lu) iffv = v' (providing, in the case S = An, that v < 1/2, u < 1/2). (ii) A projectionrule withbasic set R+ or An is statisticalifit is regular, local,and the relation(ylu) *-ii (v'lu') is equivalentto v/u = v'/u',withthe additionalconditionsv + v' < 1, u + u' < 1 if S = An. The term"semisymmetric" refersto the factthatthisweak and plausible postulateoftenimpliesthe apparentlymuchstrongerpropertyof symmetry (permutation invariance);see Theorem2(i). It wouldnot be unreasonableto use symmetry as a postulate,as did Shore and Johnson(1980), but by not doingso we will obtain strongermathematicalresultswithlittleadditional effort. Also the strongerpostulatein (ii), applicablewhen S = R n or An, appears in thelattercase. Namely,ifa priorguess natural;it is particularly compelling about a probabilitymass functionhas to be updated subject to a single ofa givenset,it is standardto assign constraintthatspecifiesthe probability This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2041 to the prior ones. probabilitiesto the elementsof this set proportionally 4(u) Definition requires this for two-elementsets only. The proportional the probabilitiesof several updating,in general for constraintsspecifying pairwisedisjointsets ["Jeffrey's rule,"cf.Diaconisand Zabell(1982)] appears uncontroversial. It was also partof the axiomsof Shore and Johnson(1980). 5. (i) A projectionrule withbasic set S = R' or R+ is scaleDEFINITION invariantifforeveryA > 0, L E Y and u E S we have [l(ALJAu) Afl(LIu). (ii) A projectionrule withbasic set S = Rn is translation-invariant if for everyL E Y, u E S and ,uE R, fl(L + ,1Iu + 1) = fl(LIu) + 41. Here AL and L + ,i1 denotethe set ofvectorsAvand v + At1,respectively, such thatvE L. The intuitivemeaningof these invariancepropertiesis obvious,and they are clearlydesirable.The characterization ofall (regularand local) projection rules satisfying eitheror both of thesepostulatesis not difficult but will be omittedbecause of its limitedpracticalinterest.Rather,thesepostulateswill be used in connection withthe nextone only. DEFINITION 6. (i) A projectionrule is subspace-transitive if forarbitrary L' c L in Y and anyu E S we have fH(L'Ju)= fl(L'jfl(L1u)). (ii) A projectionrule is parallel-transitive if (2.11) holds forsubspaces L and L' that can be representedin the form{v: Av = b} [cf.(2.2)] withthe same matrixA. (2.11) Subspacetransitivity meansthatifupdatinga "priorguess" u based upon information thefeasibleset L resultsin v = Hl(LJu),and additional specifying information leads to L' as the actual feasibleset,thenupdatingthe "present guess" v on the basis of all availableinformation leads to the same resultas wouldthe directupdatingof the "priorguess" u. For example,if an image reconstruction has beenobtainedfrommeasurements Ri f, i = 1,. . ., k, using some priorguess,and thenfurther measurementsRi f, i = k + 1,. . ., k + 1, are made,thereconstruction fromall themeasurements Ri f,i = 1,..., k + 1, willbe the same no matterwhetherthe originalpriorguess or the previous reconstruction is used as " priorguess." Parallel transitivity has a similar intuitivemeaningforthecase whenafterhavingobtainedthefirstreconstruction,the same measurements are repeatedwithresultsdifferent fromthose before;the second reconstruction is now based on the new results,the previouscontradicting ones beingdiscarded. Subspace transitivity is a highlydesirableproperty also because it implies (and is actuallyequivalentto) the commutativity of two-stageupdatings. Indeed,letus be giventwosetsoflinearconstraints determining subspacesL1 and L2 withL1 n L2 = 0. Then (2.11) impliesthatupdatinga priorguess u based upon the firstset of constraints and thenupdatingthe resultfl(L1Iu) This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2042 I. CSISZAR based uponbothsets ofconstraints, theresultwillbe H(L1 n L21u); thesame resultwouldbe obtainedifthe firstupdatingwerebased on the secondset of constraints. Ofcourse,thiscommutativity holdsfor"proper"two-stage updatingsonly,thatis, whentheconstraints used in thefirststageare notdiscarded in the secondstage. For regularprojectionrules,paralleltransitivity impliessubspacetransitivity,and more generally,also that (2.11) holds wheneverL = {v: Av = b}, L {v: Av = b'}, wherethe matrixA containsall rowsof A. In fact,then L = {v: A'v = A'v*} c L, wherev* = fl(Llu), and thus fl(Llu) = v* by the axiom.Then paralleltransitivity consistency impliesthat = I(L'Il(LIu)), [(L'ju) and (2.11) holdsas claimed. We will show that for regular,local projectionrules the two kinds of definedabove are actuallyequivalent.The familyof all regular, transitivity local and transitive in Theorem3. projectionruleswillbe characterized Combiningresultscharacterizing projectionruleswithpropertiesstatedin Definitions4-6 will lead us to the least squares and I-divergence projection rules as the onlyones that simultaneously satisfysome intuitively appealing postulates.On theotherhand,a singlepostulate(in additionto regularity and locality)willalso suffice to uniquelycharacterize theseprojection rules,as well as the least squares and I-divergenceselectionrules. To formulatethis postulate(Definition7 below),it is necessaryto considervectorsindexedby pairsofintegers(i, j), 1 ? i < m, 1 < j < n, ratherthanbyintegers1 < i < n. This is the natural representation for vectorsdescribingtwo-dimensional objects,such as images. The marginalsof v = fvij: 1 < i < m, 1 <j < n} E Rmn are the vectors E Rm, v = (v,.. =(i1,.. . ., v (2.12) v= n j=1 .., vn vij, E R, where m V E i=l vij. As before,we considerthreechoicesof our basic set S, namely,S Rmn or Amn = Rmnor For any v = {vij} E S, we denoteby L, the subspaceof S consistingof thosevectorsthathave the same marginalsas v, thatis, (2.13) We say thatv (2.14) Lv = = (w: w = v; w = v}. {vij} is of sum or productformif vij = si + tj or vij = s t, respectively. DEFINITION 7. A selection rule H with basic set S as above is composition-consistent, more precisely,sum- or product-consistent, if This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2043 A projecHI(L,) = v wheneverv E S is of sum or productform,respectively. if its compotion rule is composition-consistent (sum-or product-consistent) wheneveru itselfis of sum or nentselectionrules fl( Iu) have thisproperty productform,respectively. the vectorsv = {vij, 1 < i < m, 1 < j < n} are interpreted Intuitively, as oftwointeracting compositions the individualcomponents components, being represented by the marginalsv and v. Then Lv beingthe feasibleset means thattheavailableinformation but nothing specifiesthe individualcomponents else. The postulateofcomposition formalizes theintuitiverequireconsistency mentthaton the basis ofsuch information we shouldinfer"no interaction," unless a prior guess is available that impliesinteraction.Implicitin this interpretation is that "no interaction"is represented by the sum or product formof v. In particular,for inferring probabilitymass functions,product is a hardlyavoidablepostulate.It could be argued,thoughwith consistency less weight,that productconsistency is an appropriateaxiomalso forimage that is, that if the marginalsof an image were knownand reconstruction, nothingelse (an unlikelysituationin practice),the "best" reconstruction wouldbe ofproductform.In inverseproblemswithouta positivity constraint, the sum formofv appearsto be the naturaldescription of "no interaction," suggestingthat in this case the postulateof sum consistencyshould be adopted. REMARKS. (i) Product-consistentselection rules cannot exist for S = Rmn because in that case different elementsof S, each of productform,can have thesame marginals.On theotherhand,sum-consistent selectionrules,satisfyingthecontinuity axiomin Definition 2, cannotexistforS = R + n or S = Amn To see this,considera sequencev(k) of elementsof S of sum formv(*)-*t wees1t= s(k) >O> suhtaS(k) S (k) +t (k) > jh0,, such t5(k)O, -_> that SI -*+( SSi' t(k) i tj, wheres, = t, = 0 I( *j Ij)Si* and si > 0 fori > 1, tj > 0 forj > 1. Then v(k) - v and L(k) - Lv, where selectionrulewe shouldhave vij = si + tj. Thus fora sum-consistent fl(Lv) = limfl(Lv(k)) k- = lim v(k) = v k- a contradiction becausev i S. (ii) In the case S = Rmn,everyLv as in (2.13) containsan elementofsum form;hence fora sum-consistent selectionrule II, fl(Lv) is alwaysof sum form.It follows,subjectto regularity, thatv? = v0(01)is ofsumform,because = v? II(Lvo) by the consistency axiom. Similarly,in the cases S = Rmn or if a is HI product-consistent (regular)selectionrule,then v0(HI) is of Amn productform. (iii) For thecase S = Amn, thepostulateofproductconsistency is similarto but weakerthan Shore and Johnson's(1980) "systemindependence"postulate. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2044 I. CSISZAR 3. Statement of results. Recallthatourbasic set S is eitherRn, R n or An,the familyofaffinesubspacesof S is denotedby /, and V denotesR, R + or theinterval(0, 1), accordingto thechoiceof S. Throughout thissection,we assumethat n 2 3 (if S = Rn or R+) or n ? 5 (if S = An). The following willfacilitatethe statementofour results. terminology DEFINITION 8. A n-tupleoffunctionsf1,. . . ,fn definedon V willbe called a standardn-tuplewith0 at vo = (vI, .. ., v? E S if: (i) fi is continuously differentiable and vanishestogetherwithits derivativeat v ,i = 1, ..., n, (ii) in the cases S = Rn or A ,lim oif(V) = -0i =L. n (iii) the function n (3.1) F(v)= E fi(vi) i=l is nonnegativeand strictlyquasiconvexon S, that is, forany v and v' in S and any 0 < a < 1 we have (3.2) F(av + (1 - a)v') < max(F(v), F(v')). REMARK. The functionsfi in a standardn-tuplemustbe convexif S = Rn or R n but not necessarilyif S = A\. The proofofthisis omittedin orderto save space. (i) For anyregular,local selectionrule [: s-- 5, thereexists a standardn-tuplef1,.. ., fn with0 at v0 = v0(l) such thatr1 is generated bythefunctionF in (3.1). Conversely, any F as in (3.1) generatesa regular, local selectionrule 1l withv?(JII)= v?. (ii) For any regular,local projectionrule, thereexistfunctionsfi(vlu), u e V, v E V, such thatforeveryfixedU = (U1,..., Un)T E S, thefunctions fi(vIu) forma standardn-tuplewith0 at u and thegivenprojectionruleis generatedby THEOREM 1. (3.3) n F(vlu) = E fi(vilui). i=1 Conversely, anysuch F generatesa regular,localprojectionrule. (iii) Two functions F(v)= E= if(vi), F(v)= En1fi(vi) or F(vlu)= E Z=lfi(vIui) F(VIU) = E1 fi(viIui) as in (i) or (ii) generatethesame selection or projectionrule, respectively, if and onlyif fi= cfi,i = 1,... , n, for someconstantc > 0. REMARK. In part (iii),the assumptionthat F and F be ofthe statedform is essential.Otherwise,F(v) and F = 4(F(v)) generatethesame selectionrule for any strictlyincreasingfunctionFD,and F(vlu) and F = F(F(v1u),u) generatethe same projectionrule whenever(?(* Iu) is strictlyincreasingfor everyfixedu E S. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2045 THEOREM 2. (i) A projectionrule withbasic set S = Rn or Rn is regular, local and semisymmetric (Definition4) ifand onlyifit is generatedbyF(vlu) as in Theorem1(i) withfi notdependingon i. (ii) A projectionrule withbasic set S = R or An is statistical(Definition 4) ifand onlyif it is generatedby (3.4) F(vlu) = E ui fI-'-I) for some continuously strictlyconvexfunctionf on R+, with differentiable, = = = f(l) f'(1) 0 and limt 0 f'(t) oo. Functionsofprobability ofform(3.4), as measuresofdistance distributions motivated byinformation theory,wereintroduced by Csiszar(1963) underthe name f-divergences [cf. also Csiszar (1967)] and independently by Ali and Silvey(1966). For theirapplicationsin statistics,see, forexample,Liese and Vajda (1987). The conditionson thederivative of f are notpartoftheoriginal definition of f-divergences. Notice that the conditionf'(1) = 0 is essential onlyin the case S = R+ because if S = AnX thenany f(t) and f(t) = f(t) + c(t - 1) definethe same F in (3.4). The conditionlimt 0 f'(t)= -o is necessaryforF in (3.4) to generatea projection rule,thatis, to ensurethatits minimumin v be attainedon everyL E Y. THEOREM 3. (i) For any regular,local and subspace-transitive projection rule (Definition6), thereexists a standard n-tuple(P1, , fpnsuch that ?(V)= E=p1,i(vi) is strictlyconvexon S and the givenprojectionrule is generatedby F(vlu) (3.5) = ?(v) n E i=l - ?(u) ((v) - - (grad 1(u))T i(Ui) - (v - u) 0(Up(iu)(V-ui)). (ii) AnyF as in part (i) generatesa regular,local and parallel-transitive projectionrule. On accountofthistheorem,in the sequel we neednotdistinguish between the twokindsoftransitivity. COROLLARY.The only transitivestatisticalprojectionrule (with S An) is theI-divergence projectionrule (cf. Example2). = Rn or The class of measuresof distanceassociatedas in Theorem3 withstrictly convex functions'F (not necessarilyof a sum form)was introducedby 'F under Bregman(1967). He developedan iterativealgorithm forminimizing linear(and, moregenerally, underlinearinequality)constraints, whosesteps involvedprojectionsin the sense of the distancecorresponding to 'P [cf.also Censorand Lent (1981)]. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2046 Bregman'sdivergencesinclude squared Euclidean distanceand I-divergence,and sharewiththesethe property (3.6) F(vlu) + F(wlv) = F(w1u) if [l(Llu) = v, w E L. Notice that for squared Euclidean distance,(3.6) is just the Pythagorean also has thisPythagorean property playsa theorem.The factthat I-divergence keyrolein its applicationsin statistics[cf.Kullback(1959)]. A resultrelatedto but notdirectly comparablewithTheorem3 was recently thepreviousresultsofJones obtainedbyJonesand Byrne(1990),generalizing (1989). They showedthat amongthe continuousanalogs of the distancesof form(3.3), onlythe analogsofthosein Theorem3 satisfieda postulatecalled projectivity. That postulateis stronglymotivatedfromthe pointof view of inference, but it also involvesthe distanceitself,as opposedto transitivity which is a propertyof the projectionrule alone. Jones and Byrne(1990) pointedout that theirdistancessatisfied(3.6), and, in fact,that (3.6) was equivalentto the projectivity postulate. THEOREM4. (i) A projectionrule withbasic set S = Rn is regular,local, transitive and bothlocationand scale-invariant ifand onlyifit is a weighted least squares projectionrule (cf. Example 1). The above propertiesand semisymmetry uniquelycharacterize theleastsquaresprojectionrule. (ii) A projectionrulewithbasic setS = Rn+is regular,local, transitive and scale-invariantif and only if it is generatedby F(vlu)= ELz alha(vilhu4 wherea1, ... , an are positive constants,a < 1, and v v log-u (3.7) ha(VIU) v + u, if = 1, u v logg-+---1, 1 _(ua a v if a =0, u - va) + Ua l(v - u), if 0 < a < 1 or a < 0. A projectionrulewiththeaboveproperties ifand onlyif is also semisymmetric it is generatedby (3.8) Fa(vlu) = n E i=1 ha(vilui), a < 1. REMARK. The one-parameter family(3.8) containstwo well-knowndistances: I-divergence (a = 1) and Itakura and Saito (1968) distance (a = 0). RecentresultsofJonesand Trutzer(1989) indicatethat(continuousversions of) the membersof this familywith a = 1/m (m some integer)may be a tradipreferable to the Itakura-Saitodistancein spectrumreconstruction, tionalfieldofapplicationofthe latter. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2047 ofthe least squaresand I-diverOur last resultprovidesa characterization genceselectionand projectionrulesthat,insteadofinvarianceor transitivity, relyon composition consistency (Definition 7). We emphasizethatthischaracterizationalso appliesto individualselectionrules,ratherthan to projection rules only.On the otherhand,we have to assume thatthe basic set S is as the inferenceproblemin question describedbeforeDefinition7. Intuitively, shouldrelateto objectswith(at least) twocomponents. THEOREM5. Let S be as in Definition7, S = Rmnor R+n or Ain' with m > 2, n > 2, and if S = Amn, m + n ? 5. (i) In the case S = Rmnn,the regular, local, sum-consistentselection and projectionrulesare exactlythoseleastsquaresselectionrulesforwhichv0 is of sum form,and theleast squaresprojectionrule,respectively (cf. Example 1). (ii) In the cases S = R mnor A the regular, local and product-consistent selectionrulesfor selectionand projectionrulesare exactlythoseI-divergence whichv? is ofproductform,and theI-divergence projectionrule,respectively (cf. Example 2). (i) For S = Rmnn,thestandardleast squares selectionrule is theuniqueregular,local, sum-consistent selectionruleforwhichv0(H) = 0. (ii) For S = Almn'themaximumentropy selectionrule is theunique regular, local, product-consistent selectionruleforwhichv0(Lt) is the "uniform distribution" (1/mn)l. The same holdsalso forS = R+ n ifthelast condition is replacedby vo(H) = (1/e)1. COROLLARY. 4. Basic lemmas. LEMMA1. For a regularselectionrule H: S , for everyL' E Y of dimension less than n - 1 (if S = Rn or Rn) or less than n - 2 (if S = A thereexists L E X/ [cf. (2.3)] such that L' c L and H(L') = [1(L). - COROLLARY . For a regularH and L' E ., [1(L) impliesL' c L unless H(L') = v0(H). L E X#,theequality[I(L') = PROOF. Clearly, it sufficesto prove that if dim L' = d with 0 < d < n - 1 (or 0 < d < n - 2), then thereexistsan L e .. such that L D L', dimL = d + 1, H(L') = H(L). Further,insteadofthe last equality,it suffices to show that [(L) E L' becausethis,bytheconsistency axiomin Definition 2, already implies (L') = [(L). L1 D L' and supposethat [1(Ll) t L'. Now, pick any (d + 1)-dimensional Then we will "rotate" L1 to obtaina family{Lt: 0 < t < 2} and show that 1(Ld) E L' forsome t. To this end, pick any vo 0 L1 in S, set v, = [M(L,) and let v2 be such that some interiorpointof the segment[vO,v2] is in L'. Set vt = (1 - t)vo + tv1 if 0 < t < 1 and vt = (2 - t)v1 + (t - 1)v2 if 1 < t < 2, This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2048 I. CSISZAR and let Lt denotethe subspaceof S spannedby L' and vt.Then Lo = L2 and Ltn Lt2= L' if0< t1<t2 <2. By the continuity axiom,{I1(Lt): 0 < t < 21 is a continuousclosedcurvein the subspaceL spannedby L1 and vo (t = 0 and t = 2 representing the same point).For ? > 0 sufficiently small,fl(L1) and fl(Ll+E)-which are arbitrarily close to 11(L1)= vl-are separatedby L1 withinL; hence thereexists some t with It- 11> E forwhich fl(Lt) E L1. Then fl(Lt) E Lt n L, = L' and theproofofLemma 1 is complete. The corollaryis immediate,because for L E X containingL' such that axiomimpliesL = L. El ME(L')= 11(L),the distinctness LEMMA2. The restriction ofa regularselectionrule Vt:.-* S to X'\ Y0 is a homeomorphism ontoS \ {v?}, wherevo = vo(H) and Y0 - {L: vo E LI. PROOF. ApplyingLemma 1 to the zero-dimensional subspace L' = {v}, it followsthatforeach v E S thereexistsL EX-#suchthat11(L) = v. Ifv 0 v?, thenL 4 ?, bytheconsistency axiom.Thus 11maps X#\f ? ontoS \ {v0}. This mappingis one-to-one and continuousbythe distinctness and continuity axioms.It remainsto provethe continuity of the inverse.In otherwords,we have to showthat U(Lk) --+H(L) #v? impliesLk-- L. We do this by showingthat everysubsequenceof {Lk* containsa subseto L. Write quenceconverging L k J{v: akv =bk {v: aTv = b; 1Tv1if= ifS =Rn orS =R+, S =A [cf.(2.3)]. Here we may supposethat Ilakll= 1 and in the case S = An also thatak ? 1. Now, any subsequenceof {L*} containsa subsequence{Lkj such that by asa, say. Writefl(Ld) = Vk, fl(L) = v*, aTv* = b. As Vkv* aki = ak.Vk that b. This a means that if a aTv= sumption, implies aki bki the set -* LJ[{v:aTv-b}, ifS=R orRn, (v: &Tv= b, 1Tv = 1}, if S = An, is in we have Lk. L. But L EX,# holds because (i) L ? 0 (namely, = 1 and,if S = An, also a ? 1. We have provedthat everysubsequenceof {L*) containsa convergent subsequenceLk. -+ L. Bythecontinuity axiom,here1l(L) = limi lH(Lk.)= 11(L) # v?, and hence necessarilyL = L by the distinctnessaxiom. This completestheproofofLemma2. oJ c@, -* v*~E L bythe definition of b) and (ii) IlaI We will need the following notation,fork < n: The vectorsin Rk whose are all 0 or all 1, willbe denotedby Ok and lk, respectively components (thus This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE o= on, 1 = In). Two vectorsin a a, iffforsome A # 0, 2049 will be called equivalent,denotedby Rk if S =Rn orS =RRn+ if S = An- {Aa+ XAa+ ,k, AE R, Observethatin the representation L (4.1) ) -ftv:av = t{v: aTv blTV ifS=R orR+, fS = A = 1} ofa subspaceL E X1,thevectora E Rn is determined up to equivalence,and it is notequivalentto 0. LEMMA3. Let H: ..t-* S be a regular, local selectionrule, and let therelationintroducedin thepassage containing(2.10). *-*ij be (i) Let L fe, [(L) = v* * v? = vo(H). Then vi* ij vj* if and only if (4.1) of L; further,in the cases S = Rn or ai - aj in the representation S =R+n, v = v ifand onlyifa =0. (ii) Let L and L be both in Xt' with [(L) # v?, [(L) 0 vo, and let = Hj(L) impliesaj J c{1, ..., n}. Then[j(L) iaj fora and -a representL ing and L as in (4.1). In otherwords,aj is determined by jII(L) up to equivalence. COROLLARY.If vi +>ij also vi +ik Vj and vj *->jk Vk for some {i, j, k) c (1, .. ., n}, then = An thatvi + Vj + Vk < 1. Vk, providedin thecase S PROOF. We will show that for L as in (i), arbitraryJ = {j1, {1, . . ., n}, a E Rk, and for *, iv= a -avTv (4.2) L a={ aTvj* i J l lTvVjc), iv ifS =Rn ,ik} C orR+, = if S = 'An) the following holds:aj _a is a sufficient conditionfor (4.3) [j(L) = [rj(L') and thisconditionis also necessaryunlessaj To provethis,set (4.4) L" = L n {V: vjc ?k. = VJc. Then,by locality,Hj(L') = r1i(L"). Since,of course,[ljc(L") = V5c, it follows that(4.3) holdsifand onlyif [(L") = v*. Now, if aj a, then L" c L. By the consistency axiom,the latterimplies H(L") = v*. Thus aj a is sufficient for(4.3), as claimed. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2050 if(4.3) and therefore Conversely, [(L") = v* hold,the corollaryofLemma 1 yieldsL" c L. Comparing(4.1), (4.3) and (4.4), L" c L meansthat aTvi = = aivj (subjectto the additionalconstraintE jeJVj = aTvj impliesaiv v if S = An). Clearly,this implicationholds iffthe two conditions E iO representthe same constraint, exceptforthe case a1 0?k. This provesthat for indeed unlessajk.? a" is (4.3) necessary aj, To proveassertion(i) of the lemma,supposefirstthat S = R' or S = R+ and applythe resultjust provedto J = {i}, a = 0. Then L' in (4.2) equals S and (4.3) meansthat vi = vo.Thus we obtainthat ai = 0 impliesvi = v,0and if ai # 0, then vi*= vO cannot hold. Next, apply our resultto conversely, J = {i, j}, a = (1, 1)T. Then L' in (4.2) equals the specialsubspaceLij(t) (with t = V* + VJ) thatdefines therelation+-*ij [cf.(2.10)].Hence(4.3) is equivaa = (1,1)T, except when S = R' lent to v* v Now if ai = aj, then a VJ. or S = R+ and ai = a =O. Thus we have that ai = aj impliesvO*-*jV, because the mentionedexceptionalcase has alreadybeen covered. a; hence v* VJ if ai # aj, thenneitheraj - 02 nor ai Conversely, i cannothold. To proveassertion(ii), supposethat IIj(L) = lJ4(L)= vj and considerL' as in (4.2) with a = &j. Then ljl(L) = Hj(L) = HII(L'), by assumption and the sufficient conditionfor(4.3). This implies,bythenecessarycondition, that when L L Since the roles of and except perhaps 0k' are aj ~aj, aj we similarlyhave aj when is symmetric, not equivalent to ?k' aj aj whereasthe same trivially holdswhenbothaj ?k and aJOk. To provethe corollary, pick any v* E S with vi*= vi, Vj*= V, vk = Vk. If # v* v?, the hypothesesvi ij vj and vj **jk Vk implyby Lemmas2 and 3(i) that v* = 11(L) for some L as in (4.1) with ai = aj = ak. This, in turn, implies[again,byLemma3(i)] that Vi jik Vk whereasthelatteris obvious(by the consistency axiom)ifv* = v?. LEMMA4. (i) Let Fij(u, v), i s j, {i, j) c {1, . . ., n), n > 3, be real-valued functionsdefinedforu E Ai, v c Aj, whereA, ..., An are arbitrary sets. If theseFij satisfythefunctionalequations (4.5) Fi(u, v)Fjk(V, w) = Fik(u, w), Fij(u, v)Fji(v, u) = 1, thenthereexistfunctions gi definedon Ai, i = 1,... , n, such thatforeveryi andj, (4.6) Fij1u,v) = gi(u) (ii) Let n 2 4 and letBij c Ai x Aj, Cijkc Ai x Ai x Ak be setssuch that (1) (u, v)E Bij if and onlyif (v,u) E Bji; (2) (u, v,w) C Cijk impliesthat (u, v) E Bij, (u, w) E Bik, (v, w) E B3k; (3) forany distincti, j, k,I and any uEAi, veAj, wEAk thereexists s e Al with (u, s) e Bil, (v, s) e Bjl, (w, s) E Bkl; and (4) for any distincti, j, k,I and any u, v,s, s' with (u,s),(u,s') in BiI and (v,s),(v,s') in Bjl, there exists w EAk with This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2051 (u, w,s), (u, w,s') in Cikl and (s, w, v),(s', w,v) in Clki. Thenifthefunctions Fij are definedon thesetsB i and theequations(4.5) holdfor(u, v,w) E Cijk theconclusionofpart (i) stillholds. PROOF. (i) Fix some k and w~E Ak, and writegi(u) = FW(u, W) fori # k. Then gi(u) # 0 bythesecondpartof(4.5), and thefirstpartof(4.5) gives(4.6) if i, j are bothdifferent fromk. In addition,(4.5) impliesthat gi (u) g,(v) (V Fjk , W ) Fi k (U, W ) foreveryi * k, j * k, u E A,, vE Ai. Denotingthe commonvalue of these quotientsby gk(W),we obtain(4.6) also forj = k. Finally,the validityof(4.6) forthe remainingcase i = k followsfromthe secondequationin (4.5). (ii) On account of (i) it sufficesto provethat the functionsFij can be extendedfromBij to Ai x Ai so thattheequations(4.5) remainvalid.To this end,givenu E Ai, v E Aj, pickarbitrarily 1 = k (bothdifferent fromi, j) and s and s' in Al such as in hypothesis(4); choose w E Ak accordingto that hypothesis. Then,applyingthe firstand thenthe secondpartof(4.5), we get Fil(U, s)F1j(s, v) = Fik(U, W)Fkl(W, S)Flk(S, = Fik(U, W)Fkj(w, v) V) W)Fkj(W, and similarly Fil(U, s ( S%v) )Flj = Fik(U, v) W)Fkj(W, This meansthat Fij( u, v) = Fil( u, s) F,j(s, v) (4.7) is welldefinedbecausetheright-hand sidedoes notdependon I and s [subject to (u, s) E Bii, (v, s) E Bjl, thelatterbeingequivalentto (s, v) E B,j]. Clearly, (4.7) definesan extensionof Fij to Ai x Ai. To see thattheseextensionsFj satisfythe functional equations(4.5) foreveryu E Ai, v E Aj, wE Ak, let l be different fromi, j, k and pick s E Al accordingto hypothesis (3). Then by (4.7) and the secondpartof(4.5) Fij(u, V)Fjk(V, w) = = Fij(u, v) Fji(u Fil(U, s)F,j(s, v)Fjl(v, FEA(u, S)Flk(s, W) s) Flj(s, v) Fjl(v, I v) = Fil(u, = Fil(u, s)F,i(s, u) s)Flk(s, = Fk(U, = w) W), s) Fli(s, u) 1. 5. Proof of the main results. PROOFOF THEOREM 1. (i) Let H: S be a regular,local selectionrule, and let v? = (v9,... , v2)T= v'(H1). By Lemma 2, foreveryv / v? there exists This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2052 L(v) dependscontinua unique L = L(v) E X suchthatfl(L) = v; moreover, on v. Write L = L(v) as ously (5.1) (5.1) ~L w aT {w: S= R'or R', =aTV,if aTw = aTv; lTW = 1}, if S = A". up to equivalence(cf.thepassagebefore Here thevectora E R' is determined Lemma3). Our firstclaimis that thereexistcontinuousfunctionsgi(v), v e V, with (5.1) of L = L(v) (for gi(v) = 0, i = 1,...,n, such thatin the representation arbitrary v #v?) a - (91(V J (5.2) gn(Vn)) For laterreference, observethat(5.2) impliesby Lemma3(i) that Vi ij vj ifand onlyif gi(vi) =gj(vj) (5.3) in the case S = An that vi + vj < 1. providing To proveour firstclaim,we startwiththe simplercases S = Rn or Rn. By up to Lemma3(i), appliedto J = {i, j}, it followsthat(a1, aj )T is determined equivalenceby vi and vj. Thus (5.4) Fij(vi,vj) = of vi and vj wheneverv; 'vj9[which,by is a well-defined continuousfunction (5.4) Lemma3(i), is necessaryand sufficient foraj # 0]. Clearly,thefunctions equations satisfythe functional Fij(Vi Vj)Fjk(Vj, Vk) = Fik(Vi, Uk), Fij(vi,vj)Fji(v, vi) = 1 forvi E V\ {v9), i = 1,. . ., n (recallthatV = R or R+ accordingas S = Rn or Rn). It followsby Lemma4 that (5.5) Fij(vi,vj) = gi(vi) if vi #v10,vj #vjo,for suitable functionsdefinedand not equal to 0 on V \ {v(0.Lettinggi(vo)= 0, (5.5) also holdsforvi = vo?wheneverFij(vi,vj) is thatis, vj 0 vu?. defined, Comparing(5.4) and (5.5), we obtain(5.2). The functionsgi are continuous becausethe Fij are such. case S = An, applyLemma 3(i) with J= Turningto the more difficult {i, j, l} to obtainfora E Rn in (5.1) that(ai, aj, ai)T is uniquelydetermined, up to equivalence,by vi,vj,vI. Hence a (5.6) Fijl(vi, vj,v1) = a, -- a, a, is a well-defined continuousfunctionof (vi,vj,v,) subjectto vi + vj+ v, < 1 This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2053 foraj = a,, by Lemma exceptforvj *j v1[whichis necessaryand sufficient 3(i)]. Clearly,the functions (5.6) satisfythe functional equations (5.7) if VI + V (5.8) (5.9) (5.10) Fijl(Vi + Vk + Vl <1, Vk) VI) = Fikl(Vi) Vk, Vl) vI vjlV)Fjkl(vjy as wellas Fijl(vi, Vj,vl)Fjil(vj, vi, vl) Fijl(vi v, vj)Fjli( , vjv, vvi)Fij( vl,Vij) + Fij(vi,vl,vj) Fij(vi,vj,vl) = 1, = -1, = 1 if vi + vj + v1< 1, assumingin each case thatall functions are defined. These functional equationscan be solvedapplyingLemma4(i) threetimes. First,we use (5.7) and (5.8) fixingI and v1and restricting the domainofthe Fij1's-as functionsof vi and vj-by excludingvi *il vl, that is, Fil = 0. Thenthehypotheses ofLemma4(i) are easilychecked,takingintoaccountfor hypotheses (3) and (4) that(forfixedvl) the relationv >il vl neverholdsif v is sufficiently small;the latterfollowsfromthe continuity axiomin Definition 2. Lemma4 gives (5.11) G-1(vi, vl) Fijl(vi vj,vl) = G1('v) for suitable functionsGil defined(and nonzero) for ui + v, < 1, unless Vi *4il V1 Substituting (5.11) into(5.9), we obtain,afterrearranging, Gji(vj, vi) Glj(vl, vj) Gij(vi, vj) Gjl(vj,vl) Gli(vl vi) 'Gil(vi, vl) NowweapplyLemma4(i) to thefunctions -Gl1(vl, sG V(v)-vl Hil( vg VI) Hi1(vi, vi) (definedfor vi + v, < 1, unless vi il vl), yielding h vi) (( .Hi1(vi,v1) = h1v1 forsuitablefunctionshi defined(and nonzero)in V= (0, 1). This meansthat the functions G 1(v ,v1) Gil(v) vl) = Gjl(vi, v/) h (v1) satisfy (5.12) Gil(vi, vl) = Gli(vl vi) This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2054 I. CSISZAR and,by(5.11), (5.13) Fijl(vi,vj,vj) = G"(vi ) At this point we removethe temporaryexclusionof vi ji, v1 fromthe ofthe domainof Fijl; defining definition Gil(vi,vl) = 0 if vi 'a v1,(5.13) will alwaysholdwheneverFijl in (5.6) is defined. Finally,substituting (5.13) into (5.10) gives,afterrearranging, using also (5.12), (5.14) vj) = Gij(vi,vj), G6i(vi, vi) + G1j(v1, whenevervj + v + v1< 1. Moreexactly,the givenderivation of(5.14) is valid unless Vj -jl v1and a similarderivation from(5.10), withtherolesof i and j interchanged, is validunlessvi*->il vl; ifbothvi*->il v1 and v3 >l vl,then also vi ij vj and (5.14) holdstrivially. (5.14) and (5.12) mean that Lemma 4(i) is applicableto the functions Fij = expGij. It followsthat G7ij(vi,vi) = gi(vi) - gi(vi), forsuitablefunctionsgi definedon V = (0, 1). SinceGij(vi,vj) = 0 if vi +->ijvj andthus,in particular, if vi = v0,vj = here gi(v ) is independentof i, and we may assume that gj(v0) = 0, i = 1,... I n. Substituting (5.15) into(5.13), we obtain (5.15) (5.16) Fijl(vi, vjvl) = g(vi) -- g1(v1) g1(v3) g1(v1) whenevertheleft-hand sideis defined. As thefunctionsFijl are continuous, so are the gi's, too. Comparing(5.16) with(5.6), we obtain(5.2). Havingestablishedour firstclaim,we define v (5.17) fi(v) = f g(t) dt, v? n F(v) = E i=l fi(vi). We willshowthat f1,..., fn forma standardn-tuple[noticethatproperty (i) in Definition 8 obviouslyholds]and that F(v) generatesII. With(5.17), (5.2) becomesa - gradF(v). As the vectora in (5.1) is determinedup to equivalenceonly,it followsthat(forarbitrary v 0 vo)L(v) is the set ofall w E S satisfying (5.18) (gradF(v) )T(W - V) = 0. Observenextthat forany L E -t we have by the consistency axiomand Lemma1 thatfl(L) = v ifand onlyifv e L c L(v), providing v # v?, whereas ifv? E L, thenalways[1(L) = v?. Thus we mayassertwithoutanyrestriction This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2055 on v and forarbitraryL E t, that [1(L) = v if and onlyifv E L and (5.18) fulfilled ifv = vo). holdsforeveryw E L (the latterbeingautomatically For any distinctv and w in S satisfying (5.18), the result in the last paragraphappliedto the line L' throughv and w givesthat (grad F(w))T(w - v) # 0; herewe used the factthatthe difference ofanytwoelementsof L' is a scalar multipleofw - v. By continuity, this nonzeroinnerproductmust be of constantsign for w 0 v satisfying (5.18),whenv E S is fixed.Further,againbycontinuity, this sign cannotactuallydependon v, either.Withoutrestricting generality, we may assume that this constantsign is positive.Indeed,the functionsgi in (5.17) maybe multiplied by(-1) ifnecessary,withoutchangingtheirproperties assertedin our firstclaim. We have obtainedthat(5.18) withw 0 v alwaysimplies (grad F(w))T(w - v) > 0. (5.19) It followsthat forany line L' in S, F(w) is strictlyincreasingas w moves away fromv = 11(L') in eitherdirection.This immediately givesthat F has property (iii) in Definition 8. Further,foranyLEYE, thefactthat F(w) is strictly increasingas w moves awayfromv = 11(L) on anyline L' c L provesthat H(L) is the uniquepoint whereF is minimizedoverL. Thus fl is generatedby F. To provethat (ii) in Definition8 also holds,noticefirstthat as v = v? trivially satisfies(5.18) foreveryw e S, (5.19) gives (grad F(w)j)T(W- v?) > 0 forallw # v' in S. (5.20) IfS = R' or A, let v = ( vl,... , Qn)" be a boundarypointof S such that = s 0 if j 0 i, further,fj'(Vj) f/(01) v forsome j and 1 different vi 0 and VQ fromi. (5.20) impliesthat liminf(grad F(w)"( v w -r )>0; V hence f'(v) is bounded from above as v -- 0. Thus, if the assertion = -mowere false, fi' would have a finitelimiton a suitable limv 0/f7'(v) sequence of positivenumbersapproaching0. Then therewould exist a sequence vk -e v (withvk e S) such that gradF(vk) -- a, say, wherea is not equivalentto 0. This providesthe desiredcontradiction becausethe sequence ofsubspaces Lk = (w: (gradF(vk)j)T(w-vk) = 0, wES} Es X convergesto L-={w: and the sequence[(Lk) aT(w = -_ ) = 0, w e S} E, g to 11(L), forit goesto v Vk does notconverge This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions S. I. CSISZAR 2056 This completesthe proofof the assertionsmade in the passage containing (5.17). Finally,let fl,..., fnbe any standardn-tuplewith0 at v0 = (v?,.. ., and set F(v) = E 1fi(vi). Notice firstthat in the cases S = Rn or Rn property (iii) in Definition8 impliesthat fi(v) is strictly increasing(decreasthelatter, ing)forv > v? (v < v) and also that fi(v) Xc as v -- oo.To verify that fi(v) - c < ooas v xc, choosevi> vf', > v) with supposeindirectly fi(vi)+ fj(vj) = c, and apply(3.2) to v = (v,..., vn)T withv, = vI for1 # i, j and v' = (vi, ... V,v)T withv' = v' for1 # i. It followsthat (5.21) fi(avi + (1 - a)v') + fj(avj + (1 - a)vJ) < f-(vi) + fj(vj), foreveryvi > v? and 0 < a < 1. Lettingfirstvi -> and then a -* 0, (5.21) resultsin thecontradiction c < fi(vi).One verifies in the samewaythatin the case S = R , fi(v) > ooalso as v or An, F(v) has a limit . If S = (finiteor + cm)as v convergesto a boundarypointof S, becauseproperty (ii) in Definition8 impliesthe monotonicity of each fi near 0. It followsthat F, extendedby continuityto the closure of S if S = R+ or A attains its minimumon any L e /, moreexactly,if S = R n or An,on the closureof L. (ii) rules out the minimumbeingattainedon the boundary.By But property property (iii),the pointwhereF attainsits minimumoverL mustbe unique; hence F does generatea selectionrule [I. Property(i) impliesthatthis HI is regularand it is obviouslylocal. (ii) Let us be givena regular,local projectionrule withcomponentprojection rules fl( Iu), u E S. For arbitrary v ? u, let L(vlu) denotethe unique L E X#withrl(LIu) = v. We claimthatthe following modified versionofour firstclaimin the proofofpart(i) is valid: There existfunctionsgi(vIu), u, v E V, continuousin v and vanishingfor v = u, i = 1,.. , n, such that in the representation(5.1) of L = L(vlu) we have T U a - (g(V1 UI 1), (5.22) gn(Vn I n)) In the proofofpart(i), the functionF generating the selectionrule LI was constructed fromfunctionsgi withthe property (5.2). If (5.22) is established, the functionsgi(vlui) therecan be takenas functionsgi(v) corresponding to fl = H(I Iu). Then [cf.(5.17)] Hf(Iu) willbe generatedby (5.23) n F(vlu) = E fi(vilui), i=1 fi(vilui)= f v. ui gi(vlui)dv as a functionofv. This means,by definition, thatthe givenprojectionrule is generatedby F(vlu). Thus it suffices to provethe claimabout(5.22). This can be donealongthe linesofpart(i); hencewe onlysketchthe proof,forthe hardcase S = An . We needan obviousmodification ofLemma3, namelythatforL as in (4.1) with rl(Llu) = v # u, (a) (vilui) ij (vjluj) iffai = aj and (b) the vectoraj is determined by uj and vj up to equivalence. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2057 Applying thisto J = {i, j, 1),it followsthatthe functions (5.24) Fijl(vi,vj,vljui,uj, ul) = a, - a1 are well definedand continuousin vi,vj,v1 subjectto ui + u; + u < 1, vi + vj + v, < 1, exceptwhen(vjIuj) -jl (vilu,). The functions(5.24) satisfyfunctionalequations similarto (5.7)-(5.10), each variablevi in thelatterbeingreplacedbya pairofvariablesvi,u i, where witheach constrainton sums of variablesvi a similarconstrainton sums of variables u is imposed.This systemof functionalequationscan be solved similarlyto (5.7)-(5.10), again applyingLemma4(u)threetimes.It is convein Lemma4 werenot requiredto be nientthatthe variablesofthe functions reals; presentlywe have to let them stand forpairs of real numbersv,u. of the functions(5.24) analogousto Finally,we arriveat a representation (5.16), namely, (5.25) gi(vilui) -- g1(v1ul) riJ~U~UJ,U1tt~U1,It1 Fiji (vi,ij, VIlgi,X j,Xal) = g1(v1iu1) =gj(vjtu1) part Comparing (5.24) and (5.25) provesthedesiredrelation(5.22) and thereby (ii) ofTheorem1. (iii) Suppose that F(v) = E ,7 1fi(vi) and F(v) =E f(vi) generatethe same selectionrule,where(f1,. , fn) and ( f,..., fn) are standardn-tuples. to prove fi = cfi forthecase when( fl,. . ., fn) is arbitrary, Clearly,it suffices with0 at vo = (v?,.. ., v')T, say,and ( to the (regufn) is constructed lar,local) selectionrule 11generatedby F as in theproofofpart(i) [cf.(5.17)]. Now,since F generatesII, its minimumon L = L(v) is achievedat the point v. Hence foreveryv + v' we have (gradF(v))T(w - v) = 0 forall w E L(v). Since L(v) is the set ofall w E S satisfying (5.18), it followsthatforv + v', (5.26) (5.26) gradF(v) =A(v)gradF(v) gTad A(v)gTadF(v) ifS ifS = Rn orRn+, F(v) = gradF(v) = A(v)gradF(v) + i(v)1 if S = An; the same holdstrivially also forv = v?, withj,(v?) = 0 in the case S = An. As thecomponents ofthegradientvectorsdependonlyon thecorresponding componentsof v, the scalar functionsA and ,u in (5.26) and (5.27) mustbe constantand, in particular,,u in (5.27) is identically 0. Thus we actuallyhave (5.27) gradF(v) = c gradF(v). Since f1,..., fn and f1,..., f, have the property (i) in Definition8, (5.28) rules impliesthat fi = cfi,as claimed.The proofofassertion(iii) forprojection is similar.El (5.28) PROOF OF THEOREM 2. By theproofofTheorem1(i), anyregularand local projectionrule is generatedby a functionas in (5.23), where gi(vlu) is a continuousfunctionof v thatvanishesat v = u, i = 1,. . . , n; moreover, for anyu # v thesubspaceL definedby(5.1) withai = gi(viIu1)has theproperty This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2058 of Lemma 3(i) used in that proof,it that Il(Llu) = v. By the modification followsthat (vilui) ij (vjluj) ifandonlyif gi(vilui) =gj(vjluj) (5.29) providedin the case S = An that ui + uj < 1, vi + vj < 1. ruleis semisymmetric, Now,if S = Rn or Ri+ and thegivenprojection that is, (vlu) jj (vyu) foreveryu and v in V and everyi, j E {1, ... , n}, (5.29) impliesthatthe functionsgi do not dependon i; hence,by (5.23), neitherdo the functionsfi. Further,if S = R n or Ai and thegivenprojection ruleis statistical, thatis, (vyu) * i1(v'lu')ifand onlyif v/u = v'/u',(5.29) impliesthat (5.30) (5 30) g~vJ~ = g(~u)if~~gi(vilui) gi(vilui) i V. Vi providedin the case S = An that ui + uj < 1, vi + vj < 1. Actually,the last constraintcan be dispensedwith, because for any given u1,1U jv , vj [in V = (O,1)] thereexist Uk, Vk such that ui + Uk < 1, vi + Vk < 1, uj + Uk < 1, Vj + Vk < 1 and vj/ui = Vk/U. (5.30) means that gi(vlu) is a one-to-one functionof v/u, notdependingon i, thatis, gj(vlu) = g (5.31) ). The continuityof gi as a functionof v and the one-to-onepropertyof g implies that g(t) is a continuous,strictlymonotonicfunctionof t, and g(l) = gi(u lu) = 0. Substituting (5.31) into(5.26) gives (5.32) = Uif fi(vijui) = |dv ) f (t) = fg(s) ds. This completesthe proofofTheorem2. o PROOFOF THEOREM3. (i) First,we showthatifa regularselectionruleis thenfordistinctelementsu, v,w of S, subspace-transitive, (5.33) L(vlu) n L(wlv) c L(wlu) ifwEE L(vju). Here,as in theproofofTheorem1(i), L(vlu) denotestheunique L E X. with H(LIu)= v. Indeed, let L' = L(vlu) n L(wiv). Then [I(L'iv) = w by the consistency axiom,and the transitivity postulateappliedto L' c L = L(vlu) gives fl(L'u) = [l(L'Iv) = w. But Hl(L'lu)= w impliesthat L' c L(wlu), by the corollaryof Lemma 1, proving(5.33). This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2059 Now let us be givena regular,local projectionrule,generatedby F(vlu) as in (5.23). In particular, (5.34) gi(vlu) = -fi(vju) dv is a continuousfunctionof v, vanishingat v = u. Then L(vlu) consistsof thosew E S thatsatisfy n (5.35) E i-l gi(vilui)(wi - vi) = 0. Hence,ifthe givenprojection ruleis subspace-transitive, it followsfrom(5.33) thatforany distinctu, v,w in S satisfying there existscalars a, ,l, y, (5.35), possiblydependingon u, v,w, withy = 0 unless S = An, suchthat (5.36) i - 1,..., n. agi(uilui) + fgi(wi(vi) + y = gi(wilui), We claimthatactually (5 37) gj(vlu) + gj(wlv)= gj(wlu), foreveryu, v,w in V and i 1,.. . , n. Clearly,it sufficesto provethis for = i = 1. Considerfirstthe simplercases S = Rn or R+ . Then forany u, v,w in V, thereexistu,v,win S satisfying(5.35) suchthat u1 = u, v1 = v, w1 = w and, in addition, = (5.38) U2 = V2 + W29 U3 ? V3 W3. Withthese u,v,w, (5.38) impliesthat in (5.36), wherenow y- 0, we have a = f = 1 [usingthat,by Lemma3, gi(vlu) # 0 if v + u]. This proves(5.37) fori = 1. If S = An then v # u does not necessarilyimplygi(vlu) / 0. It follows, however,from(iii) in Definition 8, thatforat mostone index i can u < 1/2 and an intervalI c (0, 1/2) be foundsuch that fi(vlu)is constantforv E I. This implies(assuming,withoutany loss ofgenerality, thatthe exceptionali, if any, is different from2 and 3) that for any 8 < 1/2, the numbersE satisfying (5.39) g2(Ej8) + 0, g3(cJ) # 0 are densein theinterval(0, 1/2). Usingthis,it is easyto see thatforanyfixed v and w sufficiently close to v, thereexist u, v, w in S = An satisfying (5.35), such that u1 = u, v1= v, w1 = w and, withsome E and 8 satisfying (5.39), ui, (5.40) U2 = V2 = U3 W2 = V3 = W3 =? and (5.41) U4 = V4 = W4. Withtheseu, v,w, (5.41) impliesthatin (5.36) we have y = 0, then(5.40) and This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2060 I. CSISZAR (5.39) implythat a = f3= 1. This proves(5.37) (fori = 1) if w is sufficiently closeto v. Observenext that givenany u and v in V = (0,1), the numbersw E V satisfying (5.37) fori = 1 forman openset. Indeed,bythe last paragraph,for closeto w we have w' sufficiently gl(w'lw) + gl(wlv) = g1(w'v), g1(w'Jw)+ gl(wlu) = gl(w'Iu); it followsthatif w satisfies(5.37) fori = 1, thenso does w'. of w, the(nonvoid)set Since gl(wlu) and gl(wlv) are continuousfunctions of those w E V that satisfy(5.37) fori = 1 can be open onlyif it equals the wholeV = (0, 1). This completesthe proofof(5.37) in the case S = An The functionalequations(5.37) implythat the functionsgi can be representedas (5.42) gi(VIU) = iV(V) - i(u), whereit may be assumed that Ofi(s)= 0, i = 1,... , n, forsome s E V [set i(v) = gi(slv), say]. Since gi(vlu) is a continuousfunctionof v, i mustbe continuous,too.Writing (5.43) (Pi(v)= v i(t) dt, the functions'Pi .. ., (Pn satisfy(i) in Definition 8, withvo = s 1, and (in the cases S = R n or An) the validityof (ii) in Definition8 forthe functions'pi followsfromthat forgif(lu), by (5.42). Property(iii) for (Pl, Pn will,of of C4(v)= E ' 1= that we course,followfromthe strictconvexity are going li(vd) to verify immediately. From(5.34), (5.42) and (5.43) we get (5.44) fi(vIu) =] v gi(tlu) dt = fi(v) - fi(u) - M(u)(v - u). This provesthat F(vlu) = E= fi(v(Iu ) has the claimedform F(vlu) = ?F(v) - ?F(u) - (grad?(u) ) T (V - U). Since F(vlu) 2 0, withequalityifand onlyifv = u, thisresultalso provesthe strictconvexity of CFon S. (ii) Let (P1 'Pn be a standardn-tuplesuch that C(v) = E=>pi(vi) is convexon S, and let fi(vlu)be definedby(5.43) where 0i((u)= f(u). strictly Then foranyfixedu E S, the functionsfif(Iun) forma standardn-tuplewith 0 at u; hencebyTheorem1(ii),F(vlu) = E. lfi(vilui) generates a regular, local projectionrule. To provethat the latteris parallel-transitive, suppose that v = H(LIu), w = H[('Iv), where L and L' are "parallel" subspaces. Then gradF(vlu) and gradF(wlv) are both orthogonalto these subspaces (wherethe gradientrefersto the firstvariable)and henceso is gradF(wlu) = gradF(vlu) + gradF(wlv). This meansthat l(L'Iu) = w, as claimed. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2061 To provethe corollary, recallthatbyTheorem2, everystatisticalprojection ruleis generatedby an f-divergence F(vlu) z f Ui where f(t) is a continuouslydifferentiable functionwith f(1) = f'(1) = 0. Then the functionsgi(vlu) in (5.34) do not dependon i and are equal to g(v/u), whereg(t) = f'(t). If thisprojectionrule is transitive, we musthave g()q = 1(v) -q(u) [cf.(5.42)] forsome continuousfunctionq. This impliesthat g satisfiesthe functional equation g(ts) =g(t) +g(s), t,s eR+, whose onlycontinuoussolutionsare g(t) = c logt [cf.Aczel (1966), Section 2.1.2]. Hence f(t) = Jtg(t)dt = c(tlogt - t) and this means that F(vlu) = cI(vIIu). o (5.45) PROOFOF THEOREM4. (i) First we show that a (regular, local) projection rulewithbasic set S = R', generatedby (5.46) n F(vlu) = E i=l fi(vilui) as in Theorem1(i), is translation-invariant ifand onlyif (5.47) fi(v + Alu + A) = c(u) fi(vlu), foreveryu, v and ,Ain R, wherec(,u) is a suitablepositive-valued function. Indeed,sincev* = fl(L + ullu + Al) minimizesF(vlu + Al) subjectto v E L + ,ul, v* - ,ul minimizesF,(vlu) = F(v + ,Illu + Al) subject to v E L. Hence the givenprojectionrule is translation-invariant, that is, Il(Llu) = v* - 4l, if and onlyif this projectionrule is also generatedby Fu(vIu).The latteris, by Theorem1(iii),equivalentto (5.47), as claimed. By Theorem3, if(5.46) generatesa tranlsitive projection rule,then (5.48) fi(vIu) = gPi(v)-si(u) - PJ(u)(v - u), where(,P.. ., Spn) is a standardn-tuple.By the paragraphcontaining(5.43), we mayassumewithoutanyloss ofgenerality thatthisstandardn-tuplehas 0 at v? = 0. Our nextgoal is to determine whatfunctionsfi(vlu) ofform(5.48) satisfy(5.47). Observethat (5.47) implies c(GL1+ tL2) = c(t1)c(A2) and that (5.47) and (5.48) implythe continuity of c(,u).It followsthat c(,u) = el [cf.Aczel(1966), Section2.1.2]. (5.49) forsomef3e R This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2062 From(5.47) (with,u = -u) and (5.49), we obtain (5.50) = ePufi(v fi(vIu) ulO); - thisand (5.48)-where, by assumption,(pi(O)= fp(O)= 0- resultin (5.51) fi(vlu) = e- a). - i(V by v, it followsthat Comparing(5.48) and (5.51) and differentiating (p'(v) - (p'(u) = epuSp(v - u). (5.52) This means, substitutingv = u + t, that ifr= < satisfiesthe functionalequa- tion (5.53) f(u + t) = qf(u) + ePUif,(t). This functional equationis solvedeasily.First, (5.54) +lr(t)= at if,B= 0 [cf.Aczel (1966), Section 2.1.2]. For f3 0 0, observethat (5.53) impliesby symmetry +fi(u) + eiSulk(t) = +(t) + ept`O(u); thus ept - 1 +f(t) q_(u) =eBU - 1 This gives 1) if,B=kO. Thus (p' mustbe eitherofform(5.54) or ofform(5.55), wheretheconstant factora (but not ,B)maydependon i. Clearly,the positivity or negativity of these factors,accordingas p 2 0 or p < 0, is necessaryand sufficient for gettinga standardn-tuple(Pi,. I n fortranslationinvariFinally,just as (5.47) was necessaryand sufficient if ance,one sees thatthe projectionrulegeneratedby(5.46) is scale-invariant and onlyif (t (5.55) a(ept - (5.56) fi(AvlAu) = c(A) fi(vIu), foreveryA > 0. Now, fi(vlIu)definedby(5.48) with'p' ofform(5.55) does not satisfy(5.56), whereaswith'p ofform(5.54) it does. In the lattercase (5.46) becomes n (5.57) F(vju) = E i=l ai(vi - ui)2, ai > O,i = 1,...,n, thatis, onlythe weightedleast squaresprojectionrulesare transitiveas well as location-and scale-invariant. impliesthatall By Theorem2(i), the additionalpostulateofsemisymmetry in coefficients (5.57) must be equal. ai This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2063 (ii) Supposenowthat S = R+ and a projection rulegeneratedbya function and scale-invariant. Then(5.48) and (5.56) hold;in the as in (5.46) is transitive former we nowsupposethatthestandardn-tuple(pl, .. ., 9n)has 0 at vo = 1. As (5.56) impliesthat c(AlA2)= c(A1)c(A2)and (5.48) and (5.56) implythe of c(A), we have c(A) = Aa forsome a e R [Acz6l(1966), Section continuity 2.1.2]. Hence from(5.56) (with A = l/u) and (5.48) [wherenow 'i(l)= po(1) = 0], we obtain (5.58) fi(Vju) = Oaf - 1i) = U'9i Comparing(5.48) and (5.58) and differentiating by v, it followsthat = u q'i(U) -V) fi u v = tu, or,substituting (5.59) p'i(tu) = 0'i(u) + U-'p,(t). Since (5.59) means that p(t) = '9(et) satisfiesthe functionalequation(5.53) (with ( = a - 1), we obtain from(5.54) and (5.55) (5.60) 'p4(t) failog t, a,=(ta- if a =1, - 1), if a 0 1. Clearly,(5.60) with (pi(l) = 0 definesa standardn-tuplefor S = Rn iff a < 1 and, in addition,a > 0 in the case a = 1 or ai < 0 in the case a < 1. Withthis choiceof pi,(5.48) becomes fi(vlu) = lailha(Vlu),with ha defined by (3.7). On the otherhand,(5.46) withthese functionsfi does generatea transitiveand scale-invariant projectionrule. If this projectionrule is also it followsby Theorem2(i) that it is generatedby (3.8). The semisymmetric, proofis complete.0 PROOFOF THEOREM5. Supposethat S Rmn,Rn or Ain the elements of S beingrepresentedas v= {vij), i = 1,...,m, j = 1, . . ., n. Let H be a regular,local selectionrulewithbasic set S. By Theorem1, HIis generatedby a function (5.61) F(v) = m n E E fii(Vii) i=lj=1 wherethe functionsfij forma standardmn-tuplewith 0 at vo {v0} = v?(11).In particular, the functionsgij(t) = (d/dt) fj(t) are continuousand - (5.62) fij(V&?) = gij (v9j) = 0. Considerthe subspaces Lv definedby (2.13), that is, Lv= {w: w =v; = v}. Then if HI(Lv) = v forsomev E S, thatis, ifthe minimum of(5.61) on Lv is attainedat v, we have This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions I. CSISZAR 2064 Ai), . It followsfrom(5.63) that forsuitable"Lagrangemultipliers" (5.64) + gkl(Vkl) gij(Vij) = + gkj(vkj), gil(vil) forevery i,j,k,1. then Nowwe proceedto provepart(i) ofTheorem5. If fl is sum-consistent, by Definition7, rl(L,) = v always holds ifv is of sum form,vij = si + tj. Thus (5.64) givesthe systemoffunctional equations (5.65) gij(si + tj) + gkl(sk + tl) = gil(si + tl) + gkj(Sk + tj). of the (continuous) Observefirstthat (5.65) impliesthe differentiability bothsidesof(5.65) can for this be functions seen, example, by integrating gij; withrespectto Sk* bothsides of(5.64) by tj on the one hand and by Sk on the Differentiating other,we obtain i'j(si + tj) = gkj(Sk + tj) = gjl(Sk + tl). This meansthat g' equals the same constantc foreveryi, j; hence gij(v) = cv + dij. (5.66) Recalling(5.62), it followsthat (5.67) gjj(v) = c(v - V?)g = fij(V) _ (u V)2. it mustbe a least squares selection This provesthat if 17is sum-consistent, rule. Recall that,as remarkedafterDefinition7, vo mustbe of sum form. it is easy to see thatthe least squares selectionruleswithvo of Conversely, sum formare sum-consistent. The resultjust provedeasilyimpliesthatthe only(regular,local) sum-consistentprojectionrule is the least squares projectionrule. In fact,let this projectionrulebe generatedby m F(vlu) n = i=l j=1 f fij(VijUij), wherethe functionsfij( luij) forma standardn-tuplewith0 at u forevery fixedu = (uijs E S. Then (5.67) givesthatif u is of sum form,the termsof thisstandardmn-tuple mustbe constantmultiplesof(v - u ij)2. Sinceforany fixed u there exists u = {uij} of sum formwith uij = u, it followsthat fij(vlu) alwaysequals a constanttimes(v - u)2. The proofof part (ii) is similar.If fl generatedby (5.61) is product-consistent,thenby Definition 8, [l(L,) = v alwaysholdsifv is ofproductform, = Thus (5.64) gives vij sitj. + gkl(Sktl) + 9kj(Skt;)(5.68) =gil(Sitl) 9ij(Sitj) Noticethatwhereasin thecase S = Amn theconditionE vij = 1 represents on thepermissiblesi and t, in (5.68), the*equationmustcertainly a constraint bvai-oanfiei, j kg 1*i si+S 1 t+ < 1. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions AXIOMATICAPPROACHTO INFERENCE 2065 to (5.65). The systemof functionalequations(5.68) can be transformed Namely,if the functionsgij(v) satisfy(5.68), then gij(v) = gij(ev) satisfy (5.65). Hence,from(5.66), gij(v) = gij(log v) (5.69) c log v + dij. Recalling(5.62), it followsthat gij(v) = clog V[oivlo O, fj( v) = cv log- - v + v9j. (5.70) This provesthata product-consistent selectionrulemustbe an I-divergence selectionrule(cf.Example2) withv? ofproductform.Conversely, it is easyto see that these I-divergence selectionrules are product-consistent. The assertionthattheonlyproduct-consistent projection ruleis the I-divergence projectionrule followsin the same wayas did its analogin part(i). The corollaryis immediate.o REFERENCES J.(1966). Lectureson FunctionalEquationsand TheirApplications. ACZfL, NewYork. Academic, ACZEL,J. and DAR6czY,Z. (1975). On Measuresof Information and Their Characterizations. NewYork. Academic, ALI,S. M. and SILVEY, S. D. (1966). A generalclass ofcoefficients ofdivergence ofone distribution fromanother.J. Roy.Statist.Soc. Ser. B 28 131-142. BREGMAN,L. M. (1967). The relaxation methodoffinding thecommonpointofconvexsetsand its applicationto the solutionof problemsin convexprogramming. U.S.S.R. Comput. Math. and Math. Phys.7 200-217. VANCAMPENHOUT,J. M. and COVER, T. M. (1981). Maximumentropy and conditional probability. IEEE Trans. Inform.TheoryIT-27 483-489. CENSOR, Y. (1983). Finiteseries-expansion reconstruction methods.Proceedings oftheIEEE 71 409-419. CENSOR, Y. and LENT, A. (1981). An iterative row-action methodforintervalconvexprogramming. J. Optim.TheoryAppl.34 321-353. CsiszAR,I. (1963). Eine Informationstheoretische Ungleichungund ihre Anwendungauf den Beweis der Ergodizitiit von Markoffschen Ketten.Publicationsof theMathematical Institute oftheHungarianAcademyofSciences8 85-108. CsisZAiR, I. (1967). Information-type measuresofdifference ofprobability distributions and indirectobservations. Studia Sci. Math. Hungar.2 299-318. I. (1984). Sanovproperty, CsIsZAR, generalizedI-projection and a conditional limittheorem.Ann. Probab.12 768-793. CsIsZAR, I. (1985). An extendedmaximumentropyprincipleand a Bayesianjustification (with discussion).In BayesianStatistics2 (J.M. Bernardo, M. H. DeGroot,D. V. Lindleyand A. F. M. Smith,eds.) 83-89. North-Holland, Amsterdam. CSISZAR, I. and TUSNADY, G. (1984). Information and alternating geometry minimization procedures.Statist.DecisionsSuppl. 1 205-237. DEMPSTER, A. D., LAIRD, N. M. and RUBIN, D. B. (1977). Maximum likelihood fromincomplete data via theEM algorithm. J. Roy.Statist.Soc. Ser. B 39 1-37. DIACONIS, P. and ZABELL, S. L. (1982). Updatingsubjective probability. J. Amer.Statist.Assoc. 77 831-834. HERMAN, G. T. and LENT, A. (1976). Iterativereconstruction algorithms. in Biology Computers and Medicine6 273-294. This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions 2066 I. CSISZAR telephony basedon themaximumlikelihood F. and SAITO,S. (1968). Analysissynthesis ITAKURA, Congresson Acoustics(Y. Kohasi,ed.) method.In ReportsoftheSixthInternational 17-20. Tokyo,Japan. oftheIEEE 70 entropy methods.Proceedings JAYNES,E. T. (1982). On therationaleofmaximum 939-952. entropyprinciplesfor theoreticderivationof logarithmic JONES, L. K. (1989). Approximation methodto incorporate ofthemaximum entropy inverseproblemsanduniqueextension SIAM J. Appl. Math.49 650-661. priorknowledge. C. L. (1990). Generalentropycriteriafor inverseproblems,with JONES, L. K. and BYRNE, and clusteranalysis.IEEE patternclassification applicationsto data compression, Trans.Inform.TheoryIT-36 23-30. minimum-distance feasiblehigh-resolution JONES, L. and TRUTZER, V. (1989). Computationally method.InverseProblems5 749-766. whichextendthemaximum-entropy procedures Theoryand Statistics.Wiley,NewYork. KULLBACK, S. (1959). Information Distances.Teubner,Leipzig. LIESE, F. and VAJDA,I. (1987). ConvexStatistical and entropyin incomplete-data MILLER, M. I. and SNYDER, D. L. (1987). The role of likelihood and Toeplitzconstrained intensities point-process to estimating Applications problems: oftheIEEE 75 892-907. covariances.Proceedings Internaofmaximumentropy. A. (1990). A noteon theinevitability J. B. and VENcOvsKA, PARIS, tionalJournalofInexactReasoning4 183-223. in statistical measuresand its application ofa set ofprobability PEREZ, A. (1984). "Barycenter" decision.Compstat Lectures154-159.Physica,Heidelberg. derivationof the principleof maximum SHORE, J. E. and JOHNSON, R. W. (1980). Axiomatic entropy.IEEE Trans. Inform.Theory entropyand the principleof minimum-cross IT-29 (1983) 942-943.] IT-26 26-37. [Correction SKILLING, J. (1988). The axioms of maximumentropy.In MaximumEntropyand Bayesian Methodsin Scienceand Engineering1 173-187.Kluwer,Amsterdam. VARDI, Y., SHEPP, L. A. and KAUFMAN, L. (1985). A statisticalmodel for positronemission tomography (withdiscussion).J. Amer.Statist.Assoc.80 8-35. MATHEMATICALINSTITUTE OF THE HUNGARIANACADEMYOF SCIENCES BUDAPEST, P.O.B. 127 H-1364 HUNGARY This content downloaded from 152.2.104.17 on Wed, 3 Jul 2013 17:53:26 PM All use subject to JSTOR Terms and Conditions
© Copyright 2024