Gaussian processes and Bayesian moment estimation∗ Jean-Pierre Florens† Anna Simoni‡ Toulouse School of Economics CNRS - CREST March 19, 2015 – PRELIMINARY and INCOMPLETE – Abstract When a large number of moment restrictions is available for the estimation of a finite dimensional parameter θ there may be restrictions that are more important or credible than others. In these situations it might be desirable to weight each restriction based on our beliefs. This is automatically implemented by a Bayesian procedure. We develop, in this paper, a semiparametric Bayesian approach to moment estimation that is able to impose over-identifying moment restrictions via the prior. We show how to construct a nonparametric prior with support equal to the subset of data distributions F that satisfy the moment restrictions. We establish that a Gaussian process (GP) prior for the density function associated with F is particularly convenient in order to impose over-identifying restrictions and allows to have a posterior distribution in closed-form. We provide a frequentist validation of our procedure by showing consistency of the maximum a posteriori estimator for θ and asymptotic normality of the posterior of θ. ∗ First draft: February 2012. The authors are grateful to participants to seminars and conferences in: Bristol, Carlos III - Madrid, CREST, NASM 2012 at Northwestern, AFSE 2012 in Paris, Toulouse, ICEEE 2015. We thank financial support from ANR-13-BSH1-0004 (IPANEMA). Anna Simoni gratefully acknowledges financial support from Labex ECODEC (ANR - 11-LABEX-0047) and financial support and hospitality from the University of Mannheim through the SFB-884. † Toulouse School of Economics - 21, all´ee de Brienne - 31000 Toulouse (France). Email: [email protected] ‡ CREST, 15 Boulevard Gabriel P´eri, 92240 Malakoff (France). Email: [email protected] (corresponding author ). 1 Key words: Moment conditions, Gaussian processes, overidentification, posterior consistency. JEL code: C11, C14, C13 1 Introduction In many practical applications and empirical economic studies a large set of moment restrictions characterizing the parameter of interest is available. Examples are provided for instance in Cazals et al. [2005] and F`eve et al. [2006]. Such a situation is complicated to manage since it requires cumbersome computations due to the high number of moment restrictions. It is often the case that the researcher does not equally believe in all the restrictions. Therefore, it is desirable to have a procedure that assigns a specific weight to each moment restriction based on the researcher’s beliefs and that updates it (and possibly decides whether to include or not a restriction) based on the information contained in the data. This may be easily performed by a Bayesian procedure where each moment restriction can have a different weight based on the prior and posterior distribution. The purpose of this paper is to develop a Bayesian approach to estimate a structural parameter θ characterized by over-identifying moment conditions. In such a framework, the researcher has only limited information on the data generating process (DGP) which is given by a set of moment conditions. Any parametric specification of the DGP is, therefore, completely arbitrary. In this paper we do not restrict the DGP but for satisfying the moment restrictions. Thus, we construct a nonparametric prior with support equal to the subset of data distributions that satisfy the moment restrictions. Let x be a random element in Rm with distribution F and x1 , . . . , xn be an i.i.d. sample of x. We are interested in a parameter θ ∈ Θ ⊂ Rk which is linked to F through the relation (moment restrictions) EF [h(θ, x)] = 0 (1.1) where h is a known function with values in Rr . This model is semiparametric since it includes a finite dimensional structural parameter θ and a functional parameter F which, apart from the moment restrictions, is not at all constraint. Contrarily to the classical generalized method of moments (GMM) which imposes the moments restrictions directly in the DGP, we impose the moment restrictions via the prior for (θ, F ). Hence, the random parameter generated from the prior satisfies the moment restrictions by construction. Imposing moment restrictions via semiparametric priors may be challenging depending on the relationship existing between θ and F . More precisely, 2 when the model is just-identified, that is k = r, (1.1) characterizes θ as an explicit function of F : θ = B(F ), where B is a function defined on the space of probability distributions. For particular functional forms of B, the prior of θ may be easily recovered from an unrestricted prior of F and (θ, F ) generated by this prior automatically satisfy the constraints. On the contrary, in an overidentified model where k < r, θ cannot be expressed as an explicit function of F . More precisely, (1.1) imposes constraints on F and the existence of a solution θ to (1.1) is guaranteed only for a subset of distributions F . In a Bayesian approach this entails that the prior on F must be specified conditionally on θ and its support must be a proper subset of the set of probability distributions. However, in an overidentified model, the restrictions on F are in general complicated to incorporate in its prior distribution. The approach proposed in Florens and Rolin [1994], for instance, which is based on a Dirichlet process prior distribution, presents several difficulties to treat overidentified models. An important contribution of this paper is to construct a semiparametric prior that incorporate the over-identifying moment restrictions. Our approach differs substantially from Zellner’s (1996) Bayesian Method of Moments and is more in the vain of the recent literature on Bayesian estimation under moment restrictions. Kitamura and Otsu construct the restricted prior on F by projecting a Dirichlet process prior (see Ferguson [1973, 1974]) via minimization of the Kullback-Leibler divergence with respect to the Dirichlet process prior under the moment constraint. A Dirichlet process prior has nice properties due to the fact that it is a natural conjugate of the i.i.d. model, however projection of it, necessary to treat the overidentified case, comes at the price of conjugacy. Schennach [2005] proposes a maximum entropy nonparametric Bayesian procedure which, instead of employing the Dirichlet process prior, relies on a noninformative prior on the space of distributions. Kim [2002] and Chernozhukov and Hong [2003] proposes a quasi-Bayesian approach by incorporating the moment restrictions in the pseudo-likelihood. We specify a degenerate Gaussian process (GP) prior with restricted support that is easy to deal with and that works as follows. The DGP F is assumed to admit a density function f with respect to some positive measure Π chosen by the econometrician (for instance the Lebesgue measure). Then, we endow f with a GP prior conditional on θ. Because this prior imposes the moment restrictions it will be degenerate on a proper subset of the parameter space. The essential reason for the appropriateness of a GP prior in such a framework is due to the fact that the moment functions in (1.1) are linear in f and the linearity of the model matches extremely well with a GP prior. Thus, the (over-identifying) moment restrictions can be incorporated in an easy way by constraining the prior mean and prior covariance of f . An advantage of our method is that, in both the just-identified and overidentified cases, the moment restrictions are imposed directly through the (conditional) 3 prior of f (given θ) without requiring a second step projection as in Kitamura and Otsu. At the best of our knowledge this prior has not been used yet in the GMM framework. In the overidentified case we first specify a prior on θ and then a GP prior on f condi- tional on θ. In the just-identified case we may either proceed as in the overidentified case or specify an unrestricted GP prior on f and deduce from it the prior for θ through the explicit relationship θ = B(f ). We circumvent the difficulty of specifying the likelihood function, which is not available, by constructing a linear functional transformation of the DGP F , say rn , that has an asymptotic Gaussian distribution. This construction has the advantage to provide a conjugate model (because the prior of f given θ is also a GP) without the risk of misspecifying the sampling distribution. In other words, our approach stays nonparametric in F but allows easy computations. In the following we call our method the GP-approach. We propose as an estimator for θ the maximum of the marginal posterior distribu- tion of θ (obtained by integrating out f ). This is usually not available in closed-form but can be easily computed via MCMC methods. Finally, we provide a frequentist validation of our method by showing: frequentist consistency of the maximum a posteriori estimator, and frequentist consistency and asymptotic normality of the posterior distribution of θ. The paper is organized as follows. The GP-approach is described in section 2. In section 3 we analyze asymptotic properties of the posterior distribution of θ and of the maximum a posteriori estimator. There exists a link between our approach and the GMM approach for moment estimations. We stress this link in section 4. In section 5 we detail how to implement our method for both the just identified case and the overidentified case. The GP-approach 2 Let x be a random element in S ⊆ Rm with distribution F and x1 , . . . , xn be an i.i.d. sample of x.1 Assume that F is absolutely continuous with respect to some positive measure Π (e.g. the Lebesgue measure) with density function f . In other words, conditionally on f the data are drawn from F : x1 , . . . , xn |f ∼ F . The set of probability density functions on S with respect to Π is denoted by M . Let θ ∈ Θ ⊆ Rk be the parameter of interest characterized by (1.1). Throughout the paper we denote the true value of θ by θ∗ , the true DGP by F∗ and its density with respect to Π by f∗ . Hence, EF∗ (h(θ∗ , x)) = 0. We endow S ⊆ Rm with the trace of the Borelian σ-field BS and we specify Π as a positive measure on this subset. We denote by E = L2 (S, BS , Π) the Hilbert space of square integrable functions on S and by BE the 1 The DGP for x could be more general than an i.i.d. sampling process but we focus on this case for simplicity. 4 Borel σ-field generated by the open sets of E. The scalar product and norm on this space are defined in the usual way and denoted by h·, ·i and || · ||, respectively. Given a value for θ, a conditional prior for f |θ is specified such that its support is equal to the subset of M of functions that satisfy (1.1) for this value of θ. Hence, the parameters of the model are (θ, f ), with f the nuisance parameter, and the parameter space is Z Λ = (θ, f ) ∈ Θ × EM ; h(θ, x)f (x)dΠ = 0 , EM := E ∩ M, where h : Θ × Rm → Rr is a known function that satisfies the following assumption. Assumption 2.1. (i) The true f∗ satisfies f∗ ∈ EM := E ∩ M ; (ii) the moment function h(θ, ·) satisfies hi (θ, ·) ∈ E for every i = 1, . . . , r and for every θ ∈ Θ, where hi denotes the i-th component of h. Assumption 2.1 (i) restricts f∗ to be square integrable with respect to Π and is for instance verified if f∗ is bounded and Π is a bounded measure. The model is made up of three elements that we detail in the next two subsections: a prior on θ, denoted by µ(θ), a conditional prior on f given θ, denoted by µ(f |θ) and the sampling model. In the following, we shorten almost surely by a.s., we denote by EF and Ef the expectation taken with respect to F . 2.1 Prior distribution We specify a prior probability measure µ on the pair (θ, f ) of the form µ = µ(θ)⊗µ(f |θ). By abuse of notation, µ(θ) will also denote the density of the prior distribution of θ with respect to the Lebesgue measure in the case it admits it. The prior µ(θ) may either be flat (non-informative) or incorporate any additional information available to the econometrician about θ. The conditional prior µ(f |θ) requires a more careful analysis and we now explain it. Conditional prior on f given θ. The conditional prior distribution µ(f |θ) of f , given θ, is specified as a GP on BE with mean function f0θ ∈ EM and covariance operator Ω0θ : E → E. We restrict the prior µ(f |θ) to generate trajectories f ∈ E that integrate to 1 (with respect to Π) and such that the corresponding F satisfies (1.1) with probability 1. For that, we impose two sets of restrictions, one on f0θ and one on Ω0θ : Restriction 1 (Restrictions on f0θ ). The prior mean function is chosen such that f0θ ∈ EM and Z h(θ, x)f0θ (x)Π(dx) = 0. 5 (2.1) Restriction 2 (Restrictions on Ω0θ ). The prior covariance operator Ω0θ : E → E is such that ( 1/2 Ω0θ h(θ, x) = 1/2 Ω0θ 1 0 (2.2) = 0. The covariance operator Ω0θ is linear, self-adjoint and trace-class.2 Due to Restriction 2, Ω0θ is not injective. In fact, the null space of Ω0θ , denoted by N(Ω0θ ) is not trivial and contains effectively the constant 1 – which implies that the trajectories f generated by the prior integrate to 1 a.s. – and the function h(θ, x) – which implies that the trajectories f satisfy the moment conditions a.s. Therefore, the support of µ(f |θ) is a proper subset of E. In practice, this means that Ω0θ is degenerate in the directions along which we want that the corresponding projections of f and f0θ are equal. This is the meaning of the next lemma. Lemma 2.1. The conditional GP prior µ(f |θ), with mean function f0θ and covariance operator Ω0θ satisfying Restrictions 1 and 2, generates trajectories f that satisfy µ(f |θ)- a.s. the conditions Z f (x)Π(dx) = 1 Z and h(θ, x)f (x)Π(dx) = 0. Remark 2.1. Our restrictions imply that the trajectories generated by µ(f |θ) integrates to 1 a.s. but they do not guarantee non-negativity of the trajectories. Thus, the support of µ(f |θ) is smaller than E but bigger than EM . To impose non-negativity we could: (i) either project the prior on the space of non-negative functions or (ii) write f = g2 , g ∈ E, and specify a conditional prior distribution, given θ, for g instead of for f . However, the resulting prior would be non Gaussian and the resulting posterior for θ would not be available in closed form which is instead one of the main advantages of our procedure. From a practical implementation point of view, a covariance operator satisfying Restriction 2 may be constructed as Ω0θ g = ∞ X j=0 λj hg, ϕj iϕj , g∈E where (λj )j∈N is a decreasing sequence of non-negative numbers accumulating at 0 such P that j λj < ∞ and (ϕj )j∈N is a basis for E. Remark that (λj )j∈N and (ϕj )j∈N correspond to the eigenvalues and eigenfunctions of Ω0θ , respectively. Since the null space N(Ω0θ ) ⊂ E is generated by {1, h1 (θ, ·), . . . , hr (θ, ·)}, without loss of generality we can set the first 2 A trace-class operator is a compact operator with eigenvalues that R are summable. Remark that this guarantees that the trajectories f generated by µ(f |θ) satisfy f 2 dΠ < ∞ a.s. 6 eigenvalues equal to the elements that span N(Ω0θ ). Restriction 2 is fulfilled by setting the corresponding eigenvalues equal to 0. For instance, if {1, h1 (θ, ·), . . . , hr (θ, ·)} are linearly independent as elements of E, then N(Ω0θ ) has dimension r + 1, the first eigenvalues are (ϕ0 , ϕ1 , . . . , ϕr )T = (1, hT )T and the corresponding eigenvalues are λj = 0, ∀j = 0, 1, . . . , r.3 The remaining components (ϕj )j>r are chosen such that (ϕj )j≥0 forms a basis of E and P (λj )j>r such that j>r λj < ∞. In section 5 we provide some examples that explain in detail the construction of Ω0θ . Remark 2.2. In the just-identified case where r = k and the moment restrictions EF (h(θ, x)) = 0 rewrite in an explicit form as θ = B(f ), where B is a linear functional, we may construct the prior in an alternative way. This consists in first specifying a GP prior µ(f ) for f (independent of θ) with a mean function f0 restricted to be a pdf and a covariance operator Ω0 1/2 restricted to satisfy Ω0 1 = 0. Then, the prior for θ is obtained through the transformation B(·), under mild regularity conditions. For example, if θ = Ef (x) then B(f ) = hf, ιi, where ι ∈ E denotes the identity function ι(x) = x, and µθ = N (hf0 , ιi, hΩ0 ι, ιi). 2.2 The sampling model Conditional on f , the sample likelihood is Qn i=1 f (xi ). This sampling model has the disadvantage that the corresponding posterior distribution of (θ, f ), as follows from the prior of section 2.1, is not available in closed-form. Hence, we propose an alternative and original way for constructing the sampling model that allows for a closed-form posterior and a conjugate analysis. In order to construct the sampling distribution let us consider a functional transformation rn of the sample x1 , . . . , xn . This transformation rn is chosen by the researcher and must have the following characteristics: (I) rn converges weakly towards a Gaussian process (in a Hilbert space F to be defined below); (II) rn is an observable element of an infinite-dimensional Hilbert space F, for instance a L2 -space; (III) rn is linked to the nuisance parameter f according to the following linear scheme rn = Kf + U (2.3) where K : E → F is a linear operator, F is an infinite-dimensional separable Hilbert space and U ∈ F is a Hilbert space-valued random variable (H-r.v.). We recall that, for a complete probability space (Z, Z, P), U is a H-r.v. if it defines a measurable map U : (Z, Z, P) → (F, BF ), where BF denotes the Borel σ-fields generated by the open sets of F. 3 Remark that if {1, h1 (θ, ·), . . . , hr (θ, ·)} are not linearly independent then we can use their orthonormalized counterparts. 7 To construct rn we first select a function k(t, x) : T × S → R+ that is measurable in x for every t ∈ T , where T ⊆ Rp , p > 0. The transformation rn is then the expectation of k(t, ·) under the empirical measure: n 1X k(t, xi ). rn = n i=1 Define F = L2 (T, BT , ρ) where ρ is a measure on T and BT denotes the Borel σ-field generated by the open sets of T . We use the same notation h·, ·i and k · k for the inner product and norm, respectively, in F and in E. The function k(t, x) defines also the operator R K: Kφ := k(t, x)φ(x)Π(dx), ∀φ ∈ F and must be such that Kφ and rn are elements of R F for every φ ∈ E. For every f ∈ EM , Kf is the expectation of k(t, ·) under F := S f dΠ: R Kf = k(t, x)f (x)Π(dx) = EF (k(t, x)). Hence, model (2.3) rewrites: n 1X k(t, xi ) = rn = n i=1 Z k(t, x)f (x)Π(dx) + U (t). (2.4) Conditionally on f , the expectation of rn is equal to Kf and the error term U has zero mean and covariance kernel σnF (t, s) = EF U (t)U (s) = 1 F E (k(t, x)k(s, x)) − EF (k(t, x))EF (k(s, x)) . n We denote by P f the sampling distribution, namely, the conditional distribution of rn given f satisfying (2.3). When f = f∗ , i.e. the truth, we simply use the notation P ∗ = P f∗ and σnF∗ (t, s). If the function k(t, ·) is Donsker then P f weakly converges to a Gaussian process GP(Kf, Σn ) with Σn = 1 Σ:F →F n Σϕ = Z σnF (t, s)ϕ(s)ds, ϕ∈F (2.5) as n → ∞. In our analysis we use this Gaussian process as the sampling model in order to construct the posterior distribution. The frequentist asymptotic properties of the posterior distribution will be analyzed under the true measure P ∗ . The covariance operator Σn is one-to-one, linear, positive definite, self-adjoint and trace-class. In several examples Σn is unknown because F is unknown; thus in the posterior distribution we will replace F with the empirical cumulative distribution function (cdf ). Because this estimator converges at the parametric rate, then does not affect the asymptotic properties of our procedure (see Florens and Simoni [2012]). The next example clarifies the construction of the sampling model (2.3). Example 2.1. Let us suppose that we dispose of an i.i.d. sample of x: (x1 , . . . , xn ), where 8 xi ∈ R, i = 1, . . . , n. The functional transformation rn may be, for example, the empirical P P cdf Fn (t) = n1 ni=1 1{xi ≤ t} or the empirical characteristic function cn (t) = n1 ni=1 eitxi for t ∈ R. In these two cases we can write: Z Fn (t) = ZS cn (t) = 1{s ≤ t}f (s)Π(ds) + U (t), eits f (s)Π(ds) + U (t), S R respectively. In the first case rn = Fn and ∀ϕ ∈ E, Kϕ = S 1{s ≤ t}ϕ(s)Π(ds), while in R the second case rn = cn and ∀ϕ ∈ E, Kϕ = S eits ϕ(s)Π(ds). In these two cases, by the Donsker’s theorem, U is asymptotically Gaussian with zero mean and covariance operator characterized by the kernel n1 (F (s ∧ t) − F (s)F (t)) in the first case and n1 (c(s + t) − c(s)c(t)) in the second case, where c(t) denotes the characteristic function of the distribution F . These variances are unknown when f is unknown but we can replace them in the posterior distribution by their empirical counterparts. The following lemma gives an useful characterization of the operator Σn in terms of K and its adjoint K ∗ . We recall that the adjoint K ∗ of a bounded and linear operator K : E → F is defined as the operator from F to E that satisfies hKϕ, ψi = hϕ, K ∗ ψi, ∀ϕ ∈ E R and ψ ∈ F. In our case, an elementary computation shows that K ∗ ψ = T k(t, x)ψ(t)ρ(dt). Lemma 2.2. Let K : E → F be the bounded and linear operator defined as Kϕ = R R ∗ : F → E be its adjoint, that is, K ∗ ψ = k(t, x)ϕ(x)Π(dx), ∀ϕ ∈ E, and K T k(t, x)ψ(t)ρ(dt), S ∀ψ ∈ F. Moreover, denote with f∗ the true value of f that characterizes the DGP. Thus, the operator Σn = n1 Σ takes the form ∀ψ ∈ F, Σψ = KMf K ∗ ψ − (KMf 1)hMf , K ∗ ψi (2.6) where Σ : F → F and Mf : E → E is the multiplication operator ∀ϕ ∈ E, Mf ϕ = f∗ (x)ϕ(x). We denote by D the subset of E whose elements integrate to 0 with respect to Π: Z D := g ∈ E; g(x)Π(dx) = 0 . We remark that D contains the subset of functions in E that are the difference of pdf of F 1 1 1 2 2 2 ) ⊂ D where R(Ω0θ ) denotes the range of Ω0θ and is with respect to Π. Moreover, R(Ω0θ defined as 1 2 R(Ω0θ ) := ϕ ∈ E; Z ϕ(h)h(θ, x)Π(dx) = 0 and Z S S 9 ϕ(x)Π(dx) = 0 . 1 The following lemma states the relationship between the range of K and the range of Σ 2 . Lemma 2.3. Let K : E → F be the operator: ∀ϕ ∈ E, Kϕ = R S k(t, x)ϕ(x)Π(dx) and denote by K|D the operator K restricted to D ⊂ E. Then, if K|D is injective we have 1 R(K|D ) = D(Σ− 2 ). 2.3 Posterior distribution The Bayesian model defines a joint distribution on (the Borel σ-field) of Λ × F and can be summarized in the following way: θ f |θ ∼ µ(θ) ∼ µ(f |θ) ∼ GP(f0θ , Ω0θ ), r |f, θ ∼ r |f ∼ P f ∼ GP(Kf, Σ ) n n n R 1 h(θ, x)f0θ (x)Π(dx) = 0 and 2 (1, h(θ, ·)0 )0 = 0 Ω0θ where we use the approximated sampling distribution P f . Theorem 1 in Florens and Simoni [2012] shows that the conditional distribution of (f, rn ), conditional on θ, is: ! θ ∼ GP rn f f0θ Kf0θ ! , Ω0θ Ω0θ K ∗ KΩ0θ Σn + KΩ0θ K ∗ !! (2.7) where (Σn + KΩ0θ K ∗ ) : F → F, Ω0θ K ∗ : F → E and KΩ0θ : E → F. The marginal sampling distribution of rn conditional on θ, obtained by integrating out f , is: rn |θ ∼ Pnθ ∼ GP(Kf0θ , Σn + KΩ0θ K ∗ ). (2.8) We now discuss the posterior distribution, denote by µ(·|rn ). In the main text we discuss only the posterior distribution of the parameter of interest θ and postpone to Appendix A the discussion about the conditional posterior distribution µ(f |rn , θ) of f given θ. 2.3.1 Posterior distribution of θ The marginal posterior for θ, denoted by µ(θ|rn ), is obtained by using the marginal sampling distribution Pnθ given in (2.8). To obtain µ(θ|rn ) we characterize the likelihood of Pnθ with respect to a dominating measure Pn0 that has to be defined. The following theorem, which is a slight modification of Theorem 3.4 in Kuo [1975, page 125], characterizes a probability measure Pn0 which is equivalent to Pnθ as well as the corresponding likelihood of Pnθ with respect to Pn0 . 10 Theorem 2.1. Let f˜ ∈ EM denote a probability density function (with respect to Π) and Pn0 be a Gaussian measure with mean K f˜ and covariance operator n−1 Σ, i.e. Pn0 = GP(K f˜, n−1 Σ). For n fixed, if K|D is injective, then P 0 and P θ are equivalent. Moren n over, assume that ψj ∈ R(Σ1/2 ), ∀j ≥ 0 then the Radon-Nikodym derivative is given by pnθ (rn ; θ) := where 2 ljθ dPnθ (rn ) dPn0 = ∞ Y j=1 q 1 2 +1 nljθ " − 12 hrn −Kf0θ ,Σ−1/2 ψj (θ)i2 e 1 n−1 +l2 jθ −hrn −K f˜,Σ−1/2 ψj (θ)i2 n # (2.9) and ψj (θ) are the eigenvalues and eigenfunctions of Σ−1/2 KΩ 0θ K ∗ Σ−1/2 . 2 and ψ (θ) depend on θ because the operator Σ−1/2 KΩ K ∗ Σ−1/2 The quantities ljθ j 0θ depends on θ. We also remark that we can use any density function for the mean function f˜ as long as it does not depend on θ. For instance, it could be f˜ = f∗ even if it is unknown in practice. The marginal posterior distribution of θ takes the form µ(θ|rn ) = R pnθ (rn ; θ)µ(θ) dPnθ (rn |θ)µ(θ) =R θ Θ dPn (rn |θ)µ(θ)dθ Θ pnθ (rn ; θ)µ(θ)dθ and can be used to compute a point estimator of θ. We propose to use the maximum a posterior (MAP) estimator θn defined as θn = arg max µ(θ|rn ). θ∈Θ (2.10) In general, when the conditional prior distribution on f , given θ, is very precise the MAP will essentially be equivalent to the maximum likelihood estimator (MLE) that we would obtain if we use the prior mean function f0θ as the likelihood. On the contrary, with a prior µ(f |θ) almost uninformative, the MAP will be close to the GMM estimator (up to a prior on θ). Section 4 shows this argument in a rigorous way. Remark 2.3. We have already discussed (see Remark 2.2) the possibility of using a different prior scheme when we are in the just-identified case and θ can be written as a linear functional of f . In that case, given a Gaussian process prior on f , the prior of θ is recovered through the transformation θ = B(f ). The posterior distribution for θ is recovered from the posterior distribution of f (which is obviously unconditional on θ) through the transformation B(f ). 11 , 2.3.2 Choice of f0θ Our approach requires that the prior mean function f0θ be an element of the following set: Z F(θ) = f ∈ EM ; h(θ, x)f (x)dΠ(x) = 0 . Selection of an element of F(θ) is a linear inverse problem. While any element of F(θ) is a valid candidate for f0θ , asymptotic arguments require that the chosen f0θ converges to f∗ when θ is replaced by the true θ∗ . Henceforth, one has to define a population criterion Q(·) and to select f0θ as the minimizer of this criterion over F(θ) for a given θ. In practice, Q(·) will depend on unknown quantities that have to be replaced by estimated ones. We denote by Qn (·) the corresponding sample criterion and by f0θ,n the chosen prior mean function defined as: for a fixed θ ∈ Θ. f0θ,n = arg min Qn (f ) f ∈F (θ) (2.11) This procedure delivers an empirical Bayes procedure because f0θ,n is data-dependent. We outline two possible criteria. Regularization. The population criterion is Q(f ) = kEF∗ (k(t, x)) − Kf k2 + αpen(f ) where α > 0 and pen(·) is a penalty function. For instance, if pen(f ) = kf k2 then f0θ is equal to the constrained Tikhonov regularized solution. Alternatively, as f0θ is known to be non-negative, a more appropriate penalty is the negative entropy: pen(f ) = Z f log f dΠ under the assumption that this quantity is finite. The existence of a constrained maximum entropy solution is guaranteed under very general conditions, see e.g. Amato and Hughes [1991]. The sample criterion is Qn (f ) = krn − Kf k2 + αpen(f ). I-Divergence. The population criterion is the I-divergence D(F ||F∗ ) (or Kullback-Leibler information criterion) between F and F∗ Q(f ) = D(F ||F∗ ) = Z log dF dF∗ dF, where f dF = . dF∗ f∗ The sample criterion is obtained by replacing F∗ by its empirical counterpart Fn : Qn (f ) = R dF log dFn dF be . Then, we solve the problem 12 n 1X min npi log(npi ) p1 ,...,pn n such that n X pi h(xi , θ) = 0, i=1 i=1 n X pi = 1. i=1 See e.g. the beautiful paper by Csiszar [1975], Kitamura and Stutzer [1997] and Kitamura [2007] for a description of this procedure. By using duality arguments it is possible to find an explicit solution to the minimization problem that, once replaced in Kf0θ,n , gives: n Kf0θ,n t 1X k(xi , t)eγn (θ) h(xi ,θ) = n (2.12) i=1 where γn (θ) = arg minγ 1 n Pn γ 0 h(xi ,θ) . i=1 e Remark, that, because f0θ enters µ(θ|rn ) only through Kf0θ , thus Kf0θ,n is all what we need to have. Equation (2.12) is obtained by R R writing Kf0θ,n = k(x, t)f0θ,n (x)dΠ(x) = k(x, t)(dF/dFn )dFn , by abuse of notation. Both these procedures give an f0θ,n that, for a given θ, converges to the pseudo-true function f0θ := arg minf ∈F (θ) Q(f ) as n → ∞. Because f0θ is the solution of a convex problem, then its continuity in θ follows from Theorem 4.2 in Bonnans and Shapiro. Then f0θ converges to f∗ as θ → θ∗ . 3 Asymptotic Analysis In this section we focus on the asymptotic properties of our approach. We analyze three issues: (i) frequentist consistency of the MAP estimator θn (section 3.1), (ii) weak convergence of µ(θ|rn ) towards a normal distribution (section 3.2), and (iii) convergence in Total Variation of the posterior µrθn towards the asymptotic distribution of the method of moments estimator of θ for the just-identified case where θ is a linear functional of f ˜ δ) the closed (section 3.3). In the following, for every θ˜ ∈ Θ and δ > 0 we denote by B(θ, ˜ δ) = {θ ∈ Θ; ||θ − θ|| ˜ ≤ δ}. Moreover, let ball centered in θ˜ with radius δ, that is, B(θ, ln (θ) = log(pnθ (rn ; θ)) and Ln (θ) = log(pnθ (rn ; θ)) + log µ(θ) = ln (θ) + log µ(θ). 3.1 Posterior Consistency In this section we study the consistency of the MAP estimator θn when f0θ is determined by using the I-Divergence. In this case, f0θ (x) 0 = eγ(θ) h(θ,x) f∗ (x) 13 0 where γ(θ) := arg min E∗ eγ h(θ,x) , see Kitamura [2007]. Let ln (θ) = log(pnθ (rn ; θ)) where we replace f0θ by f0θ,n . Define l(θ) as l(θ) = − ∞ ∞ j=1 j=1 1X 1X hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 + hK(f˜ − f∗ ), Σ−1/2 ψj (θ)i2 (3.1) 2 2 1 1 = − kΣ−1/2 K(f0θ − f∗ )k2 + kΣ−1/2 K(f˜ − f∗ )k2 2 2 (3.2) where the second equality is due to the fact that Σ−1/2 K|D is well defined by Lemma 2.3. The maximizer of l(θ) is equal to the minimizer of kΣ−1/2 K(f0θ − f∗ )k2 which can be written 0 kΣ−1/2 K(f0θ − f∗ )k2 = kΣ−1/2 K(eγ(θ) h(θ,x) − 1)f∗ k2 . 0 Since eγ(θ) h(θ,x) = 0 eγ(θ) h(θ,x) f0θ (x) f∗ (x) (3.3) hence, the value of θ that solves (3.3) is the value that makes as close as possible to one, that is, the value that makes f0θ as close as possible to f∗ . If Σ−1/2 K is injective then there exists a unique value that minimizes (3.3). This value is the true θ∗ . From this reasoning, it turns out that consistency of θn will then follow from uniform convergence of 1 1 n ln (θ) + n log µ(θ) towards l(θ). To get uniform convergence we replace the fixed GP prior for f with a shrinking GP prior, that is, a prior where Ω0θ is replaced by an Ω0θ where an = o(n−1 ). The intuition for that is that, because the prior mean f0θ,n converges towards f∗ for n → ∞ and θ = θ∗ then, the prior distribution should also be more and more concentrated. Remark that with 2 in Theorem 2.1 satisfy nl2 → 0. We introduce the following this prior, the eigenvalues ljθ jθ assumptions: A1. The true parameter θ∗ belongs to the interior of a compact convex subset Θ of Rr . A2. The maps t 7→ k(x, t) and θ 7→ h(θ, x) are continuous for every x. A3. The prior distribution µ(θ) is continuous in θ. Theorem 3.1. Let µ(f |θ) be the GP prior with mean function f0θ,n = arg minf ∈F (θ) Qn (f ) where Qn is the I-divergence sample criterion, and covariance operator an Ω0θ where Ω0θ satisfies restriction 2 and nan → 0. Let T be compact; under A1-A3: p θn − → θ∗ . Moreover, for some δn & 0 µ (θ ∈ B(θn , δn )c |rn ) → 0 14 (3.4) in P ∗ -probability as n → 0. 3.2 Asymptotic Normality: weak convergence We show now that asymptotically the posterior distribution of θ behaves like a normal distribution centered around the MAP estimator θn defined in (2.10). Denote Ln (θ) = log(pnθ (rn ; θ)) + log µ(θ) = ln (θ) + log µ(θ). We make the following assumptions ∂Ln (θ) ∂θ θ=θ B1. θn is a strict local maximum of µ(θ|rn ) and L0n (θn ) := B2. ∆n = {−L00n (θn )}−1 := − ∂ 2 Ln (θ) ∂θ∂θ t θ=θ n −1 = 0. n exists and is positive definite. B3. For any > 0, there exists an integer N and δn & 0 such that, for any n > N and θ ∈ B(θn , δn ), L00n (θ) exists and satisfies I − G() ≤ L00n (θ){L00n (θn )}−1 ≤ I + G(), where I is a (k×k) identity matrix and G() is a (k×k) positive semidefinite symmetric matrix whose largest eigenvalue gmax () converges to 0 as → 0. B4. Let dmax be the largest eigenvalue of ∆n and δn be as in B3. Then, dmax & 0 and 1/2 δn /dmax → 0 as n → ∞. Assumptions B1- B4 hold depending on the form taken on by the operator K, Σ, Ω0 and the function rn . The next theorem, which is a slight modification of results in Kim [2002] and Chen [1985], shows that the posterior distribution µ(θ|rn ) converges weakly towards a normal distribution centered at the MAP estimator and with variance equal to ∆n . Theorem 3.2. Let δn & 0 be as in Theorem 3.1. Under Assumptions B1-B4 and the −1/2 assumptions of Theorem 3.1, the posterior distribution of ∆n to a standard normal distribution, that is, for every θ1 , θ2 ∈ Θ, Z −1/2 {θ; ∆n (θ−θn )∈(θ1 ,θ2 )} dµ(θ|rn ) → Z θ2 φ(u)du θ1 in P ∗ -probability, where φ(·) denotes the standard normal pdf. 15 (θ − θn ) converges weakly 3.3 Convergence in Total Variation for linear functionals: the just-identified case In this section we consider the just-identified case where k = r and where the assumptions of the implicit function theorem are satisfied so that every pair (θ, F ) satisfying (1.1) can be explicitly written as θ = B(f ). Assume in addition that B(·) is a linear functional with values in Rr . Denote by E r the cartesian product Πri=1 E. Hence, by the Riesz theorem, there exists a unique g ∈ E r such that we can identify B by such a g such that for all f ∈ E: θ = B(f ) = Z g(x)dF (x). In this situation the prior distribution of θ is specified through the prior distribution of f : θ ∼ µ(θ) = N (hf0 , gi, hΩ0 g, gi), as described in Remark 2.2. The posterior distribution is the given by µ(θ|rn ) = N (θ rn , Ωn ), where θ rn = hf0 + A(rn − Kf0 ), gi, Ωn = h(Ω0 − AKΩ0 )g, g) and A : F → E is a continuous and linear operator whose expression is given in Lemmas A.1 and A.2 in the Appendix. In the following, we establish convergence in total variation of µ(θ|rn ) towards a Normal distribution. This convergence is stronger than the weak convergence considered in section 3.2. We implicitly assume that the conditions of Lemmas A.1 and A.2 are satisfied. When this is not the case, then our asymptotic results can be easily extended to the case where the exact posterior µ(f |θ, rn ) is replaced by the regularized posterior distribution discussed in Remark A.4. P By using the usual notation for empirical processes, we define θˆ = B(Pn ) := n−1 ni=1 g(xi ) √ the method of moments estimator and by V the variance of nB(Pn ) under F∗ . The efficient influence function g˜ : S → Rk takes the form g˜ = g − E∗ g and V = E∗ (˜ g 2 ). For two probability measures P and Q absolutely continuous with respect to a positive measure Π, define the total variation (TV) distance as ||P − Q||T V 1 = 2 Z |fP − fQ |dΠ = Z (fP − fQ )+ dΠ where fP and fQ are the Radon-Nikodym derivatives of P and Q, respectively, √ with respect ˆ rn to Π. The next theorem states that the total variation distance between µ n(θ − θ) and N (0, V ) converges to 0 in probability. P Theorem 3.3. Let θˆ = B(Pn ) := n−1 ni=1 g(xi ) and consider the Gaussian model (2.7) −1/2 (without θ). If f∗ ∈ R(K ∗ ) and g ∈ C ∞ , then 16 √ ˆ rn − N (0, V ) n(θ − θ) µ in P ∗ -probability. 4 TV →0 The case with span{1, h1(θ, ·), . . . , hr (θ, ·)} independent of θ We focus on the case where the space spanned by {1, h1 (θ, ·), . . . , hr (θ, ·)}, namely N(Ω0θ ), does not depend on θ. This arises for instance when the moment functions h(θ, x) are separable in θ and x. Therefore, we can choose any orthonormal basis (o.n.b.) with respect to Π, say {ϕj }rj=0 , that spans N(Ω0θ ) and that does not depend on θ. Moreover, N(Ω0θ )⊥ is also independent of θ and spanned by an o.n.b. {ϕj }j>r independent of θ as well. An example where span{1, h1 (θ, ·), . . . , hr (θ, ·)} does not depend on θ is the case where h(θ, x), after normalization, is of the form: h(θ, x) = a(x) − b(θ) with a(x) = (a1 (x), . . . , ar (x))0 and b(θ) = (b1 (θ), . . . , br (θ))0 , and V ar(h(θ, x)) = Ir , where Ir denotes the r-dimensional identity matrix. Thus, the prior covariance operator Ω0θ does not depend on θ and writes: ∀ρ ∈ E, (Ω0θ ρ)(x) = Ω0 ρ = X j>r λj hρ, ϕj iϕj (x). On the other side, the prior mean function f0θ does depend on θ. Let {ψj }j≥0 be an o.n.b. in F and {λjK }j≥0 be a square-summable sequence of positive real numbers. We can then construct the operator K and the transformation rn as: ∀ρ ∈ E, (Kρ)(t) = ∞ X j=0 rn = λjK hρ, ϕj iψj (t) = n ∞ Z X ∞ λjK ρ(x)ϕj (x)ψj (t)Π(dx) j=0 1 XX λjK ϕj (xi )ψj (t). n i=1 j=0 P Hence, the kernel k(x, t) characterizing the operator K writes: k(x, t) = ∞ j=0 λjK ϕj (x)ψj (t). P∞ ∗ ∗ The adjoint of K, denoted by K , writes: ∀φ ∈ F, (K φ)(x) = j=0 λjK hφ, ψj iϕj (x). Re- mark that neither K, K ∗ , rn nor Σ depend on θ. In this section we fix Π = F∗ and denote E∗ = L2 (S, BS , F∗ ). Therefore, {ϕj }j≥0 is an o.n.b. with respect to F∗ and the truth is f∗ = 1. The operator Σ has an eigensystem which satisfies: Σψj = λ2jK ψj for j ≥ 1 and Σψ0 = 0. To see this, we write Σ in the form given in lemma 2.2: Σ = KMf K ∗ − KMf hMf , K ∗ ·i 17 and since f∗ = 1 we have: Σψj = KK ∗ ψj − K1hK1, ψj i = λjK Kϕj − λ20K ψ0 hψ0 (t), ψj (t)i ∀j 6= 0, = λ2jK ψj Σψ0 = λ20K ψ0 − (λ0K ψ0 )hK1, ψ0 i = λ20K − λ20K = 0. The result has been obtained by using the fact that hK1, ψj i = λ0K hψ0 (t), ψj (t)i = 0 for j ≥ 1 and h1, ϕ0 i = 1. Remark 4.1. The intuition behind the fact that Σ has an eigenvalue equal to 0 comes from lemma 2.3. Having the same eigenfunctions means that the ranges of Σ1/2 and K must be equal. However, we know from lemma 2.3 that when K|D is injective R(Σ1/2 ) is equal to the range of the restriction of K to D, that is K|D . Since D is orthogonal to ϕ0 the eigenfunction ψ0 , which is the transformation of ϕ0 under K, does not belong to the eigensystem of Σ corresponding to nonzero eigenvalues. Therefore, the eigenvalue corresponding to ψ0 must be equal to 0. Finally, because Σ has one eigenvalue equal to 0 it is not injective and the R.K.H.S. associated with Σ is not dense in F. In this particular case the eigenvalues lj2 (θ)2 and eigenfunctions ψj (θ) in (2.9) are as follows: lj2 (θ) = ( λj for j > r 0 for j = 0, . . . , r and ψj (θ) = ψj for j = 0, 1, . . . . (4.1) This can be verified by trivial computations. It follows that the likelihood in (2.9) can be simplified and the MAP writes as: 18 θn = arg max pnθ (rn ; θ)µ(θ) = arg max (log pnθ (rn ; θ) + log µ(θ)) θ∈Θ θ∈Θ r i 1X n h 2 ˜, ψj i2 hr − Kf , ψ i − hr − K f = arg max − n j n 0θ θ∈Θ 2 λ2 j=1 jK 1X n 1 1X 2 2 ˜ − hr − Kf , ψ i log(nλ + 1) + log µ(θ) − hr − K f , ψ i − n j j n j 0θ 2 1 + nλj 2 λ2 j>r jK j>r r X X n n 1 1 1 2 2 = arg min 2 hrn − Kf0θ , ψj i + 2 2 hrn − Kf0θ , ψj i 1 + nλ − log µ(θ) θ∈Θ 2 λ λ j jK jK j>r j=1 ! 2 n r X 1X (4.2) ϕj (xi ) − hf0θ , ϕj i + = arg min θ∈Θ n i=1 j=1 !2 n X 1X 1 ϕj (xi ) − hf0θ , ϕj i − log µ(θ) (4.3) n 1 + nλj j>r i=1 where we have eliminated the terms that do not depend on θ and we have used the fact 2 P that λ21 hrn − Kf0θ , ψj i2 = n1 ni=1 ϕj (xi ) − hf0θ , ϕj i . Equation (4.3) is quite useful and jK allows to emphasis several aspects of our methodology. I. The first term in (4.3) accounts for the moment restrictions and minimization of this R term corresponds to the classical GMM. In fact, by construction ϕj (x)f0θ Π(dx) is not 0 because we are using transformations of the moment functions. Thus, 1 Pn R 2 depends on θ through f0θ which is chosen as i=1 ϕj (xi ) − ϕj (x)f0θ Π(dx) n close as possible to the true f∗ according to the criterion (2.11). Remark that if we do the inverse transformation from {ϕj }rj=1 to {hj (x, θ)}rj=1 then the term involving f0θ will be zero and the term ϕj (xi ) will be written in terms of θ. This is similar to the LIL method proposed by Kim [2002]. II. The second term in (4.3) accounts for the extra information that we have, namely, the information contained in the subspace of E orthogonal to span{1, h1 (θ, ·), . . . , hr (θ, ·)}. This information, which is in general not exploited by the classical GMM estimation, can be exploited thanks to the prior distribution and the choice of f0θ . In fact, the parameter θ enters our criterion in (4.3) through f0θ . Thus, the value of θ that maximizes the posterior distribution is the value that makes the pdf f0θ as close as possible to the true f∗ . III. Expression (4.3) makes an explicit connection between the parametric case (infinite number of moment restrictions) and the semi-parametric case (when only the first r moment conditions hold). The semiparametric case corresponds to the classical 19 GMM approach while the parametric case corresponds to (and is as efficient as) the MLE. Indeed, the prior distribution for f specifies a parametric model for f0θ which satisfies the r moment restrictions and eventually other “extra” moment restrictions. The eigenvalues λj of the prior covariance operator play the role of weights of the “extra” moment restrictions and represent our “beliefs” concerning these restrictions. When we are very confident about these “extra” conditions, or equivalently we believe that f0θ is close to f∗ , then λj are closed to zero. In particular, since f0θ converges to f∗ (θ) for every θ, then λj should converge to 0 as n → ∞. The prior distribution for f is degenerate when the parametric model is the true one. When we are very uncertain about f0θ then the λj are very large and may tend to +∞. In this case we are back to the classical GMM. IV. The posterior distribution (4.3) does not depend on Σ anymore. This is a considerable advantage since Σ is in general unknown so that we do not have to estimate it. 4.1 Testing procedures Remark III in section 4 is important if one is interested in constructing testing procedures. We are not going to develop a formal test here, as this will make the object of a separated paper, but we would like to point out that our procedure suggests an easy way to test a parametric model against a semi-parametric one characterized by a finite number of moment restrictions. We can deal with the two following situations: 1. We know that the distribution of the data satisfies r moment restrictions and we want to test that it has a particular parametric form. In this case, for a given pdf g, R the null hypothesis is H0 : f∗ = g ∈ EM where g is such that h(θ, x)g(x)Π(dx) = 0. An example is the univariate linear regression model: Y = Zθ +ε, where X = (Y, Z)T and f∗ is the true joint pdf of X that satisfies the linear model. We may want to test that f∗ belongs to a particular parametric class. 2. There are r moment restrictions of which we are sure and we want to test the other moment restrictions. The null hypothesis writes H0 : EF∗ (hj (θ, x)) = 0 for some j > r. The natural approach would be to treat the λj s corresponding to the extra conditions (that is, j > r) as hyperparameters for which a prior distribution is specified. The null hypothesis, in both the cases above, writes as H0 : λj = 0 for some (or for all) j > r. Then, the posterior distribution of λj may be used to draw a conclusion on the test: either by looking weather the posterior concentrates around 0 (na¨ıf approach) or by considering posterior odds or by constructing encompassing tests. The last two methods are more rigorous and will be the object of a separated paper. 20 To construct a prior for the λj s let us write: λj = cρj where c = trΩ0 and We propose two alternatives priors. P∞ j=1 ρj = 1. Dirichlet prior. Suppose that we want to test the nullity of some λj s, say λj for r < j < J < ∞. Then we specify a Dirichlet prior for (ρr+1 , . . . , ρJ−1 ): µρ (ρr+1 , . . . , ρJ−1 |ν) ∝ J−1 Y j=r+1 ν −1 ρj j 1− J−1 X j=r+1 νJ −1 ρj J−1 Y j=r+1 J−1 X I(ρj ≥ 0)I j=1 ρj ≤ 1 where ν = (νr+1 , . . . , νJ ). Prior on c > 0. Suppose that we want to test that all the moment restrictions are true (that is, test of a parametric model against a semi-parametric one). Thus, the null hypothesis is H0 : c = 0. Remark that the {λj }rj=1 corresponding to the first r moment restrictions do not affect the trace of Ω0 since they are equal to 0. A prior for c will be any distribution with support contained in the positive real semi-axis, for example an inverse gamma distribution. 5 Implementation In this section we show, through the illustration of several examples, how our method can be implemented in practice. We start with toy examples that can be treated also with nonparametric priors different from the Gaussian prior. The interest in using Gaussian priors will be made evident in the more complicated examples where there are overidentifying restrictions which we show can be easily dealt with by using Gaussian priors. 5.1 Just identification and prior on θ through µ(f ) Let the parameter θ of interest be the population mean with respect to f , that is, R θ = xf (x)dx and h(θ, x) = (θ − x). This example considers the just identified case where the prior on θ is deduced from the prior distribution of f , denoted by µ(f ). The prior µ(f ) is a Gaussian measure which is unrestricted except for the fact that it must generate trajectories that integrate to 1 a.s. To guarantee this, the prior mean function f0 must be 1 a pdf and the prior covariance operator Ω0 must be such that Ω02 1 = 0. Summarizing, the Bayesian experiment is 21 ( 1 f rn |f ∼ µ(f ) ∼ N (f0 , Ω0 ) , Ω02 1 = 0 (5.1) ∼ P f ∼ N (Kf, Σn ). In the simulations we present the results for a scaled and a non-scaled prior. The implied prior and posterior distribution for θ is given in the following lemma. We use the notation ι to denote the identity functional, that is, ι(x) = x. Lemma 5.1. The Bayesian experiment (5.1) implies that the prior distribution for θ = R xf (x)dx is Gaussian with mean hf0 , ιi and variance hΩ0 ι, ιi and its posterior distribution is θ|rn ∼ N (hf0 , ιi + hΩ0 K ∗ Cn−1 (rn − Kf0 ), ιi, h[Ω0 − Ω0 K ∗ Cn−1 KΩ0 ]ι, ιi) −1 where Cn−1 = n−1 Σ + KΩ0 K ∗ This approach is appealing because it avoids the specification of two prior distributions while keeping the specification of the sampling distribution completely nonparametric. The prior is specified for the parameter with the highest dimension, that is f , and it implies a prior on the parameter θ. We illustrate now how to construct in practice the covariance operator Ω0 in (5.1) for two cases: the case where F∗ has a compact support and the case where the support of F∗ is R. In both cases we suppose m = 1. 5.1.1 Compact support Let S = [−1, 1] and Π be the Lebesgue measure. Then, the Legendre polynomials {Pj }j≥0 are suitable to construct the eigenfunctions of Ω0 . dre polynomials are {1, x, (3x2 − 1)/2, (5x3 The first few Legen- − 3x)/2, . . .} and an important property of these polynomials is that they are orthogonal with respect to the L2 inner product on R1 [−1, 1]: −1 Pl (x)Pj (x)dx = 2/(2j + 1)δlj , where δlj is equal to 1 if l = j and to 0 other- wise. Moreover, the Legendre polynomial obey the recurrence relation (j + 1)Pj+1 (x) = (2j + 1)xPj (x) − jPj−1 (x) which is useful for computing Ω0 in practice. The normalized Legendre polynomials form a basis for L2 [−1, 1] so that we can construct the operator Ω0 as Ω 0 · = σ0 ∞ X j=0 λj 2j + 1 hPj , ·iPj 2 where λ0 = 0 and the {λj , j ≥ 1} can be chosen in an arbitrary way provided that P j≥1 λj < ∞. The constant σ0 can be set to an arbitrary value and has the purpose of tuning the size of the prior covariance. This construction of Ω0 and the fact that f0 is a 22 pdf guarantee that the prior distribution generates functions that integrate to 1 a.s. In our simulation exercise we generate n i.i.d. observations (x1 , . . . , xn ) from a N (0, 1) ˆ distribution truncated to the interval [−1, 1] and construct the function rn = Φ(t) = P n −1 itx i n as the empirical characteristic function. Therefore, f∗ = N (0, 1; −1, 1) and i=1 e θ∗ = 0. We set T = [−1, 1] and ρ equal to the Lebesgue measure. Thus, the operators K and K ∗ take the form ∀φ ∈ E, Kφ = Z 1 −1 itx e φ(x)dx and ∀ψ ∈ F, ∗ K ψ= Z 1 eitx ψ(t)dt. −1 p The eigenfunction of Ω0 are set equal to the normalized Legendre polynomials { (2j + 1)/2Pj }j≥0 , the eigenvalues are set equal to σ0 λj = 5 ∗ j −a for j ≥ 1 and a > 1. We rescale Ω0 by 1 αn . The prior mean function f0 is set equal to a N (%, 1) distribution truncated to the interval [−1, 1]. We show in Figure 1 and 2 the prior and posterior distribution of θ obtained for different values of σ0 and α. We consider 4 cases: figures 1a and 1b are for a shrinking prior (α = 0.1), figures 2a and 2b are for a spreading out prior (α = 0.0001). We also show the prior mean (magenta asterisk) and the posterior mean of θ (blue asterisk) and denote their value. The prior mean is computed by using the exact formula (for the mean of the truncated normal distribution f0 ) while the posterior mean of θ is computed by discretizing the inner product hE(f |rn ), ιi. The pictures are obtained for n = 1000, ρ = 1, a = 1.7. We consider two values for σ0 : σ0 = 1 and σ0 = 1/3. The number of discretization points, used to approximate the integrals, is equal to 1000 for all the simulation schemes. We see from the pictures that the support of the prior and posterior distribution of θ is not restricted to [−1, 1]. This is due to the fact that a Gaussian process prior does not guarantees that the trajectories generated from it are non-negative. Therefore, constructing a prior for θ from the Gaussian process prior for f neglects part of the information that we have, that is, the fact that θ belong to [−1, 1]. In order to avoid this phenomenon, we can use the following procedure to construct the prior (posterior) for θ from the prior for µ: (1) draw f (i) from its prior (posterior), (2) compute max(f (i) , 0) and re-normalize it (in order to eliminate the negative part of the trajectory), (3) compute θ (i) = hf (i) , ιi, (4) iterate steps (1)-(3) for i = 1, . . . , I, (5) compute an approximation of the prior (posterior) distribution of θ through a kernel smoothing by using the I drawings of θ. Figure 3 shows the prior and posterior obtained by using this approach and a bandwidth h = 0.1 for the prior and h = 0.0001 (this is because the posterior is much more concentrated than the prior). 23 Prior and Posterior distributions of θ (shrinking prior) Prior and Posterior distributions of θ (shrinking prior) 0.4 0.4 0.35 0.35 0.3 0.25 0.25 µ θ prior distribution posterior distribution prior mean of θ = 0.28 posterior mean of θ = −0.05 0.2 µ θ 0.3 0.15 prior distribution posterior distribution prior mean of θ = 0.28 posterior mean of θ = −0.02 0.2 0.15 0.1 0.1 0.05 0.05 0 −0.2 −0.1 0 0.1 θ 0.2 0.3 0.4 0 −0.2 0.5 (a) σ0 = 1/3, α = 0.1. −0.1 0 0.1 0.2 θ 0.3 0.4 0.5 0.6 (b) σ0 = 1, α = 0.1. Figure 1: CASE I (Shrinking prior) - prior and posterior distributions and means of θ. The true value of θ is 0. Prior and Posterior distributions of θ (spreading out prior) Prior and Posterior distributions of θ (spreading out prior) 0.4 0.25 prior distribution posterior distribution prior mean of θ = 0.2772 posterior mean of θ = 0.0038 0.35 0.2 0.3 0.15 µ 0.15 θ prior distribution posterior distribution prior mean of θ = 0.28 posterior mean of θ = 0.007 0.2 µ θ 0.25 0.1 0.1 0.05 0.05 0 −6 −4 −2 0 θ 2 4 0 −10 6 (a) σ0 = 1/3, α = 0.0001 −5 0 θ 5 10 (b) σ0 = 1, α = 0.0001 Figure 2: CASE II (Spreading out prior) - prior and posterior distributions and means of θ. The true value of θ is 0. Prior distribution of θ (normalized) Posterior distribution of θ (normalized) 3 3500 posterior distribution posterior mean of θ = 0.0038 3000 2.5 2500 2 θ µ µ θ 2000 1.5 1500 1 1000 0.5 prior distribution prior mean of θ = 0.2655 0 −0.1 0 0.1 0.2 θ 0.3 0.4 0.5 500 0 3.5 0.6 (a) σ0 = 1, α = 0.0001, h = 0.1 3.6 3.7 3.8 θ 3.9 4 4.1 4.2 −3 x 10 (b) σ0 = 1, α = 0.0001, h = 0.0001 Figure 3: CASE III (Fix prior) - prior and posterior distributions and means of θ obtained after a normalization of the drawings from µ(f ). The true value of θ is 0. 5.1.2 Support equal to R Let S = R; the Hermite polynomials {Hj }j≥0 form an orthogonal basis of L2 (R, B, Π) for dΠ(x) = e−x 2 /2 dx and can be used to construct the eigenfunctions of Ω0 . The first 24 few Hermite polynomials are {1, x, (x2 − 1)/2, (x3 − 3x), . . .} and an important property of R 2 these polynomials is that they are orthogonal with respect to Π: R Hl (x)Hj (x)e−x /2 dx = √ nn!δlj , where δlj is equal to 1 if l = j and to 0 otherwise. The operator Ω is constructed as Ω 0 · = σ0 ∞ X λj j=0 1 hHj , ·iHj 2π(n!)2 Hj+1 (x) = xHj (x) − jHj−1 (x) where λ0 = 0 and {λj , j ≥ 1} = {aj , j ≥ 1} with a < 1. In our simulation exercise we generate n i.i.d. observations (x1 , . . . , xn ) from a N (1, 1) P ˆ distribution and construct the function rn = Φ(t) = n−1 n eitxi as the empirical chari=1 acteristic function. Therefore, f∗ = √1 e−(1−2x)/2 2π Thus, the operators K and K ∗ take the form ∀φ ∈ E, Kφ = We rescale Ω0 by Z itx e φ(x)e R 1 αn . −x2 /2 dx and θ∗ = 1. We set T = R and ρ = Π. and ∀ψ ∈ F, ∗ K ψ= Z 2 /2 eitx ψ(t)e−t dt. R The prior mean function f0 is set equal to a N (%, 1) distribution. In our simulation we set: a = 0.3, ... We show in Figures 4 and 5 the prior and posterior distribution of θ. We also show the prior mean (magenta asterisk) and the posterior mean of θ (blue asterisk) and denote their value. The posterior mean of θ is computed by discretizing the inner product hE(f |rn ), ιi. The pictures are obtained for n = 1000, f0 = 2 √1 e−(% −2%)/2 , 2π % = 0, a = 0.3, σ0 = 1 and two values of α. The number of discretization points, used to approximate the integrals, is equal to 1000 for all the simulation schemes. Concerning the values for α, we consider two cases: in figure 4 the prior is shrinking, in figure 5 the prior is spreading out. 5.2 Just identification and prior on θ We consider the same framework as in the previous example where the parameter θ of R interest is the population mean, that is, θ = xf (x)dx and h(θ, x) = (θ −x) but now we are going to specify a joint proper prior distribution on (θ, f ). We specify a marginal prior µ(θ) on θ and a conditional prior on f given θ. While the first one can be arbitrarily chosen, the latter is specified as a Gaussian distribution constrained to generate functions that integrate to 1 and that have mean equal to θ a.s. In particular, the prior mean function f0θ R must be a pdf and xf0θ (x)dx = θ must hold. The prior covariance operator Ω0 must be 1 1 such that Ω02 1 = 0 and Ω02 x = 0. Together with the constraint on f0θ , the first constraint 25 Posterior distributions of θ (shrinking prior) Prior distributions of θ (shrinking prior) 25 5 prior distribution prior mean of θ posterior distribution posterior mean of θ = 0.8 4.5 20 4 3.5 µ µ θ 3 θ 15 10 2.5 2 1.5 5 1 0.5 0 −0.06 −0.04 −0.02 0 θ 0.02 0.04 0 0.5 0.06 0.6 (a) σ0 = 1, α = 1. 0.7 0.8 θ 0.9 1 1.1 1.2 (b) σ0 = 1, α = 1. Figure 4: CASE I (Shrinking prior) - prior and posterior distributions and means of θ. The true value of θ is 1. Prior distributions of θ (spreading out prior) Posterior distributions of θ (spreading out prior) −3 0.25 1.6 prior distribution prior mean of θ = 0 x 10 1.4 0.2 1.2 1 θ µ µ θ 0.15 0.1 posterior distribution posterior mean of θ = 2 0.8 0.6 0.4 0.05 0.2 0 −6 −4 −2 0 θ 2 4 0 −1000 6 (a) σ0 = 1, α = 0.0001 −500 0 θ 500 1000 (b) σ0 = 1, α = 0.0001 Figure 5: CASE II (Spreading out prior) - prior and posterior distributions and means of θ. The true value of θ is 1. on Ω0 guarantees that the trajectories of f generated by this prior integrate to 1 a.s. while R the second one guarantees that xf (x)dx = θ a.s. Summarizing, the Bayesian experiment is θ f |θ r |f n ∼ µ(θ) ∼ µ(f |θ) ∼ N (f0θ , Ω0θ ), ∼ P f ∼ N (Kf, Σn ). R 1 xf0θ (x)dx = θ and Ω02 (1, x)0 = 0 (5.2) Compared to the approach in section 5.1, this approach allows to incorporate easily any prior information that an economist may have about θ. In fact, taking into account the information on θ through the prior distribution of f is complicated while to incorporate such an information directly in the prior distribution of θ results to be very simple. Let us suppose that m = 1, S = [−1, 1] and Π be the Lebesgue measure. Then, the 26 covariance operator Ω0θ can be constructed in the same way as proposed in section 5.1 since the second Legendre polynomial P1 (x) = x allows to implement the constraint on θ. The only difference concerns the number λ1 which has to be equal to 0 in this case. Therefore, we construct the operator Ω0θ as: ∞ X Ω0θ · = σ0 λn n=2 2n + 1 hPn , ·iPn 2 where the λj , j ≥ 2 can be chosen in an arbitrary way provided that P j≥2 λj < ∞. The constant σ0 can be set to an arbitrary value and has the purpose of tuning the size of the prior covariance. Many orthogonal polynomials are suitable for the construction of Ω0θ and they may be used to treat cases where S is different from [−1, 1]. Consider for instance the case S = R, then, a suitable choice is the basis made of the Hermite polynomials. The Hermite polynomials {Hen }n≥0 form an orthogonal basis of the Hilbert space L2 (R, BS , Π) where dΠ(x) = e−x 2 /2 dx. It turns out that f will be the density of F with respect to Π and f0θ the density of another probability measure with respect to Π instead of with respect to the Lebesgue measure. The first Hermite polynomials are {1, x, (x2 − 1), (x3 − 3x), (x4 − 6x2 + 3), . . .} so that we can construct an Ω0θ that satisfies the constraints by setting λ0 = λ1 = 0 in the following way Ω0θ · = ∞ X n=2 λn √ 1 hHen , ·iHen . 2πn! We perform two simulations exercises: one makes use of the empirical cumulative P distribution function to construct rn : rn (t) = Fˆ (t) := n−1 ni=1 1{xi ≤ t} (CASE IV) and P ˆ one uses the empirical characteristic function rn (t) = Φ(t) := n−1 ni=1 eitxi as in section 5.1 (CASE V). In both the simulations we use Legendre polynomials and we generate n i.i.d. observations (x1 , . . . , xn ) from a N (0, 1) distribution truncated to the interval [−1, 1] as in section 5.1. The prior distribution for θ is uniform over the interval [−1, 1]. The prior mean function f0θ is taken equal to the pdf of a Beta distribution with parameters pθ and q and with support [−1, 1]: f0θ (x) = (x + 1)pθ −1 (1 − x)q−1 . B(pθ , q)2pθ +q−1 We use the notation pθ to stress the dependence on θ of this shape parameter. We fix q = 2 R1 and recover pθ such that −1 xf0θ (x)dx = θ. It is easy to see that for our Beta distribution: R1 pθ −q −1 xf0θ (x)dx = pθ +q . The covariance operator Ω0θ is constructed by using the Legendre polynomials and λn = n−a for a > 1. The marginal posterior distribution of θ is obtained by integrating out f from the 27 sampling distribution and by using the following Bayesian scheme: ( θ ∼ U [−1, 1] rn |θ ∼ N (Kf0θ , Σn + KΩ0θ K ∗ ). Since the posterior distribution µ(θ|rn ) can not be computed in a closed-form we simulate from it by using a Metropolis-Hastings algorithm, see for instance ?. To implement this algorithm we have to selected an auxiliary pdf g. The two simulations differ basically because in the first one we need to add an artificial regularization to the conditional posterior distribution of f given θ as described in remark ˆ we do not need to use an artificial regularA.4. On the contrary, in the simulation with Φ ization and we use a scaling prior with covariance operator 1 αn Ω0 . In both the simulations, n = 1000 and the number of discretization points, used to approximate the integrals, is equal to 1000. We summarize the simulation schemes for the two cases. 1. Draw a n i.i.d. sample (x1 , . . . , xn ) from f∗ (where f∗ is a truncated standard Normal distribution); ˆ (CASE V); 2. compute rn = Fˆ (CASE IV) or rn = Φ ˜ 3. draw θ ∼ U [−1, 1] and denote it θ; 4. compute pθ as pθ = ˜ (θ+1)2 1−θ˜ (where we have fixed q = 2); 5. compute f0θ as the pdf of a Beta distribution with parameters (pθ , q) and support [−1, 1]; 6. compute the conditional posterior mean and variance of f , given θ = θ˜ (these are used for the figure 7b); 7. draw θ from the marginal posterior distribution of θ by using a Metropolis-Hasting algorithm with the following auxiliary pdf (triangular distribution): g(ξ; θ) = ξ+1 1−ξ I{ξ ∈ [−1, θ)} + I{ξ ∈ [θ, 1]}. θ+1 1−θ We draw 2000 values and discard the first 1000. The initial value for the algorithm is θ = 0.5 We represent in Figure 6 the results for the simulation with rn (t) = Fˆ (t) and in Figure ˆ 7 the result for rn (t) = Φ(t). More precisely, Figure 6a (resp. 7a) shows drawings from the conditional prior distributions of f given θ (blue dashed-dotted line) together with 28 the true f∗ that generates the data (black line) and the prior mean (dashed red line). Figure 6b (resp. 7b) shows drawings from the regularized conditional posterior distribution µ(f |τ, rn , θ) of f given θ (blue dashed-dotted line) (resp. from the conditional posterior distribution µ(f |rn , θ) of f given θ) together with the true f∗ that generates the data (black line) and the posterior mean (dashed red line). In Figure 6 the regularization parameter τ has been set equal to 0.1. Lastly, Figure 6c (resp. 7c) shows the posterior distribution of θ (marginalized with respect to f ) approximated by using a kernel smoothing and 1000 drawings from the posterior together with the posterior mean of θ. All the pictures in Figures 6 and 7 are obtained for σ0 = 5. Table 1: Just identification and prior on θ: Simulation schemes α or τ a 5.3 CASE IV: rn = Fˆ n = 1000 τ = 0.1 1.7 ˆ CASE V: rn = Φ n = 1000 α = 0.1 1.7 Overidentified case Let us consider the case in which the one-dimensional parameter of interest θ is characterized by the moment conditions EF (h(θ, x)) = 0 with h(θ, x) = (θ − x, θ 2 − x2 0 2 ). For instance, this arises when the true data generating process F is an exponential distribution with parameter θ. We specify a prior distribution for (θ, f ). The prior µθ is chosen arbitrarily provided that the potential constraint on θ are satisfied. These are essentially constraint on the support of θ. The moment conditions affect the conditional prior distribution of f conditionally on θ. This is a Gaussian distribution with mean function f0θ whatever pdf R R with the same support as F that satisfies xf0θ (x)dΠ(x) = θ and x2 f0θ (x)dΠ(x) = 2θ 2 . The covariance operator Ω0θ of f must be such that 1 1 2 Ω0θ x = 0. x2 (5.3) In our simulation exercise we take S = R+ and dΠ(x) = e−x dx. We use the empirical cumuP lative distribution function to construct rn : rn (t) = Fˆ (t) := n−1 n 1{xi ≤ t}. We generi=1 ate N = 1000 observations x1 , . . . , xN independently from an exponential distribution with parameter θ∗ = 2. Therefore, the true f associated with this DGP is f∗ (x) = θ∗ e−(θ∗ −1)x which obviously satisfies the moment restrictions. The marginal prior distribution µ(θ) for θ is a chi-squared distribution with 1 degree of freedom and, for every value of θ drawn 29 Prior mean of f and trajectories drawn from the prior Regularized posterior mean for f and trajectories drawn from the prior 0.9 0.8 Prior mean True density 0.8 0.6 0.7 0.4 0.6 0.2 f(x) f(x) 0.5 0.4 0 0.3 0.2 Conditional on θ = −0.056427 −0.2 0.1 Regularized posterior Mean True density Prior Mean −0.4 0 −0.1 −1 Conditional on θ = −0.056427 −0.5 0 x 0.5 −0.6 −1 1 (a) Draw from the conditional prior distribution µ(f |θ = 0.2) (in green), Prior mean and true f∗ , τ = 0.1 and a = 1.7. −0.5 0 x 0.5 1 (b) Draw from the regularized conditional posterior distribution µ(f |rn , θ = 0.2, τ ) (in blue), posterior mean and true f∗ , τ = 0.1 and a = 1.7. Posterior density function of θ 0.9 0.8 0.7 θ µ (θ|F hat ) 0.6 0.5 0.4 0.3 0.2 posterior distribution posterior mean of θ = −0.06 0.1 0 −1 −0.5 0 θ 0.5 1 (c) Posterior distribution and mean of θ and a = 1.7. The bandwidth is h = 0.2. Figure 6: CASE IV: rn = Fˆ and n = 1000. The true value of θ is 0 and an ad-hoc regularization has been used. from this µ(θ), the prior mean function f0θ is fixed equal to f0θ = 1 −(1−θ)x/θ . θe We fix the eigenfunctions of Ω0θ proportional to the Laguerre polynomials {Ln }n≥0 . The first few Laguerre polynomials are {1, (1−x), 12 (x2 −4x+2), 61 (−x3 +9x2 −18x+6), . . .} and they are orthogonal in L2 (R+ , e−x ). Remark that x = L0 − L1 and we construct the operator Ω0θ as: Ω0θ · = σ0 ∞ X n=0 x2 2 = L2 − 2L1 + L0 . Therefore λn hLn , ·iLn with λ0 = λ1 = λ2 = 0 to guarantee that (5.3) holds. The constant σ0 and the λn , n ≥ 3 P can be arbitrarily set provided that n≥3 λn < ∞. We present in Figure 8 results obtained by using the regularized posterior distribution µ(f |τ, rn , θ) of f given θ and in Figure 9 results obtained by using the (non-regularized) posterior distribution µ(f |rn , θ) of f given θ and a non-scaled prior µ(f |θ). In Figure 8 we take σ0 = 1 and λn = n−1.1 for n ≥ 3 while Figure 9 we take σ0 = 10 and λn = n−1.7 for 30 Prior mean of f and trajectories drawn from the prior Posterior mean for f and trajectories drawn from the prior 2.5 2.5 Prior mean True density 2 1.5 f(x) 1.5 f(x) Posterior Mean True density Prior Mean 2 1 1 0.5 0.5 0 0 Conditional on θ = 0.68754 −0.5 −1 −0.5 0 x Conditional on θ = 0.68754 0.5 −0.5 −1 1 (a) Draw from the conditional prior distribution µ(f |θ = 0.2268) (in green), Prior mean and true f∗ , α = 0.1 and a = 1.7. −0.5 0 x 0.5 1 (b) Draw from the conditional posterior distribution µ(f |rn , θ = 0.2268) (in blue), posterior mean and true f∗ , α = 0.1 and a = 1.7. Posterior density function of θ 1 0.9 0.8 0.6 0.5 θ µ (θ|F hat ) 0.7 0.4 0.3 0.2 posterior distribution posterior mean of θ = −0.11 0.1 0 −1 −0.5 0 θ 0.5 1 (c) Posterior distribution and mean of θ and a = 1.7. The bandwidth is h = 0.2 ˆ and n = 1000. The true value of θ is 0, scaled prior and no Figure 7: CASE V: rn = Φ regularization scheme. n ≥ 3. The regularization parameter τ in Figure 8 is set equal to 0.1. We represent in Figure 8a (resp. 9a) draws from the conditional prior distributions of f given θ (blue dashed-dotted line) together with the true f∗ that has generated the data (black line) and the prior mean (dashed red line). Figure 8b (resp. 9b) shows draws from the regularized (resp. non regularized) conditional posterior distribution of f given θ (blue dashed-dotted line) together with the true f∗ having generated the data (black line) and the posterior mean (dashed red line). Lastly, Figure 8c (resp. 9c) shows the posterior distribution of θ (marginalized with respect to f ) approximated by using a kernel smoothing and 1000 drawings from the posterior distribution together with the posterior mean of θ. The posterior distribution of θ is obtained by integrating out f from the sampling distribution in the following way ( θ ∼ χ21 rn |θ ∼ N (Kf0θ , Σn + KΩ0θ K ∗ ). 31 As the posterior distribution of θ cannot be computed in a closed-form we have used a Metropolis-Hastings algorithm to simulate from it. To implement this algorithm we selected, as auxiliary distribution, a χ2dθe distribution. Prior mean of f Regularized posterior mean for f 10 7 Prior mean True density 9 Regularized Posterior Mean Prior mean True f 6 * 8 5 7 4 f(x) f(x) 6 5 3 4 3 2 2 1 1 0 0 0 0.5 1 1.5 2 2.5 0 0.5 x (a) Draw from the conditional prior distribution µθf , Prior mean and true f and a = 1.1. 1 1.5 x 2 2.5 3 (b) Draw from the regularized condin ,θ tional posterior distribution µrf,τ , posterior mean and true f , τ = 0.1 and a = 1.1. Posterior density function of θ posterior distribution posterior mean of θ 1 µθ(θ|Fhat) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 θ 12 14 16 18 20 22 (c) Posterior distribution and mean of θ, τ = 0.1 and a = 1.1. The bandwidth is h = 0.1. rn ,θ Figure 8: Overidentified case and regularized posterior distribution µf,τ . rn = Fˆ . The true value of θ is 2 and the posterior mean is E(θ|rn ) = 2.3899. Appendix A Conditional posterior distribution of f , given θ The conditional distribution of f given (rn , θ), that is, the posterior distribution of f , is a Gaussian process, see Theorem 1 in Florens and Simoni [2014]. The conditional posterior mean and variance of f |rn , θ, in general, rise problems due to the infinite dimension of f . While this point has been broadly discussed in [Florens and Simoni, 2012, 2014] and ref- erences therein, in this section we analyze it in the particular case considered in the paper where the operators take a specific form. 32 Regularized posterior mean for f Prior mean of f 7 Regularized Posterior Mean Prior mean True f Prior mean True density * 6 10 5 8 f(x) f(x) 4 6 3 4 2 2 0 1 0 0 0.5 1 1.5 2 0 2.5 0.5 1 x (a) Draw from the conditional prior distribution µ(f |θ), Prior mean, true f and a = 1.1. 1.5 x 2 2.5 3 (b) Draw from the conditional posterior distribution µ(f |rn , θ), posterior mean, true f and a = 1.1. Posterior density function of θ 1.4 posterior distribution posterior mean of θ 1.2 0.8 µθ(θ|F hat ) 1 0.6 0.4 0.2 0 0 5 10 θ 15 20 25 (c) Posterior distribution and mean of θ. The bandwidth is h = 0.1 and a = 1.1. Figure 9: Overidentified case and posterior distribution µ(f |rn , θ) with non-scaled prior. rn = Fˆ . The true value of θ is 2 and the posterior mean is E(θ|ˆ r) = 3.5561. Intuitively, the problem encountered in the computation of the moments of the Gaussian posterior distribution of f given θ is the following. The moments of a conditional Gaussian distribution involve the inversion of the covariance operator of the conditioning variable, that is (Σn + KΩ0θ K ∗ ) in our case. The problem arises because the inverse operator (Σn +KΩ0θ K ∗ )−1 is in general defined only on a subset of F of measure zero. Therefore, in general there is no closed-form available for the posterior mean and variance of f |θ which implies that they cannot be computed. However, for the framework under consideration we determine mild conditions that allow to solve this problem and to have an inverse (Σn + KΩ0θ K ∗ )−1 continuously defined on F. Therefore, there exists a closed-form for the posterior mean and variance of f |θ and no problems of continuity are risen. We illustrate these conditions in the lemmas below where we use the notation µ(f |rn , θ) for the conditional posterior distribution of f given θ and B for the Borel σ-field generated by the open sets of Θ. 33 Lemma A.1. Consider the Gaussian distribution (2.7) on BE × BF and assume that −1/2 f∗ ∈ R(K ∗ ). Then, the conditional distribution on BE conditional on BF × B, denoted by µ(f |rn , θ), exists, is regular and a.s. unique. It is Gaussian with mean E[f |rn ] = f0θ + A(rn − Kf0θ ) (A.1) and trace class covariance operator V ar[f |rn ] = Ω0θ − AKΩ0θ : E → E (A.2) where A := −1/2 Ω0θ Mf 1 1/2 1 −1/2 −1/2 1/2 Ω0θ Mf I − Mf hMf , ·i + Mf n n −1 −1/2 ∗ ((K ∗ )−1 Mf ) is a continuous and linear operator from F to E. Proof. The first part of the theorem follows from theorem 1 (ii ) in Florens and Simoni [2014]. From this result, since Σn = know that E[f |rn ] = f0θ + 1 n Σ, Ω0θ K ∗ ( n1 Σ where Σ : F → F is defined in Lemma 2.2, we + KΩ0θ K ∗ )−1 (rn − Kf0θ ) and V ar[f |rn ] = Ω0θ − Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 KΩ0θ . Hence, we have to show that Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 = A −1 1 1 − 21 − 12 1 1 2 2 ˜ and that A is continuous and linear. Denote M = n I − n Mf hMf , ·i + Mf Ω0θ Mf and ˘ = M 1 1 KMf K ∗ − (KMf 1)hMf , K ∗ ·i + KΩ0θ K ∗ n n −1 . ˘ and By using the result of Lemma 2.2, we can write: Ω0θ K ∗ ( n1 Σ + KΩ0θ K ∗ )−1 = Ω0θ K ∗ M then 1 − 21 − 21 ∗ − 21 ∗ − 12 ∗ ˘ ∗ −1 ∗ −1 ∗ −1 ˜ ˜ Ω0θ K ( Σ + KΩ0θ K ) = Ω0θ Mf M ((K ) Mf ) + Ω0θ K M − Mf M ((K ) Mf ) n ∗ 1 1 − ˜ ((K ∗ )−1 M − 2 )∗ = Ω0θ Mf 2 M f where the second equality follows because 34 − 21 ∗ − 21 ∗ − 21 − 21 −1 ∗ −1 ∗ −1 ∗ ˘ ˘ ˜ ˜ ˘ M K M − Mf M ((K ) Mf ) = K − Mf M ((K ) Mf ) M 1 21 1 12 − 21 − 21 − 21 ∗ ∗ ∗ −1 −1 ˜ ˘ ˘ = Mf M M − Mf hMf , ·i + Mf Ω0θ K − ((K ) Mf ) M M n f n ∗ = 0. This establishes that Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 is equal to A. We now show that the operator − 21 A is continuous and linear on F. First, remark that the assumption f∗ that −1 (K ∗ )−1 Mf 2 ∈ R(K ∗ ) ensures exists and is bounded. Since Ω0θ is the covariance operator of a Gaussian 1 2 is Hilbert-Schmidt, measure on a Hilbert space then, it is trace class. This means that Ω0θ which is a compact operator. Therefore, since the product of two bounded and compact − 12 operators is compact, it follows that Ω0θ , Ω0θ Mf It is also easy to show that the operator −1 − 21 and Mf 2 Ω0θ Mf 1 2 1 n Mf are compact. 1 2 hMf , ·i : E → E is compact since its Hilbert-Schmidt norm is equal to 1. In particular this operator has rank equal to 1/n since it has only one eigenvalue different from 0 and which is equal to 1. This eigenvalue corresponds 1 1 1 −1 −1 to the eigenfunction f∗2 . Therefore, the operator ( n1 Mf2 hMf2 , ·i−Mf 2 Ω0θ Mf 2 ) is compact. By the Cauchy-Schwartz inequality we have ∀φ ∈ E, ˜ −1 φ, φi = 1 ||φ||2 − hM n 1 ≥ ||φ||2 − n 1 1 1 1 12 −1 −1 2 2 hf∗ , φi2 + hΩ0θ f∗ 2 φ, Ω0θ f∗ 2 φi n 1 1 12 2 −1 2 f∗ 2 φ||2 ||f∗ || ||φ||2 + ||Ω0θ n −1 2 ≥ ||Ω0θ f∗ 2 φ||2 ≥ 0 1 ˜ is injective. Then, from the Riesz Theosince ||f∗2 ||2 = 1. Therefore, we conclude that M ˜ : E → E is bounded. rem 3.4 in Kress [1999] it follows that the operator M Finally, the operator A is bounded and linear since it is the product of bounded linear operators. We conclude that A is a continuous operator from F to E. Remark A.1. If f∗−1 ∈ R(K ∗ ) then the operator A : F → E of the theorem may be written in an equivalent way as: ∀ϕ ∈ F Aϕ = Ω0θ 1 1 I + hf∗ , ·i + f∗−1 Ω0θ n n −1 ((K ∗ )−1 f∗−1 )∗ . (A.3) Remark A.2. If f∗ is assumed to be bounded away from 0 and ∞ on its support, then 35 −1/2 ∈ R(K ∗ ), cannot be satisfied if the condition f∗−1 ∈ R(K ∗ ), as well as the condition f∗ R k(t, x) is such that ∀ψ ∈ F, K ∗ ψ = T k(t, x)ψ(t)ρ(dt) vanishes at some x in the support of f∗ . This excludes the kernel k(t, x) = 1{x ≤ t} when T is equal to a compact set, say T = [a, b]. This remark suggests that some care must be taken by the researcher when he/she chooses the operator K according to its prior information about f∗ . The next lemma provides a condition alternative to the one given in lemma A.1 which also guarantees continuity of the inverse of (Σn + KΩ0θ K ∗ ). Let H(Ω0θ ) denote the repro- ducing kernel Hilbert space associated with Ω0θ and embedded in E. Lemma A.2. Consider the Gaussian distribution (2.7) on BE × BF and assume that 1 2 ) ⊆ R(Σ). Then, the result of K|H(Ω0θ ) is injective and that Ω0θ is such that R(KΩ0θ lemma A.1 holds with A equal to A := 1/2 Ω0θ 1 1/2 1/2 I + Ω0θ K ∗ Σ−1 KΩ0θ n −1 1/2 (Σ−1 KΩ0θ )∗ . Proof of Lemma A.2 1 1 1 2 2 Since KΩ0θ = K|H(Ω0θ ) Ω0θ and K|H(Ω0θ ) is injective by assumption then Σ− 2 K|H(Ω0θ ) is well defined by lemma 2.3. By applying theorem 1 (iii) in Florens and Simoni [2014] we conclude. The trajectories of f generated by the conditional posterior distribution µfrn ,θ verify a.s. the moment conditions and integrate to 1. To see this, first remark that the posterior covariance operator satisfies the moment restrictions: 1/2 [Ω0θ − AKΩ0θ ]1/2 (1, h0 (θ, ·))0 = [I − AK]1/2 Ω0θ (1, h0 (θ, ·))0 = 0 where we have factorized Ω0θ on the right and used assumption (2.2). Second, a trajectory f drawn from the posterior µ(f |θ, rn ) is such that (f − f0θ ) ∈ R (Ω0θ − AKΩ0θ )1/2 , µ(f |θ, rn )-a.s. Now, for any ϕ ∈ R (Ω0θ − AKΩ0θ )1/2 we have hϕ, h(θ, ·)i = h[Ω0θ − 1 1/2 ψ, Ω 2 h(θ, ·)i = 0, for some ψ ∈ E where A ˜ ˜ = AKΩ0θ ]1/2 ψ, h(θ, ·)i = h[I − AKΩ 0θ ] 0θ K ∗ (Σ n + KΩ0θ K ∗ )−1 , and hϕ, 1i = 0 by a similar argument. This shows that Z Z 1/2 ⊂ ϕ∈E ; ϕ(x)h(θ, x)Π(dx) = 0 and ϕ(x)Π(dx) = 0 R (Ω0θ − AKΩ0θ ) and since the set on the right of this inclusion is closed we have 36 R (Ω0θ − AKΩ0θ )1/2 Z Z ⊂ ϕ∈E ; ϕ(x)h(θ, x)Π(dx) = 0 and ϕ(x)Π(dx) = 0 . R Therefore, µ(f |θ, rn )-a.s. a trajectory f drawn from µ(f |θ, rn ) is such (f −f0θ )(x)Π(dx) = R R R 0 and (f −f0θ )(x)h(θ, x)Π(dx) = 0 which implies: f (x)Π(dx) = 1 and f (x)h(θ, x)Π(dx) = 0. Remark A.3. The posterior distribution of f conditional on θ gives the revision of the prior on f except in the direction of the constant and of the moment conditions that remain unchanged. For estimating θ we have many possibilities: we could for instance estimate θ by maximum likelihood by using the density given by E(f |rn , θ) as the probability density of the data. We do not develop this approach but we use a completely Bayesian approach by trying to recover a conditional distribution of θ conditional on rn . Remark A.4. [Regularized Posterior Distribution] When neither the conditions of lemma A.1 nor the conditions of lemma A.2 are satisfied then we cannot use the exact posterior distribution µ(f |θ, rn ). Instead, we use the regularized posterior distribution denoted by µ(f |τ, θ, rn ), where τ > 0 is a regularization parameter that must be suitably chosen and that converges to 0 with n. This distribution has been proposed by Florens and Simoni [2012] and we refer to this paper for a complete description of it. Here, we only give its expression: µ(f |τ, θ, rn ) is a Gaussian distribution with mean function E[f |rn , τ ] = f0θ + Aτ (rn − Kf0θ ) (A.4) V ar[f |rn , τ ] = Ω0θ − Aτ KΩ0θ : E → E (A.5) and covariance operator where Aτ := Ω0θ K ∗ 1 τ I + I + KΩ0θ K ∗ n and I : F → F denotes the identity operator. 37 −1 :E →E (A.6) B Proofs for Section 2 Proof of Lemma 2.1 Let H(Ω0θ ) denote the reproducing kernel Hilbert space associated with Ω0θ and em- bedded in E and H(Ω0θ ) denote its closure. If f |θ ∼ N (f0θ , Ω0θ ) then (f − f0θ ) ∈ H(Ω0θ ), −1/2 µ(F |θ)-a.s. Moreover, H(Ω0θ ) = D(Ω0θ 1/2 ) = R(Ω0θ ) where D and R denote the do- main and the range of an operator, respectively. This means that ∀ϕ ∈ H(Ω0θ ) there 1 2 exists ψ ∈ E such that ϕ = Ω0θ ψ. Moreover, for any ϕ ∈ H(Ω0θ ) we have hϕ, h(θ, ·)i = 1 1 R 2 2 ψ, h(θ, ·)i = hψ, Ω0θ h(θ, ·)i = 0 and hϕ, 1i = 0 by a similar ϕ(x)h(θ, x)Π(dx) = hΩ0θ argument. Hence, H(Ω0θ ) ⊂ Z Z ϕ∈E; ϕ(x)h(θ, x)Π(dx) = 0 and ϕ(x)Π(dx) = 0 . (B.1) Since the set on the right of this inclusion is closed we have H(Ω0θ ) ⊂ Z Z ϕ∈E; ϕ(x)h(θ, x)Π(dx) = 0 and ϕ(x)Π(dx) = 0 . We deduce that µ(F |θ)-a.s. Z (f − f0θ )(x)Π(dx) = 0 Z and (f − f0θ )(x)h(θ, x)Π(dx) = 0. Condition (2.1) and the fact that f0θ is a pdf imply the results of the lemma. Proof of Lemma 2.2 The result follows trivially from the definition of the covariance operator Σn : F → F: ∀ψ ∈ F, Z Z Z Z Z 1 k(t, x)f∗ (x)Π(dx) (k(t, x)k(s, x)) f∗ (x)Π(dx)ψ(t)ρ(dt) − k(s, x)f∗ (x)Π(dx) ψ(t)ρ(dt) Σn ψ = n T S T S Z Z Z S Z Z 1 = k(s, x)f∗ (x) k(t, x)ψ(t)ρ(dt)Π(dx) − k(s, x)f∗ (x)Π(dx) k(t, x)ψ(t)ρ(dt)f∗ (x)Π(dx) n S T S S T 1 = [KMf K ∗ ψ − (KMf 1)hMf , K ∗ ψi] n where the second equality has been obtained by using the Fubini’s theorem. 38 Proof of Lemma 2.3 We can rewrite Σ as ∀ψ ∈ F, Σψ = = Z ZT T E (v(x, t)v(x, s)) ψ(t)ρ(dt) Z (v(x, t)v(x, s)) f∗ (x)Π(dx)ψ(t)ρ(dt) S where v(x, t) = [k(x, t) − E(k(x, t))]. Then, ∀ψ ∈ F we can write Σψ = RMf R∗ ψ where R : E → F, Mf : E → E and R∗ : F → E are the operators defined as R∗ ψ = ∀ψ ∈ F, Z v(x, t)ψ(t)ρ(dt) T ∀ϕ ∈ E, Mf ϕ = f∗ (x)ϕ(x) Z v(x, t)ϕ(x)Π(dx). ∀ϕ ∈ E, Rϕ = S 1 1 1/2 1 Moreover, we have D(Σ− 2 ) = R(Σ 2 ) = R((RMf R∗ ) 2 ) = R(RMf ). R Let h ∈ R(K), that is, there exists a g ∈ E such that h(t) = S k(t, x)g(x)Π(dx). Then 1 R 1 h ∈ D(Σ− 2 ) if there exists an element ν ∈ E such that h(t) = S v(x, t)f∗2 (x)ν(x)Π(dx). By developing this equality, the element ν has to satisfy Z Z 1 v(x, t)f∗2 (x)ν(x)Π(dx) 1 Z ZS ZS k(x, t)f∗ (x)Π(dx) f∗2 (x)ν(x)Π(dx) k(t, x)g(x)Π(dx) = k(x, t) − ⇔ 1 S Z 1 ZS ZS 2 2 k(x, t) f∗ (x)ν(x) − f∗ (x) k(t, x)g(x)Π(dx) = ⇔ f∗ (x)ν(x)Π(dx) Π(dx). k(t, x)g(x)Π(dx) = S S S If K is injective it follows that such an element ν must satisfy 1 2 g(x) = f∗ ν(x) − f∗ (x) which in turn implies that solution is ν(x) −1 = f∗ 2 g(x) − 21 R S Z S f∗ (x)ν(x)Π(dx) 1 2 g(x)Π(dx) = 0, i.e. that h ∈ R(K|D ). Therefore, one which proves that the range of the truncated operator K|D in 1 contained in D(Σ ). On the other side, let h ∈ D(Σ− 2 ), then there exists a ν ∈ E such 1 R that h = S v(x, t)f∗2 (x)ν(x)Π(dx). By the previous argument and under the assumption that K|D is injective, this that h ∈ R(K|D ) since there exists g ∈ D such that implies 1 R 21 1 2 g(x) = f∗ ν(x) − f∗ (x) S f∗ (x)ν(x)Π(dx) . This shows the inclusion of D(Σ− 2 ) in R(K|D ) and concludes the proof. 39 Proof of Theorem 2.1 1/2 In this proof we denote B = Σ−1/2 KΩ0θ . To prove that Pnθ and Pn0 are equivalent we first rewrite the covariance operator of Pnθ as h i 1√ √ 1 1 1 n−1 Σ 2 I + nΣ− 2 KΩ0θ K ∗ Σ− 2 Σ 2 n−1 . = n−1 Σ + KΩ0θ K ∗ ˜ Then according to htheorem 3.3 p.125 in Kuo i [1975] we have to verify that K(f − f0θ ) ∈ 1 1 R(Σ1/2 ) and that I + nΣ− 2 KΩ0θ K ∗ Σ− 2 is positive definite, bounded, invertible with 1 1 nΣ− 2 KΩ0θ K ∗ Σ− 2 Hilbert Schmidt. • Since (f˜ − f0θ ) ∈ D and since K|D is injective then, by lemma 2.3, K(f˜ − f0θ ) ∈ R(Σ1/2 ). • Positive definiteness. It is trivial to show that the operator (I +nBB ∗ ) is self-adjoint, i.e. (I + nBB ∗ )∗ = (I + nBB ∗ ). Moreover, ∀ϕ ∈ F, ϕ 6= 0 h(I + nBB ∗ )ϕ, ϕi = hϕ, ϕi + nhB ∗ ϕ, B ∗ ϕi = ||ϕ||2 + n||B ∗ ϕ|| > 0. • Boundedness. By lemma 2.3, if K|D is injective, the operators B and B ∗ are bounded ; the operator I is bounded by definition and a linear combination of bounded operators is bounded, see Remark 2.7 in Kress [1999]. • Continuously invertible. The operator (I + nBB ∗ ) is continuously invertible if its inverse is bounded, i.e. there exists a positive number C such that ||(I +nBB ∗ )−1 ϕ|| ≤ C||ϕ||, ∀ϕ ∈ F. We have ||(I + nBB ∗ )−1 ϕ|| ≤ (supj n−1 )||ϕ|| n−1 +lj2 = ||ϕ||, ∀ϕ ∈ F. p • Hilbert-Schmidt. We consider the Hilbert-Schmidt norm ||nBB ∗ ||HS = n tr((BB ∗ )2 ). ˜ ∗ BΩ ˜ 0θ B ˜ ∗ B) ˜ ≤ tr(Ω0θ )||B ˜ ∗ BΩ ˜ 0θ B ˜ ∗ B|| ˜ < ∞ since B ˜ := Now, tr((BB ∗ )2 ) = tr(Ω0 B 1 Σ− 2 K|H(Ω0θ ) has a bounded norm by lemma 2.3. Hence, ||nBB ∗ ||HS < ∞ for n finite. This shows that Pnθ and Pn0 are equivalent. Next we derive (2.9). Let zj = hrn − K f˜, Σ−1/2 ψj (θ)i. This variable is defined under the further assumption that ψj (θ) ∈ R(Σ1/2 ), ∀j ≥ 0. By theorem 2.1 in Kuo [1975, page 116]: ∞ Y dνj dPnθ = dPn0 dµj j=1 40 √ where νj denotes the distribution of nzj under Pnθ and µj denotes the distribution of √ nzj under Pn0 . By writing down the likelihoods of νj and µj with respect to the Lebesgue measure we obtain dPnθ (rn ) dPn0 = ∞ Y j=1 1+ 2 n ljθ −1/2 −1 2 n exp{− 21 (zj − hK(f0θ − f˜), Σ−1/2 ψj (θ)i)2 n 1 + ljθ } exp{− 12 zj2 n} which, after simplifications, gives the result. C Proofs for Section 3 Proof of Theorem 3.1 First remark that assumption A3 and compactness of Θ imply that γ 0 h(θ,x) 0. Let γ(θ) := arg min E∗ e k(xi T , t)(1−eγn (θ) h(xi ,θ) ), and G(k)(t) := √1 n Pn γ(θ)0 h(θ,x) and f0θ (x) = e g(xi , t, θ) := k(xi i=1 (k(xi , t) 1 n supθ∈Θ µ(θ) → f∗ (x). Denote gn (xi , t, θ) := P G(g)(t) := √1n ni=1 (g(xi , t, θ) − K(f∗ − f0θ )) T , t)(1−eγ(θ) h(xi ,θ) ), − Kf∗ )). We first have to show | n1 ln (θ) − l(θ)| → 0 uni- formly in θ. For that, we make the following decomposition: ∞ ∞ n X 1 1 X 1 X 1 2 ln (θ) − l(θ) = − 1 √ +1)− log(nl h g(xi , t, θ), Σ−1/2 ψj (θ)i2 jθ n 2 2n 2n n 1 + nljθ j=1 j=1 i=1 ∞ n 1 1 X 1 X h√ (gn (xi , t, θ) − g(xi , t, θ)) , Σ−1/2 ψj (θ)i2 − 2 2n n 1 + nljθ j=1 i=1 + ∞ n 1 X 1 X h√ k(xi , t) − K f˜ , Σ−1/2 ψj (θ)i2 2n n j=1 i=1 ∞ n n 1X 1 X 1 X 1 −1/2 h√ − (gn (xi , t, θ) − g(xi , t, θ)) , Σ ψj (θ)ih √ g(xi , t, θ), Σ−1/2 ψj (θ)i 2 n n n 1 + nljθ j=1 i=1 + 1 2 ∞ X j=1 hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 − 41 i=1 ∞ X 1 2 j=1 hK(f˜ − f∗ ), Σ−1/2 ψj (θ)i2 Because 1 n Pn i=1 (gn (xi , t, θ) − g(xi , t, θ)) →p 0 uniformly in θ then, we can write ∞ ∞ 1 X 1 X 1 1 2 log(nljθ + 1) − hG(g)(t), Σ−1/2 ψj (θ)i2 | ln (θ) − l(θ)| = op (1) + − 2 n 2n 2n 1 + nljθ j=1 j=1 ∞ n 1 X hK(f∗ − f0θ ), Σ−1/2 ψj (θ)i2 − 2 2n 1 + nljθ j=1 1 −√ n ∞ X j=1 hG(g)(t), Σ−1/2 ψj (θ)ihK(f∗ −f0θ ), Σ−1/2 ψj (θ)i ∞ ∞ 1 1 X hG(k), Σ−1/2 ψj (θ)i2 + 2 1 + nljθ 2n j=1 1 X +√ hG(k), Σ−1/2 ψj (θ)ihK(f˜ − f∗ ), Σ−1/2 ψj (θ)i n j=1 + ∞ 1X hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 2 j=1 ∞ ∞ 1 X 1 X 1 2 =− log(nljθ + 1) − hG(g)(t), Σ−1/2 ψj (θ)i2 2 2n 2n 1 + nljθ j=1 j=1 2 nljθ 1 + kΣ−1/2 K(f∗ − f0θ )k2 2 2 1 + nljθ ∞ ∞ 1 1 X 1 X hG(g)(t), Σ−1/2 ψj (θ)ihK(f∗ −f0θ ), Σ−1/2 ψj (θ)i hG(k), Σ−1/2 ψj (θ)i2 −√ + 2 n 2n 1 + nljθ j=1 j=1 1 +√ n ∞ X j=1 hG(k), Σ −1/2 −1/2 ˜ ψj (θ)ihK(f − f∗ ), Σ ψj (θ)i where op (1) is uniform in θ and ψj (θ) is uniformly continuous in θ due to compactness of Θ. Morevoer, remark that: supt∈T,θ∈Θ |n−1/2 G(g)(t)| → 0, P ∗ -a.s., by Lemma D.1; 2 → 0 by supt∈T |n−1/2 G(k)(t)| → 0, P ∗ -a.s., because k(x, t) is chosen to be Donsker; nljθ construction of the prior. Therefore, supθ∈Θ | n1 ln (θ) − l(θ)| → 0, P∗ -a.s. Finally, by continuity of l(θ) in θ and because the maximizer of l(θ) is the true value θ∗ which is a well-separated point of maximum, we conclude that θn → θ∗ in P∗ -probability. Proof of Theorem 3.2 We first state and prove a preliminary lemma. The proof of this lemma and of the theorem are slightly modifications of results in Kim [2002] and Chen [1985]. Lemma C.1. Under B1 – B4, 1 lim µ(θn |rn )|∆n | 2 ≤ (2π)−k/2 . n→∞ 1 (C.1) Moreover, limn→∞ µ(θn |rn )|∆n | 2 = (2π)−k/2 in P ∗ -probability if and only if for some δn > 42 0, µ(θ ∈ B(θn , δn )|rn ) → 1 in P ∗ -probability. Proof. Denote pn the denominator of µ(θ|rn ). For any > 0, let N > 0 and δn & 0 such that Assumption B3 is satisfied for every n > N . Under B1 and B2, for every θ ∈ B(θn , δn ) a second order Taylor expansion of Ln (θ) around θn gives pnθ (rn ; θ)µ(θ) = pnθ (rn ; θn )µθ (θn ) exp(Ln (θ) − Ln (θn )) 1 t 00 ˜ (θ − θn ) Ln (θ)(θ − θn ) = pnθn (rn ; θn )µ(θn ) exp 2 1 ˜ −1 (θ − θn ) (C.2) = pnθn (rn ; θn )µ(θn ) exp − (θ − θn )t [I + Ψn (θ)]∆ n 2 ˜ = with Ψn (θ) ∂ 2 Ln (θ) ∂θ∂θ t θ=θ˜ ∂ 2 Ln (θ) ∂θ∂θ t θ=θ n −1 − I, I a k × k identity matrix and θ˜ lies on the segment joining θ and θn . Therefore, under B3, the probability µ(B(θn , δn )|rn ) is bounded above by µ(B(θn , δn )|rn ) = = p−1 n pnθ (rn ; θn )µ(θn ) p−1 n pnθ (rn ; θn )µ(θn ) −1/2 ≤ |I − G()| Z Z exp B(θn ,δn ) 1 t 00 ˜ (θ − θn ) Ln (θ)(θ − θn ) dθ 2 ˜ −1/2 ∆1/2 {z;||[I+Ψn(θ)] n z||<δn } |∆n |1/2 p−1 n pnθ (rn ; θn )µ(θn ) ≤ |I − G()|−1/2 |∆n |1/2 p−1 n pnθ (rn ; θn )µ(θn ) 2 ˜ 1 Z e−z t z/2 e−z ˜ −1/2 |∆n |1/2 dz|I + Ψn (θ)| t z/2 dz B(0,un ) 1/2 Ln (θ) 1/2 where z = [ ∂∂θ∂θ (θ − θn ), un = δn (1 + g max ()) 2 /dmin , g max () is the largest eigenvalue t ] of G() and dmin is the smallest eigenvalue of ∆n . The inequality in the third line is due to Assumption B3 which also implies4 ˜ −1/2 ∆1/2 z|| < δn } ⊆ {z; ||z|| < un }. {z; ||[I + Ψn (θ)] n The inequality in the last line is due to the fact that 4 R −z t z/2 dz B(0,un ) e < (2π)k/2 . In a More precisely, ˜ −1/2 ∆1/2 z|| < δn } ⊆ {z; ||[I + G()]−1/2 ∆1/2 z|| < δn } {z; ||[I + Ψn (θ)] n n ˜ −1/2 ≥ [I + G()]−1/2 which is guaranteed by Assumption B3. Moreover, if and only if [I + Ψn (θ)] 1/2 1/2 −1/2 1/2 2 ||[I + G()] ∆n z|| = tr(z t ∆n [I + G()]−1 ∆n z) ≥ λmin ([I + G()]−1 )dmin tr(zz t ), where for 1/2 a matrix A, λmin (A) denotes the minimum eigenvalue of A. Hence, {z; ||[I + G()]−1/2 ∆n z|| < 1/2 δn } ⊆ {z; (1 + g max ())−1/2 dmin ||z|| < δn }. 43 similar way, under B3 ˜ −1/2 ∆1/2 z|| < δn } ⊇ {z; ||z|| < ln }. {z; ||[I + Ψn (θ)] n 1 1/2 where ln = δn (1 − g max ()) 2 /dmax , and so we can bound µ(B(θn , η)|rn ) from below by −1/2 µ(B(θn , δn )|rn ) ≥ |I + G()| |∆n |1/2 p−1 n pnθ (rn ; θn )µ(θn ) 0 e−z z/2 dz. B(0,ln ) R −z t z/2 dz converges to B(0,ln ) e µ(θn |rn ) = p−1 n pnθ (rn ; θn )µ(θn )) By Assumption B4, ln → ∞ as n → ∞ and then n → ∞. Summarizing, we obtain (by writing Z (2π)k/2 as |I − G()|1/2 lim µ(B(θn , δn |rn )|rn ) ≤ (2π)k/2 lim |∆n |1/2 µ(θn |rn ) n→∞ n→∞ ≤ |I + G()|1/2 lim µ(B(θn , δn |rn )|rn ) n→∞ and (C.1) is implied by the facts that under B3, |I ±G()| → 1 as → 0, µ(B(θn , δn )|rn ) ≤ 1 for every n. The equality holds if and only if limn→∞ µ(B(θn , δn )|rn ) = 1 in P ∗ -probability. Now, we proceed with the proof of Theorem 3.2. Denote pn the denominator of µ(θ|rn ). For any θ1 , θ2 ∈ Θ we write θ2 ≥ θ1 (or θ2 ≤ θ1 ) if every component of θ2 −θ1 is nonnegative. Without loss of generality, consider θ1 ≤ 0 and θ2 ≥ 0. −1/2 Let sn = ∆n (θ − θn ). By using (C.2), the probability µ(sn ∈ [θ1 , θ2 ]|rn ) writes µ(sn ∈ [θ1 , θ2 ]|rn ) = µ(θn |rn ) Z =µ(θn |rn ) Z 1/2 ∆n θ2 +θn 1/2 ∆n θ1 +θn 1 t −1 ˜ exp − (θ − θn ) [I + Ψn (θ)]∆n (θ − θn ) dθ 2 ˜ 1/2 θ1 ≤z≤[I+Ψn (θ)] ˜ 1/2 θ2 } {z;[I+Ψn (θ)] e−z t z/2 ˜ −1/2 |∆n |1/2 . dz|I + Ψn (θ)| Because o n ˜ 1/2 θ1 ≤ z ≤ [I + Ψn (θ)] ˜ 1/2 θ2 H − () := {z; [I + G()]1/2 θ1 ≤ z ≤ θ2 [I − G()]1/2 } ⊆ z; [I + Ψn (θ)] ⊆ H + () := {z; [I − G()]1/2 θ1 ≤ z ≤ θ2 [I + G()]1/2 } we can lower and upper bound the probability µ(sn ∈ [θ1 , θ2 ]|rn ) as 44 |I + G()|−1/2 |∆n |1/2 µ(θn |rn ) Z e−z t z/2 H − () dz ≤ µ(sn ∈ [θ1 , θ2 ]|rn ) −1/2 ≤ |I − G()| 1/2 |∆n | µ(θn |rn ) Z e−z t z/2 H + () Remark that lim→0 H − () = lim→0 H + () = {z; θ1 ≤ z ≤ θ2 }. Therefore, by letting → 0 and taking the limit for n → ∞ we obtain: lim |∆n |1/2 µ(θn |rn )(2π)k/2 Φ([θ1 , θ2 ]) ≤ lim µ(sn ∈ [θ1 , θ2 ]|rn ) n→∞ n→∞ ≤ lim |∆n |1/2 µ(θn |rn )(2π)k/2 Φ([θ1 , θ2 ]) n→∞ where Φ([θ1 , θ2 ]) denotes the probability of [θ1 , θ2 ] under a standard normal distribution. Under assumption A2, the second result of theorem 3.1 holds and, by lemma C.1, implies that limn→∞ |∆n |1/2 µ(θn |rn )(2π)k/2 = 1. This concludes the proof. Proof of Theorem 3.3 √ Let θ rn = hE[f |rn ], gi and Ωn = hV ar[f |rn ]g, gi and remark that µ(s|rn ) is a N ( n(θ rn − ˆ We can upper bound the TV-distance by ˆ nΩn ) distribution. Denote sbn := √n(θ rn − θ). θ), ˆ V ) µ(s|rn ) − N (θ, TV ≤ kµ(s|rn ) − N (0, nΩn )kT V + kN (0, nΩn ) − N (0, V )kT V =: A + B. We start by considering A. By doing a change of variable it is easy to show that A = N (0, Ik ) − N (−(nΩn )−1/2 sbn , Ik ) TV . An application of Lemma D.2 and then of Lemmas C.2 and C.3 allows to conclude that A ≤ 1 √ k(nΩn )−1/2 sbn k = Op (n−1/2 ). 2π To bound B we introduce the Kullback-Leibler distance between two probability meaR sures P and Q, denoted by K(P, Q), and satisfying K(P, Q) = log pq dP where p and q are the densities of P and Q, respectively, with respect to the Lebesgue measure. We get 45 dz. B≤ p K (N (0, V ), N (0, nΩn )) v ! uZ 1 t −1 u 1 t −1 1 |V |−1/2 e{− 2 s V s} t |V |−1/2 e{− 2 s V s} ds = log 1 t −1 k/2 |nΩn |−1/2 e{− 2 s (nΩn ) s} (2π) s Z 1 t −1 1 |nΩn | 1 t −1 −1 = |V |−1/2 e{− 2 s V s} ds − s [V − (nΩn ) ]s log k/2 2 |V | (2π) s 1 |nΩn | 1 = log − trV [V −1 − (nΩn )−1 ] 2 |V | 2 that converges to zero by the result of Lemma C.2. Lemma C.2. Let Ωn and V be defined as in Theorem 3.3. Then, under the assumption of Theorem 3.3: (i) nΩn = O(1) and (ii) nΩn → V, as n → ∞. Proof. Under the conditions of the theorem the posterior variance writes as in (A.2) with the operator A defined in Lemma A.1. By using this expression: 1 1/2 1/2 1 −1/2 −1 −1/2 −1/2 Ω0 g, gt i ) f∗ Ω0 f ∗ ( I − f∗ hf∗ , ·i + f∗ n n 1 −1/2 1 1/2 −1/2 −1 1/2 ( f∗ − f∗ hf∗ , ·i + Ω0 f∗ = hΩ0 − Ω0 f∗ ) Ω0 g, gt i n n 1 −1/2 −1 1/2 −1/2 1 1/2 ) Ω0 g, g t i ( f∗ − f∗ hf∗ , ·i + Ω0 f∗ = h I − Ω0 f ∗ n n 1 1/2 1 1 1/2 1 −1/2 −1 1/2 1/2 = h f∗ − f∗ hf∗ , ·i ( f∗ − f∗ hf∗ , ·i + Ω0 f∗ ) Ω0 g, gt i n n n n 1 1/2 1 1 1/2 1/2 1 = h f∗ − f∗ hf∗ , ·i f∗ ( f∗ − f∗ hf∗ , ·i + Ω0 )−1 Ω0 g, g t i n n n n 1 1 1 1 = h f∗ − f∗ hf∗ , ·i ( f∗ − f∗ hf∗ , ·i + Ω0 )−1 Ω0 g, gt i n n n n 1 1 = hT ( T + Ω0 )−1 Ω0 g, gt i n n −1/2 Ωn = hΩ0 − Ω0 f∗ where T : E → E is the self-adjoint operator defined as T φ = f∗ (φ − E∗ φ), ∀φ ∈ E. This shows (i). To show (ii), by using the previous expression for Ωn we write 46 1 V − nΩn = −hT [( T + Ω0 )−1 Ω0 − I]g, g t i n 1 −1/2 −1/2 −1/2 1 −1/2 + I)−1 Ω0 T g, T g t i = −hΩ0 ( Ω0 T Ω0 n n 1 1 −1/2 −1/2 = h( Ω0 T Ω0 + I)−1 ν, ν t i n n 1/2 1/2 since there exists ν ∈ E such that T g = Ω0 ν. This is because T g = f∗ (g −E∗ g) ∈ R(Ω0 ). −1/2 So, because ( n1 Ω0 −1/2 T Ω0 + I)−1 is bounded then nΩn → V as n → ∞. Lemma C.3. Let θ rn and θˆ be defined as in Theorem 3.3. Then, under the assumption of Theorem 3.3: √ rn ˆ = Op (n−1/2 ). n(θ − θ) Proof. We want to show that √ ˆ → 0. Define T : E → E as the operator n(θ rn − θ) T φ = f∗ (φ − E∗ φ), ∀φ ∈ E. Remark that T is self-adjoint and that R(T ) ⊂ R(Ω0 ) so that Ω−1 T is well defined. √ rn ˆ n(θ − θ) * + −1 √ 1 1 1/2 1/2 1/2 −1/2 −1/2 −1/2 −1 = n( f0 + Ω0 f∗ Ω0 f ∗ I − f∗ hf∗ , ·i + f∗ K (rn − Kf0 ), g −B(Pn )) f∗ n n −1 √ 1 T + Ω0 )f0 , gi = nh(I − Ω0 n √ 1 −1 −1 + n hΩ0 ( T + Ω0 ) K rn , gi − B(Pn ) . n We consider these two terms separately. Consider the first term: √ nh(I − Ω0 1 T + Ω0 n −1 −1 √ 1 1 )f0 , gi = nh T f0 , gi T + Ω0 n n −1 1 1 T + Ω0 f0 , T gi =√ h n n −1 1 −1 1 −1/2 Ω−1 ). Ω T +I = √ hf0 , 0 T gi = O(n n n 0 47 P Consider the second term and remark that B(Pn ) = n1 ni=1 [( n1 T + Ω0 )−1 Ω0 g](xi ) + 1 1 Pn −1 i=1 (g(xi ) − [( n T + Ω0 ) Ω0 g](xi )). Let δxi be the Dirac measure that assigns a unit n mass to the point xi . Hence, it is possible to identify such a measure with a distribution Di , that is, a linear functional Di defined on C ∞ and continuous with respect to the supremum norm: g 7→ g(xi ) = Di (g), see Schwartz [1966]. We get n 1 1X 1 1X −1 −1 Di I − ( T + Ω0 ) Ω0 g (g(xi ) − [( T + Ω0 ) Ω0 g](xi )) = n n n n i=1 i=1 1 1 1X −1 Di ( T + Ω0 ) T g = n n n i=1 1 X 1 −1 −1 −1 = 2 Di ( Ω0 T + I) Ω0 T g = Op (n−1 ). n n i=1 Similarly, we can write: ∀t ∈ T , k 7→ k(xi , t) = Di (k(x, t)) =: [Di (k)](t) so that P rn (t) = n1 ni=1 [Di (k)](t) and n 1 1X 1 hΩ0 ( T + Ω0 )−1 K −1 [Di (k)], gi hΩ0 ( T + Ω0 )−1 K −1 rn , gi = n n n i=1 n 1X 1 = Di (hΩ0 ( T + Ω0 )−1 K −1 k, gi) n n i=1 n 1 1X Di (hk, (K ∗ )−1 ( T + Ω0 )−1 Ω0 gi) = n n i=1 n 1X 1 −1 = Di ( T + Ω0 ) Ω0 g n n i=1 since ∗ −1 hk, (K ) 1 ( T + Ω0 )−1 Ω0 gi = n Z 1 −1 k(x, t) (K ) ( T + Ω0 ) Ω0 g (t)ρ(t)dt n 1 1 = K ∗ (K ∗ )−1 ( T + Ω0 )−1 Ω0 g = ( T + Ω0 )−1 Ω0 g. n n ∗ −1 Therefore, 48 n √ 1 1 Xh 1 −1 −1 −1 n hΩ0 ( T + Ω0 ) K rn , gi − B(Pn ) = √ Di ( T + Ω0 ) Ω0 g n n n i=1 i 1 + O(n−1/2 ) = Op (n−1/2 ). − Di ( T + Ω0 )−1 Ω0 g n D Technical Lemmas T h(x Lemma D.1. Let g(xi , t, θ) := k(xi , t)(1 − eγ(θ) i ,θ) ). Assume that T × Θ is compact and that the maps t 7→ k(x, t) and θ 7→ h(θ, x) are continuous for every x. Then n 1X g(xi , t, θ) − E∗ g(xi , t, θ)| → 0 sup | t∈T,θ∈Θ n i=1 almost surely. Proof. Consider the following parametric class of functions G := {g(·, t, θ); t ∈ T θ, ∈ Θ} and remark that, by the assumptions of the lemma and because γ(·) is continuous in its argument (see proof of Theorem 1 in Kitamura and Stutzer), the map (t, θ) 7→ g(x, t, θ) is continuous for every x. Moreover, for every t ∈ T , θ ∈ Θ we observe that the function sup |g(x, t, θ)| t∈T,θ∈Θ is an envelope function, which is integrable because of continuity of the function and compactness of T × Θ. Hence, by Example 19.8 in Van der Vaart we conclude that G is Glivenko-Cantelli. The next Lemma applies to the just identified case described in Remark 2.2 where the prior covariance operator does not depend on θ for which we use the notation Ω0 . 1/2 Lemma D.2. Let Ω0 : E → E be a covariance operator of a GP on BE such that Ω0 1 = 0. Let D ∈ E be defined as Z D := g ∈ E; g(x)Π(dx) = 0 . Then, R(Ω0 ) = D. 1/2 1/2 Proof. Because Ω0 = Ω0 Ω0 1/2 and because Ω0 1 = 0 then Ω0 1 = 0, that is, 1 ∈ N(Ω0 ). 49 Hence, if g ∈ R(Ω0 ) then ∃ν ∈ E such that g = Ω0 ν and so Z g(x)Π(dx) = Z Ω0 νΠ(dx) =< ν, Ω0 1 >= 0. This shows that R(Ω0 ) ⊂ D. Now, take g ∈ D (so, < g, 1 >= 0) and suppose that g∈ / R(Ω0 ). Then, ∀h ∈ R(Ω0 ): < g, h >= 0 =< g, 1 > and < g, h − 1 >= 0. Because the same reasoning holds for every g ∈ D, then: < g, h − 1 >= 0 for every g ∈ D and for every h ∈ R(Ω0 ). Hence, it must be h = 1 but this is impossible since 1 ∈ / R(Ω0 ). Therefore, g must belong to R(Ω0 ) and so D ⊂ R(Ω0 ). References U. Amato and W. Hughes. Maximum entropy regularization of fredholm integral equations of the first kind. Inverse Problems, 7(6):793, 1991. J. F. Bonnans and A. Shapiro. Optimization problems with perturbations: A guided tour. SIAM Rev., 40(2). C. Cazals, F. F`eve, P. F`eve, and J.-P. Florens. Simple structural econometrics of price elasticity. Economics Letters, 86(1):1–6, 2005. C.-F. Chen. On asymptotic normality of limiting density functions with bayesian implications. Journal of the Royal Statistical Society. Series B (Methodological), 47(3):540–546, 1985. V. Chernozhukov and H. Hong. An {MCMC} approach to classical estimation. Journal of Econometrics, 115(2):293–346, 2003. I. Csiszar. i-divergence geometry of probability distributions and minimization problems. Ann. Probab., 3(1):146–158, 1975. T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., 1(2): 209–230, 03 1973. T. S. Ferguson. Prior distributions on spaces of probability measures. Ann. Statist., 2(4): 615–629, 07 1974. 50 F. F`eve, J.-P. Florens, and S. Richard. Microeconomic demand modelling for price elasticities. In M. Crew and P. Kleindorfer, editors, Lineralization of the Postal and Delivery Sector. Edouard Elgar, Northampton, USA, 2006. J.-P. Florens and J.-M. Rolin. Bayes, bootstrap, moments. Technical report, Universit´e Catholique de Louvain, 1994. J.-P. Florens and A. Simoni. Regularized posteriors in linear ill-posed inverse problems. Scandinavian Journal of Statistics, 39(2):pp. 214–235, 2012. J.-P. Florens and A. Simoni. Regularizing priors for linear inverse problems. Econometric Theory, FirstView, 11 2014. J.-Y. Kim. Limited information likelihood and bayesian analysis. Journal of Econometrics, 107(1-2):175–193, 2002. Y. Kitamura. Empirical likelihood methods in econometrics: Theory and practice. In R. Blundell, W. Newey, and T. Persson, editors, Advances in Economics and Econometrics, volume 3, pages 174–237. Cambridge University Press, 2007. Cambridge Books Online. Y. Kitamura and T. Otsu. Bayesian analysis of moment condition models using nonparametric priors. Technical report. Y. Kitamura and M. Stutzer. An information-theoretic alternative to generalized method of moments estimation. Econometrica, 65(4):pp. 861–874, 1997. R. Kress. Linear Integral Equation. Springer New York, 1999. H.-H. Kuo. Gaussian Measures in Banach Spaces. Springer Berlin Heidelberg, 1975. S. M. Schennach. Bayesian exponentially tilted empirical likelihood. Biometrika, 92(1): 31–46, 2005. ´ L. Schwartz. Th´eorie des distributions. Hermann Editeurs, 1966. 51
© Copyright 2024