Download Report

Gaussian processes and Bayesian moment
estimation∗
Jean-Pierre Florens†
Anna Simoni‡
Toulouse School of Economics
CNRS - CREST
March 19, 2015
– PRELIMINARY and INCOMPLETE –
Abstract
When a large number of moment restrictions is available for the estimation
of a finite dimensional parameter θ there may be restrictions that are more
important or credible than others. In these situations it might be desirable
to weight each restriction based on our beliefs. This is automatically implemented by a Bayesian procedure. We develop, in this paper, a semiparametric
Bayesian approach to moment estimation that is able to impose over-identifying
moment restrictions via the prior. We show how to construct a nonparametric
prior with support equal to the subset of data distributions F that satisfy the
moment restrictions. We establish that a Gaussian process (GP) prior for the
density function associated with F is particularly convenient in order to impose over-identifying restrictions and allows to have a posterior distribution in
closed-form. We provide a frequentist validation of our procedure by showing
consistency of the maximum a posteriori estimator for θ and asymptotic normality of the posterior of θ.
∗
First draft: February 2012. The authors are grateful to participants to seminars and conferences in: Bristol, Carlos III - Madrid, CREST, NASM 2012 at Northwestern, AFSE 2012 in Paris,
Toulouse, ICEEE 2015. We thank financial support from ANR-13-BSH1-0004 (IPANEMA). Anna
Simoni gratefully acknowledges financial support from Labex ECODEC (ANR - 11-LABEX-0047)
and financial support and hospitality from the University of Mannheim through the SFB-884.
†
Toulouse School of Economics - 21, all´ee de Brienne - 31000 Toulouse (France). Email: [email protected]
‡
CREST, 15 Boulevard Gabriel P´eri, 92240 Malakoff (France). Email: [email protected]
(corresponding author ).
1
Key words: Moment conditions, Gaussian processes, overidentification, posterior consistency.
JEL code: C11, C14, C13
1
Introduction
In many practical applications and empirical economic studies a large set of moment
restrictions characterizing the parameter of interest is available. Examples are provided
for instance in Cazals et al. [2005] and F`eve et al. [2006]. Such a situation is complicated
to manage since it requires cumbersome computations due to the high number of moment
restrictions. It is often the case that the researcher does not equally believe in all the
restrictions. Therefore, it is desirable to have a procedure that assigns a specific weight to
each moment restriction based on the researcher’s beliefs and that updates it (and possibly
decides whether to include or not a restriction) based on the information contained in the
data. This may be easily performed by a Bayesian procedure where each moment restriction
can have a different weight based on the prior and posterior distribution.
The purpose of this paper is to develop a Bayesian approach to estimate a structural
parameter θ characterized by over-identifying moment conditions. In such a framework,
the researcher has only limited information on the data generating process (DGP) which
is given by a set of moment conditions. Any parametric specification of the DGP is,
therefore, completely arbitrary. In this paper we do not restrict the DGP but for satisfying
the moment restrictions. Thus, we construct a nonparametric prior with support equal to
the subset of data distributions that satisfy the moment restrictions.
Let x be a random element in Rm with distribution F and x1 , . . . , xn be an i.i.d. sample
of x. We are interested in a parameter θ ∈ Θ ⊂ Rk which is linked to F through the relation
(moment restrictions)
EF [h(θ, x)] = 0
(1.1)
where h is a known function with values in Rr . This model is semiparametric since it
includes a finite dimensional structural parameter θ and a functional parameter F which,
apart from the moment restrictions, is not at all constraint.
Contrarily to the classical generalized method of moments (GMM) which imposes the
moments restrictions directly in the DGP, we impose the moment restrictions via the prior
for (θ, F ). Hence, the random parameter generated from the prior satisfies the moment
restrictions by construction. Imposing moment restrictions via semiparametric priors may
be challenging depending on the relationship existing between θ and F . More precisely,
2
when the model is just-identified, that is k = r, (1.1) characterizes θ as an explicit function
of F : θ = B(F ), where B is a function defined on the space of probability distributions. For
particular functional forms of B, the prior of θ may be easily recovered from an unrestricted
prior of F and (θ, F ) generated by this prior automatically satisfy the constraints.
On the contrary, in an overidentified model where k < r, θ cannot be expressed as an
explicit function of F . More precisely, (1.1) imposes constraints on F and the existence of a
solution θ to (1.1) is guaranteed only for a subset of distributions F . In a Bayesian approach
this entails that the prior on F must be specified conditionally on θ and its support must
be a proper subset of the set of probability distributions. However, in an overidentified
model, the restrictions on F are in general complicated to incorporate in its prior distribution. The approach proposed in Florens and Rolin [1994], for instance, which is based
on a Dirichlet process prior distribution, presents several difficulties to treat overidentified
models. An important contribution of this paper is to construct a semiparametric prior
that incorporate the over-identifying moment restrictions.
Our approach differs substantially from Zellner’s (1996) Bayesian Method of Moments
and is more in the vain of the recent literature on Bayesian estimation under moment restrictions. Kitamura and Otsu construct the restricted prior on F by projecting a Dirichlet
process prior (see Ferguson [1973, 1974]) via minimization of the Kullback-Leibler divergence with respect to the Dirichlet process prior under the moment constraint. A Dirichlet
process prior has nice properties due to the fact that it is a natural conjugate of the
i.i.d. model, however projection of it, necessary to treat the overidentified case, comes
at the price of conjugacy. Schennach [2005] proposes a maximum entropy nonparametric
Bayesian procedure which, instead of employing the Dirichlet process prior, relies on a noninformative prior on the space of distributions. Kim [2002] and Chernozhukov and Hong
[2003] proposes a quasi-Bayesian approach by incorporating the moment restrictions in the
pseudo-likelihood.
We specify a degenerate Gaussian process (GP) prior with restricted support that is
easy to deal with and that works as follows. The DGP F is assumed to admit a density
function f with respect to some positive measure Π chosen by the econometrician (for
instance the Lebesgue measure). Then, we endow f with a GP prior conditional on θ.
Because this prior imposes the moment restrictions it will be degenerate on a proper subset
of the parameter space. The essential reason for the appropriateness of a GP prior in such
a framework is due to the fact that the moment functions in (1.1) are linear in f and the
linearity of the model matches extremely well with a GP prior. Thus, the (over-identifying)
moment restrictions can be incorporated in an easy way by constraining the prior mean and
prior covariance of f . An advantage of our method is that, in both the just-identified and
overidentified cases, the moment restrictions are imposed directly through the (conditional)
3
prior of f (given θ) without requiring a second step projection as in Kitamura and Otsu.
At the best of our knowledge this prior has not been used yet in the GMM framework.
In the overidentified case we first specify a prior on θ and then a GP prior on f condi-
tional on θ. In the just-identified case we may either proceed as in the overidentified case
or specify an unrestricted GP prior on f and deduce from it the prior for θ through the
explicit relationship θ = B(f ). We circumvent the difficulty of specifying the likelihood
function, which is not available, by constructing a linear functional transformation of the
DGP F , say rn , that has an asymptotic Gaussian distribution. This construction has the
advantage to provide a conjugate model (because the prior of f given θ is also a GP) without the risk of misspecifying the sampling distribution. In other words, our approach stays
nonparametric in F but allows easy computations. In the following we call our method the
GP-approach.
We propose as an estimator for θ the maximum of the marginal posterior distribu-
tion of θ (obtained by integrating out f ). This is usually not available in closed-form but
can be easily computed via MCMC methods. Finally, we provide a frequentist validation
of our method by showing: frequentist consistency of the maximum a posteriori estimator, and frequentist consistency and asymptotic normality of the posterior distribution of θ.
The paper is organized as follows. The GP-approach is described in section 2. In
section 3 we analyze asymptotic properties of the posterior distribution of θ and of the
maximum a posteriori estimator. There exists a link between our approach and the GMM
approach for moment estimations. We stress this link in section 4. In section 5 we detail
how to implement our method for both the just identified case and the overidentified case.
The GP-approach
2
Let x be a random element in S ⊆ Rm with distribution F and x1 , . . . , xn be an i.i.d.
sample of x.1 Assume that F is absolutely continuous with respect to some positive measure
Π (e.g. the Lebesgue measure) with density function f . In other words, conditionally on f
the data are drawn from F : x1 , . . . , xn |f ∼ F . The set of probability density functions on
S with respect to Π is denoted by M .
Let θ ∈ Θ ⊆ Rk be the parameter of interest characterized by (1.1). Throughout
the paper we denote the true value of θ by θ∗ , the true DGP by F∗ and its density with
respect to Π by f∗ . Hence, EF∗ (h(θ∗ , x)) = 0. We endow S ⊆ Rm with the trace of the
Borelian σ-field BS and we specify Π as a positive measure on this subset. We denote
by E = L2 (S, BS , Π) the Hilbert space of square integrable functions on S and by BE the
1
The DGP for x could be more general than an i.i.d. sampling process but we focus on this case
for simplicity.
4
Borel σ-field generated by the open sets of E. The scalar product and norm on this space
are defined in the usual way and denoted by h·, ·i and || · ||, respectively.
Given a value for θ, a conditional prior for f |θ is specified such that its support is equal
to the subset of M of functions that satisfy (1.1) for this value of θ. Hence, the parameters
of the model are (θ, f ), with f the nuisance parameter, and the parameter space is
Z
Λ = (θ, f ) ∈ Θ × EM ;
h(θ, x)f (x)dΠ = 0 ,
EM := E ∩ M,
where h : Θ × Rm → Rr is a known function that satisfies the following assumption.
Assumption 2.1. (i) The true f∗ satisfies f∗ ∈ EM := E ∩ M ; (ii) the moment function
h(θ, ·) satisfies hi (θ, ·) ∈ E for every i = 1, . . . , r and for every θ ∈ Θ, where hi denotes the
i-th component of h.
Assumption 2.1 (i) restricts f∗ to be square integrable with respect to Π and is for
instance verified if f∗ is bounded and Π is a bounded measure. The model is made up
of three elements that we detail in the next two subsections: a prior on θ, denoted by
µ(θ), a conditional prior on f given θ, denoted by µ(f |θ) and the sampling model. In the
following, we shorten almost surely by a.s., we denote by EF and Ef the expectation taken
with respect to F .
2.1
Prior distribution
We specify a prior probability measure µ on the pair (θ, f ) of the form µ = µ(θ)⊗µ(f |θ).
By abuse of notation, µ(θ) will also denote the density of the prior distribution of θ with
respect to the Lebesgue measure in the case it admits it. The prior µ(θ) may either be flat
(non-informative) or incorporate any additional information available to the econometrician
about θ. The conditional prior µ(f |θ) requires a more careful analysis and we now explain
it.
Conditional prior on f given θ. The conditional prior distribution µ(f |θ) of f ,
given θ, is specified as a GP on BE with mean function f0θ ∈ EM and covariance operator
Ω0θ : E → E. We restrict the prior µ(f |θ) to generate trajectories f ∈ E that integrate to
1 (with respect to Π) and such that the corresponding F satisfies (1.1) with probability 1.
For that, we impose two sets of restrictions, one on f0θ and one on Ω0θ :
Restriction 1 (Restrictions on f0θ ). The prior mean function is chosen such that f0θ ∈ EM
and
Z
h(θ, x)f0θ (x)Π(dx) = 0.
5
(2.1)
Restriction 2 (Restrictions on Ω0θ ). The prior covariance operator Ω0θ : E → E is such
that
(
1/2
Ω0θ h(θ, x) =
1/2
Ω0θ 1
0
(2.2)
= 0.
The covariance operator Ω0θ is linear, self-adjoint and trace-class.2 Due to Restriction
2, Ω0θ is not injective. In fact, the null space of Ω0θ , denoted by N(Ω0θ ) is not trivial and
contains effectively the constant 1 – which implies that the trajectories f generated by the
prior integrate to 1 a.s. – and the function h(θ, x) – which implies that the trajectories f
satisfy the moment conditions a.s. Therefore, the support of µ(f |θ) is a proper subset of
E. In practice, this means that Ω0θ is degenerate in the directions along which we want
that the corresponding projections of f and f0θ are equal. This is the meaning of the next
lemma.
Lemma 2.1. The conditional GP prior µ(f |θ), with mean function f0θ and covariance
operator Ω0θ satisfying Restrictions 1 and 2, generates trajectories f that satisfy µ(f |θ)-
a.s. the conditions
Z
f (x)Π(dx) = 1
Z
and
h(θ, x)f (x)Π(dx) = 0.
Remark 2.1. Our restrictions imply that the trajectories generated by µ(f |θ) integrates
to 1 a.s. but they do not guarantee non-negativity of the trajectories. Thus, the support
of µ(f |θ) is smaller than E but bigger than EM . To impose non-negativity we could: (i)
either project the prior on the space of non-negative functions or (ii) write f = g2 , g ∈ E,
and specify a conditional prior distribution, given θ, for g instead of for f . However,
the resulting prior would be non Gaussian and the resulting posterior for θ would not be
available in closed form which is instead one of the main advantages of our procedure.
From a practical implementation point of view, a covariance operator satisfying Restriction 2 may be constructed as
Ω0θ g =
∞
X
j=0
λj hg, ϕj iϕj ,
g∈E
where (λj )j∈N is a decreasing sequence of non-negative numbers accumulating at 0 such
P
that j λj < ∞ and (ϕj )j∈N is a basis for E. Remark that (λj )j∈N and (ϕj )j∈N correspond
to the eigenvalues and eigenfunctions of Ω0θ , respectively. Since the null space N(Ω0θ ) ⊂ E
is generated by {1, h1 (θ, ·), . . . , hr (θ, ·)}, without loss of generality we can set the first
2
A trace-class operator is a compact operator with eigenvalues that
R are summable. Remark that
this guarantees that the trajectories f generated by µ(f |θ) satisfy f 2 dΠ < ∞ a.s.
6
eigenvalues equal to the elements that span N(Ω0θ ). Restriction 2 is fulfilled by setting the
corresponding eigenvalues equal to 0. For instance, if {1, h1 (θ, ·), . . . , hr (θ, ·)} are linearly
independent as elements of E, then N(Ω0θ ) has dimension r + 1, the first eigenvalues are
(ϕ0 , ϕ1 , . . . , ϕr )T = (1, hT )T and the corresponding eigenvalues are λj = 0, ∀j = 0, 1, . . . , r.3
The remaining components (ϕj )j>r are chosen such that (ϕj )j≥0 forms a basis of E and
P
(λj )j>r such that j>r λj < ∞. In section 5 we provide some examples that explain in
detail the construction of Ω0θ .
Remark 2.2. In the just-identified case where r = k and the moment restrictions EF (h(θ, x)) =
0 rewrite in an explicit form as θ = B(f ), where B is a linear functional, we may construct
the prior in an alternative way. This consists in first specifying a GP prior µ(f ) for f (independent of θ) with a mean function f0 restricted to be a pdf and a covariance operator Ω0
1/2
restricted to satisfy Ω0 1 = 0. Then, the prior for θ is obtained through the transformation
B(·), under mild regularity conditions. For example, if θ = Ef (x) then B(f ) = hf, ιi, where
ι ∈ E denotes the identity function ι(x) = x, and µθ = N (hf0 , ιi, hΩ0 ι, ιi).
2.2
The sampling model
Conditional on f , the sample likelihood is
Qn
i=1 f (xi ).
This sampling model has the
disadvantage that the corresponding posterior distribution of (θ, f ), as follows from the
prior of section 2.1, is not available in closed-form. Hence, we propose an alternative and
original way for constructing the sampling model that allows for a closed-form posterior
and a conjugate analysis.
In order to construct the sampling distribution let us consider a functional transformation rn of the sample x1 , . . . , xn . This transformation rn is chosen by the researcher
and must have the following characteristics: (I) rn converges weakly towards a Gaussian
process (in a Hilbert space F to be defined below); (II) rn is an observable element of
an infinite-dimensional Hilbert space F, for instance a L2 -space; (III) rn is linked to the
nuisance parameter f according to the following linear scheme
rn = Kf + U
(2.3)
where K : E → F is a linear operator, F is an infinite-dimensional separable Hilbert
space and U ∈ F is a Hilbert space-valued random variable (H-r.v.). We recall that,
for a complete probability space (Z, Z, P), U is a H-r.v. if it defines a measurable map
U : (Z, Z, P) → (F, BF ), where BF denotes the Borel σ-fields generated by the open sets
of F.
3
Remark that if {1, h1 (θ, ·), . . . , hr (θ, ·)} are not linearly independent then we can use their
orthonormalized counterparts.
7
To construct rn we first select a function k(t, x) : T × S → R+ that is measurable in
x for every t ∈ T , where T ⊆ Rp , p > 0. The transformation rn is then the expectation of
k(t, ·) under the empirical measure:
n
1X
k(t, xi ).
rn =
n
i=1
Define F = L2 (T, BT , ρ) where ρ is a measure on T and BT denotes the Borel σ-field
generated by the open sets of T . We use the same notation h·, ·i and k · k for the inner
product and norm, respectively, in F and in E. The function k(t, x) defines also the operator
R
K: Kφ := k(t, x)φ(x)Π(dx), ∀φ ∈ F and must be such that Kφ and rn are elements of
R
F for every φ ∈ E. For every f ∈ EM , Kf is the expectation of k(t, ·) under F := S f dΠ:
R
Kf = k(t, x)f (x)Π(dx) = EF (k(t, x)). Hence, model (2.3) rewrites:
n
1X
k(t, xi ) =
rn =
n
i=1
Z
k(t, x)f (x)Π(dx) + U (t).
(2.4)
Conditionally on f , the expectation of rn is equal to Kf and the error term U has zero
mean and covariance kernel
σnF (t, s) = EF U (t)U (s) =
1 F
E (k(t, x)k(s, x)) − EF (k(t, x))EF (k(s, x)) .
n
We denote by P f the sampling distribution, namely, the conditional distribution of rn given
f satisfying (2.3). When f = f∗ , i.e. the truth, we simply use the notation P ∗ = P f∗ and
σnF∗ (t, s). If the function k(t, ·) is Donsker then P f weakly converges to a Gaussian process
GP(Kf, Σn ) with
Σn =
1
Σ:F →F
n
Σϕ =
Z
σnF (t, s)ϕ(s)ds,
ϕ∈F
(2.5)
as n → ∞. In our analysis we use this Gaussian process as the sampling model in order to
construct the posterior distribution. The frequentist asymptotic properties of the posterior
distribution will be analyzed under the true measure P ∗ . The covariance operator Σn is
one-to-one, linear, positive definite, self-adjoint and trace-class. In several examples Σn is
unknown because F is unknown; thus in the posterior distribution we will replace F with
the empirical cumulative distribution function (cdf ). Because this estimator converges at
the parametric rate, then does not affect the asymptotic properties of our procedure (see
Florens and Simoni [2012]). The next example clarifies the construction of the sampling
model (2.3).
Example 2.1. Let us suppose that we dispose of an i.i.d. sample of x: (x1 , . . . , xn ), where
8
xi ∈ R, i = 1, . . . , n. The functional transformation rn may be, for example, the empirical
P
P
cdf Fn (t) = n1 ni=1 1{xi ≤ t} or the empirical characteristic function cn (t) = n1 ni=1 eitxi
for t ∈ R. In these two cases we can write:
Z
Fn (t) =
ZS
cn (t) =
1{s ≤ t}f (s)Π(ds) + U (t),
eits f (s)Π(ds) + U (t),
S
R
respectively. In the first case rn = Fn and ∀ϕ ∈ E, Kϕ = S 1{s ≤ t}ϕ(s)Π(ds), while in
R
the second case rn = cn and ∀ϕ ∈ E, Kϕ = S eits ϕ(s)Π(ds). In these two cases, by the
Donsker’s theorem, U is asymptotically Gaussian with zero mean and covariance operator
characterized by the kernel n1 (F (s ∧ t) − F (s)F (t)) in the first case and n1 (c(s + t) − c(s)c(t))
in the second case, where c(t) denotes the characteristic function of the distribution F .
These variances are unknown when f is unknown but we can replace them in the posterior
distribution by their empirical counterparts.
The following lemma gives an useful characterization of the operator Σn in terms of
K and its adjoint K ∗ . We recall that the adjoint K ∗ of a bounded and linear operator
K : E → F is defined as the operator from F to E that satisfies hKϕ, ψi = hϕ, K ∗ ψi, ∀ϕ ∈ E
R
and ψ ∈ F. In our case, an elementary computation shows that K ∗ ψ = T k(t, x)ψ(t)ρ(dt).
Lemma 2.2. Let K : E → F be the bounded and linear operator defined as Kϕ =
R
R
∗ : F → E be its adjoint, that is, K ∗ ψ =
k(t,
x)ϕ(x)Π(dx),
∀ϕ
∈
E,
and
K
T k(t, x)ψ(t)ρ(dt),
S
∀ψ ∈ F. Moreover, denote with f∗ the true value of f that characterizes the DGP. Thus,
the operator Σn = n1 Σ takes the form
∀ψ ∈ F,
Σψ = KMf K ∗ ψ − (KMf 1)hMf , K ∗ ψi
(2.6)
where Σ : F → F and Mf : E → E is the multiplication operator ∀ϕ ∈ E, Mf ϕ = f∗ (x)ϕ(x).
We denote by D the subset of E whose elements integrate to 0 with respect to Π:
Z
D := g ∈ E;
g(x)Π(dx) = 0 .
We remark that D contains the subset of functions in E that are the difference of pdf of F
1
1
1
2
2
2
) ⊂ D where R(Ω0θ
) denotes the range of Ω0θ
and is
with respect to Π. Moreover, R(Ω0θ
defined as
1
2
R(Ω0θ ) :=
ϕ ∈ E;
Z
ϕ(h)h(θ, x)Π(dx) = 0 and
Z
S
S
9
ϕ(x)Π(dx) = 0 .
1
The following lemma states the relationship between the range of K and the range of Σ 2 .
Lemma 2.3. Let K : E → F be the operator: ∀ϕ ∈ E, Kϕ =
R
S
k(t, x)ϕ(x)Π(dx) and
denote by K|D the operator K restricted to D ⊂ E. Then, if K|D is injective we have
1
R(K|D ) = D(Σ− 2 ).
2.3
Posterior distribution
The Bayesian model defines a joint distribution on (the Borel σ-field) of Λ × F and can
be summarized in the following way:




θ
f |θ
∼ µ(θ)
∼ µ(f |θ) ∼ GP(f0θ , Ω0θ ),


 r |f, θ ∼ r |f ∼ P f ∼ GP(Kf, Σ )
n
n
n
R
1
h(θ, x)f0θ (x)Π(dx) = 0 and
2
(1, h(θ, ·)0 )0 = 0
Ω0θ
where we use the approximated sampling distribution P f . Theorem 1 in Florens and Simoni
[2012] shows that the conditional distribution of (f, rn ), conditional on θ, is:
!
θ ∼ GP
rn f
f0θ
Kf0θ
!
,
Ω0θ
Ω0θ K ∗
KΩ0θ Σn + KΩ0θ K ∗
!!
(2.7)
where (Σn + KΩ0θ K ∗ ) : F → F, Ω0θ K ∗ : F → E and KΩ0θ : E → F. The marginal
sampling distribution of rn conditional on θ, obtained by integrating out f , is:
rn |θ ∼ Pnθ ∼ GP(Kf0θ , Σn + KΩ0θ K ∗ ).
(2.8)
We now discuss the posterior distribution, denote by µ(·|rn ). In the main text we discuss
only the posterior distribution of the parameter of interest θ and postpone to Appendix A
the discussion about the conditional posterior distribution µ(f |rn , θ) of f given θ.
2.3.1
Posterior distribution of θ
The marginal posterior for θ, denoted by µ(θ|rn ), is obtained by using the marginal
sampling distribution Pnθ given in (2.8). To obtain µ(θ|rn ) we characterize the likelihood of
Pnθ with respect to a dominating measure Pn0 that has to be defined. The following theorem,
which is a slight modification of Theorem 3.4 in Kuo [1975, page 125], characterizes a
probability measure Pn0 which is equivalent to Pnθ as well as the corresponding likelihood of
Pnθ with respect to Pn0 .
10
Theorem 2.1. Let f˜ ∈ EM denote a probability density function (with respect to Π) and
Pn0 be a Gaussian measure with mean K f˜ and covariance operator n−1 Σ, i.e. Pn0 =
GP(K f˜, n−1 Σ). For n fixed, if K|D is injective, then P 0 and P θ are equivalent. Moren
n
over, assume that ψj ∈ R(Σ1/2 ), ∀j ≥ 0 then the Radon-Nikodym derivative is given by
pnθ (rn ; θ) :=
where
2
ljθ
dPnθ
(rn )
dPn0
=
∞
Y
j=1
q
1
2 +1
nljθ
"
− 12 hrn −Kf0θ ,Σ−1/2 ψj (θ)i2
e
1
n−1 +l2
jθ
−hrn −K f˜,Σ−1/2 ψj (θ)i2 n
#
(2.9)
and ψj (θ) are the eigenvalues and eigenfunctions of
Σ−1/2 KΩ
0θ
K ∗ Σ−1/2 .
2 and ψ (θ) depend on θ because the operator Σ−1/2 KΩ K ∗ Σ−1/2
The quantities ljθ
j
0θ
depends on θ. We also remark that we can use any density function for the mean function
f˜ as long as it does not depend on θ. For instance, it could be f˜ = f∗ even if it is unknown
in practice.
The marginal posterior distribution of θ takes the form
µ(θ|rn ) =
R
pnθ (rn ; θ)µ(θ)
dPnθ (rn |θ)µ(θ)
=R
θ
Θ dPn (rn |θ)µ(θ)dθ
Θ pnθ (rn ; θ)µ(θ)dθ
and can be used to compute a point estimator of θ. We propose to use the maximum a
posterior (MAP) estimator θn defined as
θn = arg max µ(θ|rn ).
θ∈Θ
(2.10)
In general, when the conditional prior distribution on f , given θ, is very precise the
MAP will essentially be equivalent to the maximum likelihood estimator (MLE) that we
would obtain if we use the prior mean function f0θ as the likelihood. On the contrary, with
a prior µ(f |θ) almost uninformative, the MAP will be close to the GMM estimator (up to
a prior on θ). Section 4 shows this argument in a rigorous way.
Remark 2.3. We have already discussed (see Remark 2.2) the possibility of using a different prior scheme when we are in the just-identified case and θ can be written as a linear
functional of f . In that case, given a Gaussian process prior on f , the prior of θ is recovered through the transformation θ = B(f ). The posterior distribution for θ is recovered
from the posterior distribution of f (which is obviously unconditional on θ) through the
transformation B(f ).
11
,
2.3.2
Choice of f0θ
Our approach requires that the prior mean function f0θ be an element of the following
set:
Z
F(θ) = f ∈ EM ; h(θ, x)f (x)dΠ(x) = 0 .
Selection of an element of F(θ) is a linear inverse problem. While any element of F(θ) is a
valid candidate for f0θ , asymptotic arguments require that the chosen f0θ converges to f∗
when θ is replaced by the true θ∗ . Henceforth, one has to define a population criterion Q(·)
and to select f0θ as the minimizer of this criterion over F(θ) for a given θ. In practice, Q(·)
will depend on unknown quantities that have to be replaced by estimated ones. We denote
by Qn (·) the corresponding sample criterion and by f0θ,n the chosen prior mean function
defined as:
for a fixed θ ∈ Θ.
f0θ,n = arg min Qn (f )
f ∈F (θ)
(2.11)
This procedure delivers an empirical Bayes procedure because f0θ,n is data-dependent. We
outline two possible criteria.
Regularization. The population criterion is
Q(f ) = kEF∗ (k(t, x)) − Kf k2 + αpen(f )
where α > 0 and pen(·) is a penalty function. For instance, if pen(f ) = kf k2 then f0θ is
equal to the constrained Tikhonov regularized solution. Alternatively, as f0θ is known to
be non-negative, a more appropriate penalty is the negative entropy:
pen(f ) =
Z
f log f dΠ
under the assumption that this quantity is finite. The existence of a constrained maximum
entropy solution is guaranteed under very general conditions, see e.g. Amato and Hughes
[1991]. The sample criterion is Qn (f ) = krn − Kf k2 + αpen(f ).
I-Divergence. The population criterion is the I-divergence D(F ||F∗ ) (or Kullback-Leibler
information criterion) between F and F∗
Q(f ) = D(F ||F∗ ) =
Z
log
dF
dF∗
dF,
where
f
dF
= .
dF∗
f∗
The sample
criterion is obtained by replacing F∗ by its empirical counterpart Fn : Qn (f ) =
R
dF
log dFn dF be . Then, we solve the problem
12
n
1X
min
npi log(npi )
p1 ,...,pn n
such that
n
X
pi h(xi , θ) = 0,
i=1
i=1
n
X
pi = 1.
i=1
See e.g. the beautiful paper by Csiszar [1975], Kitamura and Stutzer [1997] and Kitamura
[2007] for a description of this procedure. By using duality arguments it is possible to find
an explicit solution to the minimization problem that, once replaced in Kf0θ,n , gives:
n
Kf0θ,n
t
1X
k(xi , t)eγn (θ) h(xi ,θ)
=
n
(2.12)
i=1
where γn (θ) = arg minγ
1
n
Pn
γ 0 h(xi ,θ) .
i=1 e
Remark, that, because f0θ enters µ(θ|rn ) only
through Kf0θ , thus Kf0θ,n is all what we need to have. Equation (2.12) is obtained by
R
R
writing Kf0θ,n = k(x, t)f0θ,n (x)dΠ(x) = k(x, t)(dF/dFn )dFn , by abuse of notation.
Both these procedures give an f0θ,n that, for a given θ, converges to the pseudo-true
function f0θ := arg minf ∈F (θ) Q(f ) as n → ∞. Because f0θ is the solution of a convex
problem, then its continuity in θ follows from Theorem 4.2 in Bonnans and Shapiro. Then
f0θ converges to f∗ as θ → θ∗ .
3
Asymptotic Analysis
In this section we focus on the asymptotic properties of our approach. We analyze
three issues: (i) frequentist consistency of the MAP estimator θn (section 3.1), (ii) weak
convergence of µ(θ|rn ) towards a normal distribution (section 3.2), and (iii) convergence
in Total Variation of the posterior µrθn towards the asymptotic distribution of the method
of moments estimator of θ for the just-identified case where θ is a linear functional of f
˜ δ) the closed
(section 3.3). In the following, for every θ˜ ∈ Θ and δ > 0 we denote by B(θ,
˜ δ) = {θ ∈ Θ; ||θ − θ||
˜ ≤ δ}. Moreover, let
ball centered in θ˜ with radius δ, that is, B(θ,
ln (θ) = log(pnθ (rn ; θ)) and Ln (θ) = log(pnθ (rn ; θ)) + log µ(θ) = ln (θ) + log µ(θ).
3.1
Posterior Consistency
In this section we study the consistency of the MAP estimator θn when f0θ is determined
by using the I-Divergence. In this case,
f0θ (x)
0
= eγ(θ) h(θ,x)
f∗ (x)
13
0
where γ(θ) := arg min E∗ eγ h(θ,x) , see Kitamura [2007]. Let ln (θ) = log(pnθ (rn ; θ)) where
we replace f0θ by f0θ,n . Define l(θ) as
l(θ) = −
∞
∞
j=1
j=1
1X
1X
hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 +
hK(f˜ − f∗ ), Σ−1/2 ψj (θ)i2 (3.1)
2
2
1
1
= − kΣ−1/2 K(f0θ − f∗ )k2 + kΣ−1/2 K(f˜ − f∗ )k2
2
2
(3.2)
where the second equality is due to the fact that Σ−1/2 K|D is well defined by Lemma 2.3.
The maximizer of l(θ) is equal to the minimizer of kΣ−1/2 K(f0θ − f∗ )k2 which can be
written
0
kΣ−1/2 K(f0θ − f∗ )k2 = kΣ−1/2 K(eγ(θ) h(θ,x) − 1)f∗ k2 .
0
Since eγ(θ) h(θ,x) =
0
eγ(θ) h(θ,x)
f0θ (x)
f∗ (x)
(3.3)
hence, the value of θ that solves (3.3) is the value that makes
as close as possible to one, that is, the value that makes f0θ as close as possible
to f∗ . If Σ−1/2 K is injective then there exists a unique value that minimizes (3.3). This
value is the true θ∗ . From this reasoning, it turns out that consistency of θn will then follow
from uniform convergence of
1
1
n ln (θ) + n
log µ(θ) towards l(θ).
To get uniform convergence we replace the fixed GP prior for f with a shrinking GP
prior, that is, a prior where Ω0θ is replaced by an Ω0θ where an = o(n−1 ). The intuition
for that is that, because the prior mean f0θ,n converges towards f∗ for n → ∞ and θ = θ∗
then, the prior distribution should also be more and more concentrated. Remark that with
2 in Theorem 2.1 satisfy nl2 → 0. We introduce the following
this prior, the eigenvalues ljθ
jθ
assumptions:
A1. The true parameter θ∗ belongs to the interior of a compact convex subset Θ of Rr .
A2. The maps t 7→ k(x, t) and θ 7→ h(θ, x) are continuous for every x.
A3. The prior distribution µ(θ) is continuous in θ.
Theorem 3.1. Let µ(f |θ) be the GP prior with mean function f0θ,n = arg minf ∈F (θ) Qn (f )
where Qn is the I-divergence sample criterion, and covariance operator an Ω0θ where Ω0θ
satisfies restriction 2 and nan → 0. Let T be compact; under A1-A3:
p
θn −
→ θ∗ .
Moreover, for some δn & 0
µ (θ ∈ B(θn , δn )c |rn ) → 0
14
(3.4)
in P ∗ -probability as n → 0.
3.2
Asymptotic Normality: weak convergence
We show now that asymptotically the posterior distribution of θ behaves like a normal
distribution centered around the MAP estimator θn defined in (2.10). Denote Ln (θ) =
log(pnθ (rn ; θ)) + log µ(θ) = ln (θ) + log µ(θ). We make the following assumptions
∂Ln (θ) ∂θ θ=θ
B1. θn is a strict local maximum of µ(θ|rn ) and L0n (θn ) :=
B2. ∆n =
{−L00n (θn )}−1
:= −
∂ 2 Ln (θ) ∂θ∂θ t θ=θ
n
−1
= 0.
n
exists and is positive definite.
B3. For any > 0, there exists an integer N and δn & 0 such that, for any n > N and
θ ∈ B(θn , δn ), L00n (θ) exists and satisfies
I − G() ≤ L00n (θ){L00n (θn )}−1 ≤ I + G(),
where I is a (k×k) identity matrix and G() is a (k×k) positive semidefinite symmetric
matrix whose largest eigenvalue gmax () converges to 0 as → 0.
B4. Let dmax be the largest eigenvalue of ∆n and δn be as in B3. Then, dmax & 0 and
1/2
δn /dmax → 0 as n → ∞.
Assumptions B1- B4 hold depending on the form taken on by the operator K, Σ, Ω0 and
the function rn .
The next theorem, which is a slight modification of results in Kim [2002] and Chen
[1985], shows that the posterior distribution µ(θ|rn ) converges weakly towards a normal
distribution centered at the MAP estimator and with variance equal to ∆n .
Theorem 3.2. Let δn & 0 be as in Theorem 3.1. Under Assumptions B1-B4 and the
−1/2
assumptions of Theorem 3.1, the posterior distribution of ∆n
to a standard normal distribution, that is, for every θ1 , θ2 ∈ Θ,
Z
−1/2
{θ; ∆n
(θ−θn )∈(θ1 ,θ2 )}
dµ(θ|rn ) →
Z
θ2
φ(u)du
θ1
in P ∗ -probability, where φ(·) denotes the standard normal pdf.
15
(θ − θn ) converges weakly
3.3
Convergence in Total Variation for linear functionals:
the just-identified case
In this section we consider the just-identified case where k = r and where the assumptions of the implicit function theorem are satisfied so that every pair (θ, F ) satisfying (1.1)
can be explicitly written as θ = B(f ). Assume in addition that B(·) is a linear functional
with values in Rr . Denote by E r the cartesian product Πri=1 E. Hence, by the Riesz theorem,
there exists a unique g ∈ E r such that we can identify B by such a g such that for all f ∈ E:
θ = B(f ) =
Z
g(x)dF (x).
In this situation the prior distribution of θ is specified through the prior distribution of f :
θ ∼ µ(θ) = N (hf0 , gi, hΩ0 g, gi), as described in Remark 2.2. The posterior distribution is
the given by
µ(θ|rn ) = N (θ rn , Ωn ),
where θ rn = hf0 + A(rn − Kf0 ), gi,
Ωn = h(Ω0 − AKΩ0 )g, g)
and A : F → E is a continuous and linear operator whose expression is given in Lemmas
A.1 and A.2 in the Appendix. In the following, we establish convergence in total variation
of µ(θ|rn ) towards a Normal distribution. This convergence is stronger than the weak convergence considered in section 3.2. We implicitly assume that the conditions of Lemmas
A.1 and A.2 are satisfied. When this is not the case, then our asymptotic results can be
easily extended to the case where the exact posterior µ(f |θ, rn ) is replaced by the regularized posterior distribution discussed in Remark A.4.
P
By using the usual notation for empirical processes, we define θˆ = B(Pn ) := n−1 ni=1 g(xi )
√
the method of moments estimator and by V the variance of nB(Pn ) under F∗ . The efficient influence function g˜ : S → Rk takes the form g˜ = g − E∗ g and V = E∗ (˜
g 2 ). For two
probability measures P and Q absolutely continuous with respect to a positive measure Π,
define the total variation (TV) distance as
||P − Q||T V
1
=
2
Z
|fP − fQ |dΠ =
Z
(fP − fQ )+ dΠ
where fP and fQ are the Radon-Nikodym derivatives of P and Q, respectively,
√ with respect
ˆ rn
to Π. The next theorem states that the total variation distance between µ
n(θ − θ)
and N (0, V ) converges to 0 in probability.
P
Theorem 3.3. Let θˆ = B(Pn ) := n−1 ni=1 g(xi ) and consider the Gaussian model (2.7)
−1/2
(without θ). If f∗
∈ R(K ∗ ) and g ∈ C ∞ , then
16
√
ˆ rn − N (0, V )
n(θ − θ)
µ
in P ∗ -probability.
4
TV
→0
The case with span{1, h1(θ, ·), . . . , hr (θ, ·)} independent of θ
We focus on the case where the space spanned by {1, h1 (θ, ·), . . . , hr (θ, ·)}, namely
N(Ω0θ ), does not depend on θ. This arises for instance when the moment functions h(θ, x)
are separable in θ and x. Therefore, we can choose any orthonormal basis (o.n.b.) with
respect to Π, say {ϕj }rj=0 , that spans N(Ω0θ ) and that does not depend on θ. Moreover,
N(Ω0θ )⊥ is also independent of θ and spanned by an o.n.b. {ϕj }j>r independent of θ as
well. An example where span{1, h1 (θ, ·), . . . , hr (θ, ·)} does not depend on θ is the case
where h(θ, x), after normalization, is of the form: h(θ, x) = a(x) − b(θ) with a(x) =
(a1 (x), . . . , ar (x))0 and b(θ) = (b1 (θ), . . . , br (θ))0 , and V ar(h(θ, x)) = Ir , where Ir denotes
the r-dimensional identity matrix. Thus, the prior covariance operator Ω0θ does not depend
on θ and writes:
∀ρ ∈ E,
(Ω0θ ρ)(x) = Ω0 ρ =
X
j>r
λj hρ, ϕj iϕj (x).
On the other side, the prior mean function f0θ does depend on θ.
Let {ψj }j≥0 be an o.n.b. in F and {λjK }j≥0 be a square-summable sequence of positive
real numbers. We can then construct the operator K and the transformation rn as:
∀ρ ∈ E,
(Kρ)(t) =
∞
X
j=0
rn =
λjK hρ, ϕj iψj (t) =
n
∞
Z X
∞
λjK ρ(x)ϕj (x)ψj (t)Π(dx)
j=0
1 XX
λjK ϕj (xi )ψj (t).
n
i=1 j=0
P
Hence, the kernel k(x, t) characterizing the operator K writes: k(x, t) = ∞
j=0 λjK ϕj (x)ψj (t).
P∞
∗
∗
The adjoint of K, denoted by K , writes: ∀φ ∈ F, (K φ)(x) = j=0 λjK hφ, ψj iϕj (x). Re-
mark that neither K, K ∗ , rn nor Σ depend on θ.
In this section we fix Π = F∗ and denote E∗ = L2 (S, BS , F∗ ). Therefore, {ϕj }j≥0 is an
o.n.b. with respect to F∗ and the truth is f∗ = 1.
The operator Σ has an eigensystem which satisfies: Σψj = λ2jK ψj for j ≥ 1 and Σψ0 =
0. To see this, we write Σ in the form given in lemma 2.2: Σ = KMf K ∗ − KMf hMf , K ∗ ·i
17
and since f∗ = 1 we have:
Σψj = KK ∗ ψj − K1hK1, ψj i = λjK Kϕj − λ20K ψ0 hψ0 (t), ψj (t)i
∀j 6= 0,
= λ2jK ψj
Σψ0 = λ20K ψ0 − (λ0K ψ0 )hK1, ψ0 i = λ20K − λ20K = 0.
The result has been obtained by using the fact that hK1, ψj i = λ0K hψ0 (t), ψj (t)i = 0 for
j ≥ 1 and h1, ϕ0 i = 1.
Remark 4.1. The intuition behind the fact that Σ has an eigenvalue equal to 0 comes
from lemma 2.3. Having the same eigenfunctions means that the ranges of Σ1/2 and K
must be equal. However, we know from lemma 2.3 that when K|D is injective R(Σ1/2 )
is equal to the range of the restriction of K to D, that is K|D . Since D is orthogonal
to ϕ0 the eigenfunction ψ0 , which is the transformation of ϕ0 under K, does not belong
to the eigensystem of Σ corresponding to nonzero eigenvalues. Therefore, the eigenvalue
corresponding to ψ0 must be equal to 0. Finally, because Σ has one eigenvalue equal to 0
it is not injective and the R.K.H.S. associated with Σ is not dense in F.
In this particular case the eigenvalues lj2 (θ)2 and eigenfunctions ψj (θ) in (2.9) are as
follows:
lj2 (θ) =
(
λj
for j > r
0
for j = 0, . . . , r
and
ψj (θ) = ψj
for j = 0, 1, . . . .
(4.1)
This can be verified by trivial computations. It follows that the likelihood in (2.9) can be
simplified and the MAP writes as:
18
θn = arg max pnθ (rn ; θ)µ(θ) = arg max (log pnθ (rn ; θ) + log µ(θ))
θ∈Θ
θ∈Θ
r
i
1X n h
2
˜, ψj i2
hr
−
Kf
,
ψ
i
−
hr
−
K
f
= arg max −
n
j
n
0θ
θ∈Θ
2
λ2
j=1 jK
1X n
1
1X
2
2
˜
−
hr
−
Kf
,
ψ
i
log(nλ
+
1)
+
log
µ(θ)
−
hr
−
K
f
,
ψ
i
−
n
j
j
n
j
0θ
2
1 + nλj
2
λ2
j>r jK
j>r


r
X
X
n
n
1
1
1
2
2

= arg min 
2 hrn − Kf0θ , ψj i + 2
2 hrn − Kf0θ , ψj i 1 + nλ − log µ(θ)
θ∈Θ
2
λ
λ
j
jK
jK
j>r
j=1
!
2
n
r
X
1X
(4.2)
ϕj (xi ) − hf0θ , ϕj i +
= arg min
θ∈Θ
n
i=1
j=1
!2
n
X 1X
1
ϕj (xi ) − hf0θ , ϕj i
− log µ(θ)
(4.3)
n
1 + nλj
j>r
i=1
where we have eliminated the terms that do not depend on θ and we have used the fact
2
P
that λ21 hrn − Kf0θ , ψj i2 = n1 ni=1 ϕj (xi ) − hf0θ , ϕj i . Equation (4.3) is quite useful and
jK
allows to emphasis several aspects of our methodology.
I. The first term in (4.3) accounts for the moment restrictions and minimization of this
R
term corresponds to the classical GMM. In fact, by construction ϕj (x)f0θ Π(dx)
is not 0 because we are using transformations of the moment functions. Thus,
1 Pn
R
2
depends on θ through f0θ which is chosen as
i=1 ϕj (xi ) − ϕj (x)f0θ Π(dx)
n
close as possible to the true f∗ according to the criterion (2.11). Remark that if we
do the inverse transformation from {ϕj }rj=1 to {hj (x, θ)}rj=1 then the term involving
f0θ will be zero and the term ϕj (xi ) will be written in terms of θ. This is similar to
the LIL method proposed by Kim [2002].
II. The second term in (4.3) accounts for the extra information that we have, namely, the
information contained in the subspace of E orthogonal to span{1, h1 (θ, ·), . . . , hr (θ, ·)}.
This information, which is in general not exploited by the classical GMM estimation,
can be exploited thanks to the prior distribution and the choice of f0θ . In fact, the
parameter θ enters our criterion in (4.3) through f0θ . Thus, the value of θ that
maximizes the posterior distribution is the value that makes the pdf f0θ as close as
possible to the true f∗ .
III. Expression (4.3) makes an explicit connection between the parametric case (infinite
number of moment restrictions) and the semi-parametric case (when only the first
r moment conditions hold). The semiparametric case corresponds to the classical
19
GMM approach while the parametric case corresponds to (and is as efficient as) the
MLE. Indeed, the prior distribution for f specifies a parametric model for f0θ which
satisfies the r moment restrictions and eventually other “extra” moment restrictions.
The eigenvalues λj of the prior covariance operator play the role of weights of the
“extra” moment restrictions and represent our “beliefs” concerning these restrictions.
When we are very confident about these “extra” conditions, or equivalently we believe
that f0θ is close to f∗ , then λj are closed to zero. In particular, since f0θ converges
to f∗ (θ) for every θ, then λj should converge to 0 as n → ∞. The prior distribution
for f is degenerate when the parametric model is the true one. When we are very
uncertain about f0θ then the λj are very large and may tend to +∞. In this case we
are back to the classical GMM.
IV. The posterior distribution (4.3) does not depend on Σ anymore. This is a considerable
advantage since Σ is in general unknown so that we do not have to estimate it.
4.1
Testing procedures
Remark III in section 4 is important if one is interested in constructing testing procedures. We are not going to develop a formal test here, as this will make the object of a
separated paper, but we would like to point out that our procedure suggests an easy way
to test a parametric model against a semi-parametric one characterized by a finite number
of moment restrictions. We can deal with the two following situations:
1. We know that the distribution of the data satisfies r moment restrictions and we
want to test that it has a particular parametric form. In this case, for a given pdf g,
R
the null hypothesis is H0 : f∗ = g ∈ EM where g is such that h(θ, x)g(x)Π(dx) = 0.
An example is the univariate linear regression model: Y = Zθ +ε, where X = (Y, Z)T
and f∗ is the true joint pdf of X that satisfies the linear model. We may want to test
that f∗ belongs to a particular parametric class.
2. There are r moment restrictions of which we are sure and we want to test the other
moment restrictions. The null hypothesis writes H0 : EF∗ (hj (θ, x)) = 0 for some j > r.
The natural approach would be to treat the λj s corresponding to the extra conditions
(that is, j > r) as hyperparameters for which a prior distribution is specified. The null
hypothesis, in both the cases above, writes as H0 : λj = 0 for some (or for all) j > r. Then,
the posterior distribution of λj may be used to draw a conclusion on the test: either by
looking weather the posterior concentrates around 0 (na¨ıf approach) or by considering posterior odds or by constructing encompassing tests. The last two methods are more rigorous
and will be the object of a separated paper.
20
To construct a prior for the λj s let us write: λj = cρj where c = trΩ0 and
We propose two alternatives priors.
P∞
j=1 ρj
= 1.
Dirichlet prior. Suppose that we want to test the nullity of some λj s, say λj for
r < j < J < ∞. Then we specify a Dirichlet prior for (ρr+1 , . . . , ρJ−1 ):
µρ (ρr+1 , . . . , ρJ−1 |ν) ∝
J−1
Y
j=r+1

ν −1 
ρj j
1−
J−1
X
j=r+1
νJ −1
ρj 
J−1
Y
j=r+1

J−1
X
I(ρj ≥ 0)I 
j=1

ρj ≤ 1
where ν = (νr+1 , . . . , νJ ).
Prior on c > 0. Suppose that we want to test that all the moment restrictions are
true (that is, test of a parametric model against a semi-parametric one). Thus, the null
hypothesis is H0 : c = 0. Remark that the {λj }rj=1 corresponding to the first r moment
restrictions do not affect the trace of Ω0 since they are equal to 0. A prior for c will be any
distribution with support contained in the positive real semi-axis, for example an inverse
gamma distribution.
5
Implementation
In this section we show, through the illustration of several examples, how our method
can be implemented in practice. We start with toy examples that can be treated also with
nonparametric priors different from the Gaussian prior. The interest in using Gaussian priors will be made evident in the more complicated examples where there are overidentifying
restrictions which we show can be easily dealt with by using Gaussian priors.
5.1
Just identification and prior on θ through µ(f )
Let the parameter θ of interest be the population mean with respect to f , that is,
R
θ = xf (x)dx and h(θ, x) = (θ − x). This example considers the just identified case where
the prior on θ is deduced from the prior distribution of f , denoted by µ(f ). The prior
µ(f ) is a Gaussian measure which is unrestricted except for the fact that it must generate
trajectories that integrate to 1 a.s. To guarantee this, the prior mean function f0 must be
1
a pdf and the prior covariance operator Ω0 must be such that Ω02 1 = 0. Summarizing, the
Bayesian experiment is
21
(
1
f
rn |f
∼ µ(f ) ∼ N (f0 , Ω0 ) ,
Ω02 1 = 0
(5.1)
∼ P f ∼ N (Kf, Σn ).
In the simulations we present the results for a scaled and a non-scaled prior. The implied
prior and posterior distribution for θ is given in the following lemma. We use the notation
ι to denote the identity functional, that is, ι(x) = x.
Lemma 5.1. The Bayesian experiment (5.1) implies that the prior distribution for θ =
R
xf (x)dx is Gaussian with mean hf0 , ιi and variance hΩ0 ι, ιi and its posterior distribution
is
θ|rn ∼ N (hf0 , ιi + hΩ0 K ∗ Cn−1 (rn − Kf0 ), ιi, h[Ω0 − Ω0 K ∗ Cn−1 KΩ0 ]ι, ιi)
−1
where Cn−1 = n−1 Σ + KΩ0 K ∗
This approach is appealing because it avoids the specification of two prior distributions
while keeping the specification of the sampling distribution completely nonparametric. The
prior is specified for the parameter with the highest dimension, that is f , and it implies a
prior on the parameter θ.
We illustrate now how to construct in practice the covariance operator Ω0 in (5.1) for
two cases: the case where F∗ has a compact support and the case where the support of F∗
is R. In both cases we suppose m = 1.
5.1.1
Compact support
Let S = [−1, 1] and Π be the Lebesgue measure. Then, the Legendre polynomials {Pj }j≥0 are suitable to construct the eigenfunctions of Ω0 .
dre polynomials are
{1, x, (3x2
−
1)/2, (5x3
The first few Legen-
− 3x)/2, . . .} and an important property of
these polynomials is that they are orthogonal with respect to the L2 inner product on
R1
[−1, 1]: −1 Pl (x)Pj (x)dx = 2/(2j + 1)δlj , where δlj is equal to 1 if l = j and to 0 other-
wise. Moreover, the Legendre polynomial obey the recurrence relation (j + 1)Pj+1 (x) =
(2j + 1)xPj (x) − jPj−1 (x) which is useful for computing Ω0 in practice. The normalized
Legendre polynomials form a basis for L2 [−1, 1] so that we can construct the operator Ω0
as
Ω 0 · = σ0
∞
X
j=0
λj
2j + 1
hPj , ·iPj
2
where λ0 = 0 and the {λj , j ≥ 1} can be chosen in an arbitrary way provided that
P
j≥1 λj < ∞. The constant σ0 can be set to an arbitrary value and has the purpose of
tuning the size of the prior covariance. This construction of Ω0 and the fact that f0 is a
22
pdf guarantee that the prior distribution generates functions that integrate to 1 a.s.
In our simulation exercise we generate n i.i.d. observations (x1 , . . . , xn ) from a N (0, 1)
ˆ
distribution truncated to the interval [−1, 1] and construct the function rn = Φ(t)
=
P
n
−1
itx
i
n
as the empirical characteristic function. Therefore, f∗ = N (0, 1; −1, 1) and
i=1 e
θ∗ = 0. We set T = [−1, 1] and ρ equal to the Lebesgue measure. Thus, the operators K
and K ∗ take the form
∀φ ∈ E, Kφ =
Z
1
−1
itx
e
φ(x)dx
and ∀ψ ∈ F,
∗
K ψ=
Z
1
eitx ψ(t)dt.
−1
p
The eigenfunction of Ω0 are set equal to the normalized Legendre polynomials { (2j + 1)/2Pj }j≥0 ,
the eigenvalues are set equal to σ0 λj = 5 ∗ j −a for j ≥ 1 and a > 1. We rescale Ω0 by
1
αn .
The prior mean function f0 is set equal to a N (%, 1) distribution truncated to the interval
[−1, 1]. We show in Figure 1 and 2 the prior and posterior distribution of θ obtained for
different values of σ0 and α. We consider 4 cases: figures 1a and 1b are for a shrinking
prior (α = 0.1), figures 2a and 2b are for a spreading out prior (α = 0.0001). We also show
the prior mean (magenta asterisk) and the posterior mean of θ (blue asterisk) and denote
their value. The prior mean is computed by using the exact formula (for the mean of the
truncated normal distribution f0 ) while the posterior mean of θ is computed by discretizing
the inner product hE(f |rn ), ιi. The pictures are obtained for n = 1000, ρ = 1, a = 1.7. We
consider two values for σ0 : σ0 = 1 and σ0 = 1/3. The number of discretization points, used
to approximate the integrals, is equal to 1000 for all the simulation schemes.
We see from the pictures that the support of the prior and posterior distribution of θ is
not restricted to [−1, 1]. This is due to the fact that a Gaussian process prior does not
guarantees that the trajectories generated from it are non-negative. Therefore, constructing a prior for θ from the Gaussian process prior for f neglects part of the information
that we have, that is, the fact that θ belong to [−1, 1]. In order to avoid this phenomenon,
we can use the following procedure to construct the prior (posterior) for θ from the prior
for µ: (1) draw f (i) from its prior (posterior), (2) compute max(f (i) , 0) and re-normalize it
(in order to eliminate the negative part of the trajectory), (3) compute θ (i) = hf (i) , ιi, (4)
iterate steps (1)-(3) for i = 1, . . . , I, (5) compute an approximation of the prior (posterior)
distribution of θ through a kernel smoothing by using the I drawings of θ. Figure 3 shows
the prior and posterior obtained by using this approach and a bandwidth h = 0.1 for the
prior and h = 0.0001 (this is because the posterior is much more concentrated than the
prior).
23
Prior and Posterior distributions of θ (shrinking prior)
Prior and Posterior distributions of θ (shrinking prior)
0.4
0.4
0.35
0.35
0.3
0.25
0.25
µ
θ
prior distribution
posterior distribution
prior mean of θ = 0.28
posterior mean of θ = −0.05
0.2
µ
θ
0.3
0.15
prior distribution
posterior distribution
prior mean of θ = 0.28
posterior mean of θ = −0.02
0.2
0.15
0.1
0.1
0.05
0.05
0
−0.2
−0.1
0
0.1
θ
0.2
0.3
0.4
0
−0.2
0.5
(a) σ0 = 1/3, α = 0.1.
−0.1
0
0.1
0.2
θ
0.3
0.4
0.5
0.6
(b) σ0 = 1, α = 0.1.
Figure 1: CASE I (Shrinking prior) - prior and posterior distributions and means of θ.
The true value of θ is 0.
Prior and Posterior distributions of θ (spreading out prior)
Prior and Posterior distributions of θ (spreading out prior)
0.4
0.25
prior distribution
posterior distribution
prior mean of θ = 0.2772
posterior mean of θ = 0.0038
0.35
0.2
0.3
0.15
µ
0.15
θ
prior distribution
posterior distribution
prior mean of θ = 0.28
posterior mean of θ = 0.007
0.2
µ
θ
0.25
0.1
0.1
0.05
0.05
0
−6
−4
−2
0
θ
2
4
0
−10
6
(a) σ0 = 1/3, α = 0.0001
−5
0
θ
5
10
(b) σ0 = 1, α = 0.0001
Figure 2: CASE II (Spreading out prior) - prior and posterior distributions and means of
θ. The true value of θ is 0.
Prior distribution of θ (normalized)
Posterior distribution of θ (normalized)
3
3500
posterior distribution
posterior mean of θ = 0.0038
3000
2.5
2500
2
θ
µ
µ
θ
2000
1.5
1500
1
1000
0.5
prior distribution
prior mean of θ = 0.2655
0
−0.1
0
0.1
0.2
θ
0.3
0.4
0.5
500
0
3.5
0.6
(a) σ0 = 1, α = 0.0001, h = 0.1
3.6
3.7
3.8
θ
3.9
4
4.1
4.2
−3
x 10
(b) σ0 = 1, α = 0.0001, h = 0.0001
Figure 3: CASE III (Fix prior) - prior and posterior distributions and means of θ obtained
after a normalization of the drawings from µ(f ). The true value of θ is 0.
5.1.2
Support equal to R
Let S = R; the Hermite polynomials {Hj }j≥0 form an orthogonal basis of L2 (R, B, Π)
for dΠ(x) = e−x
2 /2
dx and can be used to construct the eigenfunctions of Ω0 . The first
24
few Hermite polynomials are {1, x, (x2 − 1)/2, (x3 − 3x), . . .} and an important property of
R
2
these polynomials is that they are orthogonal with respect to Π: R Hl (x)Hj (x)e−x /2 dx =
√
nn!δlj , where δlj is equal to 1 if l = j and to 0 otherwise. The operator Ω is constructed
as
Ω 0 · = σ0
∞
X
λj
j=0
1
hHj , ·iHj
2π(n!)2
Hj+1 (x) = xHj (x) − jHj−1 (x)
where λ0 = 0 and {λj , j ≥ 1} = {aj , j ≥ 1} with a < 1.
In our simulation exercise we generate n i.i.d. observations (x1 , . . . , xn ) from a N (1, 1)
P
ˆ
distribution and construct the function rn = Φ(t)
= n−1 n eitxi as the empirical chari=1
acteristic function. Therefore, f∗ =
√1 e−(1−2x)/2
2π
Thus, the operators K and K ∗ take the form
∀φ ∈ E, Kφ =
We rescale Ω0 by
Z
itx
e
φ(x)e
R
1
αn .
−x2 /2
dx
and θ∗ = 1. We set T = R and ρ = Π.
and ∀ψ ∈ F,
∗
K ψ=
Z
2 /2
eitx ψ(t)e−t
dt.
R
The prior mean function f0 is set equal to a N (%, 1) distribution.
In our simulation we set: a = 0.3, ... We show in Figures 4 and 5 the prior and posterior
distribution of θ. We also show the prior mean (magenta asterisk) and the posterior mean of
θ (blue asterisk) and denote their value. The posterior mean of θ is computed by discretizing
the inner product hE(f |rn ), ιi. The pictures are obtained for n = 1000, f0 =
2
√1 e−(% −2%)/2 ,
2π
% = 0, a = 0.3, σ0 = 1 and two values of α. The number of discretization points, used to
approximate the integrals, is equal to 1000 for all the simulation schemes. Concerning the
values for α, we consider two cases: in figure 4 the prior is shrinking, in figure 5 the prior
is spreading out.
5.2
Just identification and prior on θ
We consider the same framework as in the previous example where the parameter θ of
R
interest is the population mean, that is, θ = xf (x)dx and h(θ, x) = (θ −x) but now we are
going to specify a joint proper prior distribution on (θ, f ). We specify a marginal prior µ(θ)
on θ and a conditional prior on f given θ. While the first one can be arbitrarily chosen,
the latter is specified as a Gaussian distribution constrained to generate functions that
integrate to 1 and that have mean equal to θ a.s. In particular, the prior mean function f0θ
R
must be a pdf and xf0θ (x)dx = θ must hold. The prior covariance operator Ω0 must be
1
1
such that Ω02 1 = 0 and Ω02 x = 0. Together with the constraint on f0θ , the first constraint
25
Posterior distributions of θ (shrinking prior)
Prior distributions of θ (shrinking prior)
25
5
prior distribution
prior mean of θ
posterior distribution
posterior mean of θ = 0.8
4.5
20
4
3.5
µ
µ
θ
3
θ
15
10
2.5
2
1.5
5
1
0.5
0
−0.06
−0.04
−0.02
0
θ
0.02
0.04
0
0.5
0.06
0.6
(a) σ0 = 1, α = 1.
0.7
0.8
θ
0.9
1
1.1
1.2
(b) σ0 = 1, α = 1.
Figure 4: CASE I (Shrinking prior) - prior and posterior distributions and means of θ.
The true value of θ is 1.
Prior distributions of θ (spreading out prior)
Posterior distributions of θ (spreading out prior)
−3
0.25
1.6
prior distribution
prior mean of θ = 0
x 10
1.4
0.2
1.2
1
θ
µ
µ
θ
0.15
0.1
posterior distribution
posterior mean of θ = 2
0.8
0.6
0.4
0.05
0.2
0
−6
−4
−2
0
θ
2
4
0
−1000
6
(a) σ0 = 1, α = 0.0001
−500
0
θ
500
1000
(b) σ0 = 1, α = 0.0001
Figure 5: CASE II (Spreading out prior) - prior and posterior distributions and means of
θ. The true value of θ is 1.
on Ω0 guarantees that the trajectories of f generated by this prior integrate to 1 a.s. while
R
the second one guarantees that xf (x)dx = θ a.s. Summarizing, the Bayesian experiment
is



 θ
f |θ


 r |f
n
∼ µ(θ)
∼ µ(f |θ) ∼ N (f0θ , Ω0θ ),
∼ P f ∼ N (Kf, Σn ).
R
1
xf0θ (x)dx = θ
and
Ω02 (1, x)0 = 0
(5.2)
Compared to the approach in section 5.1, this approach allows to incorporate easily
any prior information that an economist may have about θ. In fact, taking into account
the information on θ through the prior distribution of f is complicated while to incorporate
such an information directly in the prior distribution of θ results to be very simple.
Let us suppose that m = 1, S = [−1, 1] and Π be the Lebesgue measure. Then, the
26
covariance operator Ω0θ can be constructed in the same way as proposed in section 5.1 since
the second Legendre polynomial P1 (x) = x allows to implement the constraint on θ. The
only difference concerns the number λ1 which has to be equal to 0 in this case. Therefore,
we construct the operator Ω0θ as:
∞
X
Ω0θ · = σ0
λn
n=2
2n + 1
hPn , ·iPn
2
where the λj , j ≥ 2 can be chosen in an arbitrary way provided that
P
j≥2 λj
< ∞. The
constant σ0 can be set to an arbitrary value and has the purpose of tuning the size of the
prior covariance.
Many orthogonal polynomials are suitable for the construction of Ω0θ and they may
be used to treat cases where S is different from [−1, 1]. Consider for instance the case
S = R, then, a suitable choice is the basis made of the Hermite polynomials. The Hermite
polynomials {Hen }n≥0 form an orthogonal basis of the Hilbert space L2 (R, BS , Π) where
dΠ(x) = e−x
2 /2
dx. It turns out that f will be the density of F with respect to Π and f0θ
the density of another probability measure with respect to Π instead of with respect to the
Lebesgue measure. The first Hermite polynomials are {1, x, (x2 − 1), (x3 − 3x), (x4 − 6x2 +
3), . . .} so that we can construct an Ω0θ that satisfies the constraints by setting λ0 = λ1 = 0
in the following way
Ω0θ · =
∞
X
n=2
λn √
1
hHen , ·iHen .
2πn!
We perform two simulations exercises: one makes use of the empirical cumulative
P
distribution function to construct rn : rn (t) = Fˆ (t) := n−1 ni=1 1{xi ≤ t} (CASE IV) and
P
ˆ
one uses the empirical characteristic function rn (t) = Φ(t)
:= n−1 ni=1 eitxi as in section
5.1 (CASE V). In both the simulations we use Legendre polynomials and we generate n
i.i.d. observations (x1 , . . . , xn ) from a N (0, 1) distribution truncated to the interval [−1, 1]
as in section 5.1. The prior distribution for θ is uniform over the interval [−1, 1]. The prior
mean function f0θ is taken equal to the pdf of a Beta distribution with parameters pθ and
q and with support [−1, 1]:
f0θ (x) =
(x + 1)pθ −1 (1 − x)q−1
.
B(pθ , q)2pθ +q−1
We use the notation pθ to stress the dependence on θ of this shape parameter. We fix q = 2
R1
and recover pθ such that −1 xf0θ (x)dx = θ. It is easy to see that for our Beta distribution:
R1
pθ −q
−1 xf0θ (x)dx = pθ +q . The covariance operator Ω0θ is constructed by using the Legendre
polynomials and λn = n−a for a > 1.
The marginal posterior distribution of θ is obtained by integrating out f from the
27
sampling distribution and by using the following Bayesian scheme:
(
θ
∼ U [−1, 1]
rn |θ ∼ N (Kf0θ , Σn + KΩ0θ K ∗ ).
Since the posterior distribution µ(θ|rn ) can not be computed in a closed-form we simulate
from it by using a Metropolis-Hastings algorithm, see for instance ?. To implement this
algorithm we have to selected an auxiliary pdf g.
The two simulations differ basically because in the first one we need to add an artificial
regularization to the conditional posterior distribution of f given θ as described in remark
ˆ we do not need to use an artificial regularA.4. On the contrary, in the simulation with Φ
ization and we use a scaling prior with covariance operator
1
αn Ω0 .
In both the simulations,
n = 1000 and the number of discretization points, used to approximate the integrals, is
equal to 1000.
We summarize the simulation schemes for the two cases.
1. Draw a n i.i.d. sample (x1 , . . . , xn ) from f∗ (where f∗ is a truncated standard Normal
distribution);
ˆ (CASE V);
2. compute rn = Fˆ (CASE IV) or rn = Φ
˜
3. draw θ ∼ U [−1, 1] and denote it θ;
4. compute pθ as pθ =
˜
(θ+1)2
1−θ˜
(where we have fixed q = 2);
5. compute f0θ as the pdf of a Beta distribution with parameters (pθ , q) and support
[−1, 1];
6. compute the conditional posterior mean and variance of f , given θ = θ˜ (these are
used for the figure 7b);
7. draw θ from the marginal posterior distribution of θ by using a Metropolis-Hasting
algorithm with the following auxiliary pdf (triangular distribution):
g(ξ; θ) =
ξ+1
1−ξ
I{ξ ∈ [−1, θ)} +
I{ξ ∈ [θ, 1]}.
θ+1
1−θ
We draw 2000 values and discard the first 1000. The initial value for the algorithm
is θ = 0.5
We represent in Figure 6 the results for the simulation with rn (t) = Fˆ (t) and in Figure
ˆ
7 the result for rn (t) = Φ(t).
More precisely, Figure 6a (resp. 7a) shows drawings from
the conditional prior distributions of f given θ (blue dashed-dotted line) together with
28
the true f∗ that generates the data (black line) and the prior mean (dashed red line).
Figure 6b (resp. 7b) shows drawings from the regularized conditional posterior distribution
µ(f |τ, rn , θ) of f given θ (blue dashed-dotted line) (resp. from the conditional posterior
distribution µ(f |rn , θ) of f given θ) together with the true f∗ that generates the data (black
line) and the posterior mean (dashed red line). In Figure 6 the regularization parameter
τ has been set equal to 0.1. Lastly, Figure 6c (resp. 7c) shows the posterior distribution
of θ (marginalized with respect to f ) approximated by using a kernel smoothing and 1000
drawings from the posterior together with the posterior mean of θ. All the pictures in
Figures 6 and 7 are obtained for σ0 = 5.
Table 1: Just identification and prior on θ: Simulation schemes
α or τ
a
5.3
CASE IV: rn = Fˆ
n = 1000
τ = 0.1
1.7
ˆ
CASE V: rn = Φ
n = 1000
α = 0.1
1.7
Overidentified case
Let us consider the case in which the one-dimensional parameter of interest θ is characterized by the moment conditions EF (h(θ, x)) = 0 with h(θ, x) = (θ − x, θ 2 −
x2 0
2 ).
For
instance, this arises when the true data generating process F is an exponential distribution
with parameter θ. We specify a prior distribution for (θ, f ). The prior µθ is chosen arbitrarily provided that the potential constraint on θ are satisfied. These are essentially constraint
on the support of θ. The moment conditions affect the conditional prior distribution of f
conditionally on θ. This is a Gaussian distribution with mean function f0θ whatever pdf
R
R
with the same support as F that satisfies xf0θ (x)dΠ(x) = θ and x2 f0θ (x)dΠ(x) = 2θ 2 .
The covariance operator Ω0θ of f must be such that

1

1

2 
Ω0θ
 x  = 0.
x2
(5.3)
In our simulation exercise we take S = R+ and dΠ(x) = e−x dx. We use the empirical cumuP
lative distribution function to construct rn : rn (t) = Fˆ (t) := n−1 n 1{xi ≤ t}. We generi=1
ate N = 1000 observations x1 , . . . , xN independently from an exponential distribution with
parameter θ∗ = 2. Therefore, the true f associated with this DGP is f∗ (x) = θ∗ e−(θ∗ −1)x
which obviously satisfies the moment restrictions. The marginal prior distribution µ(θ) for
θ is a chi-squared distribution with 1 degree of freedom and, for every value of θ drawn
29
Prior mean of f and trajectories drawn from the prior
Regularized posterior mean for f and trajectories drawn from the prior
0.9
0.8
Prior mean
True density
0.8
0.6
0.7
0.4
0.6
0.2
f(x)
f(x)
0.5
0.4
0
0.3
0.2
Conditional on θ = −0.056427
−0.2
0.1
Regularized posterior Mean
True density
Prior Mean
−0.4
0
−0.1
−1
Conditional on θ = −0.056427
−0.5
0
x
0.5
−0.6
−1
1
(a) Draw from the conditional prior distribution µ(f |θ = 0.2) (in green), Prior
mean and true f∗ , τ = 0.1 and a = 1.7.
−0.5
0
x
0.5
1
(b) Draw from the regularized conditional posterior distribution µ(f |rn , θ =
0.2, τ ) (in blue), posterior mean and true
f∗ , τ = 0.1 and a = 1.7.
Posterior density function of θ
0.9
0.8
0.7
θ
µ (θ|F
hat
)
0.6
0.5
0.4
0.3
0.2
posterior distribution
posterior mean of θ = −0.06
0.1
0
−1
−0.5
0
θ
0.5
1
(c) Posterior distribution and mean of θ
and a = 1.7. The bandwidth is h = 0.2.
Figure 6: CASE IV: rn = Fˆ and n = 1000. The true value of θ is 0 and an ad-hoc
regularization has been used.
from this µ(θ), the prior mean function f0θ is fixed equal to f0θ =
1 −(1−θ)x/θ
.
θe
We fix
the eigenfunctions of Ω0θ proportional to the Laguerre polynomials {Ln }n≥0 . The first few
Laguerre polynomials are {1, (1−x), 12 (x2 −4x+2), 61 (−x3 +9x2 −18x+6), . . .} and they are
orthogonal in L2 (R+ , e−x ). Remark that x = L0 − L1 and
we construct the operator Ω0θ as:
Ω0θ · = σ0
∞
X
n=0
x2
2
= L2 − 2L1 + L0 . Therefore
λn hLn , ·iLn
with λ0 = λ1 = λ2 = 0 to guarantee that (5.3) holds. The constant σ0 and the λn , n ≥ 3
P
can be arbitrarily set provided that n≥3 λn < ∞.
We present in Figure 8 results obtained by using the regularized posterior distribution
µ(f |τ, rn , θ) of f given θ and in Figure 9 results obtained by using the (non-regularized)
posterior distribution µ(f |rn , θ) of f given θ and a non-scaled prior µ(f |θ). In Figure 8 we
take σ0 = 1 and λn = n−1.1 for n ≥ 3 while Figure 9 we take σ0 = 10 and λn = n−1.7 for
30
Prior mean of f and trajectories drawn from the prior
Posterior mean for f and trajectories drawn from the prior
2.5
2.5
Prior mean
True density
2
1.5
f(x)
1.5
f(x)
Posterior Mean
True density
Prior Mean
2
1
1
0.5
0.5
0
0
Conditional on θ = 0.68754
−0.5
−1
−0.5
0
x
Conditional on θ = 0.68754
0.5
−0.5
−1
1
(a) Draw from the conditional prior distribution µ(f |θ = 0.2268) (in green),
Prior mean and true f∗ , α = 0.1 and
a = 1.7.
−0.5
0
x
0.5
1
(b) Draw from the conditional posterior distribution µ(f |rn , θ = 0.2268) (in
blue), posterior mean and true f∗ , α =
0.1 and a = 1.7.
Posterior density function of θ
1
0.9
0.8
0.6
0.5
θ
µ (θ|F
hat
)
0.7
0.4
0.3
0.2
posterior distribution
posterior mean of θ = −0.11
0.1
0
−1
−0.5
0
θ
0.5
1
(c) Posterior distribution and mean of θ
and a = 1.7. The bandwidth is h = 0.2
ˆ and n = 1000. The true value of θ is 0, scaled prior and no
Figure 7: CASE V: rn = Φ
regularization scheme.
n ≥ 3. The regularization parameter τ in Figure 8 is set equal to 0.1.
We represent in Figure 8a (resp. 9a) draws from the conditional prior distributions
of f given θ (blue dashed-dotted line) together with the true f∗ that has generated the
data (black line) and the prior mean (dashed red line). Figure 8b (resp. 9b) shows draws
from the regularized (resp. non regularized) conditional posterior distribution of f given
θ (blue dashed-dotted line) together with the true f∗ having generated the data (black
line) and the posterior mean (dashed red line). Lastly, Figure 8c (resp. 9c) shows the
posterior distribution of θ (marginalized with respect to f ) approximated by using a kernel
smoothing and 1000 drawings from the posterior distribution together with the posterior
mean of θ. The posterior distribution of θ is obtained by integrating out f from the sampling
distribution in the following way
(
θ
∼ χ21
rn |θ ∼ N (Kf0θ , Σn + KΩ0θ K ∗ ).
31
As the posterior distribution of θ cannot be computed in a closed-form we have used a
Metropolis-Hastings algorithm to simulate from it. To implement this algorithm we selected,
as auxiliary distribution, a χ2dθe distribution.
Prior mean of f
Regularized posterior mean for f
10
7
Prior mean
True density
9
Regularized Posterior Mean
Prior mean
True f
6
*
8
5
7
4
f(x)
f(x)
6
5
3
4
3
2
2
1
1
0
0
0
0.5
1
1.5
2
2.5
0
0.5
x
(a) Draw from the conditional prior distribution µθf , Prior mean and true f and
a = 1.1.
1
1.5
x
2
2.5
3
(b) Draw from the regularized condin ,θ
tional posterior distribution µrf,τ
, posterior mean and true f , τ = 0.1 and
a = 1.1.
Posterior density function of θ
posterior distribution
posterior mean of θ
1
µθ(θ|Fhat)
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
θ
12
14
16
18
20
22
(c) Posterior distribution and mean of θ,
τ = 0.1 and a = 1.1. The bandwidth is
h = 0.1.
rn ,θ
Figure 8: Overidentified case and regularized posterior distribution µf,τ
. rn = Fˆ . The
true value of θ is 2 and the posterior mean is E(θ|rn ) = 2.3899.
Appendix
A
Conditional posterior distribution of f , given θ
The conditional distribution of f given (rn , θ), that is, the posterior distribution of f , is
a Gaussian process, see Theorem 1 in Florens and Simoni [2014]. The conditional posterior
mean and variance of f |rn , θ, in general, rise problems due to the infinite dimension of f .
While this point has been broadly discussed in [Florens and Simoni, 2012, 2014] and ref-
erences therein, in this section we analyze it in the particular case considered in the paper
where the operators take a specific form.
32
Regularized posterior mean for f
Prior mean of f
7
Regularized Posterior Mean
Prior mean
True f
Prior mean
True density
*
6
10
5
8
f(x)
f(x)
4
6
3
4
2
2
0
1
0
0
0.5
1
1.5
2
0
2.5
0.5
1
x
(a) Draw from the conditional prior distribution µ(f |θ), Prior mean, true f and
a = 1.1.
1.5
x
2
2.5
3
(b) Draw from the conditional posterior
distribution µ(f |rn , θ), posterior mean,
true f and a = 1.1.
Posterior density function of θ
1.4
posterior distribution
posterior mean of θ
1.2
0.8
µθ(θ|F
hat
)
1
0.6
0.4
0.2
0
0
5
10
θ
15
20
25
(c) Posterior distribution and mean of θ.
The bandwidth is h = 0.1 and a = 1.1.
Figure 9: Overidentified case and posterior distribution µ(f |rn , θ) with non-scaled prior.
rn = Fˆ . The true value of θ is 2 and the posterior mean is E(θ|ˆ
r) = 3.5561.
Intuitively, the problem encountered in the computation of the moments of the Gaussian posterior distribution of f given θ is the following. The moments of a conditional
Gaussian distribution involve the inversion of the covariance operator of the conditioning
variable, that is (Σn + KΩ0θ K ∗ ) in our case. The problem arises because the inverse operator (Σn +KΩ0θ K ∗ )−1 is in general defined only on a subset of F of measure zero. Therefore,
in general there is no closed-form available for the posterior mean and variance of f |θ which
implies that they cannot be computed.
However, for the framework under consideration we determine mild conditions that
allow to solve this problem and to have an inverse (Σn + KΩ0θ K ∗ )−1 continuously defined
on F. Therefore, there exists a closed-form for the posterior mean and variance of f |θ and
no problems of continuity are risen. We illustrate these conditions in the lemmas below
where we use the notation µ(f |rn , θ) for the conditional posterior distribution of f given θ
and B for the Borel σ-field generated by the open sets of Θ.
33
Lemma A.1. Consider the Gaussian distribution (2.7) on BE × BF and assume that
−1/2
f∗
∈ R(K ∗ ). Then, the conditional distribution on BE conditional on BF × B, denoted
by µ(f |rn , θ), exists, is regular and a.s. unique. It is Gaussian with mean
E[f |rn ] = f0θ + A(rn − Kf0θ )
(A.1)
and trace class covariance operator
V ar[f |rn ] = Ω0θ − AKΩ0θ : E → E
(A.2)
where
A :=
−1/2
Ω0θ Mf
1 1/2
1
−1/2
−1/2
1/2
Ω0θ Mf
I − Mf hMf , ·i + Mf
n
n
−1
−1/2 ∗
((K ∗ )−1 Mf
)
is a continuous and linear operator from F to E.
Proof. The first part of the theorem follows from theorem 1 (ii ) in Florens and Simoni
[2014]. From this result, since Σn =
know that E[f |rn ] = f0θ +
1
n Σ,
Ω0θ K ∗ ( n1 Σ
where Σ : F → F is defined in Lemma 2.2, we
+ KΩ0θ K ∗ )−1 (rn − Kf0θ ) and V ar[f |rn ] = Ω0θ −
Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 KΩ0θ . Hence, we have to show that Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 = A
−1
1
1
− 21
− 12
1
1
2
2
˜
and that A is continuous and linear. Denote M = n I − n Mf hMf , ·i + Mf Ω0θ Mf
and
˘ =
M
1
1
KMf K ∗ − (KMf 1)hMf , K ∗ ·i + KΩ0θ K ∗
n
n
−1
.
˘ and
By using the result of Lemma 2.2, we can write: Ω0θ K ∗ ( n1 Σ + KΩ0θ K ∗ )−1 = Ω0θ K ∗ M
then
1
− 21
− 21 ∗
− 21 ∗
− 12
∗ ˘
∗ −1
∗ −1
∗ −1
˜
˜
Ω0θ K ( Σ + KΩ0θ K )
= Ω0θ Mf M ((K ) Mf ) + Ω0θ K M − Mf M ((K ) Mf )
n
∗
1
1
−
˜ ((K ∗ )−1 M − 2 )∗
= Ω0θ Mf 2 M
f
where the second equality follows because
34
− 21 ∗
− 21 ∗
− 21
− 21
−1
∗ −1
∗ −1
∗
˘
˘
˜
˜
˘
M
K M − Mf M ((K ) Mf ) = K − Mf M ((K ) Mf ) M
1 21
1 12
− 21
− 21
− 21 ∗
∗
∗ −1
−1
˜
˘
˘
= Mf M
M − Mf hMf , ·i + Mf Ω0θ K − ((K ) Mf ) M
M
n f
n
∗
= 0.
This establishes that Ω0θ K ∗ ( n1 Σ+KΩ0θ K ∗ )−1 is equal to A. We now show that the operator
− 21
A is continuous and linear on F. First, remark that the assumption f∗
that
−1
(K ∗ )−1 Mf 2
∈ R(K ∗ ) ensures
exists and is bounded. Since Ω0θ is the covariance operator of a Gaussian
1
2
is Hilbert-Schmidt,
measure on a Hilbert space then, it is trace class. This means that Ω0θ
which is a compact operator. Therefore, since the product of two bounded and compact
− 12
operators is compact, it follows that Ω0θ , Ω0θ Mf
It is also easy to show that the operator
−1
− 21
and Mf 2 Ω0θ Mf
1
2
1
n Mf
are compact.
1
2
hMf , ·i : E → E is compact since its
Hilbert-Schmidt norm is equal to 1. In particular this operator has rank equal to 1/n since it
has only one eigenvalue different from 0 and which is equal to 1. This eigenvalue corresponds
1
1
1
−1
−1
to the eigenfunction f∗2 . Therefore, the operator ( n1 Mf2 hMf2 , ·i−Mf 2 Ω0θ Mf 2 ) is compact.
By the Cauchy-Schwartz inequality we have
∀φ ∈ E,
˜ −1 φ, φi = 1 ||φ||2 −
hM
n
1
≥ ||φ||2 −
n
1
1
1
1 12
−1
−1
2
2
hf∗ , φi2 + hΩ0θ
f∗ 2 φ, Ω0θ
f∗ 2 φi
n
1
1 12 2
−1
2
f∗ 2 φ||2
||f∗ || ||φ||2 + ||Ω0θ
n
−1
2
≥ ||Ω0θ
f∗ 2 φ||2 ≥ 0
1
˜ is injective. Then, from the Riesz Theosince ||f∗2 ||2 = 1. Therefore, we conclude that M
˜ : E → E is bounded.
rem 3.4 in Kress [1999] it follows that the operator M
Finally, the operator A is bounded and linear since it is the product of bounded linear
operators. We conclude that A is a continuous operator from F to E.
Remark A.1. If f∗−1 ∈ R(K ∗ ) then the operator A : F → E of the theorem may be
written in an equivalent way as: ∀ϕ ∈ F
Aϕ = Ω0θ
1
1
I + hf∗ , ·i + f∗−1 Ω0θ
n
n
−1
((K ∗ )−1 f∗−1 )∗ .
(A.3)
Remark A.2. If f∗ is assumed to be bounded away from 0 and ∞ on its support, then
35
−1/2
∈ R(K ∗ ), cannot be satisfied if
the condition f∗−1 ∈ R(K ∗ ), as well as the condition f∗
R
k(t, x) is such that ∀ψ ∈ F, K ∗ ψ = T k(t, x)ψ(t)ρ(dt) vanishes at some x in the support
of f∗ . This excludes the kernel k(t, x) = 1{x ≤ t} when T is equal to a compact set, say
T = [a, b]. This remark suggests that some care must be taken by the researcher when
he/she chooses the operator K according to its prior information about f∗ .
The next lemma provides a condition alternative to the one given in lemma A.1 which
also guarantees continuity of the inverse of (Σn + KΩ0θ K ∗ ). Let H(Ω0θ ) denote the repro-
ducing kernel Hilbert space associated with Ω0θ and embedded in E.
Lemma A.2. Consider the Gaussian distribution (2.7) on BE × BF and assume that
1
2
) ⊆ R(Σ). Then, the result of
K|H(Ω0θ ) is injective and that Ω0θ is such that R(KΩ0θ
lemma A.1 holds with A equal to
A :=
1/2
Ω0θ
1
1/2
1/2
I + Ω0θ K ∗ Σ−1 KΩ0θ
n
−1
1/2
(Σ−1 KΩ0θ )∗ .
Proof of Lemma A.2
1
1
1
2
2
Since KΩ0θ
= K|H(Ω0θ ) Ω0θ
and K|H(Ω0θ ) is injective by assumption then Σ− 2 K|H(Ω0θ )
is well defined by lemma 2.3. By applying theorem 1 (iii) in Florens and Simoni [2014] we
conclude.
The trajectories of f generated by the conditional posterior distribution µfrn ,θ verify
a.s. the moment conditions and integrate to 1. To see this, first remark that the posterior
covariance operator satisfies the moment restrictions:
1/2
[Ω0θ − AKΩ0θ ]1/2 (1, h0 (θ, ·))0 = [I − AK]1/2 Ω0θ (1, h0 (θ, ·))0 = 0
where we have factorized Ω0θ on the right and used assumption (2.2). Second, a trajectory
f drawn from the posterior µ(f |θ, rn ) is such that (f − f0θ ) ∈ R (Ω0θ − AKΩ0θ )1/2 ,
µ(f |θ, rn )-a.s. Now, for any ϕ ∈ R (Ω0θ − AKΩ0θ )1/2 we have hϕ, h(θ, ·)i = h[Ω0θ −
1
1/2 ψ, Ω 2 h(θ, ·)i = 0, for some ψ ∈ E where A
˜
˜ =
AKΩ0θ ]1/2 ψ, h(θ, ·)i = h[I − AKΩ
0θ ]
0θ
K ∗ (Σ
n
+ KΩ0θ
K ∗ )−1 ,
and hϕ, 1i = 0 by a similar argument. This shows that
Z
Z
1/2
⊂ ϕ∈E ;
ϕ(x)h(θ, x)Π(dx) = 0 and
ϕ(x)Π(dx) = 0
R (Ω0θ − AKΩ0θ )
and since the set on the right of this inclusion is closed we have
36
R (Ω0θ − AKΩ0θ
)1/2
Z
Z
⊂ ϕ∈E ;
ϕ(x)h(θ, x)Π(dx) = 0 and
ϕ(x)Π(dx) = 0 .
R
Therefore, µ(f |θ, rn )-a.s. a trajectory f drawn from µ(f |θ, rn ) is such (f −f0θ )(x)Π(dx) =
R
R
R
0 and (f −f0θ )(x)h(θ, x)Π(dx) = 0 which implies: f (x)Π(dx) = 1 and f (x)h(θ, x)Π(dx) =
0.
Remark A.3. The posterior distribution of f conditional on θ gives the revision of the
prior on f except in the direction of the constant and of the moment conditions that remain
unchanged. For estimating θ we have many possibilities: we could for instance estimate θ
by maximum likelihood by using the density given by E(f |rn , θ) as the probability density
of the data. We do not develop this approach but we use a completely Bayesian approach
by trying to recover a conditional distribution of θ conditional on rn .
Remark A.4. [Regularized Posterior Distribution] When neither the conditions of lemma
A.1 nor the conditions of lemma A.2 are satisfied then we cannot use the exact posterior
distribution µ(f |θ, rn ). Instead, we use the regularized posterior distribution denoted by
µ(f |τ, θ, rn ), where τ > 0 is a regularization parameter that must be suitably chosen and
that converges to 0 with n. This distribution has been proposed by Florens and Simoni
[2012] and we refer to this paper for a complete description of it. Here, we only give its
expression: µ(f |τ, θ, rn ) is a Gaussian distribution with mean function
E[f |rn , τ ] = f0θ + Aτ (rn − Kf0θ )
(A.4)
V ar[f |rn , τ ] = Ω0θ − Aτ KΩ0θ : E → E
(A.5)
and covariance operator
where
Aτ := Ω0θ K
∗
1
τ I + I + KΩ0θ K ∗
n
and I : F → F denotes the identity operator.
37
−1
:E →E
(A.6)
B
Proofs for Section 2
Proof of Lemma 2.1
Let H(Ω0θ ) denote the reproducing kernel Hilbert space associated with Ω0θ and em-
bedded in E and H(Ω0θ ) denote its closure. If f |θ ∼ N (f0θ , Ω0θ ) then (f − f0θ ) ∈ H(Ω0θ ),
−1/2
µ(F |θ)-a.s. Moreover, H(Ω0θ ) = D(Ω0θ
1/2
) = R(Ω0θ ) where D and R denote the do-
main and the range of an operator, respectively. This means that ∀ϕ ∈ H(Ω0θ ) there
1
2
exists ψ ∈ E such that ϕ = Ω0θ
ψ. Moreover, for any ϕ ∈ H(Ω0θ ) we have hϕ, h(θ, ·)i =
1
1
R
2
2
ψ, h(θ, ·)i = hψ, Ω0θ
h(θ, ·)i = 0 and hϕ, 1i = 0 by a similar
ϕ(x)h(θ, x)Π(dx) = hΩ0θ
argument. Hence,
H(Ω0θ ) ⊂
Z
Z
ϕ∈E;
ϕ(x)h(θ, x)Π(dx) = 0 and
ϕ(x)Π(dx) = 0 .
(B.1)
Since the set on the right of this inclusion is closed we have
H(Ω0θ ) ⊂
Z
Z
ϕ∈E;
ϕ(x)h(θ, x)Π(dx) = 0 and
ϕ(x)Π(dx) = 0 .
We deduce that µ(F |θ)-a.s.
Z
(f − f0θ )(x)Π(dx) = 0
Z
and
(f − f0θ )(x)h(θ, x)Π(dx) = 0.
Condition (2.1) and the fact that f0θ is a pdf imply the results of the lemma.
Proof of Lemma 2.2
The result follows trivially from the definition of the covariance operator Σn : F → F:
∀ψ ∈ F,
Z Z
Z
Z Z
1
k(t, x)f∗ (x)Π(dx)
(k(t, x)k(s, x)) f∗ (x)Π(dx)ψ(t)ρ(dt) −
k(s, x)f∗ (x)Π(dx) ψ(t)ρ(dt)
Σn ψ =
n T S
T S
Z
Z Z S
Z
Z
1
=
k(s, x)f∗ (x)
k(t, x)ψ(t)ρ(dt)Π(dx) −
k(s, x)f∗ (x)Π(dx)
k(t, x)ψ(t)ρ(dt)f∗ (x)Π(dx)
n S
T
S
S T
1
= [KMf K ∗ ψ − (KMf 1)hMf , K ∗ ψi]
n
where the second equality has been obtained by using the Fubini’s theorem.
38
Proof of Lemma 2.3
We can rewrite Σ as
∀ψ ∈ F,
Σψ =
=
Z
ZT
T
E (v(x, t)v(x, s)) ψ(t)ρ(dt)
Z
(v(x, t)v(x, s)) f∗ (x)Π(dx)ψ(t)ρ(dt)
S
where v(x, t) = [k(x, t) − E(k(x, t))]. Then, ∀ψ ∈ F we can write Σψ = RMf R∗ ψ where
R : E → F, Mf : E → E and R∗ : F → E are the operators defined as
R∗ ψ =
∀ψ ∈ F,
Z
v(x, t)ψ(t)ρ(dt)
T
∀ϕ ∈ E,
Mf ϕ = f∗ (x)ϕ(x)
Z
v(x, t)ϕ(x)Π(dx).
∀ϕ ∈ E, Rϕ =
S
1
1
1/2
1
Moreover, we have D(Σ− 2 ) = R(Σ 2 ) = R((RMf R∗ ) 2 ) = R(RMf ).
R
Let h ∈ R(K), that is, there exists a g ∈ E such that h(t) = S k(t, x)g(x)Π(dx). Then
1
R
1
h ∈ D(Σ− 2 ) if there exists an element ν ∈ E such that h(t) = S v(x, t)f∗2 (x)ν(x)Π(dx).
By developing this equality, the element ν has to satisfy
Z
Z
1
v(x, t)f∗2 (x)ν(x)Π(dx)
1
Z
ZS ZS
k(x, t)f∗ (x)Π(dx) f∗2 (x)ν(x)Π(dx)
k(t, x)g(x)Π(dx) =
k(x, t) −
⇔
1 S
Z 1
ZS
ZS
2
2
k(x, t) f∗ (x)ν(x) − f∗ (x)
k(t, x)g(x)Π(dx) =
⇔
f∗ (x)ν(x)Π(dx) Π(dx).
k(t, x)g(x)Π(dx) =
S
S
S
If K is injective it follows that such an element ν must satisfy
1
2
g(x) = f∗ ν(x) − f∗ (x)
which in turn implies that
solution is ν(x)
−1
= f∗ 2 g(x)
− 21
R
S
Z
S
f∗ (x)ν(x)Π(dx)
1
2
g(x)Π(dx) = 0, i.e. that h ∈ R(K|D ). Therefore, one
which proves that the range of the truncated operator K|D in
1
contained in D(Σ ). On the other side, let h ∈ D(Σ− 2 ), then there exists a ν ∈ E such
1
R
that h = S v(x, t)f∗2 (x)ν(x)Π(dx). By the previous argument and under the assumption
that K|D is injective, this
that h ∈ R(K|D ) since there exists g ∈ D such that
implies
1
R 21
1
2
g(x) = f∗ ν(x) − f∗ (x) S f∗ (x)ν(x)Π(dx) . This shows the inclusion of D(Σ− 2 ) in
R(K|D ) and concludes the proof.
39
Proof of Theorem 2.1
1/2
In this proof we denote B = Σ−1/2 KΩ0θ . To prove that Pnθ and Pn0 are equivalent we
first rewrite the covariance operator of Pnθ as
h
i 1√
√
1
1
1
n−1 Σ 2 I + nΣ− 2 KΩ0θ K ∗ Σ− 2 Σ 2 n−1 .
=
n−1 Σ + KΩ0θ K ∗
˜
Then according to htheorem 3.3 p.125 in Kuo
i [1975] we have to verify that K(f − f0θ ) ∈
1
1
R(Σ1/2 ) and that I + nΣ− 2 KΩ0θ K ∗ Σ− 2 is positive definite, bounded, invertible with
1
1
nΣ− 2 KΩ0θ K ∗ Σ− 2 Hilbert Schmidt.
• Since (f˜ − f0θ ) ∈ D and since K|D is injective then, by lemma 2.3, K(f˜ − f0θ ) ∈
R(Σ1/2 ).
• Positive definiteness. It is trivial to show that the operator (I +nBB ∗ ) is self-adjoint,
i.e. (I + nBB ∗ )∗ = (I + nBB ∗ ). Moreover, ∀ϕ ∈ F, ϕ 6= 0
h(I + nBB ∗ )ϕ, ϕi = hϕ, ϕi + nhB ∗ ϕ, B ∗ ϕi = ||ϕ||2 + n||B ∗ ϕ|| > 0.
• Boundedness. By lemma 2.3, if K|D is injective, the operators B and B ∗ are bounded ;
the operator I is bounded by definition and a linear combination of bounded operators
is bounded, see Remark 2.7 in Kress [1999].
• Continuously invertible. The operator (I + nBB ∗ ) is continuously invertible if its inverse is bounded, i.e. there exists a positive number C such that ||(I +nBB ∗ )−1 ϕ|| ≤
C||ϕ||, ∀ϕ ∈ F. We have ||(I + nBB ∗ )−1 ϕ|| ≤ (supj
n−1
)||ϕ||
n−1 +lj2
= ||ϕ||, ∀ϕ ∈ F.
p
• Hilbert-Schmidt. We consider the Hilbert-Schmidt norm ||nBB ∗ ||HS = n tr((BB ∗ )2 ).
˜ ∗ BΩ
˜ 0θ B
˜ ∗ B)
˜ ≤ tr(Ω0θ )||B
˜ ∗ BΩ
˜ 0θ B
˜ ∗ B||
˜ < ∞ since B
˜ :=
Now, tr((BB ∗ )2 ) = tr(Ω0 B
1
Σ− 2 K|H(Ω0θ ) has a bounded norm by lemma 2.3. Hence, ||nBB ∗ ||HS < ∞ for n
finite.
This shows that Pnθ and Pn0 are equivalent.
Next we derive (2.9). Let zj = hrn − K f˜, Σ−1/2 ψj (θ)i. This variable is defined under
the further assumption that ψj (θ) ∈ R(Σ1/2 ), ∀j ≥ 0. By theorem 2.1 in Kuo [1975, page
116]:
∞
Y
dνj
dPnθ
=
dPn0
dµj
j=1
40
√
where νj denotes the distribution of nzj under Pnθ and µj denotes the distribution of
√
nzj under Pn0 . By writing down the likelihoods of νj and µj with respect to the Lebesgue
measure we obtain
dPnθ
(rn )
dPn0
=
∞
Y
j=1
1+
2 n
ljθ
−1/2
−1
2 n
exp{− 21 (zj − hK(f0θ − f˜), Σ−1/2 ψj (θ)i)2 n 1 + ljθ
}
exp{− 12 zj2 n}
which, after simplifications, gives the result.
C
Proofs for Section 3
Proof of Theorem 3.1
First remark that assumption A3 and compactness of Θ imply that
γ 0 h(θ,x)
0. Let γ(θ) := arg min E∗ e
k(xi
T
, t)(1−eγn (θ) h(xi ,θ) ),
and G(k)(t) :=
√1
n
Pn
γ(θ)0 h(θ,x)
and f0θ (x) = e
g(xi , t, θ) := k(xi
i=1 (k(xi , t)
1
n
supθ∈Θ µ(θ) →
f∗ (x). Denote gn (xi , t, θ) :=
P
G(g)(t) := √1n ni=1 (g(xi , t, θ) − K(f∗ − f0θ ))
T
, t)(1−eγ(θ) h(xi ,θ) ),
− Kf∗ )). We first have to show | n1 ln (θ) − l(θ)| → 0 uni-
formly in θ. For that, we make the following decomposition:
∞
∞
n
X
1
1 X 1 X
1
2
ln (θ) − l(θ) = − 1
√
+1)−
log(nl
h
g(xi , t, θ), Σ−1/2 ψj (θ)i2
jθ
n
2
2n
2n
n
1 + nljθ
j=1
j=1
i=1
∞
n
1
1 X 1 X
h√
(gn (xi , t, θ) − g(xi , t, θ)) , Σ−1/2 ψj (θ)i2
−
2
2n
n
1 + nljθ
j=1
i=1
+
∞
n
1 X 1 X
h√
k(xi , t) − K f˜ , Σ−1/2 ψj (θ)i2
2n
n
j=1
i=1
∞
n
n
1X 1 X
1 X
1
−1/2
h√
−
(gn (xi , t, θ) − g(xi , t, θ)) , Σ
ψj (θ)ih √
g(xi , t, θ), Σ−1/2 ψj (θ)i
2
n
n
n
1 + nljθ
j=1
i=1
+
1
2
∞
X
j=1
hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 −
41
i=1
∞
X
1
2
j=1
hK(f˜ − f∗ ), Σ−1/2 ψj (θ)i2 Because
1
n
Pn
i=1 (gn (xi , t, θ) −
g(xi , t, θ)) →p 0 uniformly in θ then, we can write
∞
∞
1 X
1 X
1
1
2
log(nljθ + 1) −
hG(g)(t), Σ−1/2 ψj (θ)i2
| ln (θ) − l(θ)| = op (1) + −
2
n
2n
2n
1 + nljθ
j=1
j=1
∞
n
1 X
hK(f∗ − f0θ ), Σ−1/2 ψj (θ)i2
−
2
2n
1 + nljθ
j=1
1
−√
n
∞
X
j=1
hG(g)(t), Σ−1/2 ψj (θ)ihK(f∗ −f0θ ), Σ−1/2 ψj (θ)i
∞
∞
1
1 X
hG(k), Σ−1/2 ψj (θ)i2
+
2
1 + nljθ 2n j=1
1 X
+√
hG(k), Σ−1/2 ψj (θ)ihK(f˜ − f∗ ), Σ−1/2 ψj (θ)i
n
j=1
+
∞
1X
hK(f0θ − f∗ ), Σ−1/2 ψj (θ)i2 2
j=1
∞
∞
1 X
1 X
1
2
=−
log(nljθ + 1) −
hG(g)(t), Σ−1/2 ψj (θ)i2
2
2n
2n
1 + nljθ
j=1
j=1
2
nljθ
1
+ kΣ−1/2 K(f∗ − f0θ )k2
2
2
1 + nljθ
∞
∞
1
1 X
1 X
hG(g)(t), Σ−1/2 ψj (θ)ihK(f∗ −f0θ ), Σ−1/2 ψj (θ)i
hG(k), Σ−1/2 ψj (θ)i2
−√
+
2
n
2n
1 + nljθ
j=1
j=1
1
+√
n
∞
X
j=1
hG(k), Σ
−1/2
−1/2
˜
ψj (θ)ihK(f − f∗ ), Σ
ψj (θ)i
where op (1) is uniform in θ and ψj (θ) is uniformly continuous in θ due to compactness
of Θ. Morevoer, remark that: supt∈T,θ∈Θ |n−1/2 G(g)(t)| → 0, P ∗ -a.s., by Lemma D.1;
2 → 0 by
supt∈T |n−1/2 G(k)(t)| → 0, P ∗ -a.s., because k(x, t) is chosen to be Donsker; nljθ
construction of the prior. Therefore, supθ∈Θ | n1 ln (θ) − l(θ)| → 0, P∗ -a.s.
Finally, by continuity of l(θ) in θ and because the maximizer of l(θ) is the true value θ∗
which is a well-separated point of maximum, we conclude that θn → θ∗ in P∗ -probability.
Proof of Theorem 3.2
We first state and prove a preliminary lemma. The proof of this lemma and of the
theorem are slightly modifications of results in Kim [2002] and Chen [1985].
Lemma C.1. Under B1 – B4,
1
lim µ(θn |rn )|∆n | 2 ≤ (2π)−k/2 .
n→∞
1
(C.1)
Moreover, limn→∞ µ(θn |rn )|∆n | 2 = (2π)−k/2 in P ∗ -probability if and only if for some δn >
42
0, µ(θ ∈ B(θn , δn )|rn ) → 1 in P ∗ -probability.
Proof. Denote pn the denominator of µ(θ|rn ). For any > 0, let N > 0 and δn & 0 such
that Assumption B3 is satisfied for every n > N . Under B1 and B2, for every θ ∈ B(θn , δn )
a second order Taylor expansion of Ln (θ) around θn gives
pnθ (rn ; θ)µ(θ) = pnθ (rn ; θn )µθ (θn ) exp(Ln (θ) − Ln (θn ))
1
t 00 ˜
(θ − θn ) Ln (θ)(θ − θn )
= pnθn (rn ; θn )µ(θn ) exp
2
1
˜ −1 (θ − θn ) (C.2)
= pnθn (rn ; θn )µ(θn ) exp − (θ − θn )t [I + Ψn (θ)]∆
n
2
˜ =
with Ψn (θ)
∂ 2 Ln (θ) ∂θ∂θ t θ=θ˜
∂ 2 Ln (θ) ∂θ∂θ t θ=θ
n
−1
− I, I a k × k identity matrix and θ˜ lies on the
segment joining θ and θn . Therefore, under B3, the probability µ(B(θn , δn )|rn ) is bounded
above by
µ(B(θn , δn )|rn ) =
=
p−1
n pnθ (rn ; θn )µ(θn )
p−1
n pnθ (rn ; θn )µ(θn )
−1/2
≤ |I − G()|
Z
Z
exp
B(θn ,δn )
1
t 00 ˜
(θ − θn ) Ln (θ)(θ − θn ) dθ
2
˜ −1/2 ∆1/2
{z;||[I+Ψn(θ)]
n z||<δn }
|∆n |1/2 p−1
n pnθ (rn ; θn )µ(θn )
≤ |I − G()|−1/2 |∆n |1/2 p−1
n pnθ (rn ; θn )µ(θn )
2
˜
1
Z
e−z
t z/2
e−z
˜ −1/2 |∆n |1/2
dz|I + Ψn (θ)|
t z/2
dz
B(0,un )
1/2
Ln (θ) 1/2
where z = [ ∂∂θ∂θ
(θ − θn ), un = δn (1 + g max ()) 2 /dmin , g max () is the largest eigenvalue
t ]
of G() and dmin is the smallest eigenvalue of ∆n . The inequality in the third line is due to
Assumption B3 which also implies4
˜ −1/2 ∆1/2 z|| < δn } ⊆ {z; ||z|| < un }.
{z; ||[I + Ψn (θ)]
n
The inequality in the last line is due to the fact that
4
R
−z t z/2 dz
B(0,un ) e
< (2π)k/2 . In a
More precisely,
˜ −1/2 ∆1/2 z|| < δn } ⊆ {z; ||[I + G()]−1/2 ∆1/2 z|| < δn }
{z; ||[I + Ψn (θ)]
n
n
˜ −1/2 ≥ [I + G()]−1/2 which is guaranteed by Assumption B3. Moreover,
if and only if [I + Ψn (θ)]
1/2
1/2
−1/2 1/2
2
||[I + G()]
∆n z|| = tr(z t ∆n [I + G()]−1 ∆n z) ≥ λmin ([I + G()]−1 )dmin tr(zz t ), where for
1/2
a matrix A, λmin (A) denotes the minimum eigenvalue of A. Hence, {z; ||[I + G()]−1/2 ∆n z|| <
1/2
δn } ⊆ {z; (1 + g max ())−1/2 dmin ||z|| < δn }.
43
similar way, under B3
˜ −1/2 ∆1/2 z|| < δn } ⊇ {z; ||z|| < ln }.
{z; ||[I + Ψn (θ)]
n
1
1/2
where ln = δn (1 − g max ()) 2 /dmax , and so we can bound µ(B(θn , η)|rn ) from below by
−1/2
µ(B(θn , δn )|rn ) ≥ |I + G()|
|∆n |1/2 p−1
n pnθ (rn ; θn )µ(θn )
0
e−z z/2 dz.
B(0,ln )
R
−z t z/2 dz converges to
B(0,ln ) e
µ(θn |rn ) = p−1
n pnθ (rn ; θn )µ(θn ))
By Assumption B4, ln → ∞ as n → ∞ and then
n → ∞. Summarizing, we obtain (by writing
Z
(2π)k/2 as
|I − G()|1/2 lim µ(B(θn , δn |rn )|rn ) ≤ (2π)k/2 lim |∆n |1/2 µ(θn |rn )
n→∞
n→∞
≤ |I + G()|1/2 lim µ(B(θn , δn |rn )|rn )
n→∞
and (C.1) is implied by the facts that under B3, |I ±G()| → 1 as → 0, µ(B(θn , δn )|rn ) ≤ 1
for every n. The equality holds if and only if limn→∞ µ(B(θn , δn )|rn ) = 1 in P ∗ -probability.
Now, we proceed with the proof of Theorem 3.2. Denote pn the denominator of µ(θ|rn ).
For any θ1 , θ2 ∈ Θ we write θ2 ≥ θ1 (or θ2 ≤ θ1 ) if every component of θ2 −θ1 is nonnegative.
Without loss of generality, consider θ1 ≤ 0 and θ2 ≥ 0.
−1/2
Let sn = ∆n
(θ − θn ). By using (C.2), the probability µ(sn ∈ [θ1 , θ2 ]|rn ) writes
µ(sn ∈ [θ1 , θ2 ]|rn ) = µ(θn |rn )
Z
=µ(θn |rn )
Z
1/2
∆n θ2 +θn
1/2
∆n θ1 +θn
1
t
−1
˜
exp − (θ − θn ) [I + Ψn (θ)]∆n (θ − θn ) dθ
2
˜ 1/2 θ1 ≤z≤[I+Ψn (θ)]
˜ 1/2 θ2 }
{z;[I+Ψn (θ)]
e−z
t z/2
˜ −1/2 |∆n |1/2 .
dz|I + Ψn (θ)|
Because
o
n
˜ 1/2 θ1 ≤ z ≤ [I + Ψn (θ)]
˜ 1/2 θ2
H − () := {z; [I + G()]1/2 θ1 ≤ z ≤ θ2 [I − G()]1/2 } ⊆ z; [I + Ψn (θ)]
⊆ H + () := {z; [I − G()]1/2 θ1 ≤ z ≤ θ2 [I + G()]1/2 }
we can lower and upper bound the probability µ(sn ∈ [θ1 , θ2 ]|rn ) as
44
|I + G()|−1/2 |∆n |1/2 µ(θn |rn )
Z
e−z
t z/2
H − ()
dz ≤ µ(sn ∈ [θ1 , θ2 ]|rn )
−1/2
≤ |I − G()|
1/2
|∆n |
µ(θn |rn )
Z
e−z
t z/2
H + ()
Remark that lim→0 H − () = lim→0 H + () = {z; θ1 ≤ z ≤ θ2 }. Therefore, by letting → 0
and taking the limit for n → ∞ we obtain:
lim |∆n |1/2 µ(θn |rn )(2π)k/2 Φ([θ1 , θ2 ]) ≤ lim µ(sn ∈ [θ1 , θ2 ]|rn )
n→∞
n→∞
≤ lim |∆n |1/2 µ(θn |rn )(2π)k/2 Φ([θ1 , θ2 ])
n→∞
where Φ([θ1 , θ2 ]) denotes the probability of [θ1 , θ2 ] under a standard normal distribution.
Under assumption A2, the second result of theorem 3.1 holds and, by lemma C.1, implies
that limn→∞ |∆n |1/2 µ(θn |rn )(2π)k/2 = 1. This concludes the proof.
Proof of Theorem 3.3
√
Let θ rn = hE[f |rn ], gi and Ωn = hV ar[f |rn ]g, gi and remark that µ(s|rn ) is a N ( n(θ rn −
ˆ We can upper bound the TV-distance by
ˆ nΩn ) distribution. Denote sbn := √n(θ rn − θ).
θ),
ˆ V )
µ(s|rn ) − N (θ,
TV
≤
kµ(s|rn ) − N (0, nΩn )kT V + kN (0, nΩn ) − N (0, V )kT V
=: A + B.
We start by considering A. By doing a change of variable it is easy to show that
A = N (0, Ik ) − N (−(nΩn )−1/2 sbn , Ik )
TV
.
An application of Lemma D.2 and then of Lemmas C.2 and C.3 allows to conclude that
A ≤
1
√ k(nΩn )−1/2 sbn k = Op (n−1/2 ).
2π
To bound B we introduce the Kullback-Leibler distance between two probability meaR
sures P and Q, denoted by K(P, Q), and satisfying K(P, Q) = log pq dP where p and q are
the densities of P and Q, respectively, with respect to the Lebesgue measure. We get
45
dz.
B≤
p
K (N (0, V ), N (0, nΩn ))
v
!
uZ
1 t −1
u
1 t −1
1
|V |−1/2 e{− 2 s V s}
t
|V |−1/2 e{− 2 s V s} ds
=
log
1 t
−1
k/2
|nΩn |−1/2 e{− 2 s (nΩn ) s} (2π)
s Z 1 t −1
1
|nΩn |
1
t
−1
−1
=
|V |−1/2 e{− 2 s V s} ds
− s [V − (nΩn ) ]s
log
k/2
2
|V |
(2π)
s
1
|nΩn | 1
=
log
− trV [V −1 − (nΩn )−1 ]
2
|V |
2
that converges to zero by the result of Lemma C.2.
Lemma C.2. Let Ωn and V be defined as in Theorem 3.3. Then, under the assumption of
Theorem 3.3: (i) nΩn = O(1) and (ii)
nΩn → V,
as n → ∞.
Proof. Under the conditions of the theorem the posterior variance writes as in (A.2) with
the operator A defined in Lemma A.1. By using this expression:
1 1/2 1/2
1
−1/2 −1 −1/2
−1/2
Ω0 g, gt i
) f∗
Ω0 f ∗
( I − f∗ hf∗ , ·i + f∗
n
n
1
−1/2 1 1/2
−1/2 −1
1/2
( f∗ − f∗ hf∗ , ·i + Ω0 f∗
= hΩ0 − Ω0 f∗
) Ω0 g, gt i
n
n
1
−1/2 −1
1/2
−1/2 1 1/2
)
Ω0 g, g t i
( f∗ − f∗ hf∗ , ·i + Ω0 f∗
= h I − Ω0 f ∗
n
n
1 1/2 1
1 1/2 1
−1/2 −1
1/2
1/2
= h f∗ − f∗ hf∗ , ·i ( f∗ − f∗ hf∗ , ·i + Ω0 f∗
) Ω0 g, gt i
n
n
n
n
1 1/2 1
1
1/2
1/2 1
= h f∗ − f∗ hf∗ , ·i f∗ ( f∗ − f∗ hf∗ , ·i + Ω0 )−1 Ω0 g, g t i
n
n
n
n
1
1
1
1
= h f∗ − f∗ hf∗ , ·i ( f∗ − f∗ hf∗ , ·i + Ω0 )−1 Ω0 g, gt i
n
n
n
n
1
1
= hT ( T + Ω0 )−1 Ω0 g, gt i
n
n
−1/2
Ωn = hΩ0 − Ω0 f∗
where T : E → E is the self-adjoint operator defined as T φ = f∗ (φ − E∗ φ), ∀φ ∈ E. This
shows (i). To show (ii), by using the previous expression for Ωn we write
46
1
V − nΩn = −hT [( T + Ω0 )−1 Ω0 − I]g, g t i
n
1 −1/2
−1/2
−1/2 1 −1/2
+ I)−1 Ω0 T g, T g t i
= −hΩ0 ( Ω0 T Ω0
n
n
1 1 −1/2 −1/2
= h( Ω0 T Ω0
+ I)−1 ν, ν t i
n n
1/2
1/2
since there exists ν ∈ E such that T g = Ω0 ν. This is because T g = f∗ (g −E∗ g) ∈ R(Ω0 ).
−1/2
So, because ( n1 Ω0
−1/2
T Ω0
+ I)−1 is bounded then nΩn → V as n → ∞.
Lemma C.3. Let θ rn and θˆ be defined as in Theorem 3.3. Then, under the assumption of
Theorem 3.3:
√ rn
ˆ = Op (n−1/2 ).
n(θ − θ)
Proof. We want to show that
√
ˆ → 0. Define T : E → E as the operator
n(θ rn − θ)
T φ = f∗ (φ − E∗ φ), ∀φ ∈ E. Remark that T is self-adjoint and that R(T ) ⊂ R(Ω0 ) so that
Ω−1 T is well defined.
√ rn
ˆ
n(θ − θ)
*
+
−1
√
1
1 1/2 1/2
1/2
−1/2
−1/2
−1/2 −1
= n( f0 + Ω0 f∗
Ω0 f ∗
I − f∗ hf∗ , ·i + f∗
K (rn − Kf0 ), g −B(Pn ))
f∗
n
n
−1
√
1
T + Ω0
)f0 , gi
= nh(I − Ω0
n
√
1
−1 −1
+ n hΩ0 ( T + Ω0 ) K rn , gi − B(Pn ) .
n
We consider these two terms separately. Consider the first term:
√
nh(I − Ω0
1
T + Ω0
n
−1
−1
√ 1
1
)f0 , gi = nh T
f0 , gi
T + Ω0
n
n
−1
1
1
T + Ω0
f0 , T gi
=√ h
n n
−1
1 −1
1
−1/2
Ω−1
).
Ω T +I
= √ hf0 ,
0 T gi = O(n
n
n 0
47
P
Consider the second term and remark that B(Pn ) = n1 ni=1 [( n1 T + Ω0 )−1 Ω0 g](xi ) +
1
1 Pn
−1
i=1 (g(xi ) − [( n T + Ω0 ) Ω0 g](xi )). Let δxi be the Dirac measure that assigns a unit
n
mass to the point xi . Hence, it is possible to identify such a measure with a distribution Di ,
that is, a linear functional Di defined on C ∞ and continuous with respect to the supremum
norm: g 7→ g(xi ) = Di (g), see Schwartz [1966]. We get
n
1
1X
1
1X
−1
−1
Di I − ( T + Ω0 ) Ω0 g
(g(xi ) − [( T + Ω0 ) Ω0 g](xi )) =
n
n
n
n
i=1
i=1
1 1
1X
−1
Di
( T + Ω0 ) T g
=
n
n n
i=1
1 X
1 −1
−1 −1
= 2
Di ( Ω0 T + I) Ω0 T g = Op (n−1 ).
n
n
i=1
Similarly, we can write: ∀t ∈ T , k 7→ k(xi , t) = Di (k(x, t)) =: [Di (k)](t) so that
P
rn (t) = n1 ni=1 [Di (k)](t) and
n
1
1X
1
hΩ0 ( T + Ω0 )−1 K −1 [Di (k)], gi
hΩ0 ( T + Ω0 )−1 K −1 rn , gi =
n
n
n
i=1
n
1X
1
=
Di (hΩ0 ( T + Ω0 )−1 K −1 k, gi)
n
n
i=1
n
1
1X
Di (hk, (K ∗ )−1 ( T + Ω0 )−1 Ω0 gi)
=
n
n
i=1
n
1X
1
−1
=
Di ( T + Ω0 ) Ω0 g
n
n
i=1
since
∗ −1
hk, (K )
1
( T + Ω0 )−1 Ω0 gi =
n
Z
1
−1
k(x, t) (K ) ( T + Ω0 ) Ω0 g (t)ρ(t)dt
n
1
1
= K ∗ (K ∗ )−1 ( T + Ω0 )−1 Ω0 g = ( T + Ω0 )−1 Ω0 g.
n
n
∗ −1
Therefore,
48
n
√
1
1 Xh
1
−1 −1
−1
n hΩ0 ( T + Ω0 ) K rn , gi − B(Pn ) = √
Di ( T + Ω0 ) Ω0 g
n
n
n
i=1
i
1
+ O(n−1/2 ) = Op (n−1/2 ).
− Di ( T + Ω0 )−1 Ω0 g
n
D
Technical Lemmas
T h(x
Lemma D.1. Let g(xi , t, θ) := k(xi , t)(1 − eγ(θ)
i ,θ)
). Assume that T × Θ is compact
and that the maps t 7→ k(x, t) and θ 7→ h(θ, x) are continuous for every x. Then
n
1X
g(xi , t, θ) − E∗ g(xi , t, θ)| → 0
sup |
t∈T,θ∈Θ n
i=1
almost surely.
Proof. Consider the following parametric class of functions G := {g(·, t, θ); t ∈ T θ, ∈ Θ}
and remark that, by the assumptions of the lemma and because γ(·) is continuous in its
argument (see proof of Theorem 1 in Kitamura and Stutzer), the map (t, θ) 7→ g(x, t, θ) is
continuous for every x. Moreover, for every t ∈ T , θ ∈ Θ we observe that the function
sup |g(x, t, θ)|
t∈T,θ∈Θ
is an envelope function, which is integrable because of continuity of the function and compactness of T × Θ. Hence, by Example 19.8 in Van der Vaart we conclude that G is
Glivenko-Cantelli.
The next Lemma applies to the just identified case described in Remark 2.2 where the
prior covariance operator does not depend on θ for which we use the notation Ω0 .
1/2
Lemma D.2. Let Ω0 : E → E be a covariance operator of a GP on BE such that Ω0 1 = 0.
Let D ∈ E be defined as
Z
D := g ∈ E;
g(x)Π(dx) = 0 .
Then, R(Ω0 ) = D.
1/2
1/2
Proof. Because Ω0 = Ω0 Ω0
1/2
and because Ω0 1 = 0 then Ω0 1 = 0, that is, 1 ∈ N(Ω0 ).
49
Hence, if g ∈ R(Ω0 ) then ∃ν ∈ E such that g = Ω0 ν and so
Z
g(x)Π(dx) =
Z
Ω0 νΠ(dx) =< ν, Ω0 1 >= 0.
This shows that R(Ω0 ) ⊂ D. Now, take g ∈ D (so, < g, 1 >= 0) and suppose that
g∈
/ R(Ω0 ). Then, ∀h ∈ R(Ω0 ): < g, h >= 0 =< g, 1 > and
< g, h − 1 >= 0.
Because the same reasoning holds for every g ∈ D, then: < g, h − 1 >= 0 for every g ∈ D
and for every h ∈ R(Ω0 ). Hence, it must be h = 1 but this is impossible since 1 ∈
/ R(Ω0 ).
Therefore, g must belong to R(Ω0 ) and so D ⊂ R(Ω0 ).
References
U. Amato and W. Hughes. Maximum entropy regularization of fredholm integral equations
of the first kind. Inverse Problems, 7(6):793, 1991.
J. F. Bonnans and A. Shapiro. Optimization problems with perturbations: A guided tour.
SIAM Rev., 40(2).
C. Cazals, F. F`eve, P. F`eve, and J.-P. Florens. Simple structural econometrics of price
elasticity. Economics Letters, 86(1):1–6, 2005.
C.-F. Chen. On asymptotic normality of limiting density functions with bayesian implications. Journal of the Royal Statistical Society. Series B (Methodological), 47(3):540–546,
1985.
V. Chernozhukov and H. Hong. An {MCMC} approach to classical estimation. Journal of
Econometrics, 115(2):293–346, 2003.
I. Csiszar. i-divergence geometry of probability distributions and minimization problems.
Ann. Probab., 3(1):146–158, 1975.
T. S. Ferguson. A bayesian analysis of some nonparametric problems. Ann. Statist., 1(2):
209–230, 03 1973.
T. S. Ferguson. Prior distributions on spaces of probability measures. Ann. Statist., 2(4):
615–629, 07 1974.
50
F. F`eve, J.-P. Florens, and S. Richard. Microeconomic demand modelling for price elasticities. In M. Crew and P. Kleindorfer, editors, Lineralization of the Postal and Delivery
Sector. Edouard Elgar, Northampton, USA, 2006.
J.-P. Florens and J.-M. Rolin. Bayes, bootstrap, moments. Technical report, Universit´e
Catholique de Louvain, 1994.
J.-P. Florens and A. Simoni. Regularized posteriors in linear ill-posed inverse problems.
Scandinavian Journal of Statistics, 39(2):pp. 214–235, 2012.
J.-P. Florens and A. Simoni. Regularizing priors for linear inverse problems. Econometric
Theory, FirstView, 11 2014.
J.-Y. Kim. Limited information likelihood and bayesian analysis. Journal of Econometrics,
107(1-2):175–193, 2002.
Y. Kitamura. Empirical likelihood methods in econometrics: Theory and practice. In
R. Blundell, W. Newey, and T. Persson, editors, Advances in Economics and Econometrics, volume 3, pages 174–237. Cambridge University Press, 2007. Cambridge Books
Online.
Y. Kitamura and T. Otsu. Bayesian analysis of moment condition models using nonparametric priors. Technical report.
Y. Kitamura and M. Stutzer. An information-theoretic alternative to generalized method
of moments estimation. Econometrica, 65(4):pp. 861–874, 1997.
R. Kress. Linear Integral Equation. Springer New York, 1999.
H.-H. Kuo. Gaussian Measures in Banach Spaces. Springer Berlin Heidelberg, 1975.
S. M. Schennach. Bayesian exponentially tilted empirical likelihood. Biometrika, 92(1):
31–46, 2005.
´
L. Schwartz. Th´eorie des distributions. Hermann Editeurs,
1966.
51