6 J.O.BERGER.J.M.BERNARDO AND D.SUN But the last expression converges to zero if,and only if, lim p(x|0)π(0)d0=p(x|0)π(0)d0 and this follows from the monotone convergence theorem. It is well known that logarithmic convergence implies convergence in LI which implies uniform convergence of probabilities,so Theorem 1 could,at first sight,be invoked to justify the formal use of virtually any improper prior in Bayes theorem.As illustrated below,however,logarithmic convergence of the approximating posteriors is not necessarily good enough. EXAMPLE 1 (Fraser,Monette and Ng [21]).Consider the model,with both discrete data and parameter space, M={p(x|)=1/3,x∈{0/2,20,20+1,0∈{1,2,}, where [u]denotes the integer part of u,and [1/2]is separately defined as 1. Fraser,Monnete and Ng [21]show that the naive improper prior ()=1 produces a posterior (r)o p(x)which is strongly inconsistent,leading to credible sets for 6 given by [2r,2+1 which have posterior probability 2/3 but frequentist coverage of only 1/3 for all 6 values.Yet,choosing the natural approximating sequence of compact sets e:={1,...,i},it follows from Theorem 1 that the corresponding sequence of posteriors converges logarithmically to (x). The difficulty shown by Example 1 lies in the fact that logarithmic con- vergence is only pointwise convergence for given x,which does not guarantee that the approximating posteriors are accurate in any global sense over x. For that we turn to a stronger notion of convergence. DEFINITION 4 (Expected logarithmic convergence of posteriors).Con- sider a parametric model M={p(x0),xE,0e},a strictly positive continuous function r(0),6ee and an approximating compact sequence {i}of parameter spaces.The corresponding sequence of posteriors {mi( x)is said to be expected logarithmically convergent to the formal pos- terior (x)if (2.2) IimK{π(|x)|π(|x)}p(x)dk=0, i- where pi(x)=Je.p(x0)ni(0)do. This notion was first discussed (in the context of reference priors)in Berger and Bernardo [7],and achieves one of our original goals:A prior
6 J. O. BERGER, J. M. BERNARDO AND D. SUN But the last expression converges to zero if, and only if, lim i→∞ Z Θi p(x | θ)π(θ) dθ = Z Θ p(x | θ)π(θ) dθ, and this follows from the monotone convergence theorem. It is well known that logarithmic convergence implies convergence in L1 which implies uniform convergence of probabilities, so Theorem 1 could, at first sight, be invoked to justify the formal use of virtually any improper prior in Bayes theorem. As illustrated below, however, logarithmic convergence of the approximating posteriors is not necessarily good enough. Example 1 (Fraser, Monette and Ng [21]). Consider the model, with both discrete data and parameter space, M = {p(x | θ) = 1/3, x ∈ {[θ/2], 2θ, 2θ + 1}, θ ∈ {1, 2,...}}, where [u] denotes the integer part of u, and [1/2] is separately defined as 1. Fraser, Monnete and Ng [21] show that the naive improper prior π(θ) = 1 produces a posterior π(θ | x) ∝ p(x | θ) which is strongly inconsistent, leading to credible sets for θ given by {2x, 2x + 1} which have posterior probability 2/3 but frequentist coverage of only 1/3 for all θ values. Yet, choosing the natural approximating sequence of compact sets Θi = {1,...,i}, it follows from Theorem 1 that the corresponding sequence of posteriors converges logarithmically to π(θ | x). The difficulty shown by Example 1 lies in the fact that logarithmic convergence is only pointwise convergence for given x, which does not guarantee that the approximating posteriors are accurate in any global sense over x. For that we turn to a stronger notion of convergence. Definition 4 (Expected logarithmic convergence of posteriors). Consider a parametric model M = {p(x | θ),x ∈ X ,θ ∈ Θ}, a strictly positive continuous function π(θ), θ ∈ Θ and an approximating compact sequence {Θi} of parameter spaces. The corresponding sequence of posteriors {πi(θ | x)}∞ i=1 is said to be expected logarithmically convergent to the formal posterior π(θ | x) if lim i→∞ Z X (2.2) κ{π(· | x) | πi(· | x)}pi(x) dx = 0, where pi(x) = R Θi p(x | θ)πi(θ) dθ. This notion was first discussed (in the context of reference priors) in Berger and Bernardo [7], and achieves one of our original goals: A prior
DEFINITION OF REFERENCE PRIORS > distribution satisfying this condition will yield a posterior that,on average over x,is a good approximation to the proper posterior that would result from restriction to a large compact subset of the parameter space. To some Bayesians,it might seem odd to worry about averaging the log- arithmic discrepancy over the sample space but,as will be seen,reference priors are designed to be "noninformative"for a specified model,the notion being that repeated use of the prior with that model will be successful in practice. ExAMPLE 2 (Fraser,Monette and Ng [21]continued).In Example 1,the discrepancies (x)mi(.x)}between x)and the posteriors de- rived from the sequence of proper priors {i()1 converged to zero.How- ever,Berger and Bernardo [7]shows that Jkf(x)i(x)}pi(x)dx- log 3 as i-oo,so that the expected logarithmic discrepancy does not go to zero.Thus,the sequence of proper priors ()=1/i,01.. does not provide a good global approximation to the formal prior m(0)=1, providing one explanation of the paradox found by Fraser,Monette and Ng 21. Interestingly,for the improper prior (0)=1/0,the approximating com- pact sequence considered above can be shown to yield posterior distributions that expected logarithmically converge to (0-p(x0),so that this is a good candidate objective prior for the problem.It is also shown in Berger and Bernardo [7 that this prior has posterior confidence intervals with the correct frequentist coverage. Two potential generalizations are of interest.Definition 4 requires con- vergence only with respect to one approximating compact sequence of pa- rameter spaces.It is natural to wonder what happens for other such approx- imating sequences.We suspect,but have been unable to prove in general, that convergence with respect to one sequence will guarantee convergence with respect to any sequence.If true,this makes expected logarithmic con- vergence an even more compelling property. Related to this is the possibility of allowing not just an approximating series of priors based on truncation to compact parameter spaces,but in- stead allowing any approximating sequence of priors.Among the difficulties in dealing with this is the need for a better notion of divergence that is symmetric in its arguments.One possibility is the symmetrized form of the logarithmic divergence in Bernardo and Rueda[12],but the analysis is con- siderably more difficult. 2.2.Permissible priors.Based on the previous considerations,we re- strict consideration of possibly objective priors to those that satisfy the expected logarithmic convergence condition,and formally define them as follows.(Recall that x represents the entire data vector.)
DEFINITION OF REFERENCE PRIORS 7 distribution satisfying this condition will yield a posterior that, on average over x, is a good approximation to the proper posterior that would result from restriction to a large compact subset of the parameter space. To some Bayesians, it might seem odd to worry about averaging the logarithmic discrepancy over the sample space but, as will be seen, reference priors are designed to be “noninformative” for a specified model, the notion being that repeated use of the prior with that model will be successful in practice. Example 2 (Fraser, Monette and Ng [21] continued). In Example 1, the discrepancies κ{π(· | x) | πi(· | x)} between π(θ | x) and the posteriors derived from the sequence of proper priors {πi(θ)}∞ i=1 converged to zero. However, Berger and Bernardo [7] shows that R X κ{π(· | x) | πi(· | x)}pi(x) dx → log 3 as i → ∞, so that the expected logarithmic discrepancy does not go to zero. Thus, the sequence of proper priors {πi(θ) = 1/i,θ ∈ {1,...,i}}∞ i=1 does not provide a good global approximation to the formal prior π(θ) = 1, providing one explanation of the paradox found by Fraser, Monette and Ng [21]. Interestingly, for the improper prior π(θ) = 1/θ, the approximating compact sequence considered above can be shown to yield posterior distributions that expected logarithmically converge to π(θ | x) ∝ θ −1p(x | θ), so that this is a good candidate objective prior for the problem. It is also shown in Berger and Bernardo [7] that this prior has posterior confidence intervals with the correct frequentist coverage. Two potential generalizations are of interest. Definition 4 requires convergence only with respect to one approximating compact sequence of parameter spaces. It is natural to wonder what happens for other such approximating sequences. We suspect, but have been unable to prove in general, that convergence with respect to one sequence will guarantee convergence with respect to any sequence. If true, this makes expected logarithmic convergence an even more compelling property. Related to this is the possibility of allowing not just an approximating series of priors based on truncation to compact parameter spaces, but instead allowing any approximating sequence of priors. Among the difficulties in dealing with this is the need for a better notion of divergence that is symmetric in its arguments. One possibility is the symmetrized form of the logarithmic divergence in Bernardo and Rueda [12], but the analysis is considerably more difficult. 2.2. Permissible priors. Based on the previous considerations, we restrict consideration of possibly objective priors to those that satisfy the expected logarithmic convergence condition, and formally define them as follows. (Recall that x represents the entire data vector.)
J.O.BERGER.J.M.BERNARDO AND D.SUN DEFINITION 5.A strictly positive continuous function ()is a permis- sible prior for model M={p(x|0),x∈X,0∈Θ}if 1.for all xEX,x)is proper,that is,fe p(x)()de<oo; 2.for some approximating compact sequence,the corresponding posterior sequence is expected logarithmically convergent to r(x)o p(x)(0). The following theorem,whose proof is given in Appendix A,shows that, for one observation from a location model,the objective prior (0)=1 is permissible under mild conditions. THEOREM 2.Consider the model M={f(x-0),0ER,IER),where f(t)is a density function on R.If,for some >0, (2.3) lim tef(t)=0, t→0 then n(0)=1 is a permissible prior for the location model M. ExAMPLE 3(A nonpermissible constant prior in a location model).Con- sider the location model M=p(x0)=f(x-0),6ER,x>0+e},where f(t)=t-(logt)-2,t>e.It is shown in Appendix B that,if ()=1,then (I mo()}po()dz=oo for any compact set 0o=[a,b]with b-a >1;thus,n(0)=1 is not a permissible prior for M.Note that this model does not satisfy (2.3). This is an interesting example because we are still dealing with a location density,so that r()=1 is still the invariant (Haar)prior and,as such,satis- fies numerous nice properties such as being exact frequentist matching (i.e., a Bayesian 100(1-a)%credible set will also be a frequentist 100(1-a)% confidence set;cf.equation(6.22)in Berger [2]).This is in stark contrast to the situation with the Fraser,Monette and Ng example.However,the basic fact remains that posteriors from uniform priors on large compact sets do not seem here to be well approximated (in terms of logarithmic divergence) by a uniform prior on the full parameter space.The suggestion is that this is a situation in which assessment of the "true"bounded parameter space is potentially needed. Of course,a prior might be permissible for a larger sample size,even if it is not permissible for the minimal sample size.For instance,we suspect that m(0)=1 is permissible for any location model having two or more independent observations. The condition in the definition of permissibility that the posterior must be proper is not vacuous,as the following example shows
8 J. O. BERGER, J. M. BERNARDO AND D. SUN Definition 5. A strictly positive continuous function π(θ) is a permissible prior for model M = {p(x | θ), x ∈ X , θ ∈ Θ} if: 1. for all x ∈ X , π(θ | x) is proper, that is, R Θ p(x | θ)π(θ) dθ < ∞; 2. for some approximating compact sequence, the corresponding posterior sequence is expected logarithmically convergent to π(θ | x) ∝ p(x | θ)π(θ). The following theorem, whose proof is given in Appendix A, shows that, for one observation from a location model, the objective prior π(θ) = 1 is permissible under mild conditions. Theorem 2. Consider the model M = {f(x − θ),θ ∈ R,x ∈ R}, where f(t) is a density function on R. If, for some ε > 0, lim |t|→0 |t| 1+ε (2.3) f(t) = 0, then π(θ) = 1 is a permissible prior for the location model M. Example 3 (A nonpermissible constant prior in a location model). Consider the location model M ≡ {p(x | θ) = f(x − θ),θ ∈ R,x > θ + e}, where f(t) = t −1 (log t) −2 , t > e. It is shown in Appendix B that, if π(θ) = 1, then R Θ0 κ{π(θ | x) | π0(θ | x)}p0(x) dx = ∞ for any compact set Θ0 = [a,b] with b − a ≥ 1; thus, π(θ) = 1 is not a permissible prior for M. Note that this model does not satisfy (2.3). This is an interesting example because we are still dealing with a location density, so that π(θ) = 1 is still the invariant (Haar) prior and, as such, satis- fies numerous nice properties such as being exact frequentist matching (i.e., a Bayesian 100(1 − α)% credible set will also be a frequentist 100(1 − α)% confidence set; cf. equation (6.22) in Berger [2]). This is in stark contrast to the situation with the Fraser, Monette and Ng example. However, the basic fact remains that posteriors from uniform priors on large compact sets do not seem here to be well approximated (in terms of logarithmic divergence) by a uniform prior on the full parameter space. The suggestion is that this is a situation in which assessment of the “true” bounded parameter space is potentially needed. Of course, a prior might be permissible for a larger sample size, even if it is not permissible for the minimal sample size. For instance, we suspect that π(θ) = 1 is permissible for any location model having two or more independent observations. The condition in the definition of permissibility that the posterior must be proper is not vacuous, as the following example shows
DEFINITION OF REFERENCE PRIORS 9 ExAMPLE 4 (Mixture model).Let x={r1,...,n be a random sample from the mixture p(xi)=N(x0,1)+N(x 0,1),and consider the uni- form prior function (0)=1.Since the likelihood function is bounded below by 2-II=1N(;,1)>0,the integrated likelihood ()(do= ()d will diverge.Hence,the corresponding formal posterior is im- proper,and therefore the uniform prior is not a permissible prior function for this model.It can be shown that Jeffreys prior for this mixture model has the shape of an inverted bell,with a minimum value 1/2 at u=0;hence, it is also bounded from below and is,therefore,not a permissible prior for this model either. Example 4 is noteworthy because it is very rare for the Jeffreys prior to yield an improper posterior in univariate problems.It is also of interest because there is no natural objective prior available for the problem.(There are data-dependent objective priors:see Wasserman [43].) Theorem 2 can easily be modified to apply to models that can be trans- formed into a location model. COROLLARY1.Consider M≡{p(x|0),0∈Θ,x∈X}.If there are mono- tone functions y=y(x)and o=(0)such that p(yo)=f(y-o)is a lo- cation model and there exists such that limf(t)=0,then n(0)=o(0)is a permissible prior function for M. The most frequent transformation is the log transformation,which con- verts a scale model into a location model.Indeed,this transformation yields the following direct analogue of Theorem 2. COROLLARY 2.Consider M={p(x0)=0-f(x/0),0>0,ER), a scale model where f(s),s>0,is a density function.If,for some e>0, (2.4) lim ltee'f(e)=0, |→∞ then n(0)=0-1 is a permissible prior function for the scale model M. EXAMPLE 5 (Exponential data).If x is an observation from an expo- nential density,(2.4)becomes t+et exp(-et)→0,aslt一o,which is true.From Corollary 2,(0)=0-1 is a permissible prior;indeed,i(0)= (2i)-l0-l,e-t≤0≤e is expected logarithmically convergent toπ(0): EXAMPLE 6(Uniform data).Let z be one observation from the uniform distribution M={Un(x 0,0)=0-1,x[0,0],0>0}.This is a scale den- sity,and equation(2.4)becomes+ee1o<e'<1y→0,asl→oo,which is indeed true.Thus,m(0)=0-1 is a permissible prior function for M
DEFINITION OF REFERENCE PRIORS 9 Example 4 (Mixture model). Let x = {x1,...,xn} be a random sample from the mixture p(xi | θ) = 1 2 N(x | θ, 1) + 1 2 N(x | 0, 1), and consider the uniform prior function π(θ) = 1. Since the likelihood function is bounded below by 2−n Qn j=1 N(xj | 0, 1) > 0, the integrated likelihood R ∞ −∞ p(x | θ)π(θ) dθ = R ∞ −∞ p(x | θ) dθ will diverge. Hence, the corresponding formal posterior is improper, and therefore the uniform prior is not a permissible prior function for this model. It can be shown that Jeffreys prior for this mixture model has the shape of an inverted bell, with a minimum value 1/2 at µ = 0; hence, it is also bounded from below and is, therefore, not a permissible prior for this model either. Example 4 is noteworthy because it is very rare for the Jeffreys prior to yield an improper posterior in univariate problems. It is also of interest because there is no natural objective prior available for the problem. (There are data-dependent objective priors: see Wasserman [43].) Theorem 2 can easily be modified to apply to models that can be transformed into a location model. Corollary 1. Consider M ≡ {p(x | θ),θ ∈ Θ,x ∈ X }. If there are monotone functions y = y(x) and φ = φ(θ) such that p(y | φ) = f(y − φ) is a location model and there exists ε > 0 such that lim|t|→0 |t| 1+εf(t) = 0, then π(θ) = |φ ′ (θ)| is a permissible prior function for M. The most frequent transformation is the log transformation, which converts a scale model into a location model. Indeed, this transformation yields the following direct analogue of Theorem 2. Corollary 2. Consider M = {p(x | θ) = θ −1f(|x|/θ),θ > 0,x ∈ R}, a scale model where f(s), s > 0, is a density function. If, for some ε > 0, lim |t|→∞ |t| 1+ε e t f(e t (2.4) ) = 0, then π(θ) = θ −1 is a permissible prior function for the scale model M. Example 5 (Exponential data). If x is an observation from an exponential density, (2.4) becomes |t| 1+ε e t exp(−e t ) → 0, as |t| → ∞, which is true. From Corollary 2, π(θ) = θ −1 is a permissible prior; indeed, πi(θ) = (2i) −1 θ −1 , e −i ≤ θ ≤ e i is expected logarithmically convergent to π(θ). Example 6 (Uniform data). Let x be one observation from the uniform distribution M ={Un(x | 0,θ) = θ −1 , x ∈ [0,θ], θ > 0}. This is a scale density, and equation (2.4) becomes |t| 1+ε e t1{0<et<1} → 0, as |t| → ∞, which is indeed true. Thus, π(θ) = θ −1 is a permissible prior function for M
10 J.O.BERGER.J.M.BERNARDO AND D.SUN The examples showing permissibility were for a single observation.Pleas- antly,it is enough to establish permissibility for a single observation or,more generally,for the sample size necessary for posterior propriety of r(x) because of the following theorem,which shows that expected logarithmic discrepancy is monotonically nonincreasing in sample size. THEOREM 3 (Monotone expected logarithmic discrepancy).Let M= {p(x1,x2|0)=p(x1|0)p(x2lx1,0),x1∈1,x2∈X2,8∈Θ}be a paramet- ric model.Consider a continuous improper prior n(0)satisfying m(x1)= fep(x110)T(0)do<oo and m(x1,x2)=Je p(x1,x20)(0)de<oo.For any compact setΘoC日,let o(0)=r(0)1e(0)/J6π(0)d.Them, (2.5) )111.)mo() ≤/k{x)1x)mat)i where for0∈Θo: To(01x1,X2)= p(x1,x2|0)π(0) mo(x1;x2) mo(x1,x2)=px1,x20)r(0)d0, Jeo T0(0|x1)= p(x1|8)π(0) mo(x1) mo(x1)= p(x11θ)π(0)d0. Jeo PROOF.The proof of this theorem is given in Appendix C. As an aside,the above result suggests that,as the sample size grows,the convergence of the posterior to normality given in Clarke [16 is monotone. 3.Reference priors. 3.1.Definition of reference priors.Key to the definition of reference pri- ors is Shannon expected information (Shannon [38]and Lindley [36]). DEFINITION 6(Expected information).The information to be expected from one observation from model M≡{p(x|0),x∈X,0∈Θ},when the prior for 0 is g(),is M)()os q(0)
10 J. O. BERGER, J. M. BERNARDO AND D. SUN The examples showing permissibility were for a single observation. Pleasantly, it is enough to establish permissibility for a single observation or, more generally, for the sample size necessary for posterior propriety of π(θ | x) because of the following theorem, which shows that expected logarithmic discrepancy is monotonically nonincreasing in sample size. Theorem 3 (Monotone expected logarithmic discrepancy). Let M = {p(x1,x2 | θ) = p(x1 | θ)p(x2 | x1,θ),x1 ∈ X1,x2 ∈ X2,θ ∈ Θ} be a parametric model. Consider a continuous improper prior π(θ) satisfying m(x1) = R Θ p(x1 | θ)π(θ) dθ < ∞ and m(x1,x2) = R Θ p(x1,x2 | θ)π(θ) dθ < ∞. For any compact set Θ0 ⊂ Θ, let π0(θ) = π(θ)1Θ0 (θ)/ R Θ0 π(θ) dθ. Then, Z Z X1×X2 κ{π(· | x1,x2) | π0(· | x1,x2)}m0(x1,x2) dx1 dx2 (2.5) ≤ Z X1 κ{π(· | x1) | π0(· | x1)}m0(x1) dx1, where for θ ∈ Θ0, π0(θ | x1,x2) = p(x1,x2 | θ)π(θ) m0(x1,x2) , m0(x1,x2) = Z Θ0 p(x1,x2 | θ)π(θ) dθ, π0(θ | x1) = p(x1 | θ)π(θ) m0(x1) , m0(x1) = Z Θ0 p(x1 | θ)π(θ) dθ. Proof. The proof of this theorem is given in Appendix C. As an aside, the above result suggests that, as the sample size grows, the convergence of the posterior to normality given in Clarke [16] is monotone. 3. Reference priors. 3.1. Definition of reference priors. Key to the definition of reference priors is Shannon expected information (Shannon [38] and Lindley [36]). Definition 6 (Expected information). The information to be expected from one observation from model M ≡ {p(x | θ),x ∈ X ,θ ∈ Θ}, when the prior for θ is q(θ), is I{q | M} = Z Z X×Θ p(x | θ)q(θ) log p(θ | x) q(θ) dxdθ