DEFINITION OF REFERENCE PRIORS 11 (3.1) =/kfglp(x)}p(x)dx, where p(x)=p(x0)q(0)/p(x)and p(x)=fep(x0)q(0)de Note that x here refers to the entire observation vector.It can have any dependency structure whatsoever (e.g.,it could consist of n normal random variables with mean zero,variance one and correlation 0.)Thus,when we re- fer to a model henceforth,we mean the probability model for the actual com- plete observation vector.Although somewhat nonstandard,this convention is necessary here because reference prior theory requires the introduction of (artificial)independent replications of the entire experiment. The amount of information IfqM}to be expected from observing x from M depends on the prior g(0):the sharper the prior the smaller the amount of information to be expected from the data.Consider now the information IfgM}which may be expected from k independent repli- cations of M.As k-oo,the sequence of realizations {x1,...,x would eventually provide any missing information about the value of 0.Hence,as k-oo,Ifq M}provides a measure of the missing information about 0 associated to the prior g(0).Intuitively,a reference prior will be a permissi- ble prior which maximizes the missing information about 6 within the class P of priors compatible with any assumed knowledge about the value of 0. With a continuous parameter space,the missing information I will typically diverge as k-oo,since an infinite amount of information would be required to learn the value of 0.Likewise,the expected informa- tion is typically not defined on an unbounded set.These two difficulties are overcome with the following definition,that formalizes the heuristics described in Bernardo [10]and in Berger and Bernardo [7]. DEFINITION 7 [Maximizing Missing Information (MMI)Property].Let M≡{p(x|),x∈X,0∈Θ∈R},be a model with one continuous parame- ter,and let P be a class of prior functions for 0 for which Jep(x)p()de< oo.The function r()is said to have the MMI property for model M given Pif,for any compact set o∈and any p∈P, (3.2) lim{I{πolM}-I{polM}≥0, where mo and po are,respectively,the renormalized restrictions of r(0)and p(0)to eo. The restriction of the definition to a compact set typically ensures the ex- istence of the missing information for given k.That the missing information will diverge for large k is handled by the device of simply insisting that the missing information for the reference prior be larger,as k-oo,than the missing information for any other candidate p()
DEFINITION OF REFERENCE PRIORS 11 (3.1) = Z X κ{q | p(· | x)}p(x) dx, where p(θ | x) = p(x | θ)q(θ)/p(x) and p(x) = R Θ p(x | θ)q(θ) dθ. Note that x here refers to the entire observation vector. It can have any dependency structure whatsoever (e.g., it could consist of n normal random variables with mean zero, variance one and correlation θ.) Thus, when we refer to a model henceforth, we mean the probability model for the actual complete observation vector. Although somewhat nonstandard, this convention is necessary here because reference prior theory requires the introduction of (artificial) independent replications of the entire experiment. The amount of information I{q | M} to be expected from observing x from M depends on the prior q(θ): the sharper the prior the smaller the amount of information to be expected from the data. Consider now the information I{q | Mk} which may be expected from k independent replications of M. As k → ∞, the sequence of realizations {x1,...,xk} would eventually provide any missing information about the value of θ. Hence, as k → ∞, I{q | Mk} provides a measure of the missing information about θ associated to the prior q(θ). Intuitively, a reference prior will be a permissible prior which maximizes the missing information about θ within the class P of priors compatible with any assumed knowledge about the value of θ. With a continuous parameter space, the missing information I{q | Mk} will typically diverge as k → ∞, since an infinite amount of information would be required to learn the value of θ. Likewise, the expected information is typically not defined on an unbounded set. These two difficulties are overcome with the following definition, that formalizes the heuristics described in Bernardo [10] and in Berger and Bernardo [7]. Definition 7 [Maximizing Missing Information (MMI) Property]. Let M ≡ {p(x | θ),x ∈ X ,θ ∈ Θ ∈ R}, be a model with one continuous parameter, and let P be a class of prior functions for θ for which R Θ p(x | θ)p(θ) dθ < ∞. The function π(θ) is said to have the MMI property for model M given P if, for any compact set Θ0 ∈ Θ and any p ∈ P, lim k→∞ {I{π0 | Mk } − I{p0 | Mk (3.2) }} ≥ 0, where π0 and p0 are, respectively, the renormalized restrictions of π(θ) and p(θ) to Θ0. The restriction of the definition to a compact set typically ensures the existence of the missing information for given k. That the missing information will diverge for large k is handled by the device of simply insisting that the missing information for the reference prior be larger, as k → ∞, than the missing information for any other candidate p(θ)
12 J.O.BERGER.J.M.BERNARDO AND D.SUN DEFINITION 8.A function T(0)=T(M,P)is a reference prior for model M given P if it is permissible and has the MMI property. Implicit in this definition is that the reference prior on will also be the reference prior on any compact subset o.This is an attractive property that is often stated as the practical way to proceed when dealing with a restricted parameter space,but here it is simply a consequence of the definition. Although we feel that a reference prior needs to be both permissible and have the MMI property,the MMI property is considerably more important. Thus,others have defined reference priors only in relation to this property, and Definition 7 is compatible with a number of these previous definitions in particular cases.Clarke and Barron [17]proved that,under appropriate regularity conditions,essentially those which guarantee asymptotic posterior normality,the prior which asymptotically maximizes the information to be expected by repeated sampling from M≡{p(x|O),x∈X,θ∈O∈R}is the Jeffreys prior, π(0=V(0),。 02 (3.3) i(0)-)op() which,hence,is the reference prior under those conditions.Similarly,Ghosal and Samanta [27]gave conditions under which the prior,which asymptoti- cally maximizes the information to be expected by repeated sampling from nonregular models of the form M≡{p(x|O),x∈S(0),O∈Θ∈R},where the support S(0)is either monotonically decreasing or monotonically increasing in 0,is (3.4) π(0) plr|9)0l1oglpr|9jd which is,therefore,the reference prior under those conditions. 3.2.Properties of reference priors.Some important properties of refer- ence priors-generally regarded as required properties for any sensible pro- cedure to derive objective priors-can be immediately deduced from their definition. THEOREM 4 (Independence of sample size).If data x={y1,...,yn} consists of a random sample of size n from model M={p(y 0),yEy,0E Θ}with reference priorπ(0|M,P),thenπ(0|Mm,P)=π(0|M,P),fom any fired sample size n. PROOF.This follows from the additivity of the information measure. Indeed,for any sample size n and number of replicates k,Ifg Mmk}= nI{g|M.□
12 J. O. BERGER, J. M. BERNARDO AND D. SUN Definition 8. A function π(θ) = π(θ | M,P) is a reference prior for model M given P if it is permissible and has the MMI property. Implicit in this definition is that the reference prior on Θ will also be the reference prior on any compact subset Θ0. This is an attractive property that is often stated as the practical way to proceed when dealing with a restricted parameter space, but here it is simply a consequence of the definition. Although we feel that a reference prior needs to be both permissible and have the MMI property, the MMI property is considerably more important. Thus, others have defined reference priors only in relation to this property, and Definition 7 is compatible with a number of these previous definitions in particular cases. Clarke and Barron [17] proved that, under appropriate regularity conditions, essentially those which guarantee asymptotic posterior normality, the prior which asymptotically maximizes the information to be expected by repeated sampling from M ≡ {p(x | θ),x ∈ X ,θ ∈ Θ ∈ R} is the Jeffreys prior, π(θ) = q i(θ), i(θ) = − Z X p(x | θ) ∂ 2 (∂θ) 2 (3.3) log[p(x | θ)] dx which, hence, is the reference prior under those conditions. Similarly, Ghosal and Samanta [27] gave conditions under which the prior, which asymptotically maximizes the information to be expected by repeated sampling from nonregular models of the form M ≡ {p(x | θ),x ∈ S(θ),θ ∈ Θ ∈ R}, where the support S(θ) is either monotonically decreasing or monotonically increasing in θ, is π(θ) = Z X p(x | θ) ∂ ∂θ log[p(x | θ)] dx (3.4) , which is, therefore, the reference prior under those conditions. 3.2. Properties of reference priors. Some important properties of reference priors—generally regarded as required properties for any sensible procedure to derive objective priors—can be immediately deduced from their definition. Theorem 4 (Independence of sample size). If data x = {y1,...,yn} consists of a random sample of size n from model M = {p(y | θ),y ∈ Y,θ ∈ Θ} with reference prior π(θ | M,P), then π(θ | Mn ,P) = π(θ | M,P), for any fixed sample size n. Proof. This follows from the additivity of the information measure. Indeed, for any sample size n and number of replicates k, I{q | Mnk} = nI{q | Mk}.