target invariant!Hastings (1970)actually refers to Erhman et al.(1960)as a preliminary if specific instance of this sampler.More precisely,this is Metropolis-within-Gibbs except for the name.It is quite amazing that this first introduction of the Gibbs sampler has been completely overlooked,even though the proof of convergence is completely general,based on a composition argument as in Tierney (1994)!The remainder of the paper deals with (a)an importance sampling version of MCMC,(b)general remarks about assessment of the error,(c)an application to random orthogonal matrices(with yet again an example of Gibbs sampling). Three years later,still in Biometrika,Peskun(1973)publishes a comparison of Metropolis' and Barker's forms of acceptance probabilities and shows(again in a discrete setup)that the optimal choice (in terms of the asymptotic variance of any empirical average)is Metropolis'. The proof is a direct consequence of a result by Kemeny and Snell(1960)on the asymptotic variance.Peskun also establishes that this asymptotic variance can improve upon the iid case if and only if the eigenvalues of P-A are all negative,when A is the transition matrix corresponding to the iid simulation and P the transition matrix corresponding to the Metropolis algorithm,but he concludes that the trace of P-A is always positive. 3 Seeds of the Revolution A number of earlier pioneers had brought forward the seeds of Gibbs sampling;in particular. Hammersley and Clifford had produced a constructive argument in 1970 to recover a joint distribution from its conditionals,a result later called the Hammersley-Clifford theorem by Besag (1974,1986).Besides Hastings (1970)and Geman and Geman (1984),already mentioned,other papers that contained the germs of Gibbs sampling are Besag and Clifford (1989),Broniatowski et al.(1984),Qian and Titterington (1990),and Tanner and Wong (1987). 3.1 Besag's Early Work and the Fundamental (Missing)Theorem In the early 1970's,Hammersley,Clifford,and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution.What is now known as the Hammersley-Clifford theorem states that a joint distribution for a vector asso- ciated with a dependence graph (edge meaning dependence and absence of edge conditional independence)must be represented as a product of functions over the cligues of the graphs. that is,of functions depending only on the components indexed by the labels in the clique
target invariant! Hastings (1970) actually refers to Erhman et al. (1960) as a preliminary if specific instance of this sampler. More precisely, this is Metropolis-within-Gibbs except for the name. It is quite amazing that this first introduction of the Gibbs sampler has been completely overlooked, even though the proof of convergence is completely general, based on a composition argument as in Tierney (1994)! The remainder of the paper deals with (a) an importance sampling version of MCMC, (b) general remarks about assessment of the error, (c) an application to random orthogonal matrices (with yet again an example of Gibbs sampling). Three years later, still in Biometrika, Peskun (1973) publishes a comparison of Metropolis’ and Barker’s forms of acceptance probabilities and shows (again in a discrete setup) that the optimal choice (in terms of the asymptotic variance of any empirical average) is Metropolis’. The proof is a direct consequence of a result by Kemeny and Snell (1960) on the asymptotic variance. Peskun also establishes that this asymptotic variance can improve upon the iid case if and only if the eigenvalues of P − A are all negative, when A is the transition matrix corresponding to the iid simulation and P the transition matrix corresponding to the Metropolis algorithm, but he concludes that the trace of P − A is always positive. 3 Seeds of the Revolution A number of earlier pioneers had brought forward the seeds of Gibbs sampling; in particular, Hammersley and Clifford had produced a constructive argument in 1970 to recover a joint distribution from its conditionals, a result later called the Hammersley–Clifford theorem by Besag (1974, 1986). Besides Hastings (1970) and Geman and Geman (1984), already mentioned, other papers that contained the germs of Gibbs sampling are Besag and Clifford (1989), Broniatowski et al. (1984), Qian and Titterington (1990), and Tanner and Wong (1987). 3.1 Besag’s Early Work and the Fundamental (Missing) Theorem In the early 1970’s, Hammersley, Clifford, and Besag were working on the specification of joint distributions from conditional distributions and on necessary and sufficient conditions for the conditional distributions to be compatible with a joint distribution. What is now known as the Hammersley-Clifford theorem states that a joint distribution for a vector associated with a dependence graph (edge meaning dependence and absence of edge conditional independence) must be represented as a product of functions over the cliques of the graphs, that is, of functions depending only on the components indexed by the labels in the clique 6
(which is a subset of the nodes of the graphs such that every node is connected by an edge to every other node in the subset).See Cressie (1993)or Lauritzen (1996)for detailed treatments. From an historical point of view,Hammersley (1974)explains why the Hammersley- Clifford theorem was never published as such,but only through Besag (1974).The reason is that Clifford and Hammersley were dissatisfied with the positivity constraint:The joint density could be recovered from the full conditionals only when the support of the joint was made of the product of the supports of the full conditionals(with obvious counter-examples, as in Robert and Casella 2004).While they strived to make the theorem independent of any positivity condition,their graduate student published Moussouris(1974),a counter-example that put a full stop to their endeavors. While Julian Besag can certainly be credited to some extent of the(re-)discovery of the Gibbs sampler (as in Besag 1974),Besag (1975)has the curious and anticlimactic following comment: The simulation procedure is to consider the sites cyclically and,at each stage, to amend or leave unaltered the particular site value in question,according to a probability distribution whose elements depend upon the current value at neigh- boring sites (...)However,the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation. So,while stating the basic version of the Gibbs sampler on a graph with discrete variables, Besag dismisses it as unpractical. On the other hand,Hammersley,together with Handscomb,wrote a textbook on Monte Carlo methods,(the first?)(Hammersley and Handscomb 1964).There they cover such topics as "Crude Monte Carlo"(which is(3));importance sampling;control variates;and "Conditional Monte Carlo",which looks surprisingly like a missing-data completion ap- proach.Of course,they do not cover the Hammersley-Clifford theorem but,in contrast to Besag (1974),they state in the Preface We are convinced nevertheless that Monte Carlo methods will one day reach an impressive maturity. Well said! 7
(which is a subset of the nodes of the graphs such that every node is connected by an edge to every other node in the subset). See Cressie (1993) or Lauritzen (1996) for detailed treatments. From an historical point of view, Hammersley (1974) explains why the HammersleyClifford theorem was never published as such, but only through Besag (1974). The reason is that Clifford and Hammersley were dissatisfied with the positivity constraint: The joint density could be recovered from the full conditionals only when the support of the joint was made of the product of the supports of the full conditionals (with obvious counter-examples, as in Robert and Casella 2004). While they strived to make the theorem independent of any positivity condition, their graduate student published Moussouris (1974), a counter-example that put a full stop to their endeavors. While Julian Besag can certainly be credited to some extent of the (re-)discovery of the Gibbs sampler (as in Besag 1974), Besag (1975) has the curious and anticlimactic following comment: The simulation procedure is to consider the sites cyclically and, at each stage, to amend or leave unaltered the particular site value in question, according to a probability distribution whose elements depend upon the current value at neighboring sites (...) However, the technique is unlikely to be particularly helpful in many other than binary situations and the Markov chain itself has no practical interpretation. So, while stating the basic version of the Gibbs sampler on a graph with discrete variables, Besag dismisses it as unpractical. On the other hand, Hammersley, together with Handscomb, wrote a textbook on Monte Carlo methods, (the first?) (Hammersley and Handscomb 1964). There they cover such topics as “Crude Monte Carlo“ (which is (3)); importance sampling; control variates; and “Conditional Monte Carlo”, which looks surprisingly like a missing-data completion approach. Of course, they do not cover the Hammersley-Clifford theorem but, in contrast to Besag (1974), they state in the Preface We are convinced nevertheless that Monte Carlo methods will one day reach an impressive maturity. Well said! 7
3.2 EM and its Simulated Versions as Precursors Besides a possible difficult computation in the E-step,problems with the EM algorithm (Dempster et al.1977)do occur in the case of multimodal likelihoods.The increase of the likelihood function at each step of the algorithm ensures its convergence to the maximum likelihood estimator in the case of unimodal likelihoods but it implies a dependence on initial conditions for multimodal likelihoods.Several proposals can be found in the literature to overcome this problem,one of which we now describe because of its connection with Gibbs sampling. Broniatowski et al.(1984)and Celeux and Diebolt (1985,1992)have tried to overcome the dependence of EM methods on the starting value by replacing the E step with a simulation step,the missing data z being generated conditionally on the observation z and on the current value of the parameter 0m.The maximization in the M step is then done on the (simulated)complete-data log-likelihood,H(,m).The appeal of this approach is that it allows for a more systematic exploration of the likelihood surface by partially avoiding the fatal attraction of the closest mode.Unfortunately,the theoretical convergence results for these methods are limited.Celeux and Diebolt (1990)have,however,solved the convergence problem of SEM by devising a hybrid version called SAEM(for Simulated Annealing EM), where the amount of randomness in the simulations decreases with the iterations,ending up with an EM algorithm.This version actually relates to simulated annealing methods 3.3 Gibbs,and Beyond Although somewhat removed from statistical inference in the classical sense and based on ear- lier techniques used in Statistical Physics,the landmark paper by Geman and Geman (1984) brought Gibbs sampling into the arena of statistical application.This paper is also responsi- ble for the name Gibbs sampling,because it implemented this method for the Bayesian study of Gibbs random fields which,in turn,derive their name from the physicist Josiah Willard Gibbs (1839-1903).This original implementation of the Gibbs sampler was applied to a discrete image processing problem and did not involve completion.But this was one more spark that led to the explosion,as it had a clear influence on Green,Smith,Spiegelhalter and others. 4 The Revolution The gap of more than 30 years between Metropolis et al.(1953)and Gelfand and Smith (1990)can still be partially attributed to the lack of appropriate computing power,as most 8
3.2 EM and its Simulated Versions as Precursors Besides a possible difficult computation in the E-step, problems with the EM algorithm (Dempster et al. 1977) do occur in the case of multimodal likelihoods. The increase of the likelihood function at each step of the algorithm ensures its convergence to the maximum likelihood estimator in the case of unimodal likelihoods but it implies a dependence on initial conditions for multimodal likelihoods. Several proposals can be found in the literature to overcome this problem, one of which we now describe because of its connection with Gibbs sampling. Broniatowski et al. (1984) and Celeux and Diebolt (1985, 1992) have tried to overcome the dependence of EM methods on the starting value by replacing the E step with a simulation step, the missing data z being generated conditionally on the observation x and on the current value of the parameter θm. The maximization in the M step is then done on the (simulated) complete-data log-likelihood, H˜ (x, zm|θ). The appeal of this approach is that it allows for a more systematic exploration of the likelihood surface by partially avoiding the fatal attraction of the closest mode. Unfortunately, the theoretical convergence results for these methods are limited. Celeux and Diebolt (1990) have, however, solved the convergence problem of SEM by devising a hybrid version called SAEM (for Simulated Annealing EM), where the amount of randomness in the simulations decreases with the iterations, ending up with an EM algorithm. This version actually relates to simulated annealing methods. 3.3 Gibbs, and Beyond Although somewhat removed from statistical inference in the classical sense and based on earlier techniques used in Statistical Physics, the landmark paper by Geman and Geman (1984) brought Gibbs sampling into the arena of statistical application. This paper is also responsible for the name Gibbs sampling, because it implemented this method for the Bayesian study of Gibbs random fields which, in turn, derive their name from the physicist Josiah Willard Gibbs (1839–1903). This original implementation of the Gibbs sampler was applied to a discrete image processing problem and did not involve completion. But this was one more spark that led to the explosion, as it had a clear influence on Green, Smith, Spiegelhalter and others. 4 The Revolution The gap of more than 30 years between Metropolis et al. (1953) and Gelfand and Smith (1990) can still be partially attributed to the lack of appropriate computing power, as most 8
of the examples now processed by MCMC algorithms could not have been treated previously, even though the hundreds of dimensions processed in Metropolis et al.(1953)were quite formidable.However,by the mid-1980s the pieces were all in place. After Peskun,MCMC in the statistical world was dormant for about 10 years,and then several papers appeared that highlighted its usefulness in specific settings (see,for example,Geman and Geman 1984,Tanner and Wong 1987,Besag 1989).In particular, Geman and Geman (1984)building on Metropolis et al.(1953),Hastings (1970),and Peskun (1973),influenced Gelfand and Smith (1990)to write a paper that is the genuine starting point for an intensive use of MCMC methods by the statistical community.It sparked new interest in Bayesian methods,statistical computing,algorithms,and stochastic processes through the use of computing algorithms such as the Gibbs sampler and the Me- tropolis-Hastings algorithm.(See Casella and George 1992 for an elementary introduction to the Gibbs sampler'.) Interestingly,the earlier Tanner and Wong (1987)has essentially the same ingredients as Gelfand and Smith (1990),namely the fact that simulating from the conditional dis- tributions is sufficient to simulate (in the limiting sense)from the joint.This paper was considered important enough to be a discussion paper in the Journal of the American Sta- tistical Association,but its impact was somewhat limited,compared with the one of Gelfand and Smith (1990).There are several reasons for this;one being that the method seemed to only apply to missing data problems (hence the name of data augmentation instead of Gibbs sampling),and another is that the authors were more focused on approximating the posterior distribuion.They suggested a(Markov chain)Monte Carlo approximation to the target (x)at each iteration of the sampler,based on m π(x,), zk心元t-1(2x, m k=1 that is,by replicating the simulations from the current approximation(r)of the marginal posterior distribution of the missing data m times.This focus on estimation of the posterior distribution connect the original Data Augmentation algorithm to EM,as pointed out by Dempster in the discussion.Although the discussion by Morris gets very close to the two-stage Gibbs sampler for hierarchical models,he is still concerned about doing m iterations,and worries about how costly that would be.Tanner and Wong mention 7On a humorous note,the original Technical Report of this paper was called Gibbs for Kids,which was changed because a referee did not appreciate the humor.However,our colleague Dan Gianola,an Animal Breeder at Wisconsin,liked the title.In using Gibbs sampling in his work,he gave a presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production,Arhus,Denmark.The title: Gibbs for Pigs. 9
of the examples now processed by MCMC algorithms could not have been treated previously, even though the hundreds of dimensions processed in Metropolis et al. (1953) were quite formidable. However, by the mid-1980s the pieces were all in place. After Peskun, MCMC in the statistical world was dormant for about 10 years, and then several papers appeared that highlighted its usefulness in specific settings (see, for example, Geman and Geman 1984, Tanner and Wong 1987, Besag 1989). In particular, Geman and Geman (1984) building on Metropolis et al. (1953), Hastings (1970) , and Peskun (1973), influenced Gelfand and Smith (1990) to write a paper that is the genuine starting point for an intensive use of MCMC methods by the statistical community. It sparked new interest in Bayesian methods, statistical computing, algorithms, and stochastic processes through the use of computing algorithms such as the Gibbs sampler and the Metropolis–Hastings algorithm. (See Casella and George 1992 for an elementary introduction to the Gibbs sampler7 .) Interestingly, the earlier Tanner and Wong (1987) has essentially the same ingredients as Gelfand and Smith (1990), namely the fact that simulating from the conditional distributions is sufficient to simulate (in the limiting sense) from the joint. This paper was considered important enough to be a discussion paper in the Journal of the American Statistical Association, but its impact was somewhat limited, compared with the one of Gelfand and Smith (1990). There are several reasons for this; one being that the method seemed to only apply to missing data problems (hence the name of data augmentation instead of Gibbs sampling), and another is that the authors were more focused on approximating the posterior distribuion. They suggested a (Markov chain) Monte Carlo approximation to the target π(θ|x) at each iteration of the sampler, based on 1 m Xm k=1 π(θ|x, zt,k), zt,k ∼ πˆt−1(z|x), that is, by replicating the simulations from the current approximation ˆπt−1(z|x) of the marginal posterior distribution of the missing data m times. This focus on estimation of the posterior distribution connect the original Data Augmentation algorithm to EM, as pointed out by Dempster in the discussion. Although the discussion by Morris gets very close to the two-stage Gibbs sampler for hierarchical models, he is still concerned about doing m iterations, and worries about how costly that would be. Tanner and Wong mention 7On a humorous note, the original Technical Report of this paper was called Gibbs for Kids, which was changed because a referee did not appreciate the humor. However, our colleague Dan Gianola, an Animal Breeder at Wisconsin, liked the title. In using Gibbs sampling in his work, he gave a presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Arhus, Denmark. The title: Gibbs for Pigs. 9