A History of Markov Chain Monte Carlo* -Subjective Recollections from Incomplete Data- CHRISTIAN ROBERTT GEORGE CASELLA Universite Paris Dauphine and CREST,INSEE University of Florida August 14,2008 Abstract In this note we attempt to trace the history and development of Markov chain Monte Carlo (MCMC)from its early inception in the late 1940's through its use today. We see how the earlier stages of the Monte Carlo(MC,not MCMC)research have led to the algorithms currently in use.More importantly,we see how the development of this methodology has not only changed our solutions to problems,but has changed the way we think about problems. 1 Introduction Markov Chain Monte Carlo (MCMC)methods have been around for almost as long as Monte Carlo techniques,even though their impact on Statistics has not been truly felt until the very early 1990s.(The emergence of Markov based techniques in Physics and,in particular, Particle Physics is another story that will remain mostly untold within this survey.See Landau and Binder 2005 for a review.)Also,we will not launch into a description of MCMC techniques,unless they have some historical link,as the remainder of this volume covers *We are grateful for comments and suggestions from Olivier Cappe,David Spiegelhalter,Alan Gelfand, Jun Liu,Sharon McGrayne,Peter Muller,Gareth Roberts,and Adrian Smith Professor of Statistics,CEREMADE,Universite Paris Dauphine,75785 Paris cedex 16,France.Sup- ported by the Agence Nationale de la Recherche(ANR,212,rue de Bercy 75012 Paris)through the 2006-2008 project Adap'MC.Email:xian@ceremade.dauphine.fr. Distinguished Professor,Department of Statistics,University of Florida,Gainesville,FL 32611.Sup- ported by National Science Foundation Grants DMS-04-05543,DMS-0631632 and SES-0631588.Email: casella@stat.ufl.edu. 1
A History of Markov Chain Monte Carlo∗ —Subjective Recollections from Incomplete Data— Christian Robert† Universit´e Paris Dauphine and CREST, INSEE George Casella‡ University of Florida August 14, 2008 Abstract In this note we attempt to trace the history and development of Markov chain Monte Carlo (MCMC) from its early inception in the late 1940’s through its use today. We see how the earlier stages of the Monte Carlo (MC, not MCMC) research have led to the algorithms currently in use. More importantly, we see how the development of this methodology has not only changed our solutions to problems, but has changed the way we think about problems. 1 Introduction Markov Chain Monte Carlo (MCMC) methods have been around for almost as long as Monte Carlo techniques, even though their impact on Statistics has not been truly felt until the very early 1990s. (The emergence of Markov based techniques in Physics and, in particular, Particle Physics is another story that will remain mostly untold within this survey. See Landau and Binder 2005 for a review.) Also, we will not launch into a description of MCMC techniques, unless they have some historical link, as the remainder of this volume covers ∗We are grateful for comments and suggestions from Olivier Capp´e, David Spiegelhalter, Alan Gelfand, Jun Liu, Sharon McGrayne, Peter Muller, Gareth Roberts, and Adrian Smith †Professor of Statistics, CEREMADE, Universit´e Paris Dauphine, 75785 Paris cedex 16, France. Supported by the Agence Nationale de la Recherche (ANR, 212, rue de Bercy 75012 Paris) through the 2006-2008 project Adap’MC. Email: xian@ceremade.dauphine.fr. ‡Distinguished Professor, Department of Statistics, University of Florida, Gainesville, FL 32611. Supported by National Science Foundation Grants DMS-04-05543, DMS-0631632 and SES-0631588. Email: casella@stat.ufl.edu. 1
the technical aspects.A comprehensive entry with further references can also be found in Robert and Casella (2004). We will distinguish between the introduction of Metropolis-Hastings based algorithms from those related with Gibbs sampling,since they each stem from radically different origins, even though their mathematical justification via Markov chain theory is the same.Tracing the development of Monte Carlo methods,we will also briefly mention what we might call the "second-generation MCMC revolution".Starting in the mid-to-late 1990s,this includes the development of particle filters,reversible jump and perfect sampling,and concludes with more current work on population or sequential Monte Carlo and regeneration and the computing of "honest"standard errors.(But is it still history?!) As mentioned above,the realization that Markov chains could be used in a wide variety of situations only came(to mainstream statisticians)with Gelfand and Smith (1990),de- spite earlier publications in the statistical literature like Hastings(1970),Geman and Geman (1984)and Tanner and Wong (1987).Several reasons can be advanced:lack of computing machinery (think of the computers of 1970!),lack of background on Markov chains,lack of trust in the practicality of the method...It thus required visionary researchers like Alan Gelfand and Adrian Smith to spread the good news,backed up with a collection of papers that demonstrated,through a series of applications,that the method was easy to under- stand,easy to implement and practical (Gelfand et al.1990,1992,Smith and Gelfand 1992, Wakefield et al.1994).The rapid emergence of the dedicated BUGS (Bayesian inference Using Gibbs Sampling)software as early as 1991(when a paper on BUGS was presented at the Valencia meeting)was another compelling argument for adopting (at large)MCMC algorithms.1 2 Before the Revolution Monte Carlo methods were born in Los Alamos,New Mexico during World War II,eventually resulting in the Metropolis algorithm in the early 1950s.While Monte Carlo methods were in use by that time,MCMC was brought closer to statistical practicality by the work of Hastings in the 1970s. What can be reasonably seen as the first MCMC algorithm is the Metropolis algorithm, published by Metropolis et al.(1953).It emanates from the same group of scientists who produced the Monte Carlo method,namely the research scientists of Los Alamos,mostly 1Historically speaking,the development of BUGS initiated from Geman and Geman(1984)and Pearl (1987),in tune with the developments in the artificial intelligence community,and it pre-dates Gelfand and Smith (1990). 2
the technical aspects. A comprehensive entry with further references can also be found in Robert and Casella (2004). We will distinguish between the introduction of Metropolis-Hastings based algorithms from those related with Gibbs sampling, since they each stem from radically different origins, even though their mathematical justification via Markov chain theory is the same. Tracing the development of Monte Carlo methods, we will also briefly mention what we might call the “second-generation MCMC revolution”. Starting in the mid-to-late 1990s, this includes the development of particle filters, reversible jump and perfect sampling, and concludes with more current work on population or sequential Monte Carlo and regeneration and the computing of “honest” standard errors. (But is it still history?!) As mentioned above, the realization that Markov chains could be used in a wide variety of situations only came (to mainstream statisticians) with Gelfand and Smith (1990), despite earlier publications in the statistical literature like Hastings (1970), Geman and Geman (1984) and Tanner and Wong (1987). Several reasons can be advanced: lack of computing machinery (think of the computers of 1970!), lack of background on Markov chains, lack of trust in the practicality of the method... It thus required visionary researchers like Alan Gelfand and Adrian Smith to spread the good news, backed up with a collection of papers that demonstrated, through a series of applications, that the method was easy to understand, easy to implement and practical (Gelfand et al. 1990, 1992, Smith and Gelfand 1992, Wakefield et al. 1994). The rapid emergence of the dedicated BUGS (Bayesian inference Using Gibbs Sampling) software as early as 1991 (when a paper on BUGS was presented at the Valencia meeting) was another compelling argument for adopting (at large) MCMC algorithms.1 2 Before the Revolution Monte Carlo methods were born in Los Alamos, New Mexico during World War II, eventually resulting in the Metropolis algorithm in the early 1950s. While Monte Carlo methods were in use by that time, MCMC was brought closer to statistical practicality by the work of Hastings in the 1970s. What can be reasonably seen as the first MCMC algorithm is the Metropolis algorithm, published by Metropolis et al. (1953). It emanates from the same group of scientists who produced the Monte Carlo method, namely the research scientists of Los Alamos, mostly 1Historically speaking, the development of BUGS initiated from Geman and Geman (1984) and Pearl (1987), in tune with the developments in the artificial intelligence community, and it pre-dates Gelfand and Smith (1990). 2
physicists working on mathematical physics and the atomic bomb.2 MCMC algorithms therefore date back to the same time as the development of regular (MC only)Monte Carlo method,which are usually traced to Ulam and von Neumann in the late 1940s.Stanislaw Ulam associates the original idea with an intractable combinatorial computation he attempted in 1946(calculating the probability of winning at the card game "solitaire").This idea was enthusiastically adopted by John von Neumann for implementa- tion with direct applications to neutron diffusion,the name "Monte Carlo"being suggested by Nicholas Metropolis.(See the interesting recounting of this in Eckhardt 1987.) These occurrences very closely coincide with the appearance of the very first computer, the ENIAC,which came to life in February 1946,after three years of construction.The Monte Carlo method was set up by von Neumann,who was using it on thermonuclear and fission problems as early as 1947.At the same time,that is,1947,Ulam and von Neumann invented inversion and accept-reject techniques (also recounted in Eckhardt 1987)to simulate from non-uniform distributions.Without computers,a rudimentary version invented by Fermi in the 1930s did not get any recognition (Metropolis 1987).Note also that,as early as 1949, a symposium on Monte Carlo was supported by Rand,NBS and the Oak Ridge laboratory and that Metropolis and Ulam (1949)published the very first paper about the Monte Carlo method. 2.1 The Metropolis et al.(1953)paper The first MCMC algorithm is associated with a second computer,called MANIAC(!),built3 in Los Alamos in early 1952.Both a physicist and a mathematician,Metropolis,who died in Los Alamos in 1999,came to this place in April 1943.The other members of the team also came to Los Alamos during those years,with Edward Teller being the most controversial character of the group.As early as 1942,he was one of the first scientists to work on the Manhattan Project that led to the production of the A bomb.Almost as early,he became obsessed with the hydrogen(H)bomb,which he eventually managed to design with Stanislaw Ulam using the better computer facilities in the early 1950s.4 Published in June 1953 in the Journal of Chemical Physics,Metropolis et al.(1953) 2The atomic bomb construction did not involve simulation techniques,even though the subsequent de- velopment of the hydrogen bomb did. 3MANIAC stands for Mathematical Analyzer,Numerical Integrator and Computer. 4On a somber note,Edward Teller later testified against Robert Oppenheimer in the McCarthy trials and, much later,was a fervent proponent of the "Star Wars"defense system under the Reagan administration. 3
physicists working on mathematical physics and the atomic bomb.2 MCMC algorithms therefore date back to the same time as the development of regular (MC only) Monte Carlo method, which are usually traced to Ulam and von Neumann in the late 1940s. Stanislaw Ulam associates the original idea with an intractable combinatorial computation he attempted in 1946 (calculating the probability of winning at the card game “solitaire”). This idea was enthusiastically adopted by John von Neumann for implementation with direct applications to neutron diffusion, the name “Monte Carlo“ being suggested by Nicholas Metropolis. (See the interesting recounting of this in Eckhardt 1987.) These occurrences very closely coincide with the appearance of the very first computer, the ENIAC, which came to life in February 1946, after three years of construction. The Monte Carlo method was set up by von Neumann, who was using it on thermonuclear and fission problems as early as 1947. At the same time, that is, 1947, Ulam and von Neumann invented inversion and accept-reject techniques (also recounted in Eckhardt 1987) to simulate from non-uniform distributions. Without computers, a rudimentary version invented by Fermi in the 1930s did not get any recognition (Metropolis 1987). Note also that, as early as 1949, a symposium on Monte Carlo was supported by Rand, NBS and the Oak Ridge laboratory and that Metropolis and Ulam (1949) published the very first paper about the Monte Carlo method. 2.1 The Metropolis et al. (1953) paper The first MCMC algorithm is associated with a second computer, called MANIAC(!), built3 in Los Alamos in early 1952. Both a physicist and a mathematician, Metropolis, who died in Los Alamos in 1999, came to this place in April 1943 . The other members of the team also came to Los Alamos during those years, with Edward Teller being the most controversial character of the group. As early as 1942, he was one of the first scientists to work on the Manhattan Project that led to the production of the A bomb. Almost as early, he became obsessed with the hydrogen (H) bomb, which he eventually managed to design with Stanislaw Ulam using the better computer facilities in the early 1950s. 4 Published in June 1953 in the Journal of Chemical Physics, Metropolis et al. (1953) 2The atomic bomb construction did not involve simulation techniques, even though the subsequent development of the hydrogen bomb did. 3MANIAC stands for Mathematical Analyzer, Numerical Integrator and Computer. 4On a somber note, Edward Teller later testified against Robert Oppenheimer in the McCarthy trials and, much later, was a fervent proponent of the “Star Wars” defense system under the Reagan administration. 3
primary focus is the computation of integrals of the form 了= f F(p,q)exp{-E(p,q)/kT}dpdq Jexp-E(p,q)/kTdpdq with the energy E being defined as NN E(p,q)= ∑∑va, 2 where N is the number of particles,V a potential function and dii the distance between particles i and j.The Boltzmann distribution expt-E(p,q)/kT}is parameterised by the temperature T,k being the Boltzmann constant,with a normalisation factor Z(T)= exp{-E(p,q)/kT}dpdq that is not available in closed form.Since p and g are 2N-dimensional vectors,numerical integration is impossible.Given the large dimension of the problem,even standard Monte Carlo techniques fail to correctly approximate 3,since expf-E(p,q)/kT}is very small for most realizations of the random configurations of the particle system (uniformly in the 2N or 4N square).In order to improve the efficiency of the Monte Carlo method,Metropolis et al. (1953)propose a random walk modification of the N particles.That is,for each particle i (1≤i≤N),values x=x:+aξ1iand=+aE2i are proposed,where both i and E2i are uniformu(-1,1).The energy difference AE between the new configuration and the previous one is then computed and the new configuration is accepted with probability 1∧exp{-△E/kT}, (1) and otherwise the previous configuration is replicated (in the sense that it will count one more time in the final average of the F(pt,p)'s over the r moves of the random walk, 1 <t <T)).Note that Metropolis et al.(1953)move one particle at a time,rather than moving all of them together,which makes the initial algorithm appear as a primitive kind of Gibbs sampling (!)They The authors of Metropolis et al.(1953)demonstrate the validity of the algorithm by first establishing irreducibility (that they call ergodicity)and second proving ergodicity,that is convergence to the stationary distribution.The second part is obtained via a discretiza- tion of the space:They first note that the proposal move is reversible,then establish that expf-E/kT}is invariant.The result is therefore proven in its full generality (modulo the 4
primary focus is the computation of integrals of the form I = R F(p, q) exp{−E(p, q)/kT}dpdq R exp{−E(p, q)/kT}dpdq , with the energy E being defined as E(p, q) = 1 2 X N i=1 X N j=1 j6=i V (dij ), where N is the number of particles, V a potential function and dij the distance between particles i and j. The Boltzmann distribution exp{−E(p, q)/kT} is parameterised by the temperature T, k being the Boltzmann constant, with a normalisation factor Z(T) = Z exp{−E(p, q)/kT}dpdq that is not available in closed form. Since p and q are 2N-dimensional vectors, numerical integration is impossible. Given the large dimension of the problem, even standard Monte Carlo techniques fail to correctly approximate I, since exp{−E(p, q)/kT} is very small for most realizations of the random configurations of the particle system (uniformly in the 2N or 4N square). In order to improve the efficiency of the Monte Carlo method, Metropolis et al. (1953) propose a random walk modification of the N particles. That is, for each particle i (1 ≤ i ≤ N), values x ′ i = xi + αξ1i and y ′ i = yi + αξ2i are proposed, where both ξ1i and ξ2i are uniform U(−1, 1). The energy difference ∆E between the new configuration and the previous one is then computed and the new configuration is accepted with probability 1 ∧ exp{−∆E/kT} , (1) and otherwise the previous configuration is replicated (in the sense that it will count one more time in the final average of the F(pt , pt)’s over the τ moves of the random walk, 1 ≤ t ≤ τ )). Note that Metropolis et al. (1953) move one particle at a time, rather than moving all of them together, which makes the initial algorithm appear as a primitive kind of Gibbs sampling (!). They The authors of Metropolis et al. (1953) demonstrate the validity of the algorithm by first establishing irreducibility (that they call ergodicity) and second proving ergodicity, that is convergence to the stationary distribution. The second part is obtained via a discretization of the space: They first note that the proposal move is reversible, then establish that exp{−E/kT} is invariant. The result is therefore proven in its full generality (modulo the 4
discretization).The remainder of the paper is concerned with the specific problem of the rigid-sphere collision model.The number of iterations of the Metropolis algorithm seems to be limited:16 steps for burn-in and 48 to 64 subsequent iterations (that still required four to five hours on the Los Alamos MANIAC). An interesting variation of(1)is the Simulated Annealing algorithm,developed by Kirk- patrick et al.(1983),who connected optimization with annealing,the cooling of a metal. Their variation is to allow T of(1)to change as the algorithm runs,according to a "cooling schedule",and the Simulated Annealing algorithm can be shown to find the global maximum with probability 1,although the analysis is quite complex due to the fact that,with varying T,the algorithm is no longer a time-homogeneous Markov chain. 2.2 The Hastings (1970)paper The Metropolis algorithm was later generalized by Hastings (1970)and Peskun (1973,1981) as a statistical simulation tool that could overcome the curse of dimensionality met by regular Monte Carlo methods(already emphasized in Metropolis et al.1953).5 In his Biometrika paper,6 Hastings (1970)also defines his methodology on finite and reversible Markov chains,treating the continuous case by using a discretization analogy. The generic probability of acceptance for a move from state i to state j is Sij Qij= 1+迈, Tiqii where sij is a symmetric function.This generic form of probability encompasses the forms of both Metropolis et al.(1953)and Barker (1965).At this stage,Peskun's ordering is not yet discovered and Hastings thus mentions that little is known about the relative merits of those two choices (even though)Metropolis's method may be preferable.He also warns against high rejection rates as indicative of a poor choice of transition matrir,but does not mention the opposite pitfall of low rejection rates,associated with a slow exploration of the target. The examples given in the paper are a Poisson target with a +1 random walk proposal,a normal target with a uniform random walk proposal mixed with its reflection (i.e.centered at -X(t)rather than X(t)),and then a multivariate target where Hastings introduces a Gibbs sampling strategy,updating one component at a time and defining the composed transition as satisfying the stationary condition because each component does leave the 5In fact,Hastings starts by mentioning a decomposition of the target distribution into a product of one-dimensional conditional distributions but this falls short of an early Gibbs sampler! 6Hastings(1970)is one of the ten papers reproduced in the Biometrika 100th anniversary volume by Titterington and Cox(2001). 5
discretization). The remainder of the paper is concerned with the specific problem of the rigid-sphere collision model. The number of iterations of the Metropolis algorithm seems to be limited: 16 steps for burn-in and 48 to 64 subsequent iterations (that still required four to five hours on the Los Alamos MANIAC). An interesting variation of (1) is the Simulated Annealing algorithm, developed by Kirkpatrick et al. (1983), who connected optimization with annealing, the cooling of a metal. Their variation is to allow T of (1) to change as the algorithm runs, according to a “cooling schedule”, and the Simulated Annealing algorithm can be shown to find the global maximum with probability 1, although the analysis is quite complex due to the fact that, with varying T, the algorithm is no longer a time-homogeneous Markov chain. 2.2 The Hastings (1970) paper The Metropolis algorithm was later generalized by Hastings (1970) and Peskun (1973, 1981) as a statistical simulation tool that could overcome the curse of dimensionality met by regular Monte Carlo methods (already emphasized in Metropolis et al. 1953).5 In his Biometrika paper,6 Hastings (1970) also defines his methodology on finite and reversible Markov chains, treating the continuous case by using a discretization analogy. The generic probability of acceptance for a move from state i to state j is αij = sij 1 + πi πj qij qji , where sij is a symmetric function. This generic form of probability encompasses the forms of both Metropolis et al. (1953) and Barker (1965). At this stage, Peskun’s ordering is not yet discovered and Hastings thus mentions that little is known about the relative merits of those two choices (even though) Metropolis’s method may be preferable. He also warns against high rejection rates as indicative of a poor choice of transition matrix, but does not mention the opposite pitfall of low rejection rates, associated with a slow exploration of the target. The examples given in the paper are a Poisson target with a ±1 random walk proposal, a normal target with a uniform random walk proposal mixed with its reflection (i.e. centered at −X(t) rather than X(t)), and then a multivariate target where Hastings introduces a Gibbs sampling strategy, updating one component at a time and defining the composed transition as satisfying the stationary condition because each component does leave the 5 In fact, Hastings starts by mentioning a decomposition of the target distribution into a product of one-dimensional conditional distributions but this falls short of an early Gibbs sampler! 6Hastings (1970) is one of the ten papers reproduced in the Biometrika 100th anniversary volume by Titterington and Cox (2001). 5