Table 1.The EM Algorithm for Example 2 (from Little and Rubin (1987)) t 00) 0©-8 (0+-)/(0©-0) 00.500000000 0.126821498 0.1465 1 0.608247423 0.018574075 0.1346 2 0.624321051 0.002500447 0.1330 3 0.626488879 0.000332619 0.1328 4 0.626777323 0.000044176 0.1328 5 0.626815632 0.000005866 0.1328 6 0.626820719 0.000000779 7 0.626821395 0.000000104 0.626821484 0.000000014 Example 3:Sample from Binomial/Poisson Mixture The following table shows the number of children of N widows entitled to support from a certain pension fund. Number of Children:0 1 2 3 4 5 6 Observed of Widows:no n1 n2 n3 n4 n5 n6 Since the actual data were not consistent with being a random sample from a Poisson dis- tribution (the number of widows with no children being too large)the following alternative model was adopted.Assume that the discrete random variable is distributed as a mixture of two populations,thus: Population A:with probability &the random variable takes the value 0,and Mixture of Populations: Population B:with probability (1-E),the random variable follows a Poisson with mean A Let the observed vector of counts be nobs =(no,n,...,n6)T.The problem is to obtain the maximum likelihood estimate of (A,)This is reformulated as an incomplete data problem by regarding the observed number of widows with no children be the sum of observations that come from each of the above two populations. Define no nA+nB nA =widows with no children from population A nB no-nA =widows with no children from population B Now,the problem becomes an incomplete data problem because na is not observed.Let n=(nA,nB,n1,n2,...,n6)be the complete-data vector where we assume that na and nB are observed and no nA+nB. 6
Table 1. The EM Algorithm for Example 2 (from Little and Rubin (1987)) t θ (t) θ (t) − ˆθ (θ (t+1) − ˆθ)/(θ (t) − ˆθ) 0 0.500000000 0.126821498 0.1465 1 0.608247423 0.018574075 0.1346 2 0.624321051 0.002500447 0.1330 3 0.626488879 0.000332619 0.1328 4 0.626777323 0.000044176 0.1328 5 0.626815632 0.000005866 0.1328 6 0.626820719 0.000000779 · 7 0.626821395 0.000000104 · 8 0.626821484 0.000000014 · Example 3: Sample from Binomial/ Poisson Mixture The following table shows the number of children of N widows entitled to support from a certain pension fund. Number of Children: 0 1 2 3 4 5 6 Observed # of Widows: n0 n1 n2 n3 n4 n5 n6 Since the actual data were not consistent with being a random sample from a Poisson distribution (the number of widows with no children being too large) the following alternative model was adopted. Assume that the discrete random variable is distributed as a mixture of two populations, thus: Population A: with probability ξ, the random variable takes the value 0, and Mixture of Populations: Population B: with probability (1 − ξ), the random variable follows a Poisson with mean λ Let the observed vector of counts be nobs = (n0, n1, . . . , n6) T . The problem is to obtain the maximum likelihood estimate of (λ, ξ). This is reformulated as an incomplete data problem by regarding the observed number of widows with no children be the sum of observations that come from each of the above two populations. Define n0 = nA + nB nA = # widows with no children from population A nB = no − nA = # widows with no children from population B Now, the problem becomes an incomplete data problem because nA is not observed. Let n = (nA, nB, n1, n2, . . . , n6) be the complete-data vector where we assume that nA and nB are observed and n0 = nA + nB . 6
Then f(n;ξ,)=k(n){P(=0)}oΠ空1{P(=)}” -)a" =o+-6{-2之e(剑)] where k(n)=>o1ni/no!n!...ne!.Obviously,the complete-data sufficient statistic is (nA,nB,n1,n2,...,n6).The complete-data log-likelihood is (5,n)=nol1og(ξ+(1-)e-A) +(N-no)log(1-)-+∑in log入+const. i=l Thus,the complete-data log-likelihood is linear in the sufficient statistic.The E-step requires the computing of Es.(n nobs). This computation results in Es.(ni nobs)=ni for i=1,...,6, and noξ Ec(nAb)=E+(1-S)exp(-万' since na is Binomial(no,p)with p where pA-and pB =(1-E)e-.The expression for Es(nBlnobs)is equivalent to that for E(nA)and will not be needed for E- step computations.So the E-step consists of computing noE(t) n9=9+1-)exp-(可 (1) at the tth iteration. For the M-step,the complete-data maximum likelihood estimate of (A)is needed. To obtain these,note that na Bin(N,E)and that nB,n,...,n6 are observed counts for i=0,1,...,6 of a Poisson distribution with parameter A.Thus,the complete-data maximum likelihood estimate'sofξand入are 7
Then f(n; ξ, λ) = k(n) {P(y0 = 0)} n0 Π ∞ i=1 {P(yi = i)} ni = k(n) h ξ + (1 − ξ) e −λ in0 " Π 6 i=1 ( (1 − ξ) e −λλ i i! )ni # = k(n) h ξ + (1 − ξ)e −λ inA+nB n (1 − ξ)e −λ oP6 i=1 ni " Π 6 i=1 λ i i! !ni # . where k(n) = P6 i=1 ni/n0!n1! . . . n6!. Obviously, the complete-data sufficient statistic is (nA, nB, n1, n2, . . . , n6). The complete-data log-likelihood is `(ξ, λ; n) = n0 log(ξ + (1 − ξ)e −λ ) + (N − n0)[log(1 − ξ) − λ] + X 6 i=1 i ni log λ + const. Thus, the complete-data log-likelihood is linear in the sufficient statistic. The E-step requires the computing of Eξ,λ(n|nobs). This computation results in Eξ,λ(ni |nobs) = ni for i = 1, ..., 6, and Eξ,λ(nA|nobs) = n0ξ ξ + (1 − ξ) exp(−λ) , since nA is Binomial(n0, p) with p = pA pA+pB where pA = ξ and pB = (1 − ξ) e −λ . The expression for Eξ,λ(nB|nobs) is equivalent to that for E(nA) and will not be needed for Estep computations. So the E-step consists of computing n (t) A = n0ξ (t) ξ (t) + (1 − ξ (t) ) exp(−λ (t) ) (1) at the t th iteration. For the M-step, the complete-data maximum likelihood estimate of (ξ, λ) is needed. To obtain these, note that nA ∼ Bin(N, ξ) and that nB, n1, . . . , n6 are observed counts for i = 0, 1, . . . , 6 of a Poisson distribution with parameter λ. Thus, the complete-data maximum likelihood estimate’s of ξ and λ are ˆξ = nA N , 7
and 入= ini 台ns+∑91n The M-step computes N (2) and 6 X+1)= ini (3) 台ng+∑-1ni .(t) where ng no -nd. The EM algorithm consists of iterating between (1),and(2)and (3)successively.The following data are reproduced from Thisted(1988). Number of children 0 1 2 3 4 5 6 Number of widows 3,062 587 284 103 33 4 2 Starting with (o)=0.75 and A(0)=0.40 the following results were obtained. Table 2.EM Iterations for the Pension Data tξ 入 TA nB 00.75 0.40 2502.779559.221 1 0.614179 1.035478 2503.591 558.409 20.614378 1.036013 2504.219 557.781 30.614532 1.036427 2504.704 557.296 40.614651 1.036747 2505.079 556.921 50.614743 1.036995 2505.369556.631 A single iteration produced estimates that are within 0.5%of the maximum likelihood estimate's and are comparable to the results after about four iterations of Newton-Raphson. However,the convergence rate of the subsequent iterations are very slow;more typical of the behavior of the EM algorithm. 8
and λˆ = X 6 i=1 i ni nB + P6 i=1 ni . The M-step computes ξ (t+1) = n (t) A N (2) and λ (t+1) = X 6 i=1 i ni n (t) B + P6 i=1 ni (3) where n (t) B = n0 − n (t) A . The EM algorithm consists of iterating between (1), and (2) and (3) successively. The following data are reproduced from Thisted(1988). Number of children 0 1 2 3 4 5 6 Number of widows 3,062 587 284 103 33 4 2 Starting with ξ (0) = 0.75 and λ (0) = 0.40 the following results were obtained. Table 2. EM Iterations for the Pension Data t ξ λ nA nB 0 0.75 0.40 2502.779 559.221 1 0.614179 1.035478 2503.591 558.409 2 0.614378 1.036013 2504.219 557.781 3 0.614532 1.036427 2504.704 557.296 4 0.614651 1.036747 2505.079 556.921 5 0.614743 1.036995 2505.369 556.631 A single iteration produced estimates that are within 0.5% of the maximum likelihood estimate’s and are comparable to the results after about four iterations of Newton-Raphson. However, the convergence rate of the subsequent iterations are very slow; more typical of the behavior of the EM algorithm. 8
Example 4:Variance Component Estimation (Little and Rubin(1987)) The following example is from Snedecor and Cochran(1967,p.290).In a study of artificial insemination of cows,semen samples from six randomly selected bulls were tested for their ability to produce conceptions.The number of samples tested varied from bull to bull and the response variable was the percentage of conceptions obtained from each sample.Here the interest is on the variability of the bull effects which is assumed to be a random effect. The data are: Table 3.Data for Example 4 (from Snedecor and Cochran(1967)) Bull(i)Percentages of Conception ni 1 46,31,37.62,30 2 70.59 2 52,44,57,40,67,64,70 > 4 47,21,70,46,14 5 5 42,64,50,69,77,81,87 7 6 35,68,59,38,57,76,57,29,60 9 Total 35 A common model used for analysis of such data is the oneway random effects model: yij ai+eij,j=1,...,ni;i=1,...,k; where it is assumed that the bull effects ai are distributed as i.i.d.N(,o2)and the within- bull effects (errors)eij as i.i.d.N(0,o2)random variables where ai and ej are independent. The standard oneway random effects analysis of variance is: Source d.f. S.S. M.S. F E(M.S.) Bull 5 3322.059664.412.682+5.67o Error 29 7200.341248.29 02 Total 3410522.400 Equating observed and expected mean squares from the above gives s2=248.29 as the estimate of o2 and(664.41-248.29)/5.67 =73.39 as the estimate of o2 To construct an EM algorithm to obtain MLE's of =(u,o2,o2),first consider the joint density of y*=(y,a)T where y*is assumed to be complete-data.This joint density can be written as a product of two factors:the part first corresponds to the joint density of yij given ai and the second to the joint density of ai. f(y*;8)=fi(yla;0)f2(a;0) 9
Example 4: Variance Component Estimation (Little and Rubin(1987)) The following example is from Snedecor and Cochran (1967, p.290). In a study of artificial insemination of cows, semen samples from six randomly selected bulls were tested for their ability to produce conceptions. The number of samples tested varied from bull to bull and the response variable was the percentage of conceptions obtained from each sample. Here the interest is on the variability of the bull effects which is assumed to be a random effect. The data are: Table 3. Data for Example 4 (from Snedecor and Cochran(1967)) Bull(i) Percentages of Conception ni 1 46,31,37,62,30 5 2 70,59 2 3 52,44,57,40,67,64,70 7 4 47,21,70,46,14 5 5 42,64,50,69,77,81,87 7 6 35,68,59,38,57,76,57,29,60 9 Total 35 A common model used for analysis of such data is the oneway random effects model: yij = ai + ij , j = 1, ..., ni , i = 1, ..., k; where it is assumed that the bull effects ai are distributed as i.i.d. N(µ, σ 2 a ) and the withinbull effects (errors) ij as i.i.d. N(0, σ 2 ) random variables where ai and ij are independent. The standard oneway random effects analysis of variance is: Source d.f. S.S. M.S. F E(M.S.) Bull 5 3322.059 664.41 2.68 σ 2 + 5.67σ 2 a Error 29 7200.341 248.29 σ 2 Total 34 10522.400 Equating observed and expected mean squares from the above gives s 2 = 248.29 as the estimate of σ 2 and (664.41 - 248.29)/5.67 = 73.39 as the estimate of σ 2 a . To construct an EM algorithm to obtain MLE’s of θ = (µ, σ 2 a , σ 2 ), first consider the joint density of y ∗ = (y, a) T where y ∗ is assumed to be complete-data. This joint density can be written as a product of two factors: the part first corresponds to the joint density of yij given ai and the second to the joint density of ai . f(y ∗ ; θ) = f1(y|a; θ)f2(a; θ) = Πi Πj ( 1 √ 2πσ e − 1 2σ2 (yij−ai) 2 ) Πi ( 1 √ 2πσa e − 1 2σ2 a (ai−µ) 2 ) 9
Thus,the log-likelihood is linear in the following complete-data sufficient statistics: T1= ∑a T2= ∑ =∑∑(-a)2=∑∑-.)2+∑n(.-a2 21 2.7 Here complete-data assumes that both y and a are available.Since only y is observed,let yo =y.Then the E-step of the EM algorithm requires the computation of the expectations of T1,T2 and T3 given yobs,i.e.,E(Tily)for i=1,2,3.The conditional distribution of a given y is needed for computing these expectations.First,note that the joint distribution of y*= (y,a)is (N+k)-dimensional multivariate normal:N(u*,>*)where u*=(u,u),u= jw,u。=jk and*is the(N+k)×(N+)matrix Here 1 07 ∑2 ∑= 12=2 0 ∑k」 0 jnk where i=o2In+02Jn is an nixni matrix.The covariance matrix of the joint distribution of y is obtained by recognizing that the yi;are jointly normal with common mean u and common variance o2+o2 and covariance o2 within the same bull and 0 between bulls.That is Cov(yij,yij)=Cov(ai+eij,ai+eij) = a2+0a ifi=i,j=j, =2 ifi=t,j≠j, =0 fi≠i. 12 is covariance of y and a and follows from the fact that Cov(yij,ai)=o2 if i=i and 0 if izi.The inverse of is needed for computation of the conditional distribution of a given y and obtained as 0 Σ1 21 0 where Using a well-known theorem in multivariate normal theory,the distribution of a given y is given by N(a,A)where a=ua+12(y-u) and A=2I-212.It can be shown after some algebra that a,lyN(wμ+(1-u),) 10
Thus, the log-likelihood is linear in the following complete-data sufficient statistics: T1 = Xai T2 = Xa 2 i T3 = X i X j (yij − ai) 2 = X i X j (yij − y¯i.) 2 + X i ni(y¯i. − ai) 2 Here complete-data assumes that both y and a are available. Since only y is observed, let y ∗ obs = y. Then the E-step of the EM algorithm requires the computation of the expectations of T1, T2 and T3 given y ∗ obs, i.e., Eθ (Ti |y) for i = 1, 2, 3. The conditional distribution of a given y is needed for computing these expectations. First, note that the joint distribution of y ∗ = (y, a) T is (N + k)-dimensional multivariate normal: N(µ ∗ , Σ ∗ ) where µ ∗ = (µ, µa ) T , µ = µjN , µa = µjk and Σ ∗ is the (N + k) × (N + k) matrix Σ ∗ = Σ Σ12 Σ T 12 σ 2 a I ! . Here Σ = Σ1 0 Σ2 . . . 0 Σk , Σ12 = σ 2 a jn1 0 jn2 . . . 0 jnk where Σi = σ 2 Ini+σ 2 aJni is an ni×ni matrix. The covariance matrix Σ of the joint distribution of y is obtained by recognizing that the yij are jointly normal with common mean µ and common variance σ 2 + σ 2 a and covariance σ 2 a within the same bull and 0 between bulls. That is Cov(yij , yi 0j 0) = Cov(ai + ij , ai 0 + i 0j 0) = σ 2 + σ 2 a if i = i 0 , j = j 0 , = σ 2 a if i = i 0 , j6=j 0 , = 0 if i6=i 0 . Σ12 is covariance of y and a and follows from the fact that Cov(yij , ai) = σ 2 a if i = i 0 and 0 if i6=i 0 . The inverse of Σ is needed for computation of the conditional distribution of a given y and obtained as Σ −1 = Σ −1 1 0 Σ −1 2 . . . 0 Σ −1 k where Σ −1 i = 1 σ2 h Ini − σ 2 a σ2+niσ2 a Jni i . Using a well-known theorem in multivariate normal theory, the distribution of a given y is given by N(α, A) where α = µa + Σ 0 12Σ −1 (y − µ) and A = σ 2 a I − Σ 0 12Σ −1Σ12. It can be shown after some algebra that ai |y i.i.d ∼ N (wiµ + (1 − wi)y¯i., vi) 10