《多元统计分析》课程教学资源（阅读材料）A tutorial on Spectral clustering.pdf_P6-P10

4.0 is an eigenvalue of Lrw with the constant one vector 1 as eigenvector.0 is an eigenvalue of Lsym with eigenvector D1/21. 5.Lsym and Lrw are positive semi-definite and have n non-negative real-valued eigenvalues 0= 入1≤.≤入n Proof.Part(1)can be proved similarly to Part (1)of Proposition 1. Part (2)can be seen immediately by multiplying the eigenvalue equation Lsymw=Aw with D-1/2 from the left and substituting u=D-1/2w. Part (3)follows directly by multiplying the eigenvalue equation Lrwu=Au with D from the left. Part (4):The first statement is obvious as Lrw1=0,the second statement follows from(2). Part(5):The statement about Lsym follows from (1),and then the statement about Lrw follows from (2) As it is the case for the unnormalized graph Laplacian,the multiplicity of the eigenvalue 0 of the normalized graph Laplacian is related to the number of connected components: Proposition 4(Number of connected components and spectra of Lsym and Lw)Let G be an undirected graph with non-negative weights.Then the multiplicity k of the eigenvalue O of both Lrw and Lsum equals the number of connected components A1,...,Ak in the graph.For Lrw;the eigenspace of 0 is spanned by the indicator vectors 1A.of those components.For Lsym,the eigenspace of 0 is spanned by the vectors D21A. Proof.The proof is analogous to the one of Proposition 2,using Proposition 3. 0 4 Spectral Clustering Algorithms Now we would like to state the most common spectral clustering algorithms.For references and the history of spectral clustering we refer to Section 9.We assume that our data consists of n "points" x1,...,In which can be arbitrary objects.We measure their pairwise similarities sij=s(ri,zj) by some similarity function which is symmetric and non-negative,and we denote the corresponding similarity matrix by S=(sij)i.j=1...n. Unnormalized spectral clustering Input:Similarity matrix SE Rnxm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the unnormalized Laplacian L. Compute the first k eigenvectors u1,...,uk of L. Let UE Rnxk be the matrix containing the vectors u1,...,uk as columns. For i=1,...,n,let yi E R be the vector corresponding to the i-th row of U. .Cluster the points ()i=1..n in Rk with the k-means algorithm into clusters C1,,Ck· Output: Clusters A1,...,Ak with Ai={jl y E Ci}. There are two different versions of normalized spectral clustering,depending which of the normalized 6

4. 0 is an eigenvalue of Lrw with the constant one vector ✶ as eigenvector. 0 is an eigenvalue of Lsym with eigenvector D1/2✶. 5. Lsym and Lrw are positive semi-definite and have n non-negative real-valued eigenvalues 0 = λ1 ≤ . . . ≤ λn. Proof. Part (1) can be proved similarly to Part (1) of Proposition 1. Part (2) can be seen immediately by multiplying the eigenvalue equation Lsymw = λw with D−1/2 from the left and substituting u = D−1/2w. Part (3) follows directly by multiplying the eigenvalue equation Lrwu = λu with D from the left. Part (4): The first statement is obvious as Lrw✶ = 0, the second statement follows from (2). Part (5): The statement about Lsym follows from (1), and then the statement about Lrw follows from (2). 2 As it is the case for the unnormalized graph Laplacian, the multiplicity of the eigenvalue 0 of the normalized graph Laplacian is related to the number of connected components: Proposition 4 (Number of connected components and spectra of Lsym and Lrw) Let G be an undirected graph with non-negative weights. Then the multiplicity k of the eigenvalue 0 of both Lrw and Lsym equals the number of connected components A1, . . . , Ak in the graph. For Lrw, the eigenspace of 0 is spanned by the indicator vectors ✶Ai of those components. For Lsym, the eigenspace of 0 is spanned by the vectors D1/2✶Ai . Proof. The proof is analogous to the one of Proposition 2, using Proposition 3. 2 4 Spectral Clustering Algorithms Now we would like to state the most common spectral clustering algorithms. For references and the history of spectral clustering we refer to Section 9. We assume that our data consists of n “points” x1, . . . , xn which can be arbitrary objects. We measure their pairwise similarities sij = s(xi , xj ) by some similarity function which is symmetric and non-negative, and we denote the corresponding similarity matrix by S = (sij )i,j=1...n. Unnormalized spectral clustering Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L. • Compute the first k eigenvectors u1, . . . , uk of L. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in ❘ k with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. There are two different versions of normalized spectral clustering, depending which of the normalized 6

graph Laplacians is used.We name both algorithms after two popular papers,for more references and history please see Section 9. Normalized spectral clustering according to Shi and Malik(2000) Input:Similarity matrix SER"xm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the unnormalized Laplacian L. Compute the first k generalized eigenvectors u1,...,uk of the generalized eigenprob- lem Lu=入Du. 。LetU∈Rnxk be the matrix containing the vectors山1，：，uk as columns. For i=1,...,n,let yiERk be the vector corresponding to the i-th row of U. Cluster the points (vi)i=1...n in Rk with the k-means algorithm into clusters C1:....Ck. Output:Clusters A1,...,Ak with Ai=jlyCi}. Note that this algorithm uses the generalized eigenvectors of L,which according to Proposition 3 correspond to the eigenvectors of the matrix Lw.So in fact,the algorithm works with eigenvectors of the normalized Laplacian Lw,and hence is called normalized spectral clustering.The next algorithm also uses a normalized Laplacian,but this time the matrix Lsym instead of Lrw.As we will see,this algorithm needs to introduce an additional row normalization step which is not needed in the other algorithms.The reasons will become clear in Section 7. Normalized spectral clustering according to Ng,Jordan,and Weiss(2002) Input:Similarity matrix SERnxm,number k of clusters to construct. Construct a similarity graph by one of the ways described in Section 2.Let W be its weighted adjacency matrix. Compute the normalized Laplacian Lsym. Compute the first k eigenvectors u1,...,uk of Lsym. Let UE Rnxk be the matrix containing the vectors u1,...,uk as columns. Form the matrix TE Rnxk from U by normalizing the rows to norm 1, that is set ti=u/(∑ku绿)/2. .For i=1,...,n,let yi ER be the vector corresponding to the i-th row of T. Cluster the points (yi)i=1.....n with the k-means algorithm into clusters C1,...,Ck. Output: C1 usters A1,.·,Ak with A:={功∈C}. All three algorithms stated above look rather similar,apart from the fact that they use three different graph Laplacians.In all three algorithms,the main trick is to change the representation of the abstract data points zi to points yiER.It is due to the properties of the graph Laplacians that this change of representation is useful.We will see in the next sections that this change of representation enhances the cluster-properties in the data,so that clusters can be trivially detected in the new representation. In particular,the simple k-means clustering algorithm has no difficulties to detect the clusters in this new representation.Readers not familiar with k-means can read up on this algorithm in numerous

graph Laplacians is used. We name both algorithms after two popular papers, for more references and history please see Section 9. Normalized spectral clustering according to Shi and Malik (2000) Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the unnormalized Laplacian L. • Compute the first k generalized eigenvectors u1, . . . , uk of the generalized eigenproblem Lu = λDu. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of U. • Cluster the points (yi)i=1,...,n in ❘ k with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. Note that this algorithm uses the generalized eigenvectors of L, which according to Proposition 3 correspond to the eigenvectors of the matrix Lrw. So in fact, the algorithm works with eigenvectors of the normalized Laplacian Lrw, and hence is called normalized spectral clustering. The next algorithm also uses a normalized Laplacian, but this time the matrix Lsym instead of Lrw. As we will see, this algorithm needs to introduce an additional row normalization step which is not needed in the other algorithms. The reasons will become clear in Section 7. Normalized spectral clustering according to Ng, Jordan, and Weiss (2002) Input: Similarity matrix S ∈ ❘ n×n, number k of clusters to construct. • Construct a similarity graph by one of the ways described in Section 2. Let W be its weighted adjacency matrix. • Compute the normalized Laplacian Lsym. • Compute the first k eigenvectors u1, . . . , uk of Lsym. • Let U ∈ ❘ n×k be the matrix containing the vectors u1, . . . , uk as columns. • Form the matrix T ∈ ❘ n×k from U by normalizing the rows to norm 1, that is set tij = uij/( P k u 2 ik) 1/2. • For i = 1, . . . , n, let yi ∈ ❘ k be the vector corresponding to the i-th row of T. • Cluster the points (yi)i=1,...,n with the k-means algorithm into clusters C1, . . . , Ck. Output: Clusters A1, . . . , Ak with Ai = {j| yj ∈ Ci}. All three algorithms stated above look rather similar, apart from the fact that they use three different graph Laplacians. In all three algorithms, the main trick is to change the representation of the abstract data points xi to points yi ∈ ❘ k . It is due to the properties of the graph Laplacians that this change of representation is useful. We will see in the next sections that this change of representation enhances the cluster-properties in the data, so that clusters can be trivially detected in the new representation. In particular, the simple k-means clustering algorithm has no difficulties to detect the clusters in this new representation. Readers not familiar with k-means can read up on this algorithm in numerous 7

Histogram of the sample Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 008 0. 0.08 0 E0.0 04支846678910 24含 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 0.04 0.03 0.02 品 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 08 -0.1451 0 4 -01451 事◆ -0.1451 02345678910 488 Eigenvalues Eigenvector 1 Eigenvector 2 Eigenvector 3 Eigenvector 4 Eigenvector 5 ◆ 0.15 0.0707 0.0 0.1 -0.0707 0.05 -0.0707 345678910 2468 468 Figure 1:Toy example for spectral clustering where the data points have been drawn from a mixture of four Gaussians on R.Left upper corner:histogram of the data.First and second row:eigenvalues and eigenvectors of Lrw and L based on the k-nearest neighbor graph.Third and fourth row:eigenvalues and eigenvectors of Lw and L based on the fully connected graph.For all plots,we used the Gaussian kernel with o=1 as similarity function.See text for more details text books,for example in Hastie,Tibshirani,and Friedman(2001). Before we dive into the theory of spectral clustering,we would like to illustrate its principle on a very simple toy example.This example will be used at several places in this tutorial,and we chose it because it is so simple that the relevant quantities can easily be plotted.This toy data set consists of a random sample of 200 points z1,...,200 ER drawn according to a mixture of four Gaussians.The first row of Figure 1 shows the histogram of a sample drawn from this distribution (the z-axis represents the one-dimensional data space).As similarity function on this data set we choose the Gaussian similarity function s(xi,xj)=exp(-xi-zj2/(202))with o 1.As similarity graph we consider both the fully connected graph and the 10-nearest neighbor graph.In Figure 1 we show the first eigenvalues and eigenvectors of the unnormalized Laplacian L and the normalized Laplacian Lrw.That is,in the eigenvalue plot we plot i vs.Ai(for the moment ignore the dashed line and the different shapes of the eigenvalues in the plots for the unnormalized case;their meaning will be discussed in Section 8.5).In the eigenvector plots of an eigenvector u=(u1,...,u200)we plot zi vs.u;(note that in the example chosen zi is simply a real number,hence we can depict it on the r-axis).The first two rows of Figure 1 show the results based on the 10-nearest neighbor graph.We can see that the first four eigenvalues are 0,and the corresponding eigenvectors are cluster indicator vectors.The reason is that the clusters 8

0 2 4 6 8 10 0 2 4 6 8 Histogram of the sample 1 2 3 4 5 6 7 8 9 10 0 0.02 0.04 0.06 0.08 Eigenvalues norm, knn 2 4 6 8 0 0.2 0.4 norm, knn Eigenvector 1 2 4 6 8 −0.5 −0.4 −0.3 −0.2 −0.1 Eigenvector 2 2 4 6 8 0 0.2 0.4 Eigenvector 3 2 4 6 8 0 0.2 0.4 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.01 0.02 0.03 0.04 Eigenvalues unnorm, knn 2 4 6 8 0 0.05 0.1 unnorm, knn Eigenvector 1 2 4 6 8 −0.1 −0.05 0 Eigenvector 2 2 4 6 8 −0.1 −0.05 0 Eigenvector 3 2 4 6 8 −0.1 −0.05 0 Eigenvector 4 2 4 6 8 −0.1 0 0.1 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.2 0.4 0.6 0.8 Eigenvalues norm, full graph 2 4 6 8 −0.1451 −0.1451 −0.1451 norm, full graph Eigenvector 1 2 4 6 8 −0.1 0 0.1 Eigenvector 2 2 4 6 8 −0.1 0 0.1 Eigenvector 3 2 4 6 8 −0.1 0 0.1 Eigenvector 4 2 4 6 8 −0.5 0 0.5 Eigenvector 5 1 2 3 4 5 6 7 8 9 10 0 0.05 0.1 0.15 Eigenvalues unnorm, full graph 2 4 6 8 −0.0707 −0.0707 −0.0707 unnorm, full graph Eigenvector 1 2 4 6 8 −0.05 0 0.05 Eigenvector 2 2 4 6 8 −0.05 0 0.05 Eigenvector 3 2 4 6 8 −0.05 0 0.05 Eigenvector 4 2 4 6 8 0 0.2 0.4 0.6 0.8 Eigenvector 5 Figure 1: Toy example for spectral clustering where the data points have been drawn from a mixture of four Gaussians on ❘. Left upper corner: histogram of the data. First and second row: eigenvalues and eigenvectors of Lrw and L based on the k-nearest neighbor graph. Third and fourth row: eigenvalues and eigenvectors of Lrw and L based on the fully connected graph. For all plots, we used the Gaussian kernel with σ = 1 as similarity function. See text for more details. text books, for example in Hastie, Tibshirani, and Friedman (2001). Before we dive into the theory of spectral clustering, we would like to illustrate its principle on a very simple toy example. This example will be used at several places in this tutorial, and we chose it because it is so simple that the relevant quantities can easily be plotted. This toy data set consists of a random sample of 200 points x1, . . . , x200 ∈ ❘ drawn according to a mixture of four Gaussians. The first row of Figure 1 shows the histogram of a sample drawn from this distribution (the x-axis represents the one-dimensional data space). As similarity function on this data set we choose the Gaussian similarity function s(xi , xj ) = exp(−|xi − xj | 2/(2σ 2 )) with σ = 1. As similarity graph we consider both the fully connected graph and the 10-nearest neighbor graph. In Figure 1 we show the first eigenvalues and eigenvectors of the unnormalized Laplacian L and the normalized Laplacian Lrw. That is, in the eigenvalue plot we plot i vs. λi (for the moment ignore the dashed line and the different shapes of the eigenvalues in the plots for the unnormalized case; their meaning will be discussed in Section 8.5). In the eigenvector plots of an eigenvector u = (u1, . . . , u200) 0 we plot xi vs. ui (note that in the example chosen xi is simply a real number, hence we can depict it on the x-axis). The first two rows of Figure 1 show the results based on the 10-nearest neighbor graph. We can see that the first four eigenvalues are 0, and the corresponding eigenvectors are cluster indicator vectors. The reason is that the clusters 8

form disconnected parts in the 10-nearest neighbor graph,in which case the eigenvectors are given as in Propositions 2 and 4.The next two rows show the results for the fully connected graph.As the Gaussian similarity function is always positive,this graph only consists of one connected component. Thus,eigenvalue 0 has multiplicity 1,and the first eigenvector is the constant vector.The following eigenvectors carry the information about the clusters.For example in the unnormalized case (last row),if we threshold the second eigenvector at 0,then the part below 0 corresponds to clusters 1 and 2,and the part above 0 to clusters 3 and 4.Similarly,thresholding the third eigenvector separates clusters 1 and 4 from clusters 2 and 3,and thresholding the fourth eigenvector separates clusters 1 and 3 from clusters 2 and 4.Altogether,the first four eigenvectors carry all the information about the four clusters.In all the cases illustrated in this figure,spectral clustering using k-means on the first four eigenvectors easily detects the correct four clusters. 5 Graph cut point of view The intuition of clustering is to separate points in different groups according to their similarities.For data given in form of a similarity graph,this problem can be restated as follows:we want to find a par- tition of the graph such that the edges between different groups have a very low weight(which means that points in different clusters are dissimilar from each other)and the edges within a group have high weight (which means that points within the same cluster are similar to each other).In this section we will see how spectral clustering can be derived as an approximation to such graph partitioning problems. Given a similarity graph with adjacency matrix W,the simplest and most direct way to construct a partition of the graph is to solve the mincut problem.To define it,please recall the notation W(A,B):and A for the complement of A.For a given numberk of subsets,the mincut approach simply consists in choosing a partition A1,...,Ak which minimizes cut(A1,...,Ak):= ∑W(A,A) 2 i=1 Here we introduce the factor 1/2 for notational consistency,otherwise we would count each edge twice in the cut.In particular for k=2,mincut is a relatively easy problem and can be solved efficiently, see Stoer and Wagner (1997)and the discussion therein.However,in practice it often does not lead to satisfactory partitions.The problem is that in many cases,the solution of mincut simply separates one individual vertex from the rest of the graph.Of course this is not what we want to achieve in clustering,as clusters should be reasonably large groups of points.One way to circumvent this problem is to explicitly request that the sets A1,...,Ak are "reasonably large".The two most common objective functions to encode this are RatioCut (Hagen and Kahng,1992)and the normalized cut Ncut (Shi and Malik,2000).In RatioCut,the size of a subset A of a graph is measured by its number of vertices A,while in Ncut the size is measured by the weights of its edges vol(A).The definitions are: RatioCut(41,,A):=)∑4,4 、cut(Ai,A) Ncut(A1,...,Ak):= 2 =>cut(4,A) vol(A:) i=1 vol(Ai) Note that both objective functions take a small value if the clusters Ai are not too small.In partic- ular,the minimum of the function(A)is achieved if all coincide,and the minimum of (1vo(A))is achieved if all vol(A)coincide.So what both objective functionstry to achieve is that the clusters are "balanced",as measured by the number of vertices or edge weights,respectively. Unfortunately,introducing balancing conditions makes the previously simple to solve mincut problem 9

form disconnected parts in the 10-nearest neighbor graph, in which case the eigenvectors are given as in Propositions 2 and 4. The next two rows show the results for the fully connected graph. As the Gaussian similarity function is always positive, this graph only consists of one connected component. Thus, eigenvalue 0 has multiplicity 1, and the first eigenvector is the constant vector. The following eigenvectors carry the information about the clusters. For example in the unnormalized case (last row), if we threshold the second eigenvector at 0, then the part below 0 corresponds to clusters 1 and 2, and the part above 0 to clusters 3 and 4. Similarly, thresholding the third eigenvector separates clusters 1 and 4 from clusters 2 and 3, and thresholding the fourth eigenvector separates clusters 1 and 3 from clusters 2 and 4. Altogether, the first four eigenvectors carry all the information about the four clusters. In all the cases illustrated in this figure, spectral clustering using k-means on the first four eigenvectors easily detects the correct four clusters. 5 Graph cut point of view The intuition of clustering is to separate points in different groups according to their similarities. For data given in form of a similarity graph, this problem can be restated as follows: we want to find a partition of the graph such that the edges between different groups have a very low weight (which means that points in different clusters are dissimilar from each other) and the edges within a group have high weight (which means that points within the same cluster are similar to each other). In this section we will see how spectral clustering can be derived as an approximation to such graph partitioning problems. Given a similarity graph with adjacency matrix W, the simplest and most direct way to construct a partition of the graph is to solve the mincut problem. To define it, please recall the notation W(A, B) := P i∈A,j∈B wij and A for the complement of A. For a given number k of subsets, the mincut approach simply consists in choosing a partition A1, . . . , Ak which minimizes cut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai). Here we introduce the factor 1/2 for notational consistency, otherwise we would count each edge twice in the cut. In particular for k = 2, mincut is a relatively easy problem and can be solved efficiently, see Stoer and Wagner (1997) and the discussion therein. However, in practice it often does not lead to satisfactory partitions. The problem is that in many cases, the solution of mincut simply separates one individual vertex from the rest of the graph. Of course this is not what we want to achieve in clustering, as clusters should be reasonably large groups of points. One way to circumvent this problem is to explicitly request that the sets A1, . . . , Ak are “reasonably large”. The two most common objective functions to encode this are RatioCut (Hagen and Kahng, 1992) and the normalized cut Ncut (Shi and Malik, 2000). In RatioCut, the size of a subset A of a graph is measured by its number of vertices |A|, while in Ncut the size is measured by the weights of its edges vol(A). The definitions are: RatioCut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai) |Ai | = X k i=1 cut(Ai , Ai) |Ai | Ncut(A1, . . . , Ak) := 1 2 X k i=1 W(Ai , Ai) vol(Ai) = X k i=1 cut(Ai , Ai) vol(Ai) . Note that both objective functions take a small value if the clusters Ai are not too small. In particular, the minimum of the function Pk i=1(1/|Ai |) is achieved if all |Ai | coincide, and the minimum of Pk i=1(1/ vol(Ai)) is achieved if all vol(Ai) coincide. So what both objective functions try to achieve is that the clusters are “balanced”, as measured by the number of vertices or edge weights, respectively. Unfortunately, introducing balancing conditions makes the previously simple to solve mincut problem 9