3.3 Fowlkes-Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4].However,it can also be used for flat clusterings since it consists in calculating an index Bi for each level i=2,...,n-1 of the hierarchies in consideration and plotting Bi against i.The measure B is easily generalized to a measure for clusterings with different numbers of clusters.The generalized Fowlkes-Mallows Index is defined by FM(C,C')= ∑1f=1m喝-n n11 V(∑:lCP-n)(∑,lCgP-n)Vm1+no)m1+no In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant docmnsto the total mber of retrieved docmentand recall (ratio of the number of retrieved relevant documents to the total number of relevant documents n11) n11十n01 Like for the adjusted Rand Index,the "amount"of similarity of two clus- terings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes.Again,the strong assumptions on the distribution make the result hard to interpret. Futhermore,this measure has the undesirable property that for small num- bers of clusters,the value is very high,even for independant clusterings (which even achieve the maximum value for small numbers of clusters).Wal- lace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of"good" pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11]is defined by MC,C=】 It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy ofXI clusterings,with the two trivial clusterings at the top and bottom,respectively,and each level of the hierarchy is a refinement of all the levels above. 6
3.3 Fowlkes–Mallows Index Fowlkes and Mallows introduced their index as a measure for comparing hierarchical clusterings2 [4]. However, it can also be used for flat clusterings since it consists in calculating an index Bi for each level i = 2, . . . , n − 1 of the hierarchies in consideration and plotting Bi against i. The measure Bi is easily generalized to a measure for clusterings with different numbers of clusters. The generalized Fowlkes–Mallows Index is defined by FM(C, C 0 ) = Pk i=1 P` j=1 m2 ij − n q ( P i |Ci | 2 − n)(P j |C0 j | 2 − n) = p n11 (n11 + n10)(n11 + n01) In the context of Information Retrieval this measure can be interpreted as the geometric mean of precision (ratio of the number of retrieved relevant documents to the total number of retrieved documents = n11 n11+n10 ) and recall (ratio of the number of retrieved relevant documents to the total number of relevant documents = n11 n11+n01 ). Like for the adjusted Rand Index, the ”amount” of similarity of two clusterings corresponds to the deviation from the expected value under the null hypothesis of independant clusterings with fixed cluster sizes. Again, the strong assumptions on the distribution make the result hard to interpret. Futhermore, this measure has the undesirable property that for small numbers of clusters, the value is very high, even for independant clusterings (which even achieve the maximum value for small numbers of clusters). Wallace proposed to attenuate this effect by substracting the number of pairs whose match is forced by the cluster overlaps from the number of ”good” pairs and from the number of all pairs [9]. 3.4 Mirkin Metric The Mirkin Metric which is also known as Equivalence Mismatch Distance [11] is defined by M(C, C 0 ) = X k i=1 |Ci | 2 + X ` j=1 |C 0 j | 2 − 2 X k i=1 X l j=1 m2 ij . It corresponds to the Hamming distance for binary vectors if the set of all pairs of elements is enumerated and a clustering is represented by a 2A hierarchical clustering of a set X is a hierarchy of |X| clusterings, with the two trivial clusterings at the top and bottom, respectively, and each level of the hierarchy is a refinement of all the levels above. 6
binary vector defined on this enumeration.An advantage is the fact that this distance is a metric on P(X).However,this measure is very sensitive to cluster sizes such that two clusterings that are"at right angles"to each other (i.e.each cluster in one clustering contains the same amount of elements of each of the clusters of the other clustering)are closer to each other than two clusterings for which one is a refinement of the other [11].The Mirkin Metric is a variation of the Rand Index [7]since it can be rewritten as M(C,C')=2(no1+n1o)=n(n-1)(1-R(C,C') 3.5 Other measures 3.5.1 Jaccard Index The Jaccard Index is very common in geology and ecology,e.g.for measur- ing the species diversity between two different communities [10.It is very similar to the Rand Index,however it disregards the pairs of elements that are in different clusters for both clusterings.It is defined as follows: J(C,C)= n11 n11+n10+n01 3.5.2 Partition Difference The Partition Difference [19]simply counts the pairs of elements that belong to different clusters unter both clusterings: PD(C,C=n0o According to [19],this measure is commonly used.In our opinion,it has too many drawbacks and should therefore not be used:the measure wants to express a distance,but it is not a distance in the mathematical sense, since it fulfills neither the identity of indiscernibles-property (you can have PD(C,C)=0,but CC,e.g.for the trivial one-clustering C and an arbitrary clustering CC),nor the triangle inequality (take two arbi- trary,non-trivial clusterings CC and the trivial one-clustering C",then PD(C,C")+PD(C",C)=0<PD(C,C)).The measure is sensitive to clus- ter sizes and the number of clusters and since it is not normalized,the values are hard to interpret(what does a distance of 5 mean,when we do not know the total number of pairs of elements?). 7
binary vector defined on this enumeration. An advantage is the fact that this distance is a metric on P(X). However, this measure is very sensitive to cluster sizes such that two clusterings that are ”at right angles” to each other (i.e. each cluster in one clustering contains the same amount of elements of each of the clusters of the other clustering) are closer to each other than two clusterings for which one is a refinement of the other [11]. The Mirkin Metric is a variation of the Rand Index [7] since it can be rewritten as M(C, C 0 ) = 2(n01 + n10) = n(n − 1)(1 − R(C, C 0 )). 3.5 Other measures 3.5.1 Jaccard Index The Jaccard Index is very common in geology and ecology, e.g. for measuring the species diversity between two different communities [10]. It is very similar to the Rand Index, however it disregards the pairs of elements that are in different clusters for both clusterings. It is defined as follows: J (C, C 0 ) = n11 n11 + n10 + n01 3.5.2 Partition Difference The Partition Difference [19] simply counts the pairs of elements that belong to different clusters unter both clusterings: PD(C, C 0 ) = n00 According to [19], this measure is commonly used. In our opinion, it has too many drawbacks and should therefore not be used: the measure wants to express a distance, but it is not a distance in the mathematical sense, since it fulfills neither the identity of indiscernibles-property (you can have PD(C, C 0 ) = 0, but C 6= C 0 , e.g. for the trivial one-clustering C and an arbitrary clustering C 0 6= C), nor the triangle inequality (take two arbitrary, non-trivial clusterings C 6= C 0 and the trivial one-clustering C 00, then PD(C, C 00) + PD(C 00 , C 0 ) = 0 < PD(C, C 0 )). The measure is sensitive to cluster sizes and the number of clusters and since it is not normalized, the values are hard to interpret (what does a distance of 5 mean, when we do not know the total number of pairs of elements?). 7