What is a Cluster? Perspectives from Game Theory Marcello Pelillo and Samuel Rota Bul`o Department of Computer Science Ca’ Foscari University, Venice, Italy {pelillo,srotabul}@dsi.unive.it Abstract. Contrary to the vast majority of approaches to clustering, which view the problem as one of partitioning a set of observations into coherent classes, thereby obtaining the clusters as a by-product of the partitioning process, we propose to reverse the terms of the problem and attempt instead to derive a rigorous formulation of the very notion of a cluster. In our endeavor to provide an answer to this question, we found that game theory offers a very elegant and general perspective that serves well our purposes. Accordingly, we formulate the clustering problem as a non-cooperative clustering game. Within this context, the notion of a cluster turns out to be equivalent to a classical equilibrium concept from (evolutionary) game theory. 1 Motivations There is no shortage of clustering algorithms, and recently a new wave of excitement has spread across the machine learning community mainly because of the important development of spectral methods. At the same time, there is also growing interest around fundamental questions pertaining to the very nature of the clustering problem [1–3]. Yet, despite the tremendous progress in the field, the clustering problem remains elusive and a satisfactory answer even to the most basic questions is still to come. Upon scrutinizing the relevant literature on the subject, it becomes apparent that the vast majority of the existing approaches deal with a very specific version of the problem, which asks for partitioning the input data into coherent classes. In fact, almost invariably, the problem of clustering is defined as a partitioning problem, and even the classical distinction between hierarchical and partitional algorithms [4] seems to suggest the idea that partitioning data is, in essence, what clustering is all about (as hierarchies are but nested partitions). This is unfortunate, because it has drawn the community’s attention away from different, and more general, variants of the problem and has led people to neglect underdeveloped foundational issues. As J. Hartigan clearly put it more than a decade ago: “We pay too much attention to the details of algorithms. [...] We must begin to subordinate engineering to philosophy.” [5, p. 3]. The partitional paradigm (as we will call it, following Kuhn) is attractive as it leads to elegant mathematical and algorithmic treatments and allows us to employ powerful ideas from such sophisticated fields as linear algebra, graph theory, 2 Marcello Pelillo and Samuel Rota Bul` o optimization, statistics, information theory, etc. However, there are several (far too often neglected) reasons for feeling uncomfortable with this oversimplified formulation. Probably the best-known limitation of the partitional approach is the typical (algorithmic) requirement that the number of clusters be known in advance, but there is more than that. To begin, the very idea of a partition implies that all the input data will have to get assigned to some class. This subsumes the old philosophical view which gives categories an a priori ontological status, namely that they exist independent of human experience, a view which has now been discredited by cognitive scientists, linguists, philosophers, and machine learning researchers alike (see, e.g., [6–8]). Further, there are various applications for which it makes little sense to force all data items to belong to some group, a process which might result either in poorly-coherent clusters or in the creation of extra spurious classes. As an extreme example, consider the classical figure/ground separation problem in computer vision which asks for extracting a coherent region (the figure) from a noisy background [9, 10]. It is clear that, due to their intrinsic nature, partitional algorithms have no chance of satisfactorily solving this problem, being, as they are, explicitly designed to partition all the input data, and hence the unstructured clutter items too, into coherent groups. More recently, motivated by practical applications arising in document retrieval and bioinformatics, a conceptually identical problem has attracted some attention within the machine learning community and is generally known under the name of one-class clustering [11, 12]. The second intrinsic limitation of the partitional paradigm is even more severe as it imposes that each element cannot belong to more than one cluster. There are a variety of important applications, however, where this requirement is too restrictive. Examples abound and include, e.g., clustering micro-array gene expression data (wherein a gene often participate in more than one process), clustering documents into topic categories, perceptual grouping, and segmentation of images with transparent surfaces. In fact, the importance of dealing with overlapping clusters has been recognized long ago [13] and recently, in the machine learning community, there has been a renewed interest around this problem [14, 15]. Typically, this is solved by relaxing the constraints imposed by crisp partitions in such a way as to have “soft” boundaries between clusters. Finally, we would like to mention another limitation of current state-of-theart approaches to clustering which, admittedly, is not caused in any direct way by the partitioning assumption but, rather, by the intrinsic nature of the technical tools typically used to attack the problem. This is the symmetry assumption, namely the requirement that the similarities between the data being clustered be symmetric (and non-negative). Indeed, since Tversky’s classical work [16], it is widely recognized by psychologists that similarity is an asymmetric relation. Further, there are many practical applications where asymmetric (or, more generally, non-metric) similarities do arise quite naturally. For example, such (dis)similarity measures are typically derived when images, shapes or sequences are aligned in a template matching process. In image and video processing, these What is a Cluster? Perspectives from Game Theory 3 measures are preferred in the presence of partially occluded objects [17]. Other examples include pairwise structural alignments of proteins that focus on local similarity [18], variants of the Hausdorff distance [19], normalized edit-distances, and probabilistic measures such as the Kullback-Leibler divergence. A common method to deal with asymmetric affinities is simply to symmetrize them, but in so doing we might lose important information that reside in the asymmetry. As argued in [17], the violation of metricity is often not an artifact of poor choice of features or algorithms, but it is inherent in the problem of robust matching when different parts of objects (shapes) are matched to different images. The same argument may hold for any type of local alignments. Corrections or simplifications of the original affinity matrix may therefore destroy essential information. Although probabilistic model-based approaches do not suffer from several of the limitations mentioned above, here we will suggest an alternative strategy. Instead of insisting on the idea of determining a partition of the input data, and hence obtaining the clusters as a by-product of the partitioning process, we propose to reverse the terms of the problem and attempt instead to derive a rigorous formulation of the very notion of a cluster. Clearly, the conceptual question “what is a cluster?” is as hopeless, in its full generality, as is its companion “what is an optimal clustering?” which has dominated the literature in the past few decades, both being two sides of the same coin. An attempt to answer the former question, however, besides shedding fresh light into the nature of the clustering problem, would allow us, as a consequence, to naturally overcome the major limitations of the partitional approach alluded to above, and to deal with more general problems where, e.g., clusters may overlap and clutter elements may get unassigned, thereby hopefully reducing the gap between theory and practice. In our endeavor to provide an answer to the question raised above, we found that game theory offers a very elegant and general perspective that serves well our purposes [20–22]. The starting point is the elementary observation that a “cluster” may be informally defined as a maximally coherent set of data items, i.e., as a subset of the input data C which satisfies both an internal criterion (all elements belonging to C should be highly similar to each other) and an external one (no larger cluster should contain C as a proper subset). We then formulate the clustering problem as a non-cooperative clustering game. Within this context, the notion of a cluster turns out to be equivalent to a classical equilibrium concept from (evolutionary) game theory, as the latter reflects both the internal and external cluster conditions mentioned above. 2 The clustering game The (pairwise) clustering problem can be formulated as the following (twoplayer) game. Assume a pre-existing set of objects O and a (possibly asymmetric and even negative) matrix of affinities A between the elements of O. Two players with complete knowledge of the setup play by simultaneously selecting an element of O. After both have shown their choice, each player receives a payoff, monetary or otherwise, proportional to the affinity that the chosen element has 4 Marcello Pelillo and Samuel Rota Bul` o with respect to the element chosen by the opponent, except in the case when the selected objects coincide, which leads to zero reward. Clearly, it is in each player’s interest to pick an element that is strongly supported by the elements that the adversary is likely to choose. As an example, let us assume that our clustering problem is one of figure/ground discrimination, that is, the objects in O consist of a cohesive group with high mutual affinity (figure) and of nonstructured noise (ground). Being non-structured, the noise gives equal average affinity to elements of the figures as to elements of the ground. Informally, assuming no prior knowledge of the inclination of the adversary, a player will be better-off selecting elements of the figure rather than of the ground. The clustering game is thus a two-players symmetric game, where O = {1, . . . , n} is the set of pure strategies and A is the payoff matrix. Mixed strategies will be elements of the standard simplex ∆ of Rn , which is defined as ( ) n X n ∆= x∈R : xi = 1, xi ≥ 0 for all i = 1 . . . n . i=1 The support σ(x) of a mixed strategy x is the set σ(x) = {i ∈ O : xi > 0}. The average payoff that a player selecting strategy iP∈ O obtains against an opponent playing mixed strategy x ∈ ∆ is (Ax)i = j∈O aij xj , while the average payoff that he obtains by playing mixed strategy y ∈ ∆ is y> Ax = P j∈O xj (Ax)j . The set of best replies β(x) to a mixed strategy x is given by β(x) = arg maxy∈∆ y> Ax, while the set of pure best replies to x is given by Ω(x) = {i ∈ O : ei ∈ β(x)}, where ei is the ith column of the n × n identity matrix. If x ∈ β(x) then x is said to be a Nash equilibrium. It is straightforward to verify that if x ∈ ∆ is a Nash equilibrium then (Ax)i ≤ x> Ax for all i ∈ O with equality if i ∈ σ(x). An interesting refinement of the Nash equilibrium is that of an evolutionary stable strategy (ESS). This concept has been introduced in evolutionary game theory, a field that originated in the early 1970’s as an attempt to apply the principles and tools of game theory to biological contexts, with a view to model the evolution of animal, as opposed to human, behavior [23]. It considers an idealized scenario whereby pairs of individuals are repeatedly drawn at random from a large, ideally infinite, population to play a symmetric two-player game. In contrast to conventional game theory, here players are not supposed to behave rationally or to have complete knowledge of the details of the game. They act instead according to an inherited behavioral pattern, or pure strategy, and it is supposed that some evolutionary selection process operates over time on the distribution of behaviors. We refer the reader to [24, 25] for classical introductions to this rapidly expanding field. An ESS is essentially a Nash equilibrium satisfying an additional stability property which guarantees that if an ESS is established in a population, and if a small proportion of the population adopts some mutant behavior, then the selection process will eventually drive them to extinction. Specifically, x ∈ ∆ is an ESS if x is a Nash equilibrium and y ∈ β(x)\{x} ⇒ x> Ay > y> Ay (stability property). What is a Cluster? Perspectives from Game Theory 5 In the next section we will provide a combinatorial characterization of ESS of a clustering game, which motivates its use as a notion of a cluster. 3 Clusters as ESSs Consider a clustering problem where O = {1, . . . , n} is the set of the objects to cluster and A = (aij ) is the n × n similarity matrix. Let C ⊆ O be a non-empty subset of objects. The (average) weighted in-degree of i ∈ O with respect to C is defined as: 1 X aij , awindegC (i) = |C| j∈C where |C| denotes the cardinality of C. Moreover, if j ∈ C we define: φC (i, j) = aij − awindegC (j) , which is a measure of the similarity of object i with object j with respect to the average similarity of object j with elements in C. The weight of i with respect to C is ( 1 if |C| = 1 WC (i) = P φ (i, j)W (j) otherwise , C\{i} j∈C\{i} C\{i} while the total weight of C is defined as: X W (C) = WC (i) . i∈C Intuitively, WC (i) gives us a measure of the support that object i receives from the objects in C \ {i} relative to the overall mutual similarity of the objects in C \ {i}. Here positive values indicate that i has high similarity to C \ {i}. A non-empty subset of objects C ⊆ O such that W (T ) > 0 for any non-empty T ⊆ C, is said to be a dominant set if: 1. WC (i) > 0, for all i ∈ C, 2. WC∪{i} (i) ≤ 0, for all i ∈ / C. The two previous conditions correspond to the two main properties of a cluster. The first regards internal homogeneity: each object in the cluster C is positively supported by all other objects in C. The second regards external heterogeneity: any extension of the cluster C with an external object i will undermine the internal coherency, since the new cluster C ∪ {i} will negatively support the novel entry i. The next result provides a one-to-one correspondence between ESSs and dominant sets, thereby motivating the use of ESSs as clusters. Theorem 1. If C ⊆ O is a dominant set with respect to similarity matrix A, then xC is an ESS for a two-player game with payoff matrix A, where xC is a vector defined as ( WC (i) if i ∈ C C xi = W (C) 0 otherwise . 6 Marcello Pelillo and Samuel Rota Bul` o Conversely, if x is an ESS for a two-person game with payoff matrix A, then C = σ(x) is a dominant set with respect to A, provided that C = Ω(x). Evolutionary game theory provides us with an algorithmic means of extracting clusters. Indeed, the hypotheses that each object belongs to a cluster compete with one-another, each obtaining support from similar objects and competitive pressure from the others. Competition will reduce the population of individuals that assume weakly supported hypotheses, while allowing populations assuming hypotheses with strong support to thrive. Eventually, all inconsistent hypotheses will be driven to extinction, while all the surviving ones will reach an equilibrium whereby they will all receive the same average support, hence exhibiting the internal coherency characterizing a cluster (this derives from the Nash condition). As for the extinct hypotheses, they will provably have a lower support, thereby hinting to external incoherency. The stable strategies can be found using (discrete-time) replicator dynamics, Ax(t) i (t+1) (t) . xi = xi > x(t) Ax(t) which is a classic formalization of a natural selection process [25, 24]. 4 Applications and extensions The proposed game-theoretic framework for clustering has been used to tackle a variety of problems. Indeed, we find applications in image and video segmentation [26, 20] to perceptual grouping [21], analysis of fMRI data [27, 28], contentbased image retrieval [29], detection of anomalous activities in video streams [30], bioinformatics [31] and human action recognition [32]. In [33], we show how this framework can also be used for addressing matching problems, where the goal is to find correspondences between two sets of objects O1 and O2 satisfying some desired criteria. The formulation we derive is general and application-independent. The main idea is to model the set of possible correspondences between two sets of objects O1 and O2 as a set of game strategies. The matching task becomes thus a non-cooperative game where the potential associations between objects to be matched correspond to strategies, while the payoff reflect the degree of compatibility between competing hypotheses. Within this formulation, the solutions to the matching problem correspond to ESSs of the underlying “matching game”. A distinguishing features of this formulation is that it allows one to naturally deal with general many-to-many matching problems even in the presence of asymmetric compatibilities. Note that, within the context of the clustering framework introduced in the previous sections, this is equivalent to finding a cluster over the set of feasible associations, where similarities reflect the association compatibilities. The potential of the proposed approach is demonstrated in [33] via two sets of image matching experiments, both of which show that our results outperform those obtained using well-known domain-specific algorithms. What is a Cluster? Perspectives from Game Theory 7 Our game-theoretic framework for clustering has also be extended to the case of high-order similarities [22]. Indeed, objects similarities are typically expressed as pairwise relations, but in some applications higher-order relations are more appropriate, and approximating them in terms of pairwise interactions can lead to substantial loss of information. Consider for instance the problem of clustering a given set of d-dimensional Euclidean points into lines. As every pair of data points trivially defines a line, there does not exist a meaningful pairwise measure of similarity for this problem. In [22], we show that our game-theoretic clustering framework can be naturally adapted to cases when high-order similarities are involved. In this case we refer to the hypergraph clustering problem, which is the process of extracting maximally coherent groups from a set of objects using high-order (rather than pairwise) similarities. Coherently with the pairwise case, traditional approaches to this problem are based on the idea of partitioning the input data into a user-defined number of classes, thereby obtaining the clusters as a by-product of the partitioning process. Moreover, most of them cast the hypergraph clustering problem into a graph clustering problem, by approximating higher-order similarities to pairwise similarities. In contrast to the classical approach, we show that the hypergraph clustering problem can be naturally cast into a non-cooperative multi-player clustering game (without resorting to approximations), whereby the notion of a cluster is equivalent to a classical game-theoretic equilibrium concept. In this case the number of players involved in the game is determined by the ariety of the similarity relations, while the payoff function is driven by the weights of the high-order similarities. From the computational viewpoint, we show that the problem of finding the equilibria of our clustering game is, in the case of supersymmetric similarities, equivalent to locally optimizing a polynomial function over the standard simplex, and we provide a discrete-time dynamics to perform this optimization, which generalize the replicator dynamics. In [22], we assess the superiority of the proposed approach by performing synthetic experiments on line clustering with different types of noise as well as real experiments on illuminant-invariant face clustering. 5 Conclusions In this paper we have introduced a game-theoretic notion of a cluster, which has found applications in fields as diverse as computer vision and bioinformatics. In a nutshell, our game-theoretic perspective has the following attractive features: 1. it makes no assumption on the underlying (individual) data representation: like spectral (and, more generally, graph-based) clustering, it does not require that the elements to be clustered be represented as points in a vector space; 2. it makes no assumption on the structure of the affinity matrix, being it able to work with asymmetric and even negative similarity functions alike; 3. it does not require a priori knowledge on the number of clusters (since it extracts them sequentially); 4. it leaves clutter elements unassigned (useful, e.g., in figure/ground separation or one-class clustering problems) 8 Marcello Pelillo and Samuel Rota Bul` o 5. it allows extracting overlapping clusters [34]; 6. it generalizes naturally to hypergraph clustering problems, i.e., in the presence of high-order affinities [22], in which case the clustering game is played by more than two players. The approach outlined above is but one example of using purely gametheoretic concepts to model generic machine learning problems (see [35] for another such example in a totally different context), and the potential of game theory to machine learning is yet to be fully explored. Other areas where game theory could potentially offer a fresh and powerful perspective include, e.g., semisupervised learning, multi-similarity learning, multi-task learning, learning with incomplete information, learning with context-dependent similarities. The concomitant increasing interest around the algorithmic aspects of game theory [36] is certainly beneficial in this respect, as it will allow useful cross-fertilization of ideas. 6 Acknowledgments We acknowledge financial support from the FET programme within EU FP7, under the SIMBAD project (contract 213250), and we are grateful to M´ario Figueiredo and Tiberio Caetano for providing useful comments on a preliminary version of this paper. References 1. Kleinberg, J.M.: An impossibility theorem for clustering. In: Adv. in Neural Inform. Proces. Syst. (NIPS). (2002) 2. Ackerman, M., Ben-David, S.: Measures of clustering quality: A working set of axioms for clustering. In: Adv. in Neural Inform. Proces. Syst. (NIPS). (2008) 3. Zadeh, R.B., Ben-David, S.: A uniqueness theorem for clustering. In: Uncertainty in Artif. Intell. (2009) 4. Jain, A.K., Dubes, R.C.: Algorithms for data clustering. Prentice-Hall (1988) 5. Hartigan, J.: Introduction. In Arabie, P., Hubert, L.J., de Soete, G., eds.: Clustering and Classification, River Edge, NJ, World Scientific (1996) 6. Lakoff, G.: Women, Fire, and Dangerous Things: What Categories Reveal about the Mind. The University of Chicago Press (1987) 7. Eco, U.: Kant and the Platypus: Essays on Language and Cognition. Harvest Books (2000) 8. Guyon, I., von Luxburg, U., Williamson, R.C.: Clustering: Science or art? (Opinion paper, 2009) 9. Herault, L., Horaud, R.: Figure-ground discrimination: a combinatorial optimization approach. IEEE Trans. Pattern Anal. Machine Intell. 15(9) (1993) 899–914 10. Shashua, A., Ullman, S.: Structural saliency: The detection of globally salient features using a locally connected network. In: Int. Conf. Comp. Vision (ICCV). (1988) 11. Gupta, G., Ghosh, J.: Robust one-class clustering using hybrid global and local search. In: Int. Conf. on Mach. Learning. (2005) What is a Cluster? Perspectives from Game Theory 9 12. Crammer, K., Talukdar, P.P., Pereira, F.: A rate-distortion one-class model and its applications to clustering. In: Int. Conf. on Mach. Learning. (2008) 13. Jardine, N., Sibson, R.: The construction of hierarchic and non-hierarchic classifications. Computer J. 11 (1968) 177–184 14. Banerjee, A., Krumpelman, C., Basu, S., Mooney, R.J., Ghosh, J.: Model-based overlapping clustering. In: Int. Conf. on Knowledge Discovery and Data Mining. (2005) 532 – 537 15. Heller, K., Ghahramani, Z.: A nonparametric bayesian approach to modeling overlapping clusters. In: Int. Conf. AI and Statistics. (2007) 16. Tversky, A.: Features of similarity. Psychological Review 84 (1977) 327–352 17. Jacobs, D.W., Weinshall, D., Gdalyahu, Y.: Classification with nonmetric distances. IEEE Trans. Pattern Anal. Machine Intell. 22(6) (2000) 583–600 18. Altschul, S.F., Gish, W., Miller, W., Meyers, E.W., Lipman, D.J.: Basic local alignment search tool. J. Molec. Biology 215(3) (1990) 403–410 19. Dubuisson, M.P., Jain, A.K.: A modified haudroff distance for object matching. In: Int. Conf. Patt. Recogn. (ICPR). (1994) 566–568 20. Pavan, M., Pelillo, M.: Dominant sets and pairwise clustering. IEEE Trans. Pattern Anal. Machine Intell. 29(1) (2007) 167–172 21. Torsello, A., Rota Bul` o, S., Pelillo, M.: Grouping with asymmetric affinities: a game-theoretic perspective. In: IEEE Conf. Computer Vision and Patt. Recogn. (CVPR). (2006) 292–299 22. Rota Bul` o, S., Pelillo, M.: A game-theoretic approach to hypergraph clustering. In: Adv. in Neural Inform. Proces. Syst. (NIPS). (2009) In press. 23. Maynard Smith, J.: Evolution and the theory of games. Cambridge University Press (1982) 24. Hofbauer, J., Sigmund, K.: Evolutionary games and population dynamics. Cambridge University Press (1998) 25. Weibull, J.W.: Evolutionary game theory. Cambridge University Press (1995) 26. Pavan, M., Pelillo, M.: A new graph-theoretic approach to clustering and segmentation. In: IEEE Conf. Computer Vision and Patt. Recogn. (CVPR). Volume 1. (2003) 145–152 27. M¨ uller, K., Neumann, J., Grigutsch, M., von Cramon, D.Y., Lohmann, G.: Detecting groups of coherent voxels in functional MRI data using spectral analysis and replicator dynamics. J. Magn. Reson. Imaging 26(6) (2007) 1642–1650 28. Neumann, J., von Cramon, D.Y., Forstmann, B.U., Zysset, S., Lohmann, G.: The parcellation of cortical areas using replicator dynamics in fMRI. NeuroImage 32(1) (2006) 208–219 29. Wang, M., Ye, Z.L., Wang, Y., Wang, S.X.: Dominant sets clustering for image retrieval. Signal Process. 88(11) (2008) 2843–2849 30. Hamid, R., Johnson, A., Batta, S., Bobick, A., Isbell, C., Coleman, G.: Detection and explanation of anomalous activities: representing activities as bags of event n-grams. In: IEEE Conf. Computer Vision and Patt. Recogn. (CVPR). Volume 1. (2005) 20–25 31. Florian, F.: Tag SNP selection based on clustering according to dominant sets found using replicator dynamics. Adv. in Data Analysis and Classif. (2010) (in press). 32. Wei, Q.D., Hu, W.M., Zhang, X.Q., Luo, G.: Dominant sets-based action recognition using image sequence matching. In: Int. Conf. Image Processing (ICIP). Volume 6. (2007) 133–136 33. Albarelli, A., Torsello, A., Rota Bul` o, S., Pelillo, M.: Matching as a non-cooperative game. In: Int. Conf. Comp. Vision (ICCV). (2009) 10 Marcello Pelillo and Samuel Rota Bul` o 34. Torsello, A., Rota Bul` o, S., Pelillo, M.: Beyond partitions: allowing overlapping groups in pairwise clustering. In: Int. Conf. Patt. Recogn. (ICPR). (2008) 35. Cesa Bianchi, N., Lugosi, G.: Prediction, Learning, and Games. Cambridge University Press (2006) ´ Vazirani, V., eds.: Algorithmic Game 36. Nisan, N., Roughgarden, T., Tardos, E., Theory. Cambridge University Press (2007)
© Copyright 2025