Springer, 2011, -298 p.
Clustering is an important unsupervised classification technique where a set of patterns, usually vectors in multidimensional space, are grouped into clusters based on some similarity or dissimilarity criteria. In crisp clustering, each pattern is assigned to exactly one cluster, whereas in fuzzy clustering, each pattern is given a membership degree to each class. Fuzzy clustering is inherently more suitable for handling imprecise and noisy data with overlapping clusters. Clustering techniques aim to find a suitable grouping of the input dataset so that some criteria are optimized. Hence the problem of clustering can be posed as an optimization problem. The objectives to be optimized may represent different characteristics of the clusters, such as compactness, separation, and connectivity. A straightforward way to pose clustering as an optimization problem is to optimize some cluster validity index that reflects the goodness of the clustering solutions. All possible partitionings of the dataset and the corresponding values of the validity index define the complete search space. Traditional partitional clustering techniques, such as K-means and fuzzy C-means, employ greedy search techniques over the search space to optimize the compactness of the clusters. Although these algorithms are computationally efficient, they often get stuck at some local optima depending on the choice of the initial cluster centers. Moreover, they optimize a single cluster validity index (compactness in this case), and therefore do not cover different characteristics of the datasets.
To overcome the problem of local optima, some global optimization tools such as Genetic Algorithms (GAs) have been widely used to reach the global optimum value of the chosen validity measure. GAs are randomized search and optimization techniques guided by the principles of evolution and natural genetics, and have a large amount of implicit parallelism. GAs perform multimodal search in complex landscapes and provide near-optimal solutions for the objective or fitness function of an optimization problem. Conventional GA-based clustering techniques use some validity measure as the fitness value. However, no single validity measure works equally well for different kinds of datasets. Thus it is natural to simultaneously optimize multiple such measures for capturing different characteristics of the data.
The present volume is aimed at providing a treatise in a unified framework, with extensive applications to real-life datasets. Chapter 1 introduces the clustering problem and discusses different types of clustering algorithms, cluster validity measures, research issues, challenges and applications, and representation of the clustering problem as an optimization task and the possible use of GAs and MOGAs for this purpose. Chapter 2 describes the basic principles of GAs and MOGAs, their theoretical analysis and their applications to data mining and bioinformatics problems. Chapter 3 gives a broad overview of data mining and knowledge discovery tasks and applications. Chapter 4 provides an introduction to basic molecular biology concepts followed by basic tasks in bioinformatics, with a discussion on some of the existing works. Chapter 5 presents a multiobjective fuzzy clustering algorithm that uses real-valued center-based encoding of chromosomes and simultaneously optimizes two fuzzy cluster validity indices. In Chapter 6, a method based on the Support Vector Machine (SVM) classifier for obtaining the final solution from the set of non-dominated Pareto-optimal clustering solutions produced by the multiobjective fuzzy clustering method is described. In Chapter 7, a two-stage fuzzy clustering technique is described that utilizes the data points having significant membership to multiple classes (SiMM points). A variable string length genetic fuzzy clustering algorithm and multiobjective clustering algorithm are used for this purpose. Results of all the algorithms described in Chapters 5–7 have been demonstrated on both remote sensing imagery as well as microarray gene expression data. Chapter 8 addresses the problem of multiobjective fuzzy clustering of categorical data. A cluster mode-based encoding technique is used and fuzzy compactness and fuzzy separation have been simultaneously optimized in the context of categorical data. The results have been demonstrated for various synthetic and real-life categorical datasets. Chapter 9 presents an application of the multiobjective clustering technique for unsupervised cancer classification and identifying relevant genetic markers using some statistics followed by multiobjective feature selection. Finally, in Chapter 10, an overview of the biclustering problem and algorithms is presented. Also, application of multiobjective GAs in biclustering has been described. Different characteristics of the biclusters are optimized simultaneously. The effect of incorporation of fuzziness has also been studied. Results have been reported for various artificial and real-life gene expression datasets with biological and statistical significance tests.
The present volume is an attempt dedicated to clustering using multiobjective GAs with extensive real-life application in data mining and bioinformatics. The volume, which is unique in its character, will be useful to graduate students and researchers in bioinformatics, computer science, biotechnology, electrical engineering, system science, and information technology as both a text and a reference book for some parts of the curriculum. Researchers and practitioners in the industry and R & D laboratories in the fields of system design, pattern recognition, data mining, soft computing, geoscience, remote sensing and bioinformatics will also benefit from this volume.
Genetic Algorithms and Multiobjective Optimization
Data Mining Fundamentals
Computational Biology and Bioinformatics
Multiobjective Genetic Algorithm-Based Fuzzy Clustering
Combining Pareto-Optimal Clusters Using Supervised Learning
Two-Stage Fuzzy Clustering
Clustering Categorical Data in a Multiobjective Framework
Unsupervised Cancer Classification and Gene Marker Identification
Multiobjective Biclustering in Microarray Gene Expression Data