Second Edition. — Morgan Kaufmann, 2006. — 743 p.
This book explores the concepts and techniques of data mining, a promising and ourishing frontier in database systems and new database applications. Data mining, also popularly referred to as knowledge discovery in databases (KDD), is the automated or convenient extraction of patterns representing knowledge implicitly stored in large databases, data warehouses, and other massive information repositories.
What Motivated Data Mining? Why Is It Important?
So, What Is Data Mining?
Data Mining — On What Kind of Data?
Relational Databases
Data Warehouses
Transactional Databases
Advanced Data and Information Systems and Advanced Applications
Data Mining Functionalities — What Kinds of Patterns Can Be Mined?
Concept/Class Description: Characterization and Discrimination
Mining Frequent Patterns, Associations, and Correlations
Classification and Prediction
Cluster Analysis
Outlier Analysis
Evolution Analysis
Are All of the Patterns Interesting?
Classification of Data Mining Systems
Data Mining Task Primitives
Integration of a Data Mining System with a Database or Data Warehouse System
Major Issues in Data Mining
Exercises
Bibliographic Notes
Data PreprocessingWhy Preprocess the Data?
Descriptive Data Summarization
Measuring the Central Tendency
Measuring the Dispersion of Data
Graphic Displays of Basic Descriptive Data Summaries
Data Cleaning
Missing Values
Noisy Data
Data Cleaning as a Process
Data Integration and Transformation
Data Integration
Data Transformation
Data Reduction
Data Cube Aggregation
Attribute Subset Selection
Dimensionality Reduction
Numerosity Reduction
Data Discretization and Concept Hierarchy Generation
Discretization and Concept Hierarchy Generation for Numerical Data
Concept Hierarchy Generation for Categorical Data
Exercises
Bibliographic Notes
Data Warehouse and OLAP Technology: An OverviewWhat Is a Data Warehouse?
Differences between Operational Database Systems and Data Warehouses
But, Why Have a Separate Data Warehouse?
A Multidimensional Data Model
From Tables and Spreadsheets to Data Cubes
Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Databases
Examples for Defining Star, Snowflake, and Fact Constellation Schemas
Measures: Their Categorization and Computation
Concept Hierarchies
OLAP Operations in the Multidimensional Data Model
A Starnet Query Model for Querying Multidimensional Databases
Data Warehouse Architecture
Steps for the Design and Construction of Data Warehouses
A Three-Tier Data Warehouse Architecture
Data Warehouse Back-End Tools and Utilities
Metadata Repository
Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP
Data Warehouse Implementation
Efficient Computation of Data Cubes
Indexing OLAP Data
Efficient Processing of OLAP Queries
From Data Warehousing to Data Mining
Data Warehouse Usage
From On-Line Analytical Processing to On-Line Analytical Mining
Exercises
Bibliographic Notes
Data Cube Computation and Data GeneralizationEfficient Methods for Data Cube Computation
A Road Map for the Materialization of Different Kinds of Cubes
Multiway Array Aggregation for Full Cube Computation
BUC: Computing Iceberg Cubes from the Apex Cuboid Downward
Star-cubing: Computing Iceberg Cubes Using a Dynamic Star-tree Structure
Precomputing Shell Fragments for Fast High-Dimensional OLAP
Computing Cubes with Complex Iceberg Conditions
Further Development of Data Cube and OLAP Technology
Discovery-Driven Exploration of Data Cubes
Complex Aggregation at Multiple Granularity: Multifeature Cubes
Constrained Gradient Analysis in Data Cubes
Attribute-Oriented Induction — An Alternative Method for Data Generalization and Concept Description
Attribute-Oriented Induction for Data Characterization
Efficient Implementation of Attribute-Oriented Induction
Presentation of the Derived Generalization
Mining Class Comparisons: Discriminating between Different Classes
Class Description: Presentation of Both Characterization and Comparison
Exercises
Bibliographic Notes
Mining Frequent Patterns, Associations, and CorrelationsBasic Concepts and a Road Map
Market Basket Analysis: A Motivating Example
Frequent Itemsets, Closed Itemsets, and Association Rules
Frequent Pattern Mining: A Road Map
Efficient and Scalable Frequent Itemset Mining Methods
The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
Generating Association Rules from Frequent Itemsets
Improving the Efficiency of Apriori
Mining Frequent Itemsets without Candidate Generation
Mining Frequent Itemsets Using Vertical Data Format
Mining Closed Frequent Itemsets
Mining Various Kinds of Association Rules
Mining Multilevel Association Rules
Mining Multidimensional Association Rules from Relational Databases and Data Warehouses
From Association Mining to Correlation Analysis
Strong Rules Are Not Necessarily Interesting: An Example
From Association Analysis to Correlation Analysis
Constraint-Based Association Mining
Metarule-Guided Mining of Association Rules
Constraint Pushing: Mining Guided by Rule Constraints
Exercises
Bibliographic Notes
Classification and PredictionWhat Is Classification? What Is Prediction?
Issues Regarding Classification and Prediction
Preparing the Data for Classification and Prediction
Comparing Classification and Prediction Methods
Classification by Decision Tree Induction
Decision Tree Induction
Attribute Selection Measures
Tree Pruning
Scalability and Decision Tree Induction
Bayesian Classification
Bayes’ Theorem
Naïve Bayesian Classification
Bayesian Belief Networks
Training Bayesian Belief Networks
Rule-Based Classification
Using IF-THEN Rules for Classification
Rule Extraction from a Decision Tree
Rule Induction Using a Sequential Covering Algorithm
Classification by Backpropagation
A Multilayer Feed-Forward Neural Network
Defining a Network Topology
Backpropagation
Inside the Black Box: Backpropagation and Interpretability
Support Vector Machines
The Case When the Data Are Linearly Separable
The Case When the Data Are Linearly Inseparable
Associative Classification: Classification by Association Rule Analysis
Lazy Learners (or Learning from Your Neighbors)
k-Nearest-Neighbor Classifiers
Case-Based Reasoning
Other Classification Methods
Genetic Algorithms
Rough Set Approach
Fuzzy Set Approaches
Prediction
Linear Regression
Nonlinear Regression
Other Regression-Based Methods
Accuracy and Error Measures
Classifier Accuracy Measures
Predictor Error Measures
Evaluating the Accuracy of a Classifier or Predictor
Holdout Method and Random Subsampling
Cross-validation
Bootstrap
Ensemble Methods — Increasing the Accuracy
Bagging
Boosting
Model Selection
Estimating Confidence Intervals
ROC Curves
Exercises
Bibliographic Notes
Cluster AnalysisWhat Is Cluster Analysis?
Types of Data in Cluster Analysis
Interval-Scaled Variables
Binary Variables
Categorical, Ordinal, and Ratio-Scaled Variables
Variables of Mixed Types
Vector Objects
A Categorization of Major Clustering Methods
tioning Methods
Classical Partitioning Methods: k-Means and k-Medoids
tioning Methods in Large Databases: From k-Medoids to CLARANS
Hierarchical Methods
Agglomerative and Divisive Hierarchical Clustering
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
ROCK: A Hierarchical Clustering Algorithm for Categorical Attributes
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling
Density-Based Methods
DBSCAN: A Density-Based Clustering Method Based on Connected Regions with Sufficiently High Density
OPTICS: Ordering Points to Identify the Clustering Structure
DENCLUE: Clustering Based on Density Distribution Functions
Grid-Based Methods
STING: STatistical INformation Grid
WaveCluster: Clustering Using Wavelet Transformation
Model-Based Clustering Methods
Expectation-Maximization
Conceptual Clustering
Neural Network Approach
Clustering High-Dimensional Data
CLIQUE: A Dimension-Growth Subspace Clustering Method
PROCLUS: A Dimension-Reduction Subspace Clustering Method
Frequent Pattern–Based Clustering Methods
Constraint-Based Cluster Analysis
Clustering with Obstacle Objects
User-Constrained Cluster Analysis
Semi-Supervised Cluster Analysis
Outlier Analysis
Statistical Distribution-Based Outlier Detection
Distance-Based Outlier Detection
Density-Based Local Outlier Detection
Deviation-Based Outlier Detection
Exercises
Bibliographic Notes
Mining Stream, Time-Series, and Sequence DataMining Data Streams
Methodologies for Stream Data Processing and Stream Data Systems
Stream OLAP and Stream Data Cubes
Frequent-Pattern Mining in Data Streams
Classification of Dynamic Data Streams
Clustering Evolving Data Streams
Mining Time-Series Data
Trend Analysis
Similarity Search in Time-Series Analysis
Mining Sequence Patterns in Transactional Databases
Sequential Pattern Mining: Concepts and Primitives
Scalable Methods for Mining Sequential Patterns
Constraint-Based Mining of Sequential Patterns
Periodicity Analysis for Time-Related Sequence Data
Mining Sequence Patterns in Biological Data
Alignment of Biological Sequences
Hidden Markov Model for Biological Sequence Analysis
Exercises
Bibliographic Notes
Graph Mining, Social Network Analysis, and Multirelational Data MiningGraph Mining
Methods for Mining Frequent Subgraphs
Mining Variant and Constrained Substructure Patterns
Applications: Graph Indexing, Similarity Search, Classification, and Clustering
Social Network Analysis
What Is a Social Network?
Characteristics of Social Networks
Link Mining: Tasks and Challenges
Mining on Social Networks
Multirelational Data Mining
What Is Multirelational Data Mining?
LP Approach to Multirelational Classification
Tuple ID Propagation
Multirelational Classification Using Tuple ID Propagation
Multirelational Clustering with User Guidance
Exercises
Bibliographic Notes
Mining Object, Spatial, Multimedia, Text, and Web DataMultidimensional Analysis and Descriptive Mining of Complex Data Objects
Generalization of Structured Data
Aggregation and Approximation in Spatial and Multimedia Data Generalization
Generalization of Object Identifiers and Class/Subclass Hierarchies
Generalization of Class Composition Hierarchies
Construction and Mining of Object Cubes
Generalization-Based Mining of Plan Databases by Divide-and-Conquer
Spatial Data Mining
Spatial Data Cube Construction and Spatial OLAP
Mining Spatial Association and Co-location Patterns
Spatial Clustering Methods
Spatial Classification and Spatial Trend Analysis
Mining Raster Databases
Multimedia Data Mining
Similarity Search in Multimedia Data
Multidimensional Analysis of Multimedia Data
Classification and Prediction Analysis of Multimedia Data
Mining Associations in Multimedia Data
Audio and Video Data Mining
Text Mining
Text Data Analysis and Information Retrieval
Dimensionality Reduction for Text
Text Mining Approaches
Mining the World Wide Web
Mining the Web Page Layout Structure
Mining the Web’s Link Structures to Identify Authoritative Web Pages
Mining Multimedia Data on the Web
Automatic Classification of Web Documents
Web Usage Mining
Exercises
Bibliographic Notes
Applications and Trends in Data MiningData Mining Applications
Data Mining for Financial Data Analysis
Data Mining for the Retail Industry
Data Mining for the Telecommunication Industry
Data Mining for Biological Data Analysis
Data Mining in Other Scientific Applications
Data Mining for Intrusion Detection
Data Mining System Products and Research Prototypes
How to Choose a Data Mining System
Examples of Commercial Data Mining Systems
Additional Themes on Data Mining
Theoretical Foundations of Data Mining
Statistical Data Mining
Visual and Audio Data Mining
Data Mining and Collaborative Filtering
Social Impacts of Data Mining
Ubiquitous and Invisible Data Mining
Data Mining, Privacy, and Data Security
Trends in Data Mining
Exercises
Bibliographic Notes
Appendix An Introduction to Microsoft’s OLE DB for Data Mining
A.1 Model Creation
A.2 Model Training
A.3 Model Prediction and Browsing