Academic Press, 2009. — 864 p. — ISBN: 0123747651.
Robert Nisbet, Pacific Capital Bank Corporation, Santa Barbara, CA, USA
John Elder, Elder Research, Inc. and the University of Virginia, Charlottesville, USA
Gary Miner, StatSoft, Inc. , Tulsa, OK, USA
DescriptionThe Handbook of Statistical Analysis and Data Mining Applications is a comprehensive professional reference book that guides business analysts, scientists, engineers and researchers (both academic and industrial) through all stages of data analysis, model building and implementation. The Handbook helps one discern the technical and business problem, understand the strengths and weaknesses of modern data mining algorithms, and employ the right statistical methods for practical application. Use this book to address massive and complex datasets with novel statistical approaches and be able to objectively evaluate analyses and solutions. It has clear, intuitive explanations of the principles and tools for solving problems using modern analytic techniques, and discusses their application to real problems, in ways accessible and beneficial to practitioners across industries - from science and engineering, to medicine, academia and commerce. This handbook brings together, in a single resource, all the information a beginner will need to understand the tools and issues in data mining to build successful data mining solutions.
Theoretical Considerations for Data MiningPreambleIn Chapter 1, we explored the historical background of statistical analysis and data mining. Statistical analysis is a relatively old discipline (particularly if you consider its origins in China). But data mining is a relatively new field, which developed during the 1990s and coalesced into a field of its own during the early years of the twenty-first century. It represents a confluence of several well-established fields of interest:
Traditional statistical analysis
Artificial intelligence
Machine learning
Development of large databases
Traditional statistical analysis followsthe deductive method in the search for relationships in data sets. Artificial intelligence (e.g. , expert systems) and machine learning techniques (e.g. , neural nets and decision trees) follow the inductive method to find faint patterns of relationship in data sets. Deduction (or deductive reasoning) is the Aristotelian process of analyzing detailed data, calculating a number of metrics, and forming some conclusions based (or deduced) solely on the mathematics of those metrics.
Induction is the more Platonic process of using information in a data set as a "springboard" to make general conclusions, which are not wholly contained directly in the input data. The scientific method follows the inductive approach but has strong Aristotelian elements in the preliminary steps.
The scientific methodThe scientific method is as follows:Define the problem.
Gather existing information about a phenomenon.
Form one or more hypotheses.
Collect new experimental data.
Analyze the information in the new data set.
Interpret results.
Synthesize conclusions, based on the old data, new data, and intuition.
Form new hypotheses for further testing.
Do it again (iteration).
Steps 1-5 involve deduction, and steps 6-9 involve induction. Even though the scientific method is based strongly on deductive reasoning, the final products arise through inductive reasoning. Data mining is a lot like that.
In fact, machine learning algorithms used in data mining are designed to mimic the process that occurs in the mind of the scientist. Data mining uses mathematics, but the results are not mathematically determined. This statement may sound somewhat contradictory until you view it in terms of the human brain.
You can describe many of the processes in the human conceptual pathway with various mathematical relationships, but the result of being human goes far beyond the mathematical descriptions of these processes. Women's intuition, mother's wisdom regarding their offspring, and "gut" level feelings about who should win the next election are all intuitive models of reality created by the human brain. They are based largely on empirical data, but the mind extrapolates beyond the data to form the conclusions following a purely inductive reasoning process.
What is data mining?Data mining can be defined in several ways, which differ primarily in their focus on different aspects of data mining. One of the earliest definitions is
The non-trivial extraction of implicit, previously unknown, and potentially useful information from data (Frawley et al. , 1991).
As data mining developed as a professional activity, it was necessary to distinguish it from the previous activity of statistical modeling and the broader activity of knowledge discovery.
For the purposes of this handbook, we will use the following working definitions:Statistical modeling: The use of parametric statistical algorithms to group or predict an outcome or event, based on predictor variables.
Data mining: The use of machine learning algorithms to find faint patterns of relationship between data elements in large, noisy, and messy data sets, which can lead to actions to increase benefit in some form (diagnosis, profit, detection, etc. ).
Knowledge discovery: The entire process of data access, data exploration, data preparation, modeling, model deployment, and model monitoring. This broad process includes data mining activities, as shown in Figure 2.1.
As the practice of data mining developed further, the focus of the definitions shifted to specific aspects of the information and its sources. In 1996, Fayyad et al. proposed the following:
Knowledge discovery in databases is the non-trivial process of identifying valid, novel, potential useful, and ultimately understandable patterns in data.
The second definition focuses on the patterns in the data rather than just information in a generic sense. These patterns are faint and hard to distinguish, and they can only be sensed by analysis algorithms that can evaluate nonlinear relationships between predictor variables and their targets and themselves. This form of the definition of data mining developed along with the rise of machine learning tools for use in data mining. Tools like decision trees and neural nets permit the analysis of nonlinear patterns in data easier than is possible in parametric statistical algorithms. The reason is that machine learning algorithms learn the way humans do-by example, not by calculation of metrics based on averages and data distributions.
The definition of data mining was confined originally to just the process of model building.
But as the practice matured, data mining tool packages (e.g. , SPSS-Clementine) included other necessary tools to facilitate the building of models and for evaluating and displaying models. Soon, the definition of data mining expanded to include those operations in Figure 2.1 (and some include model visualization also).
The modern Knowledge Discovery in Databases (KDD) process combines the mathematics used to discover interesting patterns in data with the entire process of extracting data and using resulting models to apply to other data sets to leverage the information for some purpose. This process blends business systems engineering, elegant statistical methods, and industrial-strength computing power to find structure (connections, patterns, associations, and basis functions) rather than statistical parameters (means, weights, thresholds, knots).
In Chapter 3, we will expand this rather linear organization of data mining processes to describe the iterative, closed-loop system with feedbacks that comprise the modern approach to the practice of data mining.
A theoretical framework for the data mining processThe evolutionary nature of the definition and focus of data mining occurred primarily as a matter of experience and necessity. A major problem with this development was the lack of a consistent body of theory, which could encompass all aspects of what information is, where it comes from, and how is it used. This logical concept is sometimes called a model-theoretic. Model theory links logic with algebraic expressions of structure to describe a system or complex process with a body of terms with a consistent syntax and relationships between them (semantics). Most expressions of data mining activities include inconsistent terms (e.g. , attribute and predictor), which may imply different logical semantic relations with the data elements employed. Mannila (2000) summarized a number of criteria that should be satisfied in an approach to develop a model-theoretic for data mining. These criteria include the ability to
Model typical data mining tasks (clustering, rule discovery, classification)
Describe data and the inductive generalizations derived from the data
Express information from a variety of forms of data (relational data, sequences, text, Web)
Support interactive and iterative processes
Express comprehensible relationships
Incorporate users in the process
Incorporate multiple criteria for defining what is an "interesting" discovery
Mannila describes a number of approaches to developing an acceptable model-theoretic but concludes that none of them satisfy all the above criteria. The closest we can come is to combine the microeconomic approach with the inductive database approach.
Microeconomic ApproachThe starting point of the microeconomic approach is that data mining is concerned with finding actionable patterns in data that have some utility to form a decision aimed at getting something done (e.g. , employ interdiction strategies to reduce attrition). The goal is to find the decision that maximizes the total utility across all customers.
Inductive Database ApproachAn inductive database includes all the data available in a given structure plus all the questions (queries) that could be asked about patterns in the data. Both stored and derived facts are handled in the same way. One of the most important functions of the human brain is to serve as a pattern recognition engine. Detailed data are submerged in the unconscious memory, and actions are driven primarily by the stored patterns.
Manilla suggests that the microeconomic approach can express most of the requirements for a model-theoretic based on stored facts, but the inductive database approach is much more facile to express derived facts. One attempt to implement this was taken in the development of the Predictive Modeling Markup Language (PMML) as a superset of the standard
Extended Markup Language (XML). Most data mining packages available today store internal information (e.g. , arrays) in XML format and can output results (analytical models) in the form of PMML. This combination of XML and PMML permits expression of the same data elements and the data mining process either in a physical database environment or a Web environment. When you choose your data mining tool, look for these capabilities.
Strengths of the data mining processTraditional statistical studies use past information to determine a future state of a system (often called prediction), whereas data mining studies use past information to construct patterns based not solely on the input data, but also on the logical consequences of those data.
This process is also called prediction, but it contains a vital element missing in statistical analysis: the ability to provide an orderly expression of what might be in the future, compared to what was in the past (based on the assumptions of the statistical method).
Compared to traditional statistical studies, which are often hindsight, the field of data mining finds patterns and classifications that look toward and even predict the future.
In summary, data mining can (1) provide a more complete understanding of data by finding patterns previously not seen and (2) make models that predict, thus enabling people to make better decisions, take action, and therefore mold future events.
Customer-Centric versus account-Centric: A new way to look at your data-Most computer databases in business were designed for the efficient storage and retrieval of account or product information. Business operations were controlled by accounting systems; it was natural that the application of computers to business followed the same data structures. The focus of these data structures was on transactions, and multiple transactions were stored for a given account. Data in early transactional business systems were held in Indexed Sequential Access Method (ISAM) databases. But as data volumes increased and the need for flexibility increased, Relational Database Management Systems (RDBMS) were developed. Relational theory developed by C. J. Codd distributed data into tables linked by primary and foreign keys, which progressively reduced data redundancy (like customer names) in (eventually) six "normal forms" of data organization.
Some of the very large relational systems using NCR Teradata technology extend into the hundreds of terabytes. These systems provide relatively efficient systems for storage and retrieval of account-centric information. Account-centric systems were quite efficient for their intended purpose, but they have a major drawback: it is difficult to manage customers per se as the primary responders, rather than accounts. One person could have one account or multiple accounts. One account could be owned by more than one person. As a result, it was very difficult for a company on an RDBMS to relate its business to specific customers. Also, accounts (per se) don't buy products or services; products don't buy themselves. People buy products and services, and our businesses operations (and the databases that serve them) should be oriented around the customer, not around accounts.
When we store data in a customer-centric format, extracts to build the Customer Analytical Record (CAR) are much easier to create.
And customer-centric databases are much easier to update in relation to the customer.
The Physical Data MartOne solution to this problem is to organize data structures to hold specific aspects (dimensions) of customer information. These structures can be represented by tables with common keys to link them together. This approach was championed by Oracle to hold customer- related information apart from the transactional data associated with them. The basic architecture was organized around a central (fact) table, which stored general information about a customer. This fact table formed the hub of a structure like a wheel (Figure 2.2). This structure became known as the star-schema.
Another name for a star-schema is a multidimensional database. In an online store, the dimensions can hold data elements for Products, Orders, Back-orders, etc. The transactional data are often stored in another very different data structure. The customer database system is refreshed daily with summaries and aggregations from the transactional system. This smaller database is "dependent" on the larger database to create the summary and aggregated data stored in it. When the larger database is a data warehouse, the smaller dependent database is referred to as a dependent data mart. In Chapter 3, we will see how a system of dependent data marts can be organized around a relational data warehouse to form the Corporate Information Factory.
The Virtual Data MartAs computing power and disk storage capacity increased, it became obvious in the early 1990s that a business could appeal to customers directly by using characteristics and historical account information, and Customer Relationship Management (CRM) was born. One-to-one marketing appeals could be supported, and businesses became "smarter" in their ability to convince customers to buy more goods and services. This success of CRMoperations changed the way some companies looked at their data. No longer must companies view their databases in terms of just accounts and products, but rather they could view their customers directly, in terms of all accounts, products and demographic data associated with each customer. These "logical" data marts could even he implemented as "views" in an RDBMS.
Householded DatabasesAnother way to gain customer-related insights is to associate all accounts to the customers who own them and to associate all individual members of the same household. This process is called householding. The householding process requires some fuzzy matching to aggregate all accounts to the same customer. The reason is that the customer names may not be spelled exactly the same way in all records. An analogous situation occurs when trying to gather all individuals into the same household, because not all addresses are listed in exactly the same format. This process of fuzzy matching can be performed by a number of data integration and data quality tools available in the market today (DataFlux, Trillium, Informatica Data Quality, IBM Quality Stage).
The householded data structure could consist of the following tables:Accounts
Individuals
Households
Historical data could be combined with each of the preceding hierarchical levels of aggregation. Alternatively, the preceding tables could be restricted to current data, and historical data could be installed in historical versions of the same tables (e.g. , Accounts_Hst), linked together with common keys. This compound structure would optimize speed of database queries and simplify data extraction for most applications requiring only current data. Also, the historical data would be available for trending in the historical tables.
The data paradigm shiftThe organization of data structures suitable for data mining requires a basic shift in thinking about data in business. Data do not serve the account; data should be organized to serve the customer who buys goods and services. To directly serve customers, data must be organized in a customer-centric data structure to permit the following:
Relationship of all data elements must be relevant to the customer.
Data structures must make it relatively easy to convert all required data elements into a form suitable for data mining: the Customer Analytical Record (CAR).
Creation of the carAll input data must be loaded into the CAR (Accenture Global Services, 2006). This process is similar to preparing for a vacation by automobile. If your camping equipment is stored in one place in your basement, you can easily access it and load it into the automobile. If it is spread throughout your house and mixed in with noncamping equipment, access will be more difficult because you have to separate (extract) it from among other items. Gathering data for data mining is a lot like that. If your source data is a data warehouse, this process will denormalize your data. Denormalization is the process of extracting, data from normalized tables in the relational model of a data warehouse. Data from these tables must be associated with the proper individuals (or households) along the way. Data integration tools (like SAS DataFlux or Informatica) are required to extract and transform data from the relational database tables to build the CAR. See any one of a number of good books on relational data warehousing to understand what this process entails. If your data are already in a dimensional or householding data structure, you are already halfway there. The CAR includes the following:
All data elements are organized into one record per customer.
One or more "target" (Y) variables are assigned or derived.
The CAR is expressed as a textual version of
An equation: Y= [X.sub.1] + [X.sub.2] + [X.sub.3] +. [X.sub.n]
This expression represents a computerized "memory" of the information about a customer.
These data constructs are analyzed by either statistical or machine learning "algorithms, " following specific methodological operations. Algorithms are mathematical expressions that describe relationships between the variable predicted (Y or the customer response) and the predictor variables ([X.sub.1] + [X.sub.2] + [X.sub.3] +. [X.sub.n] )
Basic advanced data mining algorithms are discussed later. The CAR is analyzed by parametric statistical or machine learning algorithms, within the broader process of Knowledge Discovery in Databases (KDD), as shown in Figure 2.1.
The data mining aspect of KDD consists of an ordered series of activities aimed at training and evaluating the best patterns (for machine learning) or equations (for parametric statistical procedures). These optimum patterns or equations are called models.
Major activities of data miningMajor data mining activities include the following general operations (Hand et al. , 2001):
Exploratory Data Analysis: These data exploration activities include interactive and visual techniques that allow you to "view" a data set in terms of summary statistical parameters and graphical display to "get a feel" for any patterns or trends that are in the data set.
Descriptive Modeling: This activity forms higher-level "views" of a data set, which can include the following:
Determination of overall probability distributions of the data (sometimes called density estimations);
Models describing the relationship between variables (sometimes called dependency modeling);
Partitioning of the data into groups, either by cluster analysis or segmentation.
Cluster analysis is a little different, as the clustering algorithms try to find "natural groups" either with many "clusters, " or in one type of cluster analysis, the user can specify that all the cases "must be" put into x number of clusters (say, for example, three cluster groups). For segmentation, the goal is to find homogeneous groups related to the variable to be modeled (e.g. , customer segments like big-spenders).3. Predictive Modeling: Classification and Regression: The goal here is to build a model where the value of one variable can be predicted from the values of other variables. Classification is used for "categorical" variables (e.g. , Yes/No variables or multiple-choice answers for a variable like 1-5 for "like best" to "like least").
Regression is used for "continuous" variables (e.g. , variables where the values can be any number, with decimals, between one number and another; age of a person would be an example, or blood pressure, or number of cases of a product coming off an assembly line each day).4. Discovering Patterns and Rules: This activity can involve anything from finding the combinations of items that occur frequently together in transaction databases (e.g. , products that are usually purchased together, at the same time, by a customer at a convenience store, etc. ) or things like finding groupings of stars, maybe new stars, in astronomy, to finding genetic patterns in DNA microarray assays. Analyses like these can be used to generate association rules; e.g. , if a person goes to the store to buy milk, he will also buy orange juice. Development of association rules is supported by algorithms in many commercial data mining software products. An advanced association method is Sequence, Association, and Link (SAL) analysis. SALanalysis develops not only the associations, but also the sequences of the associated items. From these sequenced associations, "links" can be calculated, resulting in Web link graphs or rule graphs (see the NTSB Text Mining Tutorial, included with this book, for nice illustrations of both rule graphs and SAL graphs).5. Retrieval by Content: This activity type begins with a known pattern of interest and follows the goal to find similar patterns in the new data set. This approach to pattern recognition is most often used with text material (e.g. , written documents, brought into analysis as Word docs, PDFs, or even text content of Web pages) or image data sets. To those unfamiliar with these data mining activities, their operations might appear magical or invoke images of the wizard.
Contrary to the image of data miners as magicians, their activities are very simple in principle. They perform their activities following a very crude analog to the way the human brain learns. Machine learning algorithms learn case by case, just the way we do. Data input to our senses are stored in our brains not in the form of individual inputs, but in the form of patterns. These patterns are composed of a set of neural signal strengths our brains have associated with known inputs in the past. In addition to their abilities to build and store patterns, our brains are very sophisticated pattern recognition engines. We may spend a lifetime building a conceptual pattern of "the good life" event by event and pleasure by pleasure. When we compare our lives with those in other countries, we unconsciously compare what we know about their lives (data inputs) with the patterns of our good lives. Analogously, a machine learning algorithm builds the pattern it "senses" in a data set. The pattern is saved in terms of mathematical weights, constants, or groupings. The mined pattern can be used to compare mathematical patterns in other data sets, to score their quality.
Granted, data miners have to perform many detailed numerical operations required by the limitations of our tools. But the principles behind these operations are very similar to the ways our brains work.
Data mining did not arise as a new academic discipline from the studies in universities.
Rather, data mining is the logical next step in a series of developments in business to use data and computers to do business better in the future. Table 2.1 shows the historical roots of data mining.
The discussion in Chapter 1 ended with the question of whether the latest data mining algorithms represent the best we can do. The answer was probably no. The human minds of the data mining algorithm developers will continue to generate novel and increasingly sophisticated methods of emulating the human brain.
Major challenges of data miningSome of the major challenges of data mining projects include
Use of data in transactional databases for data mining
Data reduction
Data transformation
Data cleaning
Data sparsity
Data rarity (rare case pattern recognition and thus "data set balancing")
Each of these challenges will be discussed in the ensuing chapters.
Examples of data mining applicationsData mining technology can be applied anywhere a decision is made, based on some body of evidence. The diversity of applications in the past included the following:
Sales Forecasting: One of the earliest applications of data mining technology
Shelf Management: A logical follow-on to sales forecasting
Scientific Discovery: A way to identify which among the half-billion stellar objects are worthy of attention (JPL/Palomar Observatory)
Gaming: A method of predicting which customers have the highest potential for spending
Sports: A method of discovering which players/game situations have the highest potential for high scoring
Customer Relationship Management: Retention, cross-sell/up-sell propensity
Customer Acquisition: A way to identify the prospects most likely to respond to a membership offer
Major issues in data miningSome major issues of data mining include the following (adapted from Hart and Kamber, 2006):
Mining of different kinds of information in databases: It is necessary to integrate data from diverse input sources, including data warehouses/data marts, Excel spreadsheets, text documents, and image data. This integration may be quite complex and time consuming.
Interactive mining of knowledge at multiple levels of abstraction: Account-level data must be combined with individual-level data and coordinated with data with different time-grains (daily, monthly, etc. ). This issue requires careful transformation of each type of input data to make them consistent with each other.
Incorporation of background information: Some of the most powerful predictor variables are those gathered from outside the corporate database. These data can include demographic and firmographic data, historical data, and other third-party data.
Integration of this external data with internal data can be very tricky and imprecise. Inexact ("fuzzy") matching is necessary in many cases. This process can be very time consuming also.
Data mining query languages and ad hoc data mining: Data miners must interface closely with database management systems to access data. Structured Query Language (SQL) is the most common query tool used to extract data from large databases. Sometimes, specialized query languages must be used in the place of SQL. This requirement means that data miners must become proficient (at least to some extent) in the programming skills with these languages. This is the most important interface between data mining and database management operations.
Presentation and visualization of data mining results: Presenting highly technical results to nontechnical managers can be very challenging. Graphics and visualizations of the result data can be very valuable to communicate properly with managers who are more graphical rather than numerical in their analytical skills.
Handling "noisy" or incomplete data: Many items of data ("fields") for a given customer or account (a "record") are often blank. One of the most challenging tasks in data mining is filling those blanks with intuitive values. In addition to data that is not there, some data present in a record represent randomness and are analogous to noise in a signal transmission.
Different data mining algorithms have different sensitivities to missing data and noise. Part of the art of data mining is selecting the algorithm with the right balance of sensitivity to these "distractions" and also to have a relatively high potential to recognize the target pattern.
Pattern evaluation-the "interestingness" problem: Many patterns may exist in a data set. The challenge for data mining is to distinguish those patterns that are "interesting" and useful to solve the data mining problem at hand. Various measures of interestingness have been proposed for selecting and ranking patterns according to their potential interest to the user. Applying good measures of interestingness can highlight those variables likely to contribute significantly to the model and eliminate unnecessary variables. This activity can save much time and computing "cycles" in the model building process.
Efficiency and scalability of data mining algorithms: Efficiency of a data mining algorithm can be measured in terms of its predictive power and the time it takes to produce a model. Scalability issues can arise when an algorithm or model built on a relatively small data set is applied to a much larger data set. Good data mining algorithms and models are linearly scalable; that is, time consumed in processing increases geometrically rather than exponentially with the size of the data set.
Parallel, distributed, and incremental mining algorithms: Large data mining problems can be processed much more efficiently by "dividing and conquering" the problem with multiple processors in parallel computers. Another strategy for processing large data sets is to distribute the processing to multiple computers and compose the results from the combined outputs. Finally, some data mining problems (e.g. , power grid controls) must be solved by using incremental algorithms, or those that work on continuous streams of data, rattler man large "chunks" of data. A good example of such an algorithm is a Generalized Regression Neural Net (GRNN). Many power grids are controlled by GRNNs.
Handling of relational and complex types of data: Much input data might come from relational databases (a system of "normalized" tables linked together by common keys). Other input data might come from complex multidimensional databases (elaborations of star-schemas). The data mining process must be flexible enough to encompass both.
Mining information from heterogeneous and global information systems: Data mining tools must have the ability to process data input from very different database structures. In tools with graphical user interfaces (GUIs), multiple nodes must be configured to input data from very different data strutures.
General requirements for success in a data mining projectFollowing are general requirements for success of a data mining project:
Significant gain is expected. Usually, either
Results will identify "low-hanging fruit, " as in a customer acquisition model where analytic techniques haven't been tried before (and anything rational will work better).
Improved results can be highly leveraged; that is, an incremental improvement in a vital process will have a strong bottom-line impact. For instance, reducing "chargeoffs" in credit scoring from 10% to 9.8% could make a difference of millions of dollars.
A team skilled in each required activity. For other than very small projects, it is unlikely that one person will be sufficiently skilled in all activities. Even if that is so, one person will not have the time to do it all, including data extraction, data integration, analytical modeling, and report generation and presentation. But, more importantly, the analytic and business people must cooperate closely so that analytic expertise can build on the existing domain and process knowledge.
Data vigilance: Capture and maintain the accumulating information stream (e.g. , model results from a series of marketing campaigns).
Time: Learning occurs over multiple cycles. The corporate mantra of Dr. Ferdinand Porsche was "Racing improves the breed. " Today, Porsche is the most profitable automobile manufacturer in the world.
Likewise, data mining models must be "raced" against reality. The customer acquisition model used in the first marketing campaign is not very likely to be optimal. Successive iterations breed successive increases in success. Each of these types of data mining applications followed a common methodology in principle.
Example of a data mining project: Classify a bat'S species by its soundApproach:
Use time-frequency features of echo-location signals to classify bat species in the field (no capture is necessary).
University of Illinois biologists gathered 98 signals from 19 bats representing 6 species.
Thirty-five data features were calculated from the signals, such as low frequency at the 3 db level; time position of the signal peak; amplitude ratio of the 1st and 2nd harmonics.
Multiple data mining algorithms were employed to relate the features to the species.
The groupings of bat signals in are depicted in terms of color and shape of the plotted symbols. From these groupings, we can see that it is likely that a modeling algorithmcould distinguish between many of them (as many colored groups cluster), but not all (as there are multiple clusters for most bat types). The first set of models used decision trees and was 46% accurate. A second set used a new tree algorithm that looks two steps ahead and did better at 58%. A third set of models used neural nets with different configurations of inputs. The best neural net solution increased the correct prediction rate to 68%, and it was observed that the simplest neural net architecture (the one with the fewest input variables) did best. The reduced set of inputs for the neural networks had been suggested by the inputs chosen by the two decision tree algorithms. Further models with nearest neighbors, using this same reduced set of inputs, also did as well as the best neural networks. Lastly, an ensemble of the estimates from four different types of models did better than any of the individual models.
The bat signal modeling example illustrates several key points in the process of creating data mining models:
Multiple algorithms are better than a single algorithm.
Multiple configurations of an algorithm permit identification of the best configuration, which yields the best model.
Iteration is important and is the only way to assure that the final model is the right one for a given application.
Data mining can speed up the solution cycle, allowing you to concentrate on higher aspects of the problem.
Data mining can address information gaps in difficult-to-characterize problems.
The importance of domain knowledgeOne data mining analyst might build a model with a data set and find very low predictability in the results. Another analyst might start with the same data set but create a model with a much higher predictability. Why the difference? In most cases like this, the difference is in the data preparation, not the modeling algorithm chosen. Granted, some algorithms are clearly superior to others for a particular data set. A model is no better than the predictor variables input to it. The second analyst may know much more about the business domain from which the data came. This intimate knowledge facilitates the derivation of powerful predictor variables from the set of existing variables.
There is simply no substitution for domain knowledge. If you don't have it, get it by either learning it before building a model, or bring it into the project team in the form of one who does know it.
PostscriptWhy Did Data Mining Arise?
Now, we can go on to a final dimension of our subject of analytical theory. Statistical analysis has been around for a long time. Why did data mining development occur when it did? Necessity may indeed be the mother of invention. During the past 50 years, business, industry, and society have accumulated a huge amount of data. It has been estimated that over 90% of the total knowledge we have now has been learned since 1950. Faced with huge data sets, analysts could bring computers to their "knees" with the processing of classical statistical analyses. A new form of learning was needed. A new approach to decision making based on input data had to be created to work in this environment of huge data sets. Scientists in artificial intelligence (AI) disciplines proposed that we use an approach modeled on the human brain rather than on Fisher's Parametric Model. From early AI research, neural nets were developed as crude analogs to the human thought process, and decision trees (hierarchical systems of Yes/No answers to questions) were developed as a systematic approach to discovering "truth" in the world around us. Data mining approaches were also applied to relatively small data sets, with predictive accuracies equal to or better than statistical techniques. Some medical and pharmaceutical data sets have relatively few cases but many hundreds of thousands of data attributes (fields). One such data set was used in the 2001 KDD Cup competition, which had only about 2,000 cases, but each case had over 139,000 attributes! Such data sets are not very tractable with parametric statistical techniques. But some data mining algorithms (like MARS) can handle data sets like this with relative ease.
Some Caveats with Data Mining SolutionsHand (2005) summarized some warnings about using data mining tools for pattern discovery:
Data Quality: Poor data quality may not be explicitly revealed by the data mining methods, and this poor data quality will produce poor models. It is possible that poor data will support the building of a model with relatively high predictability, but the model will be a fantasy.
Opportunity: Multiple opportunities can transform the seemingly impossible to a very probable event. Hand refers to this as the problem of multiplicity, or the law of truly large numbers. For example, the odds of a person winning the lottery in the United States are extremely small, and the odds of that person winning it twice are fantastically so. But the odds of someone in the United States winning it twice (in a given year) are actually better than ever. As another example, you can search the digits of pi for "prophetic" strings such as your birthday or significant dates in history and usually find them-given enough digits.
Interventions: One unintended result of a data mining model is that some changes will be made to invalidate it. For example, developing fraud detection models may lead to some effective short-term preventative measures. But soon thereafter, fraudsters may evolve in their behavior to avoid these interventions in their operations.
Separability: Often, it is difficult to separate the interesting information from the mundane information in a data set. Many patterns may exist in a data set, but only a few may be of interest to the data miner for solving a given problem. The definition of the target variable is one of the most important factors that determine which pattern the algorithm will find. For one purpose, retention of a customer may be defined very distinctively by using a variable like Close_date to derive the target. In another case, a 70% decline in customer activity over the last two billing periods might be the best way to define the target variable. The pattern found by the data mining algorithm for the first case might be very different from that of the second case.
Obviousness: Some patterns discovered in a data set might not be useful at all because they are quite obvious, even without data mining analysis. For example, you could find that there is an almost equal number of married men as married women (duh! ). Or, you could learn that ovarian cancer occurs primarily in women and that check fraud occurs most often for customers with checking accounts.
Nonstationarity: Nonstationarity occurs when the process that generates a data set changes of its own accord. For example, a model of deer browsing propensity on leaves of certain species will be quite useless when the deer population declines rapidly. Any historical data on browsing will have little relationship to patterns after the population crash.