Pearson R.K. Mining Imperfect Data. Dealing with Contamination and Incomplete Records

djvu file
size 2,57 MB

added by Shushimora 02/03/2012 19:02
info modified 02/04/2012 05:34

Pearson R.K. Mining Imperfect Data. Dealing with Contamination and Incomplete Records

Society for Industrial and Applied Mathematics, 2005, -316 p.

Data mining may be defined broadly as the use of automated procedures to extract useful information and insight from large data sets. In practice, these data sets contain various types of anomalous records that significantly complicate the analysis problem. In particular, the prevalence of outliers, missing or incomplete data, and other more subtle phenomena such as misalignment scan completely invalidate the results obtained with standard analysis procedures, often with no indication that anything is wrong. This book is concerned with the problems of detecting these data anomalies and overcoming their deleterious effects.
Two ideas that are central to this book are data pretreatment and analytical validation. Data pretreatment is concerned with the issues of detecting outliers of various types, treatment strategies once we have found them, the closely allied problem of missing data, the detection of noninformative variables that should be excluded from subsequent analyses, and the use of simple preliminary analyses to detect other types of data anomalies such as misalignments. The essential idea behind pretreatment is the too often overlooked early computer axiom garbage in, garbage out. Analytical validation is concerned with the assessment of results once we have them in hand to determine whether they are garbage, gold, or something in between. The essential idea here is the use of systematic and, to the degree possible, extensive comparison to assess the quality of these results. This idea is formalized here as generalized sensitivity analysis (GSA), based on the following idea:
A "good" data analysis result should be insensitive to small changes in either the methods or the datasets on which the analysis is based.
Although this statement sounds very much like the definition of resistance in robust statistics, the GSA formulation differs in two important respects, both of which are particularly relevant to the data-mining problem. First, the GSA formulation is broader since the notion of "small changes in a dataset" considered here is based on the concept of exchangeability as defined by Draper et al. (1993), whereas resistance usually involves either small changes in all of the data values or large changes in a few of the values. The difference is important since exchangeable datasets can include either of these data modifications, together with many others, such as randomly selected subsets of a larger dataset, subsets obtained by deleting observations, or subsets obtained by stratification based on auxiliary variables. The second important difference is that the GSA formulation is informal, based ultimately on graphical data summaries, although formal extensions are possible, as in the case of the comparison-based GSA strategies discussed in Chapter 7 . This informality does cost us in terms of the power of the statements we can make about the data, but this loss is offset by the tremendous flexibility of the GSA framework. For example, this framework is applicable to any of the popular data-mining methods discussed by Cios, Pedrycz, and Swiniarski (1998) without modification.

Imperfect Datasets: Character, Consequences, and Causes
Univariate Outlier Detection
Data Pretreatment
What Is a "Good" Data Characterization?
GSA
Sampling Schemes for a Fixed Dataset
Concluding Remarks and Open Questions