Wu James, Coggeshall Stephen. Foundations of Predictive Analytics

pdf file
size 3,05 MB

Boca Raton: CRC Press, 2012. — 337 p. — (Data Mining and Knowledge Discovery Series). — ISBN10: 1439869464.

Drawing on the authors’ two decades of experience in applied modeling and data mining, Foundations of Predictive Analytics presents the fundamental background required for analyzing data and building models for many practical applications, such as consumer behavior modeling, risk and marketing analytics, and other areas. It also discusses a variety of practical topics that are frequently missing from similar texts.

The book begins with the statistical and linear algebra/matrix foundation of modeling methods, from distributions to cumulant and copula functions to Cornish–Fisher expansion and other useful but hard-to-find statistical techniques. It then describes common and unusual linear methods as well as popular nonlinear modeling approaches, including additive models, trees, support vector machine, fuzzy systems, clustering, naïve Bayes, and neural nets. The authors go on to cover methodologies used in time series and forecasting, such as ARIMA, GARCH, and survival analysis. They also present a range of optimization techniques and explore several special topics, such as Dempster–Shafer theory.

An in-depth collection of the most important fundamental material on predictive analytics, this self-contained book provides the necessary information for understanding various techniques for exploratory data analysis and modeling. It explains the algorithmic details behind each technique (including underlying assumptions and mathematical formulations) and shows how to prepare and encode data, select variables, use model goodness measures, normalize odds, and perform reject inference.

What Is a Model?
What Is a Statistical Model?
The Modeling Process
Modeling Pitfalls
Characteristics of Good Modelers
The Future of Predictive Analytics
Properties of Statistical Distributions
Fundamental Distributions
Central Limit Theorem
Estimate of Mean, Variance, Skewness, and Kurtosis from Sample Data
Estimate of the Standard Deviation of the Sample Mean
(Pseudo) Random Number Generators
Transformation of a Distribution Function
Distribution of a Function of Random Variables
Moment Generating Function
Cumulant Generating Function
Characteristic Function
Chebyshev’s Inequality
Markov’s Inequality
Gram–Charlier Series
Edgeworth Expansion
Cornish–Fisher Expansion
Copula Functions
Important Matrix Relationships
Pseudo-Inverse of a Matrix
A Lemma of Matrix Inversion
Identity for a Matrix Determinant
Inversion of Partitioned Matrix
Determinant of Partitioned Matrix
Matrix Sweep and Partial Correlation
Singular Value Decomposition (SVD)
Diagonalization of a Matrix
Spectral Decomposition of a Positive Semi-Definite Matrix
Normalization in Vector Space
Conjugate Decomposition of a Symmetric Definite Matrix
Cholesky Decomposition
Cauchy–Schwartz Inequality
Relationship of Correlation among Three Variables
Linear Modeling and Regression
Properties of Maximum Likelihood Estimators
Linear Regression
Fisher’s Linear Discriminant Analysis
Principal Component Regression (PCR)
Factor Analysis
Partial Least Squares Regression (PLSR)
Generalized Linear Model (GLM)
Logistic Regression: Binary
Logistic Regression: Multiple Nominal
Logistic Regression: Proportional Multiple Ordinal
Fisher Scoring Method for Logistic Regression
Tobit Model: A Censored Regression Model
Nonlinear Modeling
Naive Bayesian Classifier
Neural Network
Segmentation and Tree Models
Additive Models
Support Vector Machine (SVM)
Fuzzy Logic System
Clustering
Time Series Analysis
Fundamentals of Forecasting
ARIMA Models
Survival Data Analysis
Exponentially Weighted Moving Average (EWMA) and GARCH(1, 1)
Data Preparation and Variable Selection
Data Quality and Exploration
Variable Scaling and Transformation
How to Bin Variables
Interpolation in One and Two Dimensions
Weight of Evidence (WOE) Transformation
Variable Selection Overview
Missing Data Imputation
Stepwise Selection Methods
Mutual Information, KL Distance
Detection of Multicollinearity
Model Goodness Measures
Training, Testing, Validation
Continuous Dependent Variable
Binary Dependent Variable (Two-Group Classification)
Population Stability Index Using Relative Entropy
Optimization Methods
Lagrange Multiplier
Gradient Descent Method
Newton–Raphson Method
Conjugate Gradient Method
Quasi-Newton Method
Genetic Algorithms (GA)
Simulated Annealing
Linear Programming
Nonlinear Programming (NLP)
Nonlinear Equations
Expectation-Maximization (EM) Algorithm
Optimal Design of Experiment
Miscellaneous Topics
Multidimensional Scaling
Simulation
Odds Normalization and Score Transformation
Reject Inference
Dempster–Shafer Theory of Evidence
Appendix А. Useful Mathematical Relations
Information Inequality
Relative Entropy
Saddle-Point Method
Stirling’s Formula
Convex Function and Jensen’s Inequality
Appendix B. DataMinerXL – Microsoft Excel Add-In for Building Predictive Models

Overview

Utility Functions
Data Manipulation Functions
Basic Statistical Functions
Modeling Functions for All Models
Weight of Evidence Transformation Functions
Linear Regression Functions
Partial Least Squares Regression Functions
Logistic Regression Functions
Time Series Analysis Functions
Naive Bayes Classifier Functions
Tree-Based Model Functions
Clustering and Segmentation Functions
Neural Network Functions
Support Vector Machine Functions
Optimization Functions
Matrix Operation Functions
Numerical Integration Functions
Excel Built-in Statistical Distribution Functions