Apress, 2018. — 821 p. — ISBN: 1484230531.
Learn how to build a data science technology stack and perform good data science with repeatable methods. You will learn how to turn data lakes into business assets.
The data science technology stack demonstrated in Practical Data Science is built from components in general use in the industry. Data scientist Andreas Vermeulen demonstrates in detail how to build and provision a technology stack to yield repeatable results. He shows you how to apply practical methods to extract actionable business knowledge from data lakes consisting of data from a polyglot of data types and dimensions.
Data Science Technology StackRapid Information Factory Ecosystem
Data Science Storage Tools
Data Lake
Data Vaul
Data Warehouse Bus Matrix
Data Science Processing Tools Spark
Mesos
Akka
Cassandra
Kafka
Elastic Search
R
Scala
Python
MQTT (MQ Telemetry Transport)
What’s Next?
Vermeulen-Krennwallner-Hillman-ClarkWindows
Linux
It’s Now Time to Meet Your Customer
Processing Ecosystem
Example Ecosystem
Sample Data
Layered FrameworkDefinition of Data Science Framework
Cross-Industry Standard Process for Data Mining (CRISP-DM)
Homogeneous Ontology for Recursive Uniform Schema
The Top Layers of a Layered Framewor
Layered Framework for High-Level Data Science and Engineering
Business LayerBusiness Layer
Engineering a Practical Business Layer
Utility Layer
Basic Utility Design
Engineering a Practical Utility Layer
Three Management LayersOperational Management Layer
Audit, Balance, and Control Layer
Balance
Control
Yoke Solution
Cause-and-Effect Analysis System
Functional Layer
Data Science Process
Retrieve SuperstepData Lakes
Data Swamps
Training the Trainer Model
Understanding the Business Dynamics of the Data Lake
Actionable Business Knowledge from Data Lakes
Engineering a Practical Retrieve Superstep
Connecting to Other Data Sources
Assess SuperstepAssess Superstep
Errors
Analysis of Data
Practical Actions
Engineering a Practical Assess Superstep
Process SuperstepData Vault
Time-Person-Object-Location-Event Data Vault
Data Science Process
Data Science
Transform SuperstepTransform Superstep
Building a Data Warehouse
Transforming with Data Science
Hypothesis Testing
Overfitting and Underfitting
Precision-Recall
Cross-Validation Test
Univariate Analysis
Bivariate Analysis
Multivariate Analysis
Linear Regression
Logistic Regression
Clustering Techniques
ANOVA
Principal Component Analysis (PCA)
Decision Trees
Support Vector Machines, Networks, Clusters, and Grids
Data Mining
Pattern Recognition
Machine Learning
Bagging Data
Random Forests
Computer Vision (CV)
Natural Language Processing (NLP)
Neural Networks
TensorFlow
Organize and Report SuperstepsOrganize Superstep
Report Superstep
Graphics
Pictures
Showing the Difference
Closing Words