Technical University of Berlin, 2018 - 130 c.
The popularity of the World Wide Web and its ubiquitous global online services have led
to unprecedented amounts of available data. Novel distributed data processing systems
have been developed in order to scale out computations and analysis to such massive
data set sizes. These "Big Data Analytics" systems are also popular choices to scale out
the execution of machine learning algorithms. However, it remains an open question how
efficient they perform at this task and how to adequately evaluate and benchmark these
systems for scalable machine learning workloads in general. In this thesis, we present
work on all crucial building blocks for a benchmark of distributed data processing systems
for scalable machine learning including extensive experimental evaluations of distributed
data flow systems.
First, we introduce a representative set of distributed machine learning algorithms
suitable for large scale distributed settings which have close resemblance to industry-
relevant applications and provide generalizable insights into system performance. We
specify data sets, workloads, experiments and metrics that address all relevant aspects of
scalability, including the important aspect of model dimensionality. We provide results of
a comprehensive experimental evaluation of popular distributed dataflow systems, which
highlight shortcomings in these systems. Our results show, that while being able to
robustly scale with increasing data set sizes, current state of the art data flow systems are
surprisingly inefficient at coping with high dimensional data, which is a crucial requirement
for large scale machine learning algorithms.
Second, we propose methods and experiments to explore the trade-off space between
the runtime for training a machine learning model and the model quality. We make the
case for state of the art, single machine algorithms as baselines when evaluating distributed
data processing systems for scalable machine learning workloads and present such an
experimental evaluation for two popular and representative machine learning algorithms
with distributed data flow systems and single machine libraries. Our results show, that
even latest generation distributed data flow systems require substantial hardware resources
to provide comparable prediction quality to a state of the art single machine library
within the same time frame. This insight is a valuable addition for future systems papers
as well as for scientists and practitioners considering distributed data processing systems
for applying machine learning algorithms to their problem domain.
Third, we present work on reducing the operational complexity of carrying out
benchmark experiments. We introduce a framework for defining, executing, analyzing
and sharing experiments on distributed data processing systems. On the one hand, this
framework automatically orchestrates experiments, on the other hand, it introduces a
unified and transparent way of specifying experiments, including the actual application
code, system configuration, and experiment setup description enabling the sharing of
end-to-end experiment artifacts. With this, our framework fosters reproducibility and
portability of benchmark experiments and significantly reduces the "entry barrier" to
running benchmarks of distributed data processing systems.