Richter S. Efficient indexing for big data in Hadoop MapReduce and main memory databases

pdf file
size 2,84 MB

added by Lesly 03/24/2019 22:47
info modified 03/25/2019 06:23

Richter S. Efficient indexing for big data in Hadoop MapReduce and main memory databases

Saarland University, 2016. — 167 p.

In recent years, the database community has witnessed the advent and breakthrough of many new system designs like the famous Hadoop MapReduce or main memory databases like MonetDB, Hekaton, SAP Hana, and HyPer to solve the problems of “Big Data”. The system architectures in this generation of emerging systems often radically differ from traditional relational databases. For database research, this trend creates new challenges to design and optimize data structures for those novel architectures. Premier candidates for innovations are index structures, which are traditionally among the most crucial performance factors in databases. In this thesis, we focus on efficient indexing methods for Hadoop MapReduce and main memory databases. Our work consists of three independent parts that resulted from different research projects. In the first part, we introduce HAIL, a novel approach for efficient static and adaptive indexing in Hadoop MapReduce. We believe that efficient indexing becomes increasingly important in the context of very large data sets. HAIL combines very low index build times that are often even invisible for users, with significant runtime improvements for selective MapReduce jobs. We provide extensive experiments, and show that HAIL can improve job runtimes by up to 68x over Hadoop. In the second part of this thesis, we present an in-depth evaluation of the adaptive radix tree ART, a recent and very promising competitor in the domain of tree-indexes for main memory databases. ART was reported by its inventors to be significantly faster than previous tree indexes and even competitive to hash tables. However, the original evaluation of ART did not consider Judy Arrays, which is, to the best of our knowledge, the first data structure introducing adaptivity to radix trees. Furthermore, the hash table used in the comparison with ART was just a textbook implementation of chained hashing and not a more sophisticated state-of-the-art hash tables. We provide an extended analysis and experimental evaluation of ART, including a detailed comparison to Judy Arrays, hashing via quadratic probing, and three variants of Cuckoo hashing. Our results give a more differentiated look on ART. In particular, we present striking conceptual similarities between ART and Judy Arrays and show that well-engineered hash tables can beat the lookup throughput of adaptive radix trees by up 6x. In the third part, motivated by our previous results, we take a closer look at hashing methods in main memory databases. We identify seven key factors that influence hashing performance, evaluate their impact, and discuss the implications on hashing in modern databases. Our study indicates that choosing the right hashing method and configuration can make an order of magnitude difference in insert and lookup performance. We also provide a guideline for practitioners on when to use which hashing method.