Sonnenburg S. Machine learning for genomic sequence analysis

pdf file
size 4,57 MB

added by Lesly 03/13/2019 23:27
info modified 09/28/2021 17:15

Sonnenburg S. Machine learning for genomic sequence analysis

Technical University of Berlin, 2008. — 170 p.

With the development of novel sequencing technologies, the way has been paved for cost efficient, high-throughput whole genome sequencing. In the year 2008 alone, about 250 genomes will have been sequenced. It is self-evident that the handling of this wealth of data requires efficient and accurate computational methods for sequence analysis. They are needed to tackle one of the most important problems in computational biology: the localisation of genes on DNA. In this thesis, I describe the development of state-of-the- art genomic signal detectors based on Support Vector Machines (SVM) that can be used in gene finding systems. The main contributions of this work can be summarized as follows: String Kernels We have developed and extended string kernels so that they are particularly well suited for the detection of genomic signals. These kernels are computationally very efficient - they have linear effort with respect to the length of the input sequences - and are applicable to a wide range of signal detection problems. Only little prior knowledge is needed to select a suitable string kernel combination for use in a classifier that delivers a high recognition accuracy. Large Scale Learning The training of SVMs used to be too computationally demand- ing to be applicable to datasets of genomic scale. We have developed large scale learning methods that enable the training of string kernel based SVMs using up to ten million instances and the application of the trained classifiers to six billions of instances within reasonable time. The proposed linadd approach speeds up the computation of linear combinations of already linear time string kernels. Due to its high efficiency, there is no need for kernel caches in SVM training. This leads to drastically reduced memory requirements. Interpretability An often criticised downside of SVMs with complex kernels is that it is very hard for humans to understand the learnt decision rules and to derive insight from them. We have opened the “black box” of SVM classifiers by developing two concepts helpful for their understanding: Multiple Kernel Learning (MKL) and Positional Oligomer Importance Matrices (POIMs). While MKL algorithms work with arbitrary kernels and are also useful in fusing heterogeneous data sources, POIMs are especially well suited for string kernels and the identification of the most discriminative sequence motifs. Genomic Sequence Analysis We have applied SVMs using novel string kernels to the detection of various genomic signals, like the transcription start and splice sites, outperforming state-of-the-art methods. Using POIMs, we analysed the trained classifiers, demonstrating the fidelity of POIMs by identifying, among others, many previously known functional sequence motifs. Finally, we have used the improved signal detectors to accurately predict gene structures. This thesis concludes with an outlook of how this framework prepares the ground for a fully fledged gene finding system outcompeting prior state-of-the-art methods.