University of Freiburg, 2015. — 161 p.
Text Mining approaches cover a range of methods to extract information from usually unstructured literature resources. The largest freely available repository to search for this information is PubMed. Within the amount of different software solutions to gain knowledge from texts, the newly developed software library PubMed2Go provides a unified approach to index abstracts and their meta information on a local computer, thus making them easily searchable for the user. With an appropriate infrastructure, sophisticated approaches like machine learning can be used to learn and predict patterns in texts. Such models were built within this thesis for extracting functional compound-protein relationships from sentences of PubMed abstracts, applying two different kernels with support vector machines. The approach reached an F1 score of around 80 %, based on a newly developed and annotated benchmark data set. Text mining enables the efficient connection of textual information to other sources, like structures of chemical substances and sequences of proteins, by mapping their synonyms in texts to unique identifiers in specific databases. ChemIDplus is such an information repository including expert annotations. It was used for toxicity prediction of small molecules based on their median lethal dose. The machine learning classifiers decision tree, random forest, artificial neural network, and support vector machine reached an accuracy of up to 91 % with different sets of molecular descriptors. The best result was achieved by the random forest approach with an area under curve value of around 97 % on a clearly separated data set. The synchronisation of user-annotated data with information sources like textual and structural identifiers is a complex task, described in this thesis for the StreptomeDB, a database containing different information about the bacteria genus Streptomyces. Around 1,600 structures were included to the new database version via the presented update pipeline, produced by around 600 Streptomyces strains and containing a range of curated synthesis pathways as well as activities. The presented results prove that a combination of machine learning with automated text mining and manual curation is a valuable approach, leading to linkage of published information and generation of new knowledge.