Sign up
Forgot password?
FAQ: Login

Ghayoomi M. From HPSG-based Persian treebanking to parsing: machine learning for data annotation

  • pdf file
  • size 4,32 MB
  • added by
  • info modified
Ghayoomi M. From HPSG-based Persian treebanking to parsing: machine learning for data annotation
Free University of Berlin, 2014. — 261 p.
Parsing is a step for understanding a natural language to find out about the words and their grammatical relations in a sentence. Statistical parsers require a set of annotated data, called a treebank, to learn the grammar of a language and apply the learnt model on new, unseen data. This set of annotated data is not available for all languages, and its development is very time- consuming, tedious, and expensive. In this dissertation, we propose a method for treebanking from scratch using machine learning methods. We first propose a bootstrapping approach to initialize the data annotation process. We aim at reducing human intervention to annotate the data. After developing a small data set, we use this data to train a statistical parser. This small data set suffers from the sparseness of data at the lexical and syntactic construction levels. Therefore, a parser trained with this amount of data might have a low performance in a real application. To resolve the data sparsity problem at the lexical level, we propose an unsupervised word clustering approach to provide a more coarse-grained representation of the lexical items. To resolve the data sparsity problem at the syntactic construction level, we propose active learning which is a promising supervised method to seek informative samples in a data pool. The data that is annotated through an active learning approach helps a learner to obtain performance similar to that of a learner trained with the complete set of annotated data. Consequently, active learning is a great help to reduce the amount of required annotated data.
  • Sign up or login using form at top of the page to download this file.
  • Sign up
Up