Murfi H. Machine learning for text indexing: concept extraction, keyword extraction and tag recommendation

pdf file
size 1,03 MB

added by Lesly 03/10/2019 23:26
info modified 03/11/2019 04:18

Murfi H. Machine learning for text indexing: concept extraction, keyword extraction and tag recommendation

Technical University of Berlin, 2010. — 108 p.

Due to some drawbacks, mainly because of semantic issues such as synonymy and polysemy, people consider some approaches to improve the performance of full-text indexing. The alternative approaches include latent semantic indexing, keyword indexing, social indexing (web 2.0) and linked data-based indexing (semantic web). The aim of this dissertation is to investigate the applications of machine learning methods for the alternative approaches. The application areas are concept extraction, keyword extraction and tag recom- mendation. Firstly, we propose a new learning method called two-level learning hierar- chy (TLLH) to extract concepts from tagged textual contents. This learning method executes separately the existing textual sources, i.e. the user-created tags and the textual contents. At the lower level, concepts and conceptdocument relationships are discovered by non-negative matrix factorization (NMF) algorithm based on the user-created tags. Having these relationships, the concepts are populated by terms existing in the textual contents at higher level. We expect this method to be successful because the hidden document structures are discovered based on tags collectively created by users who understand the semantic content of documents. Another advantage is that the NMF algorithm executes more compact and cleaner data representations. On the other hand, concept extraction from the textual contents is handled by non-negative least squares (NNLS) algorithm which is much more efficient than the NMF algorithm. Moreover, the TLLH approach may have richer vocabularies because it can combine vocabularies from the user-created tags and the textual contents. Therefore, this approach is not only more reliable but also more efficient than the standard one-level learning hierarchy (OLLH) which extracts concepts only from the textual contents. Next, we apply the extracted concepts for a keyword extraction method. In other words, we propose a new keyword extraction method called concept-based keyword ex- traction (CBKE). Its basic idea is that a term of a document is important if the term is associated to important concepts of the document and important itself in the document. The exibility regarding the characteristics of learning data is one of the advantages of the method. This method can operate on learning data either with or without manually assigned keywords. Finally, we apply our proposed CBKE methods to content-based tag recommendations in folksonomy. The results show that the tag recommendations have competitive performances in ICML PKDD Discovery Challenge 2009.