CLUSTERING DOCUMENT TREES BASED ON SIMILARITY MEASURE

Authors

  • Manasa Sudha Akula*, PrajnaBodapati, Shashi Mogalla

Abstract

With rapid change in technology, Data Mining and Warehousing is gaining a lot of prominence in the field of computers. Retrieval of information in large intra organizations is becoming a tedious task. Data mining is now offering many powerful and innovative techniques for solving the problem of information retrieval. This paper introduces a novelapproach for clustering text documents based on frequent subtrees. Document trees are constructed by extracting noun hypernyms relationship for each and every word in the text document using Wordnet 2.1 lexical reference. This technique sweeps over the traditional text mining approaches which are based on frequent keyword occurrences. The aim of this technique is that it can cluster documents even if the documents do not have words in common. The key idea behind this paper is to automate the clustering mechanism by discovering frequent subtrees from various document trees. To identify the frequent sub trees occurrences in the constructed document trees, the closed frequent substructure mining approach is employed. This approach explores the depth first search in frequent subtree mining to discover all frequent subtrees without candidate generation and false positive pruning to accelerate the mining process. Clusters are formed based on similarity measure. Hence, this paper accentuates on the concept of frequent subtree mining based on noun hypernyms to form clusters resulting in the automation of system for easy searching, organizing and maintaining of voluminous text documents

Article Metrics Graph

Downloads

Published

2013-10-12