Text Clustering by Concept Similarity Frequency Score

S. Prapurna, Moghal nisar ahmed baig, R. Paripurna chander


Text document clustering has been intensively deliberate because of its important role in text-mining and in order
retrieval. The high dimensionality trouble caused by the high number of words is always happening in the vector space
model clustering. On the other hand, a text document is not only a collected work of the word (“bag of word”) but also a
collection of concept. Therefore if we can convert term-document matrix in concept-document matrix the decline of
dimension will be significant. This paper explored experimentally of renovation of the matrix and the presentation of
concept-based clustering. The concept-document matrix was erected by utilizing the cluster center. Three clustering
models were selected i.e. hierarchical, partitioned and hybrid. Four comparison techniques i.e. Group Average, Complete
Link, Single Link, and Cluster Center were tried for hierarchical, K-Means and Bisecting K-Mean for partitioned and pellet
for hybrids. Document collections from physically categorized of 500-800 news text was used to test these algorithms by
using F-measure as criteria of clustering routine Results show that by using concept-based clustering the concert of
clustering can notably be improved from 80% to 90% judge against to word-based clustering.


Document similarity, text classification, Text mining, Concept similarity, term frequency, document categorization

Full Text:



  • There are currently no refbacks.