欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2006, Vol. 23 ›› Issue (5): 640-646.DOI: 10.7523/j.issn.2095-6134.2006.5.012

• 论文 • 上一篇    下一篇

文本聚类算法的质量评价

刘务华; 罗铁坚; 王文杰   

  1. 中国科学院研究生院,北京 100080
  • 收稿日期:1900-01-01 修回日期:1900-01-01 发布日期:2006-09-15

Quality Evaluation for Three Textual Document Clustering Algorithms

LIU Wu-Hua, LUO Tie-Jian, WANG Wen-Jie   

  1. Graduate University of Chinese Academy of Sciences, Beijing 100039
  • Received:1900-01-01 Revised:1900-01-01 Published:2006-09-15

摘要: 文本聚类是建立大规模文本集合的分类体系实例的有效手段之一。本文讨论了利用标准的分类测试集合进行聚类质量的量化评价的手段,选择了k-Means聚类算法、STC(后缀树聚类)算法和基于Ant的聚类算法进行了实验对比。对实验结果的分析表明,STC聚类算法由于在处理文本时充分考虑了文本的短语特性,其聚类效果较好;基于Ant的聚类算法的结果受参数输入的影响较大;在Ant聚类算法中引入文本特性可以提高聚类结果的质量。

关键词: 文本聚类, 质量评价, 有效性验证, 后缀树聚类, Ant-Based 聚类, k-Means聚类

Abstract: Textual document clustering is one of the effective approaches to establish a classification instance of huge textual document set. Clustering Validation or Quality Evaluation techniques can be used to assess the efficiency and effective of a clustering algorithm. This paper presents the quality evaluation criterions from outer and inner. Based on these criterions we take three typical textual document clustering algorithms for assessment with experiments. The comparison results show that STC(Suffix Tree Clustering) algorithm is better than k-Means and Ant-Based clustering algorithms. The better performance of STC algorithm comes from that it takes accounts the linguistic property when processing the documents. Ant-Based clustering algorithm’s performance variation is affected by the input variables. It is necessary to adopt linguistic properties to improve the Ant-Based text clustering’s performance.

Key words: Textual document clustering, Quality evaluate, Clustering validation, STC, Ant-Based Clustering, k-Means Clustering

中图分类号: