欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2005, Vol. 22 ›› Issue (5): 554-559.DOI: 10.7523/j.issn.2095-6134.2005.5.004

• 综述 • 上一篇    下一篇

一种基于k最近邻的快速文本分类方法

张庆国1, 张宏伟2, 张君玉1   

  1. 1. 中国科学院研究生院数学系, 北京 100049;
    2. 清华大学光盘国家工程研究中心, 北京 100084
  • 收稿日期:2004-08-09 修回日期:2004-11-08 发布日期:2005-09-15
  • 通讯作者: 张庆国,E-mail:qgzhang@mails.gscas.ac.cn

A Fast Text Categorization Approach Based on k-Nearest Neighbor

ZHANG Qing-Guo1, ZHANG Hong-Wei2, ZHANG Jun-Yu1   

  1. 1. Department of Mathematics, Graduate School of the Chinese Academy of Sciences, Beijing 100049, China;
    2. Optical Memory National Engineering Research Center, Tsinghua University, Beijing 100084, China
  • Received:2004-08-09 Revised:2004-11-08 Published:2005-09-15

摘要:

k最近邻方法是一种简单而有效的文本分类方法,但是传统的k最近邻分类方法在训练集数据量很大情况下,全局的最优搜索几乎是不可能的.因此,加速k个最近邻的搜索是k最近邻方法实用的关键.提出了一种基于k最近邻的快速文本分类方法,它能够保证在海量数据集中进行快速有效的分类.实验结果表明,这一方法较传统方法性能有显著提升.

关键词: 文本分类, k最近邻, 多维索引, 相似检索

Abstract:

k-Nearest Neighbor (k-NN) is one of the simplest and most effective algorithms for text categorizat ion. However, k-NN search requires intensive similarity computations, part icularly for large training set, the search of the whole set is unacceptable. Therefore, speeding-up k-NN search is a key for making k-NN categorizat ion useful in practice. In this paper a fast text categorization approach based on k-NN, which can classify textual documents quickly and efficiently on condition of searching in the very large training set is presented. Experiment shows that the new algorithm can greatly improve the performance.

Key words: text categorization, k-Nearest Neighbor(k-NN), multidimensional index, similarity retrieval

中图分类号: