欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2009, Vol. 26 ›› Issue (3): 400-407.DOI: 10.7523/j.issn.2095-6134.2009.3.015

• 论文 • 上一篇    下一篇

中文文本分类中的文本表示因素比较

张爱华1,2, 荆继武2, 向继2   

  1. 1. 中国科学技术大学电子工程与信息科学系, 合肥 230027;
    2. 中国科学院研究生院信息安全国家重点实验室, 北京 100049
  • 收稿日期:2008-10-13 修回日期:2008-11-07 发布日期:2009-05-15
  • 通讯作者: 张爱华
  • 基金资助:

    国家863研究计划(2006AA01Z454)项目资助 

Comparative study on text representation schemes in Chinese text classification

ZHANG Ai-Hua1,2, JING Ji-Wu2, XIANG Ji2   

  1. 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China;
    2. State Key Laboratory of Information Security, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
  • Received:2008-10-13 Revised:2008-11-07 Published:2009-05-15

摘要:

研究了中文文本分类中的文本表示方法,提出了对中文文本表示因素的分析框架,并通过对3个数据集实验结果的分析,确定了各种文本表示因素对分类效果的影响.直接使用汉字进行划分也可以获得较好的分类效果;简单的不使用很大词库的分词和使用大词库的分词,以及复杂的分词对分类效果影响不大;仅使用01表示特征是否出现也可以获得比较好的分类效果;采用综合了合理的向量取值(如使用合适的归一化算法)可以较大幅度地提高分类准确率等.这些结论为后续的应用提供了指导原则.

关键词: 中文文本分类, 文本表示, 向量化

Abstract:

We investigated the representation methods for text classification, proposed the framework of analyzing Chinese text representation algorithms, analyzed the influence of text representation, and obtained the influence of variable text representation factors on classification effect. Using Chinese characters can directly obtain better effect than expected; there is little difference on classification effect among splitting articles with smaller or huger dictionary or even by complicated splitting algorithm; and classification with only 01 to represent whether a feature is presented in a text or not can lead to not bad effect. We also found it can greatly improve classification effect to use reasonable vector value such as suitable formalization algorithm. These conclusions have provided instructions to contifurther applications.

Key words: Chinese text classification, text presentation, vectorization

中图分类号: