Welcome to Journal of University of Chinese Academy of Sciences,Today is

›› 2009, Vol. 26 ›› Issue (3): 400-407.DOI: 10.7523/j.issn.2095-6134.2009.3.015

• Research Articles • Previous Articles     Next Articles

Comparative study on text representation schemes in Chinese text classification

ZHANG Ai-Hua1,2, JING Ji-Wu2, XIANG Ji2   

  1. 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China;
    2. State Key Laboratory of Information Security, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
  • Received:2008-10-13 Revised:2008-11-07 Online:2009-05-15

Abstract:

We investigated the representation methods for text classification, proposed the framework of analyzing Chinese text representation algorithms, analyzed the influence of text representation, and obtained the influence of variable text representation factors on classification effect. Using Chinese characters can directly obtain better effect than expected; there is little difference on classification effect among splitting articles with smaller or huger dictionary or even by complicated splitting algorithm; and classification with only 01 to represent whether a feature is presented in a text or not can lead to not bad effect. We also found it can greatly improve classification effect to use reasonable vector value such as suitable formalization algorithm. These conclusions have provided instructions to contifurther applications.

Key words: Chinese text classification, text presentation, vectorization

CLC Number: