欢迎访问中国科学院大学学报,今天是

中国科学院大学学报 ›› 2009, Vol. 26 ›› Issue (5): 703-711.DOI: 10.7523/j.issn.2095-6134.2009.5.017

• 论文 • 上一篇    下一篇

一种快速中文分词词典机制

吴晶晶1,2, 荆继武2, 聂晓峰2, 王平建2   

  1. 1. 中国科学技术大学电子工程与信息科学系,合肥 230027;
    2. 中国科学院研究生院信息安全国家重点实验室, 北京100049
  • 收稿日期:2008-10-16 修回日期:2009-04-21 发布日期:2009-09-15
  • 通讯作者: 吴晶晶
  • 基金资助:

    国家高技术研究发展计划(863)(2006AA01Z454)、国家信息安全242计划(2005B23)和国家自然科学基金(60573015)资助 

Fast dictionary mechanism for Chinese word segmentation

WU Jing-Jing1,2, JING Ji-Wu2, NIE Xiao-Feng2, Wang Ping-Jian2   

  1. 1. Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei 230027, China;
    2. State Key Laboratory of Information Security, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
  • Received:2008-10-16 Revised:2009-04-21 Published:2009-09-15

摘要:

通过研究目前中文分词领域各类分词机制,注意到中文快速分词机制的关键在于对单双字词的识别,在这一思想下,提出了一种快速中文分词机制:双字词-长词哈希机制,通过提高单双字词的查询效率来实现对中文分词机制的改进.实验证明,该机制提高了中文文本分词的效率.

关键词: 文本实时处理, 中文分词, 词典法分词, 双字词-长词哈希机制

Abstract:

With the development of global networking through Internet, the amount of articles in Chinese or other native languages is increasing rapidly. As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus it affects the whole system in performance. In this paper, we propose a new solution for Chinese word segmentation problem based on Lexicon named double-character-and-long-word-hash-indexing (DCLWHI).Compared with traditional lexicon mechanism, DCLWHI improves the speed and efficiency of word segmentation without extra memory spending and gains the same accuracy.

Key words: text real-time processing, Chinese word segmentation, lexicon mechanism, double-character-and-long-word-Hash-indexing(DCLWHI)

中图分类号: