首页> 中文期刊> 《计算机工程》 >基于迭代算法的新词识别

基于迭代算法的新词识别

         

摘要

新词识别是中文信息处理的重要基础,但中文字符极强的构词能力给新词检测带来较大困难。受对偶原理的启发,提出一种基于迭代算法的新词识别算法。对目标语料进行分词和词性标注,通过两遍扫描进行字符串统计并提取重复模式。结合词语结构的特征,迭代使用重复模式互信息、左(右)熵,左(右)邻右(左)平均熵等特征进行新词识别,获得候选新词列表。利用中文词语搭配库对候选新词列表进行最后一次过滤得到最终新词列表。实验结果表明,利用该方法进行新词识别,P@10值达到100%,P@100值提高至90%,左(右)邻右(左)平均熵可在一定程度上提高新词识别的准确率。%New words identification is an important foundation for Chinese information processing. However, the energetic word building ability of Chinese makes it difficult to automatically identify new words. Inspired by the duality principle, a new word identification algorithm based on iterative algorithm is proposed. The target corpus is analyzed for segmentation and part-of-speech tagging. The repetitive patterns are extracted after statistic of string frequency through scanning twice. Combining with word structure's characteristics, the candidate list of new words is obtained through iteratively using characteristics of repetitive patterns such as Mutual Information(MI), the left(right) entropy, the right(left) average entropy of the left(right) neighbor. The final list of new words is obtained by filtering the candidate list with the help of the library of Chinese words collocation. With this method for identification of new words, results show that the value of P@10 reaches 100%, and that of P@100 increases to 90%, the use of the right(left) average entropy of the left(right) neighbor can raise the accuracy of new words identification.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号