首页> 中文期刊> 《计算机应用》 >融合规则与统计的微博新词发现方法

融合规则与统计的微博新词发现方法

         

摘要

The formation rules of microblog new words are extremely complex with high degree of dispersion,and the extracted results by using traditional C/NC-value method have several problems,including relatively low accuracy of the boundary of identified new words and low detection accuracy of new words with low frequency.To solve these problems,a method of integrating heuristic rules,modified C/NC-value method and Conditional Random Field (CRF) model was proposed.On one hand,heuristic rules included the abstracted information of classification and inductive rules focusing on the components of microblog new words.The rules were artificially summarized by using Part Of Speech (POS),character types and symbols through observing a large number of microblog documents.On the other hand,to improve the accuracy of the boundary of identified new words and the detection accuracy of new words with low frequency,traditional C/NC-value method was modified by merging the information of word frequency,branch entropy,mutual information and other statistical features to reconstruct the objective function.Finally,CRF model was used to train and detect new words.The experimental results show that the F value of the proposed method in new words detection is improved effectively.%结合微博新词的构词规则自由度大和极其复杂的特点,针对传统的C/NC-value方法抽取的结果新词边界的识别准确率不高,以及低频微博新词无法正确识别的问题,提出了一种融合人工启发式规则、C/NC-value改进算法和条件随机场(CRF)模型的微博新词抽取方法.一方面,人工启发式规则是指对微博新词的分类和归纳总结,并从微博新词构词的词性(POS)、字符类别和表意符号等角度设计的微博新词的构词规则;另一方面,改进的C/NC-value方法通过引入词频、邻接熵和互信息等统计量来重构NC-value目标函数,并使用CRF模型训练和识别新词,最终达到提高新词边界识别准确率和低频新词识别精度的目的.实验结果显示,与传统方法相比,所提出的方法能有效地提高微博新词识别的F值.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号