首页> 外国专利> METHOD AND APPARATUS FOR DISCOVERING NEW WORD

METHOD AND APPARATUS FOR DISCOVERING NEW WORD

机译:发现新词的方法和装置

摘要

The embodiments of the present invention relate to a method and apparatus for discovering a new word. The method comprises: extracting a morpheme from a target text in a target text library, constructing a morpheme set H, making statistics on an appearance frequency of the morpheme, representing the morpheme and the appearance frequency of the morpheme as a two-tuple form, and forming a two-tuple set T; calculating a context association degree d of a subset w of a morpheme ti, and summarizing the subsets w of morphemes ti with the d value being greater than or equal to a pre-set association degree threshold value to form a first candidate word set Ws; calculating a support degree and a confidence degree of the morpheme ti, and summarizing morphemes ti with both the support degree and the confidence degree being greater than or equal to a corresponding minimum threshold value to form a second candidate word set Wt; and obtaining an intersection between the first candidate word set Ws and the second candidate word set Wt as a candidate new word set Wh, filtering the candidate new word set Wh, extracting a new word and saving same as a new word set W. In the embodiments of the present invention, information entropy algorithm analysis and association rule algorithm analysis are effectively combined, and thus the accuracy degree of new word discovery can be effectively improved.
机译:本发明的实施例涉及用于发现新单词的方法和设备。该方法包括:从目标文本库中的目标文本中提取一个词素,构造一个词素集H,对所述词素的出现频率进行统计,以两个元组的形式表示所述词素和所述词素的出现频率,形成一个二元组集合T;计算词素t i 的子集w的上下文关联度d,并总结d值大于或等于a的词素t i 的子集w。预设关联度阈值,以形成第一候选词集W s ;计算词素t i 的支持度和置信度,并总结词素t i 的词素,其支持度和置信度均大于或等于相应的形成第二候选词集合W t 的最小阈值;获得第一候选单词集W s 和第二候选单词集W t 之间的交集作为候选新单词集W h ,过滤候选新词集W h ,提取新词并保存为新词集W。在本发明实施例中,有效地结合了信息熵算法分析和关联规则算法分析,因此可以有效地提高新词发现的准确性。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号