首页> 外国专利> Word boundary probability estimation device and method, probabilistic language model construction device and method, kana-kanji conversion device and method, and unknown word model construction method,

Word boundary probability estimation device and method, probabilistic language model construction device and method, kana-kanji conversion device and method, and unknown word model construction method,

机译:词边界概率估计装置和方法,概率语言模型构造装置和方法,假名汉字转换装置和方法以及未知词模型构造方法,

摘要

Calculates a word n-gram probability with high accuracy in a situation where a first corpus), which is a relatively small corpus containing manually segmented word information, and a second corpus, which is a relatively large corpus, are given as a training corpus that is storage containing vast quantities of sample sentences. Vocabulary including contextual information is expanded from words occurring in first corpus of relatively small size to words occurring in second corpus of relatively large size by using a word n-gram probability estimated from an unknown word model and the raw corpus. The first corpus (word-segmented) is used for calculating n-grams and the probability that the word boundary between two adjacent characters will be the boundary of two words (segmentation probability). The second corpus (word-unsegmented), in which probabilistic word boundaries are assigned based on information in the first corpus (word-segmented), is used for calculating a word n-grams.
机译:在第一个语料库(它是一个包含手动分割的单词信息的相对较小的语料库)和第二个语料库(一个相对较大的语料库)被给出为训练语料库的情况下,可以高精度地计算单词n-gram概率是包含大量例句的存储。通过使用从未知单词模型和原始语料库估计的单词n-gram概率,将包括上下文信息的词汇从在较小尺寸的第一语料库中出现的单词扩展到在较大尺寸的第二语料库中出现的单词。第一个语料库(单词分段)用于计算n元语法,两个相邻字符之间的单词边界将成为两个单词的边界的概率(分段概率)。第二语料库(未分词)根据第一语料库中的信息分配了概率词边界(单词分词),用于计算单词n-gram。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号