首页> 外国专利> APPARATUS AND METHOD FOR ESTIMATING WORD BOUNDARY PROBABILITY, APPARATUS AND METHOD FOR CONSTRUCTING PROBABILISTIC LANGUAGE MODEL, APPARATUS AND METHOD FOR KANA-KANJI CONVERSION, AND METHOD FOR CONSTRUCTING UNKNOWN WORD MODEL

APPARATUS AND METHOD FOR ESTIMATING WORD BOUNDARY PROBABILITY, APPARATUS AND METHOD FOR CONSTRUCTING PROBABILISTIC LANGUAGE MODEL, APPARATUS AND METHOD FOR KANA-KANJI CONVERSION, AND METHOD FOR CONSTRUCTING UNKNOWN WORD MODEL

机译:估计词边界概率的装置和方法,构造概率语言模型的装置和方法,用于假名-汉字转换的装置和方法以及用于构造未知词模型的方法

摘要

PPROBLEM TO BE SOLVED: To provide an apparatus and a technique for increasing the accuracy of recognition in natural language processing by calculating the n-gram probability of words with high precision while making effective use of a first corpus where words are separated from one another and a second corpus where words are not separated. PSOLUTION: In a method for using the corpus where words are separated from one another, the first corpus (words separated) is used in the calculation of n-gram and the probability (division probability) with which a space between two adjacent characters becomes a word boundary; the second corpus (words unseparated) is assigned with probabilistic word boundaries based upon information in the first corpus (words separated) and used in the calculation of word n-gram. For the calculation of the probabilistic word boundaries, the second corpus (words unseparated) assigns the division probabilities calculated via the first corpus (words separated) to every space between characters. An unknown-word model based on character units models the correspondence between each character and how it is read in character units. In this way, a model of kana-kanji conversion for unknown words is proposed. PCOPYRIGHT: (C)2006,JPO&NCIPI
机译:

要解决的问题:提供一种设备和技术,该设备和技术通过在有效利用单词与单词分离的第一语料的同时,以高精度计算单词的n-gram概率来提高自然语言处理中的识别精度。另一个语料库,其中单词不分开。

解决方案:在一种使用单词彼此分开的语料库的方法中,第一个语料库(单词分开)用于计算n-gram和两个相邻词之间的间隔的概率(除法概率)字符成为单词边界;根据第一语料库(分离的词)中的信息,为第二语料库(未分离的词)分配概率词边界,并将其用于词n-gram的计算。对于概率词边界的计算,第二语料库(未分离的词)将通过第一语料库(分离的词)计算的除法概率分配给字符之间的每个空格。基于字符单元的未知单词模型对每个字符之间的对应关系以及如何以字符为单位进行读取建模。以此方式,提出了针对未知单词的假名汉字转换模型。

版权:(C)2006,JPO&NCIPI

著录项

  • 公开/公告号JP2006031295A

    专利类型

  • 公开/公告日2006-02-02

    原文格式PDF

  • 申请/专利权人 INTERNATL BUSINESS MACH CORP IBM;

    申请/专利号JP20040207864

  • 发明设计人 MORI SHINSUKE;TAKUMA DAISUKE;

    申请日2004-07-14

  • 分类号G06F17/27;G10L15/18;

  • 国家 JP

  • 入库时间 2022-08-21 21:52:58

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号