首页> 外文期刊>Computer speech and language >Statistical language modeling based on variable-length sequences
【24h】

Statistical language modeling based on variable-length sequences

机译:基于可变长度序列的统计语言建模

获取原文
获取原文并翻译 | 示例
           

摘要

In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20,000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.
机译:在自然语言中,尤其是在自发的讲话中,人们经常对单词进行分组,以构成成为常用表达的短语。这是由于语音的(使发音更容易)或语义的原因(通过为单词块分配含义来更容易地记住短语)引起的。古典语言模型没有充分考虑到此类短语。更好的方法是对某些单词序列进行建模,就好像它们是单独的字典元素一样。序列被视为词汇的附加条目,在该条目上计算语言模型。在本文中,我们提出了一种从句子的语料库中自动检索最相关的短语的方法。我们的方法的独创性在于,提取的短语是从语言标记的语料库中获得的。因此,获得的短语在语言上是可行的。为了衡量类在检索短语中的作用,我们在不使用类的情况下实现了相同的算法。基于类的方法的性能比其他方法高11%。我们的方法使用信息理论标准,该标准可确保高度的统计一致性,并根据语言的复杂性做出选择潜在序列的决定。我们提出了带有和不带有单词序列的语言模型的几种变体。其中,我们提出了一个模型,其中触发器对在语言上更为重要。我们表明,使用序列可以降低单词错误率并提高归一化的困惑度。例如,最佳序列模型可将困惑度提高16%,将听写系统(MAUD)的准确性提高约14%。在困惑和识别率方面,已经进行了实验,该词汇是从法国报纸《世界报》两年组成的4,300万个单词的语料库中提取的20,000个单词的词汇表。声学模型(HMM)使用Bref80语料库进行训练。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号