Statistical language modeling based on variable-length sequences

Imed Zitouni; Kamel Smaieli; Jean-Paul Haton

首页> 外文期刊>Computer speech and language >Statistical language modeling based on variable-length sequences

【24h】

Statistical language modeling based on variable-length sequences

机译：基于可变长度序列的统计语言建模

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In natural language and especially in spontaneous speech, people often group words in order to constitute phrases which become usual expressions. This is due to phonological (to make the pronunciation easier), or to semantic reasons (to remember more easily a phrase by assigning a meaning to a block of words). Classical language models do not adequately take into account such phrases. A better approach consists in modeling some word sequences as if they were individual dictionary elements. Sequences are considered as additional entries of the vocabulary, on which language models are computed. In this paper, we present a method for automatically retrieving the most relevant phrases from a corpus of written sentences. The originality of our approach resides in the fact that the extracted phrases are obtained from a linguistically tagged corpus. Therefore, the obtained phrases are linguistically viable. To measure the contribution of classes in retrieving phrases, we have implemented the same algorithm without using classes. The class-based method outperformed by 11% the other method. Our approach uses information theoretic criteria which insure a high statistical consistency and make the decision of selecting a potential sequence optimal in accordance with the language perplexity. We propose several variants of language model with and without word sequences. Among them, we present a model in which the trigger pairs are linguistically more significant. We show that the use of sequences decrease the word error rate and improve the normalized perplexity. For instance, the best sequence model improves the perplexity by 16%, and the the accuracy of our dictation system (MAUD) by approximately 14%. Experiments, in terms of perplexity and recognition rate, have been carried out on a vocabulary of 20,000 words extracted from a corpus of 43 million words made up of two years of the French newspaper Le Monde. The acoustic model (HMM) is trained with the Bref80 corpus.

机译：在自然语言中，尤其是在自发的讲话中，人们经常对单词进行分组，以构成成为常用表达的短语。这是由于语音的（使发音更容易）或语义的原因（通过为单词块分配含义来更容易地记住短语）引起的。古典语言模型没有充分考虑到此类短语。更好的方法是对某些单词序列进行建模，就好像它们是单独的字典元素一样。序列被视为词汇的附加条目，在该条目上计算语言模型。在本文中，我们提出了一种从句子的语料库中自动检索最相关的短语的方法。我们的方法的独创性在于，提取的短语是从语言标记的语料库中获得的。因此，获得的短语在语言上是可行的。为了衡量类在检索短语中的作用，我们在不使用类的情况下实现了相同的算法。基于类的方法的性能比其他方法高11％。我们的方法使用信息理论标准，该标准可确保高度的统计一致性，并根据语言的复杂性做出选择潜在序列的决定。我们提出了带有和不带有单词序列的语言模型的几种变体。其中，我们提出了一个模型，其中触发器对在语言上更为重要。我们表明，使用序列可以降低单词错误率并提高归一化的困惑度。例如，最佳序列模型可将困惑度提高16％，将听写系统（MAUD）的准确性提高约14％。在困惑和识别率方面，已经进行了实验，该词汇是从法国报纸《世界报》两年组成的4,300万个单词的语料库中提取的20,000个单词的词汇表。声学模型（HMM）使用Bref80语料库进行训练。

著录项

来源
《Computer speech and language》 |2003年第1期|p.27-41|共15页
作者
Imed Zitouni; Kamel Smaieli; Jean-Paul Haton;
展开▼
作者单位

LORIA/INRIA-Lorraine, 615 rue du Jardin Botanique, BP 101, F-54600 Villers-les-Nancy, France;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. A hierarchical language model based on variable-length classsequences: the MCnν approach [J] . Zitouni I. IEEE Transactions on Speech and Audio Proceessing . 2002,第3期

机译：基于可变长度类序列的分层语言模型：MCnν方法
2. Suite of tools for statistical N-gram language modeling for pattern mining in whole genome sequences [J] . Ganapathiraju M.K., Mitchell A.D., Thahir M., Journal of Bioinformatics and Computational Biology . 2012,第6期

机译：一套用于统计N-gram语言建模的工具，用于在整个基因组序列中进行模式挖掘
3. SUITE OF TOOLS FOR STATISTICAL N-GRAM LANGUAGE MODELING FOR PATTERN MINING IN WHOLE GENOME SEQUENCES [J] . MADHAVI K. GANAPATHIRAJU*§ ASIA D. MITCHELL*¶‡‡MOHAMED THAHIR*‡§§ KAMIYA MOTWANI†**¶¶and SESHAN ANANTHASUBRAMANIAN Journal of Bioinformatics and Computational Biology . 2012,第6期

机译：用于全基因组序列模式挖掘的统计N-G语言建模工具集
4. Beyond The Conventional Statistical Language Models: The Variable-Length Sequences Approach [C] . I.Zitouni, K.Smaieli, J-P.Haton 6th International conference on Spoken Language Processing ICSLP 2000 Oct. 16-Oct.20 2000 Beijing International Convention Center, Beijing, China . 2000

机译：超越常规统计语言模型：可变长度序列方法
5. Generation and statistical modeling of active protein chimeras: A sequence based approach. [D] . Fico, Nicholas Justin. 2013

机译：活性蛋白嵌合体的产生和统计建模：一种基于序列的方法。
6. Variable-length Positional Modeling for Biological Sequence Classification [O] . Andigoni Malousi, Ioanna Chouvarda, Vassilis Koutkias, 2008

机译：用于生物序列分类的变长位置建模
7. Introducing Statistical Dependencies and Structural Constraints in Variable-Length Sequence Models [O] . Sabine Deligne, François Yvon, Frédéric Bimbot 1996

机译：在可变长度序列模型中引入统计依存关系和结构约束
8. Asynchronous Nature of Communication in Concurrent Logic Languages: A Fully Abstract Model Based on Sequences [R] . de Boer, F. S., Palamidessi, C. 1990

机译：并发逻辑语言中的通信异步性：基于序列的全抽象模型

Statistical language modeling based on variable-length sequences

摘要

著录项

相似文献

相关主题

期刊订阅