...
首页> 外文期刊>Journal of Computers >Topic Mining based on Word Posterior Probability in Spoken Document
【24h】

Topic Mining based on Word Posterior Probability in Spoken Document

机译:语音文档中基于词后验概率的主题挖掘

获取原文
           

摘要

For speech recognition system, there are three kinds of result representations as one-best, N-best and Lattice. Since lattice has multi-path which can reduce the effect of recognition error rate, it is widely applied nowadays. In fact, there are amount of redundancies in lattice, which leads to the increasing of complexity of latter algorithm based on it. Additionally, for the decoding algorithm, it is acted as maximum a posterior probability (MAP) which can only guarantee the posterior probability of the whole sentence is of maximum. For MAP does not mean the highest syllable recognition rate, here, confusion network is introduced in topic mining system. In the clustering during confusion network, the minimum word error rule is adopted, which is proper to topic mining system since the least meaningful unit is word in Chinese and word information is most important in topic mining. In this paper, a simplified confusion network generation algorithm is proposed to handle some problems caused by insertion error during recognition. Then based on the confusion network, a word list extraction approach is proposed, in which, the dictionary is adopted to judge whether the consecutive arc in confusion sets is a word. At this stage, the error word information produced by error recognition rate can be corrected to some extent. After the competition part in word list extraction on confusion network, a final word list with posterior probability can be obtained. Furthermore, this kind of posterior probability can be combined in topic mining system. SVD and NMF are adopted here to decompose the term-document matrix on the word list of confusion network. From the experiments, it can be drawn that the proposed approach based on confusion network can achieve better performance than that of one-best and N-best. Additionally, the modified weight which combined posterior probability into term-document matrix can further improve the system performance.
机译:对于语音识别系统,有三种结果表示形式:最佳,N最佳和格。由于晶格具有多路径,可以降低识别错误率的影响,因此在当今已得到广泛应用。实际上,晶格中存在大量冗余,这导致后一种基于它的算法的复杂性增加。另外,对于解码算法,它充当最大后验概率(MAP),只能保证整个句子的后验概率最大。由于MAP并不意味着最高的音节识别率,因此在主题挖掘系统中引入了混淆网络。在混淆网络的聚类中,采用了最小单词错误规则,该规则适用于主题挖掘系统,因为最不有意义的单位是中文单词,而单词信息在主题挖掘中最为重要。本文提出了一种简化的混淆网络生成算法,以解决识别过程中由于插入错误引起的一些问题。然后在混淆网络的基础上,提出了一种词表提取方法,该方法采用字典来判断混淆集中的连续弧是否是一个词。在这一阶段,可以在一定程度上校正由错误识别率产生的错误词信息。通过在混淆网络中抽取词表中的竞争部分后,可以获得具有后验概率的最终词表。此外,这种后验概率可以在主题挖掘系统中进行组合。这里采用SVD和NMF分解混淆网络词表上的术语文档矩阵。从实验中可以看出,所提出的基于混淆网络的方法可以实现比最佳和最佳的性能更好的性能。另外,将后验概率结合到期限文档矩阵中的改进权重可以进一步提高系统性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号