【24h】

Auto segmentation for Malay Speech Corpus

机译:马来语语音语料库的自动分割

获取原文

摘要

This paper deals with the automatic segmentation of Malay continuous speech database. Auto segmentation is a process of producing a sequence of discrete utterance with particular characteristics remaining constant within each one. In terms of quality, hand crafted segmentation would be the best method. However, due to the large database size, manual speech segmentation and labeling become tremendous. It is time consuming and error prone. Besides, even if the database is segmented by an expert, the segmentation rule may become subjective and not reproducible. Inconsistency result may occur from different linguistic experts. Thus, an automated segmentation rule was drawn to consistently segment the large scale database with satisfactory level of quality. Automated segmentation of Malay Language syllable is not a tough task because all syllables in Malay Language are pronounced almost equally and moreover it is not a tonal language like English. The manipulation and identification of the segment boundaries of Malay Language is straight forward and easy to understand. For the segmentation, the HMM based approach with adapted Viterbi force alignment technique is used. Composite HMM with Baum Welch reestimation was utilized to ease the process of phonetic segmentation. All the data from the database was fed into the segmentation tool directly without prior trained sample for pre-training purpose. For the design of the sentence coverage of the database, the scripts are consisting of 1000 sentences. 620 sentences are selected from primary school Malay Language text book and 380 sentences were computed using the 70% highest frequency words that appear in the 10 million words online digital text. This configuration of Malay Language script already promises a phonetically balanced database which covers all the vowels and consonants. The objective evaluation method is used to identify the performance. The result from the autosegmentation was verified to obtain the accuracy degree and overall quality. The result was tested perceptually and it is proven to have satisfactory high quality.
机译:本文涉及马来连续语音数据库的自动分割。自动分割是产生一系列离散话语的过程,其特定特性剩余持续的特征。在质量方面,手工制作的细分将是最好的方法。但是,由于数据库尺寸大,手动语音分割和标签变得巨大。它是耗时和易于错误的。此外,即使数据库由专家分割,分割规则也可能成为主观的,而不是可重复的。不一致的结果可能来自不同的语言专家。因此,绘制自动分割规则以始终如一地将大规模数据库持续地分段为令人满意的质量。马来语语言音节的自动分割不是一个艰难的任务,因为马来语中的所有音节都几乎同样发音,而且它不是英语等音调语言。马来语语言分段边界的操纵和识别是直接的,易于理解。对于分割,使用基于HMM的基于迁移的维特比力对准技术的方法。利用BAUM韦尔奇重新定期的复合嗯,缓解语音分割的过程。数据库中的所有数据都被直接进入分段工具,而无需现有培训的样本以进行预训练目的。对于数据库的句子覆盖范围的设计,脚本由1000个句子组成。 620个句子选自小学马来语文本簿,并使用70%的最高频率单词计算380个句子,这些频率在在线数字文本中出现在1000万字中。这种配置马来语脚本已经承诺了一个覆盖所有元音和辅音的语音平衡数据库。客观评估方法用于识别性能。验证了自动分段的结果以获得准确度和整体质量。结果感知测试了,并且证明具有令人满意的高品质。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号