Auto segmentation for Malay Speech Corpus

机译：马来语语音语料库的自动分割

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper deals with the automatic segmentation of Malay continuous speech database. Auto segmentation is a process of producing a sequence of discrete utterance with particular characteristics remaining constant within each one. In terms of quality, hand crafted segmentation would be the best method. However, due to the large database size, manual speech segmentation and labeling become tremendous. It is time consuming and error prone. Besides, even if the database is segmented by an expert, the segmentation rule may become subjective and not reproducible. Inconsistency result may occur from different linguistic experts. Thus, an automated segmentation rule was drawn to consistently segment the large scale database with satisfactory level of quality. Automated segmentation of Malay Language syllable is not a tough task because all syllables in Malay Language are pronounced almost equally and moreover it is not a tonal language like English. The manipulation and identification of the segment boundaries of Malay Language is straight forward and easy to understand. For the segmentation, the HMM based approach with adapted Viterbi force alignment technique is used. Composite HMM with Baum Welch reestimation was utilized to ease the process of phonetic segmentation. All the data from the database was fed into the segmentation tool directly without prior trained sample for pre-training purpose. For the design of the sentence coverage of the database, the scripts are consisting of 1000 sentences. 620 sentences are selected from primary school Malay Language text book and 380 sentences were computed using the 70% highest frequency words that appear in the 10 million words online digital text. This configuration of Malay Language script already promises a phonetically balanced database which covers all the vowels and consonants. The objective evaluation method is used to identify the performance. The result from the autosegmentation was verified to obtain the accuracy degree and overall quality. The result was tested perceptually and it is proven to have satisfactory high quality.

机译：本文涉及马来连续语音数据库的自动分割。自动分割是产生一系列离散话语的过程，其特定特性剩余持续的特征。在质量方面，手工制作的细分将是最好的方法。但是，由于数据库尺寸大，手动语音分割和标签变得巨大。它是耗时和易于错误的。此外，即使数据库由专家分割，分割规则也可能成为主观的，而不是可重复的。不一致的结果可能来自不同的语言专家。因此，绘制自动分割规则以始终如一地将大规模数据库持续地分段为令人满意的质量。马来语语言音节的自动分割不是一个艰难的任务，因为马来语中的所有音节都几乎同样发音，而且它不是英语等音调语言。马来语语言分段边界的操纵和识别是直接的，易于理解。对于分割，使用基于HMM的基于迁移的维特比力对准技术的方法。利用BAUM韦尔奇重新定期的复合嗯，缓解语音分割的过程。数据库中的所有数据都被直接进入分段工具，而无需现有培训的样本以进行预训练目的。对于数据库的句子覆盖范围的设计，脚本由1000个句子组成。 620个句子选自小学马来语文本簿，并使用70％的最高频率单词计算380个句子，这些频率在在线数字文本中出现在1000万字中。这种配置马来语脚本已经承诺了一个覆盖所有元音和辅音的语音平衡数据库。客观评估方法用于识别性能。验证了自动分段的结果以获得准确度和整体质量。结果感知测试了，并且证明具有令人满意的高品质。

著录项

来源
《World Multi-Confernece on Systemics, Cybernetics and Informatics》|2012年||共4页
会议地点
作者
Tan Tian Swee; Ting Chee Ming; Chin Wee Lip; Lau Chee Yong; Sh-Hussain Salleh;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 N945-53;
关键词

相似文献

外文文献
中文文献
专利

1. Corpus Design for Malay Corpus-based Speech Synthesis System | Science Publications [J] . Sh-Hussain, Tian-Swee Tan American journal of applied sciences . 2009,第4期

机译：基于马来语语料库的语音合成系统的语料库设计科学出版物
2. Statistical Parametric Evaluation on New Corpus Design for Malay Speech Articulation Disorder Early Diagnosis | Science Publications [J] . Azran Azhim, Mohd Nizam Mazenan, Tan Hui Ru, American journal of applied sciences . 2015,第7期

机译：马来语语音清晰度障碍早期诊断的新语料库设计的统计参数评估科学出版物
3. Statistical Parametric Evaluation on New Corpus Design for Malay Speech Articulation Disorder Early Diagnosis [J] . Mohd Nizam Mazenan, Tan Tian Swee, Tan Hui Ru, American journal of applied sciences . 2015,第7期

机译：马来语语音清晰度障碍早期诊断新语料库设计的统计参数评估
4. Auto segmentation for Malay Speech Corpus [C] . Tan Tian Swee, Ting Chee Ming, Chin Wee Lip, World Multi-Confernece on Systemics, Cybernetics and Informatics . 2012

机译：马来语语音语料库的自动分割
5. Second language speech: Production and perception of voicing contrasts in word -final obstruents by Malay speakers of English. [D] . Pilus, Zahariah. 2002

机译：第二语言演讲：马来英语使用者在单词-最终的s语中产生和理解语音对比。
6. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：用于临床文本的细粒度中文分词和词性标注语料库
7. Corpus design for Malay corpus-based speech synthesis system [O] . Tan, Tian-Swee, Sh-Hussain, Sh-Hussain 2009

机译：马来语料库语音合成系统的语料库设计
8. DARPA TIMIT Acoustic-Phonetic Continous Speech Corpus CD-ROM. NIST Speech Disc 1-1.1 [R] . Garofolo, J. S., Lamel, L. F., Fisher, W. M., 1993

机译：DaRpa TImIT acoustic-phonetic连续语音语料库CD-ROm。 NIsT语音盘1-1.1

Auto segmentation for Malay Speech Corpus

摘要

著录项

相似文献

相关主题

期刊订阅