首页> 外文会议>Advances in Multimedia Information Processing - PCM 2008 >Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News
【24h】

Subword Lexical Chaining for Automatic Story Segmentation in Chinese Broadcast News

机译:中文广播新闻中用于自动故事分割的子词词法链接

获取原文
获取原文并翻译 | 示例

摘要

We present a subword lexical chaining approach to automatic story segmentation of Chinese broadcast news (BN). Conventional lexical chains link related words with cohesion (e.g. repetition of words) and high concentration points of starting and ending chains are indicative of story boundaries. However, inevitable speech recognition errors in BN transcripts may destroy the cohesiveness of words, resulting in word match failures. We show the robustness of Chinese subwords (characters and syllables) in lexical matching in errorful ASR transcripts. This motivates us to discover story boundaries on chains formed by character and syllable n-gram units. Experimental results on the TDT2 Mandarin corpus show that chaining by character unigram exhibits the best story segmentation performance with relative F-measure improvement of 6.06% over conventional word chaining. Integrations of multi-scales (words and subwords) exhibit further improvement. For example, fusion by voting from different scales achieves an F-measure gain of 9.04% over words.
机译:我们提出了一个子词词法链接方法来对中国广播新闻(BN)进行自动故事分割。常规词汇链将相关词与内聚力(例如单词的重复)联系起来,起始和终止链的高度集中点指示故事边界。但是,BN笔录中不可避免的语音识别错误可能会破坏单词的内聚性,从而导致单词匹配失败。我们显示了错误的ASR成绩单中词汇匹配中中文子词(字符和音节)的鲁棒性。这促使我们发现由字符和音节n-gram单元组成的链上的故事边界。在TDT2普通话语料库上的实验结果表明,通过字符unigram进行链接表现出最佳的故事分割性能,相对于传统单词链接,相对F度量提高了6.06%。多尺度(单词和子单词)的集成表现出进一步的改进。例如,通过不同规模的投票进行融合,F度量的收益比单词高9.04%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号