...
首页> 外文期刊>ACM transactions on Asian language information processing >Bigram Language Models and Reevaluation Strategy for Improved Recognition of Online Handwritten Tamil Words
【24h】

Bigram Language Models and Reevaluation Strategy for Improved Recognition of Online Handwritten Tamil Words

机译:Bigram语言模型和重新评估策略,用于改进在线手写泰米尔语单词的识别

获取原文
获取原文并翻译 | 示例
           

摘要

This article describes a postprocessing strategy for online, handwritten, isolated Tamil words. Contributions have been made with regard to two issues hardly addressed in the online Indic word recognition literature, namely, use of (1) language models exploiting the idiosyncrasies of Indic scripts and (2) expert classifiers for the disambiguation of confused symbols. The input word is first segmented into its individual symbols, which are recognized using a primary support vector machine (SVM) classifier. Thereafter, we enhance the recognition accuracy by utilizing (ⅰ) a bigram language model at the symbol or character level and (ⅱ) expert classifiers for reevaluating and disambiguating the different sets of confused symbols. The symbol-level bigram model is used in a traditional Viterbi framework. The concept of a character comprising multiple symbols is unique to Dravidian languages such as Tamil. This multi-symbol feature of Tamil characters has been exploited in proposing a novel, prefix-tree-based character-level bigram model that does not use Viterbi search; rather it reduces the search space for each input symbol based on its left context. For disambiguating confused symbols, a dynamic time-warping approach is proposed to automatically identify the parts of the online trace that discriminates between the confused classes. Fine classification of these regions by dedicated expert SVMs reduces the extent of confusions between such symbols. The integration of segmentation, prefix-tree-based language model and disambiguation of confused symbols is presented on a set of 15,000 handwritten isolated online Tamil words. Our results show recognition accuracies of 93.0% and 81.6% at the symbol and word level, respectively, as compared to the baseline classifier performance of 88.4% and 65.1%, respectively.
机译:本文介绍了在线,手写,孤立的泰米尔语单词的后处理策略。关于在线印度语单词识别文献中几乎未解决的两个问题,已经做出了贡献,即使用(1)利用印度语文字特质的语言模型和(2)专家分类器来消除混淆符号的歧义。首先将输入字分割成其各个符号,然后使用主要支持向量机(SVM)分类器对其进行识别。此后,我们通过利用(ⅰ)符号或字符级别的双字母组语言模型和(ⅱ)专家分类器来重新评估和消除混淆符号的不同集合,来提高识别准确性。在传统的Viterbi框架中使用符号级别的bigram模型。包含多个符号的字符的概念对于诸如泰米尔语的德拉维语来说是唯一的。泰米尔语字符的这种多符号功能已被用于提出一种新颖的,基于前缀树的字符级双字母组模型,该模型不使用维特比搜索。而是根据其左上下文减少了每个输入符号的搜索空间。为了消除混淆符号的歧义,提出了一种动态时间扭曲方法来自动识别在线跟踪中区分混淆类的部分。通过专用的专家SVM对这些区域进行精细分类,减少了此类符号之间的混淆程度。分割,基于前缀树的语言模型和混淆符号的歧义的集成在一组15,000个手写的孤立在线泰米尔语单词上呈现。我们的结果显示,在符号和单词级别的识别准确率分别为93.0%和81.6%,而基线分类器的性能分别为88.4%和65.1%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号