...
首页> 外文期刊>Journal of Computers >Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text
【24h】

Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

机译:变音标记或Hudhaa字符对Oromo文本标记化的挑战

获取原文
           

摘要

The problem of tokenization in natural language processing is to find a way to get every token in a text. For languages like Oromo, for which, much effort has not been done yet regarding language processing, the task of tokenization by no means cannot be overlooked. This paper reports on Oromo tokenizer that we designed and tested by accommodating the challenge of diacritical marker-Hudhaa which is a sign to represent in-word glottal sound in Oromo language. In this work, we have studied the effect of using acute accent for diacritical mark rather than using other confusing marks like right-quote to write Hudhaa. Accuracy is a prime factor in evaluating any Natural Language Processing (NLP) system. So we measured the accuracy of our system on 1.2MB (9686 sentences having 164932 words) of Oromo text data and an accuracy of 99.94% was achieved by this algorithm.
机译:自然语言处理中的标记化问题是找到一种获取文本中每个标记的方法。对于像Oromo这样的语言,在语言处理方面还没有做很多工作,因此标记化的任务绝不能被忽略。本文报道了我们通过适应变音标记-Hudhaa的挑战而设计和测试的Oromo标记器,该标记是用Oromo语言表示声门声的标志。在这项工作中,我们研究了将重音符号用于变音标记的效果,而不是使用诸如右引号之类的其他混淆标记来书写Hudhaa的效果。准确性是评估任何自然语言处理(NLP)系统的主要因素。因此,我们在Oromo文本数据1.2MB(9686个句子,共164932个单词)上测量了系统的准确性,该算法的准确性为99.94%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号