Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

Abraham Tesso Nedjo; Degen Huang; Xiaoxia Liu

首页> 外文期刊>Journal of Computers >Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

【24h】

Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

机译：变音标记或Hudhaa字符对Oromo文本标记化的挑战

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The problem of tokenization in natural language processing is to find a way to get every token in a text. For languages like Oromo, for which, much effort has not been done yet regarding language processing, the task of tokenization by no means cannot be overlooked. This paper reports on Oromo tokenizer that we designed and tested by accommodating the challenge of diacritical marker-Hudhaa which is a sign to represent in-word glottal sound in Oromo language. In this work, we have studied the effect of using acute accent for diacritical mark rather than using other confusing marks like right-quote to write Hudhaa. Accuracy is a prime factor in evaluating any Natural Language Processing (NLP) system. So we measured the accuracy of our system on 1.2MB (9686 sentences having 164932 words) of Oromo text data and an accuracy of 99.94% was achieved by this algorithm.

机译：自然语言处理中的标记化问题是找到一种获取文本中每个标记的方法。对于像Oromo这样的语言，在语言处理方面还没有做很多工作，因此标记化的任务绝不能被忽略。本文报道了我们通过适应变音标记-Hudhaa的挑战而设计和测试的Oromo标记器，该标记是用Oromo语言表示声门声的标志。在这项工作中，我们研究了将重音符号用于变音标记的效果，而不是使用诸如右引号之类的其他混淆标记来书写Hudhaa的效果。准确性是评估任何自然语言处理（NLP）系统的主要因素。因此，我们在Oromo文本数据1.2MB（9686个句子，共164932个单词）上测量了系统的准确性，该算法的准确性为99.94％。

著录项

来源
《Journal of Computers》 |2014年第7期|共9页
作者
Abraham Tesso Nedjo; Degen Huang; Xiaoxia Liu;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Diacritical MarkerGlottalHudhaaOromoTokenization;

机译：变音标记GlottalHudhaaOromoTokenization;

相似文献

外文文献
中文文献
专利

1. Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text [J] . Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu Journal of software . 2014,第7期

机译：变音标记或Hudhaa字符对Oromo文本标记化的挑战
2. Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text [J] . Abraham Tesso Nedjo, Degen Huang, Xiaoxia Liu Journal of software . 2014,第7期

机译：变音标记或Hudhaa字符对Oromo文本标记化的挑战
3. Character N-Gram Tokenization for European Language Text Retrieval [J] . PAUL McNAMEE, JAMES MAYFIELD Information retrieval . 2004,第1a2期

机译：用于欧洲语言文本检索的字符N-Gram标记化
4. A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Wuerzburg Glosses on the Pauline Epistles [C] . Adrian Doyle, John P. McCrae, Clodagh Downey Machine translation summit;Celtic language technology workshop . 2019

机译：字符级LSTM网络模型，用于标记Pauline书信中的Wuerzburg Glosses的旧爱尔兰文本
5. Oromo transnationalism in the Washington, D.C. metropolitan area: An examination of the development, challenges, and prospects of gaining an institutional footing. [D] . Posey, Zakia Louise. 2014

机译：华盛顿特区都会区的奥罗莫（Oromo）跨国主义：对获得制度基础的发展，挑战和前景的考察。
6. Residual-based approach for authenticating pattern of multi-style diacritical Arabic texts [O] . Saqib Hakak, Amirrudin Kamsin, Shivakumara Palaiahnakote, 2012

机译：基于残差的多样式变音阿拉伯文本模式验证方法
7. Digital Character Design of Diacritical Dcroat Mark [O] . M. Turcic, T. Koren, M. Rudolf 2011

机译：数字角色设计的变音itext dcroat标记

Challenges of Diacritical Marker or Hudhaa Character in Tokenization of Oromo Text

摘要

著录项

相似文献

相关主题

期刊订阅