...
首页> 外文期刊>ACM transactions on Asian language information processing >Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications
【24h】

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

机译:乌尔都语命名实体识别:语料库生成和深度学习应用

获取原文
获取原文并翻译 | 示例
           

摘要

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
机译:命名实体识别(NER)在各种自然语言处理任务(例如机器翻译和自动问答系统)中起着关键作用。认识到NER的重要性,已经开发了许多针对西方和亚洲语言的NER技术。但是,尽管在全球拥有超过4.9亿乌尔都语使用者,但针对乌尔都语的NER资源仍然不存在或不足。为了填补这一空白,本文做出了四个关键贡献。首先,我们开发了最大的Urdu NER语料库,其中包含926,776个令牌和99,718个经过仔细注释的NE。与现有的任何Urdu NER语料库相比,已开发的语料库的数量至少是手动标记的NE数量的两倍。其次,我们在两种乌尔都语文本集上使用了三种不同的技术,fastText,Word2vec和Glove,生成了六个新单词嵌入。除了Facebook最近发布的Urdu单词嵌入之外,这些是唯一可用的Urdu语言嵌入。第三,我们在深度学习技术NN和RNN的应用方面开创了先河,用于乌尔都语命名实体识别。最后,我们结合了传统的监督学习和深度学习技术,七种类型的单词嵌入以及两个不同的Urdu NER数据集,对32个不同的实验进行了10倍的实验。在对结果进行分析的基础上,提供了有关深度学习技术的有效性,词嵌入的影响以及数据集变体的一些有价值的见解。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号