Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

Kanwal Safia; Malik Kamran; Shahzad Khurram; Aslam Faisal; Nawaz Zubair

首页> 外文期刊>ACM transactions on Asian language information processing >Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

【24h】

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

机译：乌尔都语命名实体识别：语料库生成和深度学习应用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.

机译：命名实体识别（NER）在各种自然语言处理任务（例如机器翻译和自动问答系统）中起着关键作用。认识到NER的重要性，已经开发了许多针对西方和亚洲语言的NER技术。但是，尽管在全球拥有超过4.9亿乌尔都语使用者，但针对乌尔都语的NER资源仍然不存在或不足。为了填补这一空白，本文做出了四个关键贡献。首先，我们开发了最大的Urdu NER语料库，其中包含926,776个令牌和99,718个经过仔细注释的NE。与现有的任何Urdu NER语料库相比，已开发的语料库的数量至少是手动标记的NE数量的两倍。其次，我们在两种乌尔都语文本集上使用了三种不同的技术，fastText，Word2vec和Glove，生成了六个新单词嵌入。除了Facebook最近发布的Urdu单词嵌入之外，这些是唯一可用的Urdu语言嵌入。第三，我们在深度学习技术NN和RNN的应用方面开创了先河，用于乌尔都语命名实体识别。最后，我们结合了传统的监督学习和深度学习技术，七种类型的单词嵌入以及两个不同的Urdu NER数据集，对32个不同的实验进行了10倍的实验。在对结果进行分析的基础上，提供了有关深度学习技术的有效性，词嵌入的影响以及数据集变体的一些有价值的见解。

著录项

来源
《ACM transactions on Asian language information processing》 |2020年第1期|8.1-8.13|共13页
作者
Kanwal Safia; Malik Kamran; Shahzad Khurram; Aslam Faisal; Nawaz Zubair;
展开▼
作者单位

Univ Punjab Coll Informat Technol Old Campus Lahore Pakistan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Resource poor languages; deep learning; Urdu NER corpus; Word2vec; fastText; word embeddings;

机译：资源贫乏的语言;深度学习乌尔都语语料库;Word2vec;fastText;词嵌入;

相似文献

外文文献
中文文献
专利

1. Deep recurrent neural networks with word embeddings for Urdu named entity recognition [J] . Wahab Khan, Ali Daud, Fahd Alotaibi, ETRI journal . 2020,第1期

机译：具有Word Embeddings的深度经常性神经网络，用于URDU命名实体识别
2. Myanmar named entity corpus and its use in syllable-based neural named entity recognition [J] . Hsu Myat Mo, Khin Mar Soe International Journal of Electrical and Computer Engineering . 2020,第2期

机译：缅甸名为实体语料库及其在基于音节的神经名为实体识别中的用途
3. Urdu Named Entity Recognition and Classification System Using Artificial Neural Network [J] . MUHAMMAD KAMRAN MALIK ACM transactions on Asian language information processing . 2018,第1期

机译：基于人工神经网络的乌尔都语命名实体识别与分类系统
4. Maximum Entropy based Urdu Named Entity Recognition [C] . Fatima Riaz, Muhammad Waqas Anwar, Humaira Muqades International Conference on Engineering and Emerging Technologies . 2020

机译：基于最大熵的乌尔都语命名实体识别
5. Improving Search via Named Entity Recognition in Morphologically Rich Languages: A Case Study in Urdu [D] . Riaz, Kashif H. 2018

机译：通过形态丰富的语言中的命名实体识别来改善搜索：以乌尔都语为例
6. Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN [O] . Xishuang Dong, Shanta Chowdhury, Lijun Qian, 2015

机译：深度学习用于中国电子病历中的命名实体识别：将深度迁移学习与多任务双向LSTM RNN相结合
7. Deep learning for named entity recognition on Chinese electronic medical records: Combining deep transfer learning with multitask bi-directional LSTM RNN [O] . Xishuang Dong, Shanta Chowdhury, Lijun Qian, 2019

机译：关于中国电子病历的命名实体认可的深度学习：将深度转移学习与多任务双向LSTM RNN相结合

Urdu Named Entity Recognition: Corpus Generation and Deep Learning Applications

摘要

著录项

相似文献

相关主题

期刊订阅