...
首页> 外文期刊>Journal of computer sciences >Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields
【24h】

Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields

机译:使用带有条件随机字段的地名词典列表将其命名为卡纳达语实体识别

获取原文
获取原文并翻译 | 示例
           

摘要

Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73.676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.
机译:句子中存在的命名实体(NE)对于构建用于从大型语料中提取信息(IE)的自然语言处理(NLP)应用程序至关重要。但是,对于资源贫乏的语言(例如卡纳达语),生成大型语料库具有挑战性。此外,在线没有可用的注释语料库。用预定义类注释网元时面临的挑战是:它在形态上与其他单词结合在一起,而卡纳达语单词的拼写变化更为频繁。句子结构根据词法,词性(pos)和语言块化而变化。这些参数因一种语言而异。为了解决这些挑战,提出了一种新颖的应用系统,该系统使用73.676个令牌的大型语料库在卡纳达语中标识NE。命名实体识别(NER)系统由一个健壮的pos标记器和为通用数据开发的名词短语(NP)分块器组成。根据每个单词的许多拼字形式创建了五个地名词典列表。上下文信息(例如前两个单词,后两个单词,单词形态和地名词典列表)已添加到功能列表中。设计了一个字母组合图模板,并将其合并到条件随机字段(CRF)中以生成条件特征函数。所提出的系统分别对黄金测试数据和报纸数据进行了86.85%和71.01%的f测量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号