【24h】

Locating Complex Named Entities in Web Text

机译:在Web文本中查找复杂的命名实体

获取原文

摘要

Named Entity Recognition (NER) is the task of locating and classifying names in text. In previous work, NER was linaited to a small number of predefined entity classes (e.g., people, locations, and organizations). However, NER on the Web is a far more challenging problem. Complex names (e.g., film or book titles) can be very difficult to pick out precisely from text. Further, the Web contains a wide variety of entity classes, which are not known in advance. Thus, hand-tagging examples of each entity class is impractical. This paper investigates a novel approach to the first step in Web NER: locating complex named entities in Web text. Our key observation is that named entities can be viewed as a species of multiword units, which can be detected by accumulating n-gram statistics over the Web corpus. We show that this statistical method's F1 score is 50% higher than that of supervised techniques including Conditional Random Fields (CRFs) and Conditional Markov Models (CMMs) when applied to complex names. The method also outperforms CMMs and CRFs by 117% on entity classes absent from the training data. Finally, our method outperforms a semi-supervised CRF by 73%.
机译:命名实体识别(NER)是在文本中查找和分类名称的任务。在以前的工作中,NER被限制为少数预定义的实体类(例如,人员,位置和组织)。但是,网络上的NER是一个更具挑战性的问题。复杂的名称(例如电影或书名)可能很难从文本中准确地挑选出来。此外,Web包含各种各样的实体类,这些实体类是事先未知的。因此,每个实体类别的手动标记示例是不切实际的。本文研究了Web NER第一步的新颖方法:在Web文本中定位复杂的命名实体。我们的主要观察结果是,命名实体可以看作是一个多字单元,可以通过在Web语料库上累积n-gram统计信息来检测。我们表明,当将统计方法应用于复杂名称时,该方法的F1得分比包括条件随机字段(CRF)和条件马尔可夫模型(CMM)在内的监督技术的F1得分高50%。在缺少训练数据的实体类别上,该方法的性能也比CMM和CRF高出117%。最后,我们的方法比半监督CRF的效果要好73%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号