首页> 外文会议>IEEE International Conference on Bioinformatics and Biomedicine >Generating features for named entity recognition by learning prototypes in semantic space: The case of de-identifying health records
【24h】

Generating features for named entity recognition by learning prototypes in semantic space: The case of de-identifying health records

机译:通过在语义空间中学习原型来生成用于命名实体识别的特征:取消识别健康记录的情况

获取原文

摘要

Creating sufficiently large annotated resources for supervised machine learning, and doing so for every problem and every domain, is prohibitively expensive. Techniques that leverage large amounts of unlabeled data, which are often readily available, may decrease the amount of data that needs to be annotated to obtain a certain level of performance, as well as improve performance when large annotated resources are indeed available. Here, the development of one such method is presented, where semantic features are generated by exploiting the available annotations to learn prototypical (vector) representations of each named entity class in semantic space, constructed by employing a model of distributional semantics (random indexing) over a large, unannotated, in-domain corpus. Binary features that describe whether a given word belongs to a specific named entity class are provided to the learning algorithm; the feature values are determined by calculating the (cosine) distance in semantic space to each of the learned prototype vectors and ascertaining whether they are below or above a given threshold, set to optimize F-score. The proposed method is evaluated empirically in a series of experiments, where the case is health-record deidentification, a task that involves identifying protected health information (PHI) in text. It is shown that a conditional random fields model with access to the generated semantic features, in addition to a set of orthographic and syntactic features, significantly outperforms, in terms of F-score, a baseline model without access to the semantic features. Moreover, the quality of the features is further improved by employing a number of slightly different models of distributional semantics in an ensemble. Finally, the way in which the features are generated allows one to optimize them for various F-scores, giving some degree of control to trade off precision and recall. Methods that are ab- e to improve performance on named entity recognition tasks by exploiting large amounts of unlabeled data may substantially reduce costs involved in creating annotated resources for every domain and every problem.
机译:创建用于监督式机器学习的足够大的带注释的资源,并针对每个问题和每个领域这样做非常昂贵。利用通常容易获得的大量未标记数据的技术,可以减少获得一定级别性能所需的注释数据量,并在确实有大量注释资源可用时提高性能。在这里,提出了一种这样的方法的开发,其中通过利用可用的注释来学习语义空间中每个命名实体类的原型(向量)表示,并通过采用分布语义(随机索引)模型来构造语义特征,从而生成语义特征。大型,无注释的域内语料库。用于描述给定单词是否属于特定命名实体类的二进制特征提供给学习算法;通过计算语义空间中到每个学习到的原型向量的(余弦)距离,并确定它们是否低于或高于给定阈值(设置为优化F分数)来确定特征值。在一系列实验中,以案例为基础的健康记录身份识别,以经验方式对提出的方法进行了评估,该任务涉及识别文本中的受保护健康信息(PHI)。结果表明,除了一组正字法和句法特征外,可以访问生成的语义特征的条件随机字段模型在F分数方面明显优于不访问语义特征的基线模型。此外,通过在集合中采用许多稍微不同的分布语义模型,可以进一步提高特征的质量。最后,特征生成的方式允许人们针对各种F分数对其进行优化,从而提供一定程度的控制权以权衡精度和召回率。通过利用大量未标记的数据来提高命名实体识别任务的性能的方法可能会大大降低为每个域和每个问题创建带注释的资源所涉及的成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号