首页> 外文期刊>Information Processing & Management >Learning entity-centric document representations using an entity facet topic model
【24h】

Learning entity-centric document representations using an entity facet topic model

机译:使用实体构面主题模型学习以实体为中心的文档表示形式

获取原文
获取原文并翻译 | 示例
           

摘要

Learning semantic representations of documents is essential for various downstream applications, including text classification and information retrieval. Entities, as important sources of information, have been playing a crucial role in assisting latent representations of documents. In this work, we hypothesize that entities are not monolithic concepts; instead they have multiple aspects, and different documents may be discussing different aspects of a given entity. Given that, we argue that from an entity-centric point of view, a document related to multiple entities shall be (a) represented differently for different entities (multiple entity-centric representations), and (b) each entity-centric representation should reflect the specific aspects of the entity discussed in the document. In this work, we devise the following research questions: (1) Can we confirm that entities have multiple aspects, with different aspects reflected in different documents, (2) can we learn a representation of entity aspects from a collection of documents, and a representation of document based on the multiple entities and their aspects as reflected in the documents, (3) does this novel representation improves algorithm performance in downstream applications, and (4) what is a reasonable number of aspects per entity? To answer these questions we model each entity using multiple aspects (entity facets1), where each entity facet is represented as a mixture of latent topics. Then, given a document associated with multiple entities, we assume multiple entity-centric representations, where each entity-centric representation is a mixture of entity facets for each entity. Finally, a novel graphical model, the Entity Facet Topic Model (EFTM), is proposed in order to learn entity-centric document representations, entity facets, and latent topics. Through experimentation we confirm that (1) entities are multi-faceted concepts which we can model and learn, (2) a multi-faceted entity-centric modeling of documents can lead to effective representations, which (3) can have an impact in downstream application, and (4) considering a small number of facets is effective enough. In particular, we visualize entity facets within a set of documents, and demonstrate that indeed different sets of documents reflect different facets of entities. Further, we demonstrate that the proposed entity facet topic model generates better document representations in terms of perplexity, compared to state-of-the-art document representation methods. Moreover, we show that the proposed model outperforms baseline methods in the application of multi-label classification. Finally, we study the impact of EFTM's parameters and find that a small number of facets better captures entity specific topics, which confirms the intuition that on average an entity has a small number of facets reflected in documents.
机译:学习文档的语义表示对于各种下游应用程序(包括文本分类和信息检索)至关重要。实体作为重要的信息来源,在协助文档的潜在表示中一直发挥着至关重要的作用。在这项工作中,我们假设实体不是整体概念。相反,它们具有多个方面,并且不同的文档可能正在讨论给定实体的不同方面。鉴于此,我们认为,从以实体为中心的角度出发,与多个实体相关的文档应(a)对于不同的实体以不同的方式表示(多个以实体为中心的表示),并且(b)每个以实体为中心的表示应反映文档中讨论的实体的特定方面。在这项工作中,我们提出以下研究问题:(1)是否可以确认实体具有多个方面,不同方面反映在不同的文档中;(2)我们可以从文档集合中学习实体方面的表示吗?基于文档中反映的多个实体及其方面的文档表示形式;(3)这种新颖的表示形式是否可以提高下游应用程序的算法性能;(4)每个实体合理的方面数量是多少?为了回答这些问题,我们使用多个方面(实体方面1)为每个实体建模,其中每个实体方面都表示为潜在主题的混合。然后,给定与多个实体相关联的文档,我们假设多个以实体为中心的表示形式,其中每个以实体为中心的表示形式是每个实体的实体构面的混合体。最后,为了学习以实体为中心的文档表示,实体构面和潜在主题,提出了一种新颖的图形模型,实体构面主题模型(EFTM)。通过实验,我们确认(1)实体是我们可以建模和学习的多层面概念,(2)以实体为中心的多层面文档建模可以产生有效的表示形式,这(3)可能对下游产生影响应用;以及(4)考虑少量方面足够有效。特别是,我们将一组文档中的实体构面可视化,并证明确实不同的文档集反映了实体的不同构面。此外,与最新的文档表示方法相比,我们证明了所提出的实体构面主题模型在困惑方面产生了更好的文档表示。此外,我们表明,在多标签分类的应用中,所提出的模型优于基线方法。最后,我们研究了EFTM参数的影响,发现少量切面可以更好地捕获实体特定主题,这证实了直觉,即平均而言,实体在文档中反映的切面数量很少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号