首页> 外文期刊>Information Processing & Management >Field independent probabilistic model for clustering multi-field documents
【24h】

Field independent probabilistic model for clustering multi-field documents

机译:用于多字段文档聚类的与字段无关的概率模型

获取原文
获取原文并翻译 | 示例
       

摘要

We propose a new finite mixture model for clustering multiple-field documents, such as scientific literature with distinct fields: title, abstract, keywords, main text and references. This probabilistic model, which we call field independent clustering model (FICM), incorporates the distinct word distributions of each field to integrate the discriminative abilities of each field as well as to select the most suitable component probabilistic model for each field. We evaluated the performance of FICM by applying it to the problem of clustering three-field (title, abstract and MeSH) biomedical documents from TREC 2004 and 2005 Genomics tracks, and two-field (title and abstract) news reports from Reuters-21578. Experimental results showed that FICM outperformed the classical multinomial model and the multivariate Bernoulli model, being at a statistically significant level for all the three collections. These results indicate that FICM outperformed widely-used probabilistic models for document clustering by considering the characteristics of each field. We further showed that the component model, which is consistent with the nature of the corresponding field, achieved a better performance and considering the diversity of model setting also gave a further performance improvement. An extended abstract of parts of the work presented in this paper has appeared in Zhu et al. [Zhu, S., Takigawa, I., Zhang, S., & Mamitsuka, H. (2007). A probabilistic model for clustering text documents with multiple fields. In Proceedings of the 29th European conference on information retrieval, ECIR 2007. Lecture notes in computer science (Vol. 4425, pp. 331-342)].
机译:我们提出了一种新的有限混合模型,用于聚类多领域文档,例如具有不同领域的科学文献:标题,摘要,关键字,主要文本和参考文献。我们将这种概率模型称为场无关聚类模型(FICM),它结合了每个域的不同单词分布,以集成每个域的判别能力,并为每个域选择最合适的组件概率模型。我们将FICM的性能应用于TREC 2004和2005 Genomics跟踪的三场(标题,摘要和MeSH)生物医学文档以及来自Reuters-21578的两场(标题和摘要)新闻报道的问题进行了评估。实验结果表明,FICM优于经典多项式模型和多元伯努利模型,在所有三个集合中均具有统计学上的显着水平。这些结果表明,通过考虑每个字段的特征,FICM优于用于文档聚类的广泛使用的概率模型。我们进一步表明,与相应领域的性质相一致的组件模型实现了更好的性能,并且考虑到模型设置的多样性也进一步提高了性能。 Zhu等人发表了本文提出的部分工作的扩展摘要。 [Zhu,S.,Takigawa,I.,Zhang,S.,&Mamitsuka,H.(2007)。用于将文本文档与多个字段聚类的概率模型。在第29届欧洲信息检索会议论文集中,ECIR2007。计算机科学讲义(第4425卷,第331-342页)]。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号