...
首页> 外文期刊>Information retrieval >A goodness of fit test approach in information retrieval
【24h】

A goodness of fit test approach in information retrieval

机译:信息检索中的契合度检验方法

获取原文
获取原文并翻译 | 示例
           

摘要

In many probabilistic modeling approaches to Information Retrieval we are interested in estimating how well a document model "fits" the user's information need (query model). On the other hand in statistics, goodness of fit tests are well established techniques for assessing the assumptions about the underlying distribution of a data set. Supposing that the query terms are randomly distributed in the various documents of the collection, we actually want to know whether the occurrences of the query terms are more frequently distributed by chance in a particular document. This can be quantified by the so-called goodness of fit tests. In this paper, we present a new document ranking technique based on Chi-square goodness of fit tests. Given the null hypothesis that there is no association between the query terms q and the document d irrespective of any chance occurrences, we perform a Chi-square goodness of fit test for assessing this hypothesis and calculate the corresponding Chi-square values. Our retrieval formula is based on ranking the documents in the collection according to these calculated Chi-square values. The method was evaluated over the entire test collection of TREC data, on disks 4 and 5, using the topics of TREC-7 and TREC-8 (50 topics each) conferences. It performs well, outperforming steadily the classical OKAPI term frequency weighting formula but below that of KL-Divergence from language modeling approach. Despite this, we believe that the technique is an important non-parametric way of thinking of retrieval, offering the possibility to try simple alternative retrieval formulas within goodness-of-fit statistical tests' framework, modeling the data in various ways estimating or assigning any arbitrary theoretical distribution in terms.
机译:在许多信息检索的概率建模方法中,我们有兴趣估算文档模型“适合”用户信息需求(查询模型)的程度。另一方面,在统计中,拟合优度检验是用于评估有关数据集基本分布的假设的完善技术。假设查询词随机分布在集合的各个文档中,我们实际上想知道查询词的出现是否在特定文档中偶然地分布得更频繁。这可以通过所谓的拟合优度来量化。在本文中,我们提出了一种基于卡方拟合优度检验的新文档排名技术。给定零假设,查询条件q和文档d之间不存在关联,而不考虑任何偶然事件的发生,我们执行卡方拟合优度检验以评估该假设并计算相应的卡方值。我们的检索公式基于根据这些计算出的卡方值对集合中的文档进行排名。使用TREC-7和TREC-8(各50个主题)会议的主题,在磁盘4和5的TREC数据的整个测试集合中对该方法进行了评估。它表现良好,稳步优于经典的OKAPI术语频率加权公式,但低于语言建模方法的KL-Divergence。尽管如此,我们认为该技术是一种重要的非参数化检索思维方式,提供了在拟合优度统计检验框架内尝试简单的替代检索公式,以各种方式对数据进行建模以估计或分配任何可能性的可能性。就任意理论分布而言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号