...
首页> 外文期刊>ACM Transactions on Information Systems >Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures
【24h】

Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures

机译:同意用户SERP偏好的检索评估措施:基于传统,优惠和多样性措施

获取原文
获取原文并翻译 | 示例
           

摘要

We examine the "goodness" of ranked retrieval evaluation measures in terms of how well they align with users' Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-9 INTENT task, reflecting the views of 15 different assessors. Each assessor made two SERP preference judgements for each triplet: one in terms of relevance and the other in terms of diversity. For each evaluation measure, we compute the Agreement Rate (AR) of each triplet: the proportion of assessors that agree with the measure's SERP preference. We then compare the mean ARs of the measures as well as those of best/median/worst assessors using Tukey HSD tests. Our first experiment compares traditional ranked retrieval measures based on the SERP relevance preferences: we find that normalised Discounted Cumulative Gain (nDCG) and intentwise Rank-biased Utility (iRBU) perform best in that they are the only measures that are statistically indistinguishable from our best assessor; nDCG also statistically significantly outperforms our median assessor. Our second experiment utilises 119,646 document preferences that we collected for a subset of the above topic-SERP-SERP triplets (containing 894 triplets) to compare preference-based evaluation measures as well as traditional ones. Again, we evaluate them based on the SERP relevance preferences. The results suggest that measures such as wpref5 are the most promising among the preference-based measures considered, although they underperform the best traditional measures such as nDCG on average. Our third experiment compares diversified search measures based on the SERP diversity preferences as well as the SERP relevance preferences, and it shows that D#-measures are clearly the most reliable: in particular, D#-nDCG and D#-RBP statistically significantly outperform the median assessor and all intent-aware measures; they also outperform the recently proposed RBU on average. Also, in terms of agreement with SERP diversity preferences, D#-nDCG statistically significantly outperforms RBU. Hence, if IR researchers want to use evaluation measures that align well with users' SERP preferences, then we recommend nDCG and iRBU for traditional search, and D#-measures such as D#-nDCG for diversified search. As for document preference-based measures that we have examined, we do not have a strong reason to recommended them over traditional measures like nDCG, since they align slightly less well with users' SERP preferences despite their quadratic assessment cost.
机译:我们考察的排名检索评价办法“善”在他们与用户的搜索引擎如何排列结果页面(SERP)的网页搜索偏好方面。的SERP偏好覆盖从NTCIR-9 INTENT任务提取1127主题-SERP-SERP三胞胎,反映了15名不同的评审员的意见。每个评价由每个三重2个SERP偏好判断:一个在相关性方面,另一个在多样性方面。对于每一个评价尺度,我们计算每个三联的协定税率(AR):与措施的SERP偏好同意评估的比例。然后,我们比较使用杜克HSD测试措施的平均人工鱼礁以及那些最佳/平均/最坏的评估的。我们的第一个实验中比较传统的排名基础上,SERP关联喜好检索措施:我们发现,归一化贴现累计收益(NDCG)和intentwise排名偏向实用程序(iRBU)表现最好,因为它们是我们最好的统计学上没有区别的唯一措施评估者; NDCG也有统计学显著优于我们的中位数评估员。我们的第二个实验中使用119646层文档的喜好,我们收集了上述主题的SERP,SERP三胞胎的(含894个三胞胎)的一个子集来比较基于偏好的评估措施,以及传统的。同样,我们对其进行评估基础上,SERP的相关性偏好。结果表明,如wpref5措施是最被认为是基于偏好的措施中有前途的,虽然他们表现不佳的最好的传统措施,如NDCG平均。我们的第三个实验比较基础上,SERP多样性偏好多元化搜索的措施以及在SERP相关的喜好,它表明,d#-measures显然是最可靠的:尤其是,d#-nDCG和d#-RBP统计显著跑赢平均陪审员和所有意图感知措施;他们也优于平均最近提出的RBU。此外,在与SERP多样性喜好协议的条款,d#-nDCG统计显著性能优于RBU。因此,如果IR研究人员想使用的评估措施相吻合与用户的喜好SERP,那么我们建议NDCG和iRBU传统搜索,而d#为d#-nDCG -measures这种多样化的搜索。至于文件基于偏爱的措施,我们已经讨论,我们没有一个强有力的理由一样NDCG在推荐这些传统的措施,因为尽管他们的二次评估成本与用户的喜好SERP对准略少好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号