Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures

Sakai Tetsuya; Zeng Zhaohao

首页> 外文期刊>ACM Transactions on Information Systems >Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures

【24h】

Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures

机译：同意用户SERP偏好的检索评估措施：基于传统，优惠和多样性措施

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

We examine the "goodness" of ranked retrieval evaluation measures in terms of how well they align with users' Search Engine Result Page (SERP) preferences for web search. The SERP preferences cover 1,127 topic-SERP-SERP triplets extracted from the NTCIR-9 INTENT task, reflecting the views of 15 different assessors. Each assessor made two SERP preference judgements for each triplet: one in terms of relevance and the other in terms of diversity. For each evaluation measure, we compute the Agreement Rate (AR) of each triplet: the proportion of assessors that agree with the measure's SERP preference. We then compare the mean ARs of the measures as well as those of best/median/worst assessors using Tukey HSD tests. Our first experiment compares traditional ranked retrieval measures based on the SERP relevance preferences: we find that normalised Discounted Cumulative Gain (nDCG) and intentwise Rank-biased Utility (iRBU) perform best in that they are the only measures that are statistically indistinguishable from our best assessor; nDCG also statistically significantly outperforms our median assessor. Our second experiment utilises 119,646 document preferences that we collected for a subset of the above topic-SERP-SERP triplets (containing 894 triplets) to compare preference-based evaluation measures as well as traditional ones. Again, we evaluate them based on the SERP relevance preferences. The results suggest that measures such as wpref5 are the most promising among the preference-based measures considered, although they underperform the best traditional measures such as nDCG on average. Our third experiment compares diversified search measures based on the SERP diversity preferences as well as the SERP relevance preferences, and it shows that D#-measures are clearly the most reliable: in particular, D#-nDCG and D#-RBP statistically significantly outperform the median assessor and all intent-aware measures; they also outperform the recently proposed RBU on average. Also, in terms of agreement with SERP diversity preferences, D#-nDCG statistically significantly outperforms RBU. Hence, if IR researchers want to use evaluation measures that align well with users' SERP preferences, then we recommend nDCG and iRBU for traditional search, and D#-measures such as D#-nDCG for diversified search. As for document preference-based measures that we have examined, we do not have a strong reason to recommended them over traditional measures like nDCG, since they align slightly less well with users' SERP preferences despite their quadratic assessment cost.

机译：我们考察的排名检索评价办法“善”在他们与用户的搜索引擎如何排列结果页面（SERP）的网页搜索偏好方面。的SERP偏好覆盖从NTCIR-9 INTENT任务提取1127主题-SERP-SERP三胞胎，反映了15名不同的评审员的意见。每个评价由每个三重2个SERP偏好判断：一个在相关性方面，另一个在多样性方面。对于每一个评价尺度，我们计算每个三联的协定税率（AR）：与措施的SERP偏好同意评估的比例。然后，我们比较使用杜克HSD测试措施的平均人工鱼礁以及那些最佳/平均/最坏的评估的。我们的第一个实验中比较传统的排名基础上，SERP关联喜好检索措施：我们发现，归一化贴现累计收益（NDCG）和intentwise排名偏向实用程序（iRBU）表现最好，因为它们是我们最好的统计学上没有区别的唯一措施评估者; NDCG也有统计学显著优于我们的中位数评估员。我们的第二个实验中使用119646层文档的喜好，我们收集了上述主题的SERP，SERP三胞胎的（含894个三胞胎）的一个子集来比较基于偏好的评估措施，以及传统的。同样，我们对其进行评估基础上，SERP的相关性偏好。结果表明，如wpref5措施是最被认为是基于偏好的措施中有前途的，虽然他们表现不佳的最好的传统措施，如NDCG平均。我们的第三个实验比较基础上，SERP多样性偏好多元化搜索的措施以及在SERP相关的喜好，它表明，d＃-measures显然是最可靠的：尤其是，d＃-nDCG和d＃-RBP统计显著跑赢平均陪审员和所有意图感知措施;他们也优于平均最近提出的RBU。此外，在与SERP多样性喜好协议的条款，d＃-nDCG统计显著性能优于RBU。因此，如果IR研究人员想使用的评估措施相吻合与用户的喜好SERP，那么我们建议NDCG和iRBU传统搜索，而d＃为d＃-nDCG -measures这种多样化的搜索。至于文件基于偏爱的措施，我们已经讨论，我们没有一个强有力的理由一样NDCG在推荐这些传统的措施，因为尽管他们的二次评估成本与用户的喜好SERP对准略少好。

著录项

来源
《ACM Transactions on Information Systems》 |2021年第2期|14.1-14.35|共35页
作者
Sakai Tetsuya; Zeng Zhaohao;
展开▼
作者单位

Waseda Univ Shinjuku Ku 3-4-1 Okubo Tokyo 1698555 Japan;

Waseda Univ Shinjuku Ku 3-4-1 Okubo Tokyo 1698555 Japan;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Document preferences; evaluation measures; preference assessments; search engine result pages; search result diversification; SERP preferences;

机译：文档偏好;评估措施;偏好评估;搜索引擎结果页面;搜索结果多样化;SERP偏好;

相似文献

外文文献
中文文献
专利

1. Review of Valuation Methods of Preference-Based Measures of Health for Economic Evaluation in Child and Adolescent Populations: Where are We Now and Where are We Going? [J] . Rowen Donna, Rivero-Arias Oliver, Devlin Nancy, PharmacoEconomics . 2020,第4期

机译：审查儿童与青少年人口经济评估偏好的偏好措施估值方法：我们现在在哪里，我们要去哪里？
2. Evaluating the content validity of generic preference-based measures for use in Parkinson's disease [J] . Ayse Kuspinar, Kedar Mate, Anne-Louise Lafontaine, Parkinsonism & related disorders . 2019,第期

机译：评估帕金森病中通用偏好措施的内容有效性
3. Sensitivity of Preference-Based Quality-of-Life Measures for Economic Evaluations in Early-Stage Melanoma [J] . Dieng Mbathio, Kasparian Nadine A., Cust Anne E., JAMA dermatology . 2018,第1期

机译：基于优先级的初期Melanoma经济评估的生活质量措施的敏感性
4. Evaluation of text, numeric and graphical presentations for information retrieval interfaces: user preference and task performance measures [C] . Morse, E.L., Lewis, . 1998

机译：评估信息检索界面的文本，数字和图形表示形式：用户偏爱和任务绩效指标
5. Affect and cognition measures in preference-based decisions: Validity testing of the Ottawa Decisional Conflict Scale and a decision-specific anxiety measure with men eligible for prostate cancer screening [D] . Linder, Suzanne Kneuper. 2010

机译：基于偏好的决策中的情感和认知措施：渥太华决策冲突量表的有效性测试以及针对符合前列腺癌筛查条件的男性的决策特定焦虑措施
6. Psychometric Properties of Preference-Based Measures for Economic Evaluation in Amyotrophic Lateral Sclerosis: A Systematic Review [O] . Nicole Peters, Vanina Dal Bello-Haas, Tara Packham, 2021

机译：基于疗效的疗效措施对肌营养的侧面硬化症的经济评估措施的心理性能：系统评价
7. Review of Valuation Methods of Preference-Based Measures of Health for Economic Evaluation in Child and Adolescent Populations: Where are We Now and Where are We Going? [O] . Donna Rowen, Oliver Rivero-Arias, Nancy Devlin, 2020

机译：审查儿童与青少年人口经济评估偏好的偏好措施估值方法：我们现在在哪里，我们要去哪里？
8. Analyses, Experimental Studies, and Evaluations of Control Measures for Air Flow and Air Quality on and Near Highways. Volume III: User's Manual for FHWA Data Base and Retrieval Programs [R] . Wolf, D. E. , Shelar, E. , Ruff, R. E. , 1981

机译：高速公路及其附近气流和空气质量控制措施的分析，实验研究和评价。第三卷：FHWa数据库和检索程序的用户手册

Retrieval Evaluation Measures that Agree with Users' SERP Preferences: Traditional, Preference-based, and Diversity Measures

摘要

著录项

相似文献

相关主题

期刊订阅