Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

机译：在适当的评分范围内研究汇总评估指标

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

机译：总而言之，通常根据自动评估指标与人类判断相关的能力进行比较。不幸的是，在DUC / TAC共享任务期间执行的手动评估的副产品已经创建了一些现有的人类判断数据集。但是，现代系统通常比执行这些共享任务时提交的最佳系统更好。我们令人惊讶地表明，在这些数据集（平均评分范围）上表现类似的评估指标在当前系统目前运行的较高评分范围内存在很大差异。这是有问题的，因为指标不尽相同，但我们无法决定信任哪个指标。这是呼吁收集有关高分摘要的人为判断的呼吁，因为这将解决有关信任哪些度量标准的争论。这对于进一步改进摘要系统和度量标准也将非常有益。

著录项

来源
《Annual meeting of the Association for Computational Linguistics》|2019年|5093-5100|共8页
会议地点
作者
Maxime Peyrard;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Extractive speech summarization using evaluation metric-related training criteria [J] . Berlin Chen, Shih-Hsiang Lin, Yu-Mei Chang, Information Processing & Management . 2013,第1期

机译：使用与评估指标相关的训练标准进行语音提取摘要
2. Smartphone applications for the evaluation of pathologic shoulder range of motion and shoulder scores—a comparative study [J] . Kevyn Mejia-Hernandez, Angela Chang, Nathan Eardley-Harris, JSES Open Access . 2018,第1期

机译：智能手机在病理性肩部运动范围和肩部评分评估中的应用-比较研究
3. Re: The GMS Hypospadias Score: Assessment of Inter-Observer Reliability and Correlation with Post-Operative Complications Re: Introducing the HOPE (Hypospadias Objective Penile Evaluation)-Score: A Validation Study of an Objective Scoring System for Evaluating Cosmetic Appearance in Hypospadias Patients Editorial Comment [J] . Canning Douglas A. The Journal of Urology . 2015,第1期

机译：Re：GMS Hypospadias评分：评估观察者间可靠性和与手术后并发症的相关性Re：介绍希望（Hypospadias目标阴茎评估）-score：腹期下患者中化妆品外观的客观评分系统的验证研究评论
4. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range [C] . Maxime Peyrard Annual meeting of the Association for Computational Linguistics . 2019

机译：在适当评分范围内研究摘要评估指标
5. A COMPARATIVE STUDY OF STANDARD SCORES, DISCREPANCY SCORES AND RELIABILITY ANALYSES IN THE LEARNING DISABILITIES EVALUATION PROCESS EMPLOYING THE WIDE RANGE ACHIEVEMENT TEST (1978), AND THE WIDE RANGE ACHIEVEMENT TEST-REVISED (1984). [D] . PANEFF, MARY JO. 1987

机译：对采用广泛成就测试（1978）和经过广泛成就测试（1984）进行的学习能力评估过程中的标准得分，差异得分和可靠性分析进行了比较研究。
6. Discovering Defining and Summarizing Persistent Hotspots in SCORE Studies [O] . Nupur Kittur, Carl H. Campbell, Jr., 2020

机译：在SCORE研究中发现定义和汇总持久性热点
7. Studying Summarization Evaluation Metrics in the Appropriate Scoring Range [O] . Maxime Peyrard 2019

机译：在适当评分范围内研究摘要评估指标
8. Text Summarization Evaluation: Correlating Human Performance on an Extrinsic Task with Automatic Intrinsic Metrics [R] . President, S. F. , Dorr, B. J. 2006

机译：文本摘要评估：将外部任务的人员绩效与自动内在度量相关联

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

摘要

著录项

相似文献

相关主题

期刊订阅