首页> 外文会议>Annual meeting of the Association for Computational Linguistics >Studying Summarization Evaluation Metrics in the Appropriate Scoring Range
【24h】

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

机译:在适当的评分范围内研究汇总评估指标

获取原文

摘要

In summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can't decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.
机译:总而言之,通常根据自动评估指标与人类判断相关的能力进行比较。不幸的是,在DUC / TAC共享任务期间执行的手动评估的副产品已经创建了一些现有的人类判断数据集。但是,现代系统通常比执行这些共享任务时提交的最佳系统更好。我们令人惊讶地表明,在这些数据集(平均评分范围)上表现类似的评估指标在当前系统目前运行的较高评分范围内存在很大差异。这是有问题的,因为指标不尽相同,但我们无法决定信任哪个指标。这是呼吁收集有关高分摘要的人为判断的呼吁,因为这将解决有关信任哪些度量标准的争论。这对于进一步改进摘要系统和度量标准也将非常有益。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号