Invited Talk: Evaluating Natural Language Generation Systems

机译：特邀演讲：评估自然语言生成系统

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Natural Language Generation (NLG) systems have different characteristics than other NLP systems, which effects how they are evaluated. In particular, it can be difficult to meaningfully evaluate NLG texts by comparing them against gold- standard reference texts, because (A) there are usually many possible texts which are acceptable to users and (B) some NLG systems produce texts which are better (as judged by human users) than human-written corpus texts. Partially because of these reasons, the NLG community places much more emphasis on human-based evaluations than most areas of NLP. I will discuss the various ways in which NLG systems are evaluated, focusing on human-based evaluations. These typically either measure the success of generated texts at achieving a goal (eg, measuring how many people change their behaviour after reading behaviour-change texts produced by an NLG system); or ask human subjects to rate various aspects of generated texts (such as readability, accuracy, and appropriateness), often on Likert scales. I will use examples from evaluations I have carried out, and highlight some of the lessons I have learnt, including the importance of reporting negative results, the difference between laboratory and real-world evaluations, and the need to look at worse-case as well as average-case performance. I hope my talk will be interesting and relevant to anyone who is interested in the evaluation of NLP systems.

机译：自然语言生成（NLG）系统与其他NLP系统具有不同的特征，这影响了它们的评估方式。尤其是，很难通过将它们与黄金标准参考文本进行比较来有意义地评估NLG文本，因为（A）通常有许多可能的文本被用户接受，并且（B）一些NLG系统生成的文本更好一些（由人类使用者判断），而不是人类书面语料。部分由于这些原因，与NLP的大多数领域相比，NLG社区更加重视基于人的评估。我将讨论以人为基础的评估方式对NLG系统进行评估的各种方式。通常，这些方法要么衡量生成的文本在实现目标方面的成功率（例如，衡量多少人在阅读NLG系统生成的更改行为的文本后改变其行为）;或要求人类受试者经常按李克特量表对所生成文本的各个方面（例如可读性，准确性和适当性）进行评分。我将使用我已进行的评估中的示例，并重点介绍我所学到的一些经验教训，包括报告负面结果的重要性，实验室评估与实际评估之间的差异以及需要考虑更坏情况的情况。作为平均情况下的表现。我希望我的演讲对所有对NLP系统评估感兴趣的人都能够引起兴趣并引起他们的兴趣。

著录项

来源
《Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies》|2016年|xx-xx|共1页
会议地点
作者
Ehud Reitter;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation Generation: Core tasks, applications and evaluation [J] . Albert Gatt, Emiel Krahmer The Journal of Artificial Intelligence Research . 2018,第12期

机译：自然语言生成的最新状况调查：核心任务，应用程序和评估生成：核心任务，应用程序和评估
2. Picat: A Scalable Logic-based Language and System (Invited talk) [J] . Neng-Fa Zhou OASIcs : OpenAccess Series in Informatics . 2013,第2期

机译：Picat：基于可扩展逻辑的语言和系统（特邀演讲）
3. Regular Languages: To Finite Automata and Beyond (Invited Talk) [J] . Prigioniero Luca OASIcs : OpenAccess Series in Informatics . 2021,第a期

机译：常规语言：为有限自动机和超越（邀请的谈话）
4. Invited Talk: Evaluating Natural Language Generation Systems [C] . Ehud Reitter Conference on the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2016

机译：特邀演讲：评估自然语言生成系统
5. Data-driven Natural Language Generation: Making Machines Talk Like Humans Using Natural Corpora. [D] . Langner, Brian. 2010

机译：数据驱动的自然语言生成：使用自然语料库使机器像人一样说话。
6. HF-Explain: a natural language generation system for explaining a medical expert system. [O] . H. C. Lewin 1991

机译：HF-Explain：用于解释医学专家系统的自然语言生成系统。
7. Creation of a New Domain and Evaluation of Comparison Generation in a Natural Language Generation System [O] . Marge, Matthew, Isard, Amy, Moore, Johanna 2008

机译：创建新域并评估自然语言生成系统中的比较生成
8. Talking to InterFIS: Adding Speech Input to a Natural Language Interface [R] . Everett, S. S. 1992

机译：与InterFIs交谈：将语音输入添加到自然语言界面

Invited Talk: Evaluating Natural Language Generation Systems

摘要

著录项

相似文献

相关主题

期刊订阅