【24h】

Guided Dialog Policy Learning: Reward Estimation for Multi-Domain Task-Oriented Dialog

机译:指导对话框策略学习:针对多域任务导向对话框的奖励估算

获取原文

摘要

Dialog policy decides what and how a task-oriented dialog system will respond, and plays a vital role in delivering effective conversations. Many studies apply Reinforcement Learning to learn a dialog policy with the reward function which requires elaborate design and pre-specified user goals. With the growing needs to handle complex goals across multiple domains, such manually designed reward functions are not affordable to deal with the complexity of real-world tasks. To this end, we propose Guided Dialog Policy Learning, a novel algorithm based on Adversarial Inverse Reinforcement Learning for joint reward estimation and policy optimization in multi-domain task-oriented dialog. The proposed approach estimates the reward signal and infers the user goal in the dialog sessions. The reward estimator evaluates the state-action pairs so that it can guide the dialog policy at each dialog turn. Extensive experiments on a multi-domain dialog dataset show that the dialog policy guided by the learned reward function achieves remarkably higher task success than state-of-the-art baselines.
机译:对话策略决定了面向任务的对话系统将如何响应以及如何响应,并且在传递有效的对话中起着至关重要的作用。许多研究都使用强化学习来学习具有奖励功能的对话策略,这需要精心设计和预先指定的用户目标。随着处理跨多个领域的复杂目标的需求不断增长,这种手动设计的奖励功能已经无法承受现实任务的复杂性。为此,我们提出了“引导对话策略学习”,这是一种基于对抗逆强化学习的新颖算法,用于在面向多领域任务的对话中进行联合奖励估计和策略优化。所提出的方法估计奖励信号并推断对话会话中的用户目标。奖励估算器评估状态操作对,以便它可以在每次对话时指导对话策略。在多域对话框数据集上进行的大量实验表明,由学习的奖励功能指导的对话框策略比最新的基线能够显着提高任务成功率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号