...
首页> 外文期刊>Advanced Robotics: The International Journal of the Robotics Society of Japan >Integration of imitation learning using GAIL and reinforcement learning using task-achievement rewards via probabilistic graphical model
【24h】

Integration of imitation learning using GAIL and reinforcement learning using task-achievement rewards via probabilistic graphical model

机译:通过概率图形模型使用任务成就奖励使用盖尔和强化学习的模仿学习

获取原文
获取原文并翻译 | 示例
           

摘要

The integration of reinforcement learning (RL) and imitation learning (IL) is an important problem that has long been studied in the field of intelligent robotics. RL optimizes policies to maximize the cumulative reward, whereas IL attempts to extract general knowledge about the trajectories demonstrated by experts, i.e, demonstrators. Because each has its own drawbacks, many methods combining them and compensating for each set of drawbacks have been explored thus far. However, many of these methods are heuristic and do not have a solid theoretical basis. This paper presents a new theory for integrating RL and IL by extending the probabilistic graphical model (PGM) framework for RL,control as inference. We develop a new PGM for RL with multiple types of rewards, called probabilistic graphical model for Markov decision processes with multiple optimality emissions (pMDP-MO). Furthermore, we demonstrate that the integrated learning method of RL and IL can be formulated as a probabilistic inference of policies on pMDP-MO by considering the discriminator in generative adversarial imitation learning (GAIL) as an additional optimality emission. We adapt the GAIL and task-achievement reward to our proposed framework, achieving significantly better performance than policies trained with baseline methods.
机译:钢筋学习(RL)和仿制学习(IL)的整合是在智能机器人领域中已经研究过的重要问题。 RL优化了最大化累计奖励的政策,而IL试图提取关于专家演示的轨迹的一般知识,即示威者。因为每个人具有自己的缺点,因此迄今为止已经探讨了许多组合它们和补偿每组缺点的方法。然而,许多这些方法是启发式的并且没有稳定的理论基础。本文介绍了通过扩展RL的概率图形模型(PGM)框架来集成R1和IL的新理论,控制为推理。我们为RL开发了新的PGM,具有多种类型的奖励,称为Markov决策过程的概率图形模型,具有多种最优排放(PMDP-Mo)。此外,我们证明通过考虑生成的对抗性模仿学习(GAIL)作为额外的最优性发射,将R1和IL的综合学习方法作为PMDP-MO对PMDP-MO政策的概率推断。我们调整盖尔和任务成就奖励对我们提出的框架,实现比具有基线方法培训的政策更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号