【24h】

Inverse reinforcement learning using Dynamic Policy Programming

机译:使用动态策略编程进行逆向强化学习

获取原文
获取原文并翻译 | 示例

摘要

This paper proposes a novel model-free inverse reinforcement learning method based on density ratio estimation under the framework of Dynamic Policy Programming. We show that the logarithm of the ratio between the optimal policy and the baseline policy is represented by the state-dependent cost and the value function. Our proposal is to use density ratio estimation methods to estimate the density ratio of policies and the least squares method with regularization to estimate the state-dependent cost and the value function that satisfies the relation. Our method can avoid computing the integral such as evaluating the partition function. A simple numerical simulation of a grid world navigation, a car driving, and a pendulum swing-up shows its superiority over conventional methods.
机译:在动态策略规划的框架下,提出了一种基于密度比估计的无模型逆强化学习新方法。我们表明,最优策略和基准策略之间的比率的对数由状态相关成本和价值函数表示。我们的建议是使用密度比率估计方法来估计策略的密度比率,并使用最小二乘法进行正则化来估计状态相关成本和满足该关系的价值函数。我们的方法可以避免计算积分,例如评估分区函数。网格世界导航,汽车驾驶和摆摆的简单数值模拟显示了其优于常规方法的优越性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号