首页> 外文期刊>IFAC PapersOnLine >Demonstration Guided Actor-Critic Deep Reinforcement Learning for Fast Teaching of Robots in Dynamic Environments
【24h】

Demonstration Guided Actor-Critic Deep Reinforcement Learning for Fast Teaching of Robots in Dynamic Environments

机译:示范引导演员 - 在动态环境中快速教学的深度加固学习

获取原文
           

摘要

Using direct reinforcement learning (RL) to accomplish a task can be very inefficient, especially in robotic configurations where interactions with the environment are lengthy and costly. Instead, learning from expert demonstration (LfD) is an alternative approach to gain better performance in an RL setting, which also greatly improves sample efficiency. We propose a novel demonstration learning framework for actor-critic based algorithms. Firstly, we put forward an environment pre-training paradigm to initialize the model parameters without interacting with the target environment, which effectively avoids the cold start problem in deep RL scenarios. Secondly, we design a general-purpose LfD framework for most of the mainstream actor-critic RL algorithms that include a policy network and a value function like PPO, SAC, TRPO, A3C. Thirdly, we build a dedicated model training platform to perform the human-robot interaction and numerical experimentation. We evaluate the method in six Mujoco simulated locomotion environments and our robot control simulation platform. Results show that several epochs of pre-training can improve the agent’s performance over the early stage of training. Also, the final converged performance of the RL algorithm is also boosted by external demonstration. In general the sample efficiency is improved by 30% with the proposed method. Our demonstration pipeline makes full use of the exploration property of the RL algorithm, which is feasible for fast teaching robots in dynamic environments.
机译:使用直接加强学习(RL)来完成任务可以是非常低效的,尤其是在与环境交互的机器人配置中冗长且昂贵。相反,从专家演示(LFD)的学习是一种替代方法,可以在RL设置中获得更好的性能,这也大大提高了样本效率。我们为基于演员批评的算法提出了一种新颖的演示学习框架。首先,我们提出了一个环境预训练范例来初始化模型参数而不与目标环境进行交互,从而有效地避免了深度RL方案中的冷启动问题。其次,我们为大多数主流演员 - 评论家RL算法设计了一个通用的LFD框架,包括策略网络和PPO,SAC,TRPO,A3C等价值函数。第三,我们建立专用模型培训平台,以执行人机交互和数值实验。我们评估六个Mujoco模拟运动环境和机器人控制仿真平台的方法。结果表明,几个训练时期可以在培训的早期阶段提高代理商的表现。此外,RL算法的最终收敛性能也被外部演示提升。通常,采用所提出的方法,样品效率提高了30%。我们的演示管道充分利用了RL算法的探索性,这对于动态环境中的快速教学机器人来​​说是可行的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号