首页> 外文期刊>Neural Networks and Learning Systems, IEEE Transactions on >An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time
【24h】

An Equivalence Between Adaptive Dynamic Programming With a Critic and Backpropagation Through Time

机译:具有批判的自适应动态规划与反向传播之间的等价关系

获取原文
获取原文并翻译 | 示例
       

摘要

We consider the adaptive dynamic programming technique called Dual Heuristic Programming (DHP), which is designed to learn a critic function, when using learned model functions of the environment. DHP is designed for optimizing control problems in large and continuous state spaces. We extend DHP into a new algorithm that we call Value-Gradient Learning, ${rm VGL}(lambda)$, and prove equivalence of an instance of the new algorithm to Backpropagation Through Time for Control with a greedy policy. Not only does this equivalence provide a link between these two different approaches, but it also enables our variant of DHP to have guaranteed convergence, under certain smoothness conditions and a greedy policy, when using a general smooth nonlinear function approximator for the critic. We consider several experimental scenarios including some that prove divergence of DHP under a greedy policy, which contrasts against our proven-convergent algorithm.
机译:我们考虑一种称为双重启发式编程(DHP)的自适应动态编程技术,该技术旨在在使用学习到的环境模型函数时学习评论函数。 DHP旨在优化大型和连续状态空间中的控制问题。我们将DHP扩展到称为“价值梯度学习”的新算法$ {rm VGL}(lambda)$,并用贪婪策略证明该新算法的实例与“时间控制传播”等效。这种等效性不仅提供了这两种不同方法之间的联系,而且还使我们的DHP变体在使用某些平滑的非线性函数逼近器进行批判时,可以在一定的平滑度条件和贪婪策略下确保收敛。我们考虑了几种实验方案,其中包括一些在贪婪策略下证明DHP有分歧的方案,这与我们的经过证明的收敛算法形成了对比。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号