首页> 美国卫生研究院文献>Frontiers in Neurorobotics >Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play
【2h】

Evolving Robust Policy Coverage Sets in Multi-Objective Markov Decision Processes Through Intrinsically Motivated Self-Play

机译:通过内在动机的自我博弈在多目标马尔可夫决策过程中发展稳健的政策覆盖范围

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

Many real-world decision-making problems involve multiple conflicting objectives that can not be optimized simultaneously without a compromise. Such problems are known as multi-objective Markov decision processes and they constitute a significant challenge for conventional single-objective reinforcement learning methods, especially when an optimal compromise cannot be determined beforehand. Multi-objective reinforcement learning methods address this challenge by finding an optimal coverage set of non-dominated policies that can satisfy any user's preference in solving the problem. However, this is achieved with costs of computational complexity, time consumption, and lack of adaptability to non-stationary environment dynamics. In order to address these limitations, there is a need for adaptive methods that can solve the problem in an online and robust manner. In this paper, we propose a novel developmental method that utilizes the adversarial self-play between an intrinsically motivated preference exploration component, and a policy coverage set optimization component that robustly evolves a convex coverage set of policies to solve the problem using preferences proposed by the former component. We show experimentally the effectiveness of the proposed method in comparison to state-of-the-art multi-objective reinforcement learning methods in stationary and non-stationary environments.
机译:许多现实世界中的决策问题涉及多个相互冲突的目标,这些目标不能同时妥协而无法同时优化。这些问题被称为多目标马尔可夫决策过程,它们对常规的单目标强化学习方法构成了重大挑战,尤其是在无法事先确定最佳折衷方案的情况下。多目标强化学习方法通​​过找到可以满足任何用户在解决问题上的偏好的非主导策略的最佳覆盖范围来解决此挑战。但是,这是通过计算复杂性,时间消耗和缺乏对非平稳环境动力学的适应性的成本来实现的。为了解决这些局限性,需要可以在线且可靠地解决问题的自适应方法。在本文中,我们提出了一种新颖的开发方法,该方法利用了内在动机的偏好探索组件和策略覆盖集优化组件之间的对抗性自我博弈,该组件可以稳健地演化出一个凸凸的策略覆盖集以使用由决策者提出的偏好来解决问题。前组件。与固定和非固定环境中最新的多目标强化学习方法相比,我们通过实验证明了所提出方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号