...
首页> 外文期刊>International Journal of Information Technology & Decision Making >Directed Exploration in Black-Box Optimization for Multi-Objective Reinforcement Learning
【24h】

Directed Exploration in Black-Box Optimization for Multi-Objective Reinforcement Learning

机译:用于多目标强化学习的黑匣子优化的定向探索

获取原文
获取原文并翻译 | 示例
           

摘要

Usually, real-world problems involve the optimization of multiple, possibly conflicting, objectives. These problems may be addressed by Multi-objective Reinforcement learning (MORL) techniques. MORL is a generalization of standard Reinforcement Learning (RL) where the single reward signal is extended to multiple signals, in particular, one for each objective. MORL is the process of learning policies that optimize multiple objectives simultaneously. In these problems, the use of directional/gradient information can be useful to guide the exploration to better and better behaviors. However, traditional policy-gradient approaches have two main drawbacks: they require the use of a batch of episodes to properly estimate the gradient information (reducing in this way the learning speed), and they use stochastic policies which could have a disastrous impact on the safety of the learning system. In this paper, we present a novel population-based MORL algorithm for problems in which the underlying objectives are reasonably smooth. It presents two main characteristics: fast computation of the gradient information for each objective through the use of neighboring solutions, and the use of this information to carry out a geometric partition of the search space and thus direct the exploration to promising areas. Finally, the algorithm is evaluated and compared to policy gradient MORL algorithms on different multi-objective problems: the water reservoir and the biped walking problem (the latter both on simulation and on a real robot).
机译:通常,现实世界问题涉及优化多重,可能相互冲突的目标。这些问题可以通过多目标强化学习(Morl)技术来解决。 Morl是标准加强学习(RL)的概括,其中单个奖励信号延伸到多个信号,特别是每个目标的信号。 Morl是学习策略的过程,即同时优化多个目标。在这些问题中,使用方向/梯度信息可能有助于指导探索更好,更好的行为。但是,传统的政策梯度方法具有两个主要缺点:它们需要使用一批剧集来正确估计梯度信息(以这种方式减少学习速度),并且它们使用随机策略对此产生灾难性影响学习系统的安全。在本文中,我们提出了一种新的基于人群的Morl算法,用于潜在目标合理流畅的问题。它提出了两个主要特征:通过使用相邻解决方案,快速计算每个目标的梯度信息,以及使用这些信息来执行搜索空间的几何分区,从而指导探索到有前景区域。最后,评估了该算法,并与不同多目标问题的政策梯度Morl算法进行了评估,水库和Biped行走问题(后者在模拟和真实机器人上)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号