首页> 外文会议>International Conference on Artificial Intelligence Planning and Scheduling; 2000414-17; Breckenridge,CO(US) >Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers
【24h】

Approximate Solutions to Factored Markov Decision Processes via Greedy Search in the Space of Finite State Controllers

机译:有限状态控制器空间中基于贪婪搜索的因式马尔可夫决策过程的近似解

获取原文
获取原文并翻译 | 示例

摘要

In stochastic planning problems formulated as factored Markov decision processes (MDPs), also called dynamic belief network MDPs (DBN-MDPs) (Boutilier, Dean, & Hanks 1999), finding the best policy (or conditional plan) is NP-hard. One of the difficulties comes from the fact that the number of conditionals required to specify the policy can grow to be exponential in the size of the representation for the MDP. Several recent algorithms have focused on finding an approximate policy by restricting the representation of conditionals using decision trees. We propose an alternative policy representation for Factored MDPs in terms of finite-state machine (FSM) controllers. Since practically speaking we are forced to limit the number of conditionals, we claim that there is a benefit to be had in using FSM controllers given that these controllers can use their internal state to maintain context information that might otherwise require a large conditional table or decision tree. Although the optimal policy might not be representable as a finite-state controller with a fixed amount of memory, we will be satisfied with finding a "good" policy; to that end, we derive a stochastic greedy-search algorithm based on recent developments in reinforcement learning (Baird & Moore 1999) and then demonstrate its performance in some example domains.
机译:在由因子马尔可夫决策过程(MDP)制定的随机计划问题中,也称为动态信念网络MDP(DBN-MDP)(Boutilier,Dean,&Hanks 1999),找到最佳策略(或有条件的计划)是NP难的。困难之一来自这样一个事实,即指定策略所需的条件数量可能会增长到MDP表示形式的指数级。最近的几种算法集中在通过使用决策树限制条件表示来寻找近似策略。我们根据有限状态机(FSM)控制器,提出了因式MDP的替代策略表示形式。由于实际上说来我们被迫限制条件的数量,因此我们声称使用FSM控制器是有好处的,因为这些控制器可以使用其内部状态来维护上下文信息,否则它们可能需要大的条件表或决策树。尽管最佳策略可能无法表示为具有固定内存量的有限状态控制器,但我们对找到“良好”策略感到满意;为此,我们基于强化学习的最新发展(Baird&Moore 1999)推导了一种随机贪婪搜索算法,然后在某些示例领域中证明了其性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号