首页> 外文会议> >Action Selection for MDPs: Anytime AO* Versus UCT
【24h】

Action Selection for MDPs: Anytime AO* Versus UCT

机译:MDP的动作选择:随时AO *与UCT

获取原文

摘要

In the presence of non-admissible heuristics, A* and other best-first algorithms can be converted into anytime optimal algorithms over OR graphs, by simply continuing the search after the first solution is found. The same trick, however, does not work for best-first algorithms over AND/OR graphs, that must be able to expand leaf nodes of the explicit graph that are not necessarily part of the best partial solution. Anytime optimal variants of AO* must thus address an exploration-exploitation tradeoff: they cannot just "exploit", they must keep exploring as well. In this work, we develop one such variant of AO* and apply it to finite-horizon MDPs. This Anytime AO* algorithm eventually delivers an optimal policy while using non-admissible random heuristics that can be sampled, as when the heuristic is the cost of a base policy that can be sampled with rollouts. We then test Anytime AO* for action selection over large infinite-horizon MDPs that cannot be solved with existing off-line heuristic search and dynamic programming algorithms, and compare it with UCT.
机译:在存在不允许的试探法的情况下,只需在找到第一个解后继续进行搜索,即可将A *和其他最佳优先算法转换为基于OR图的随时最佳算法。但是,相同的技巧不适用于AND / OR图上的最佳优先算法,该算法必须能够扩展不一定是最佳部分解决方案一部分的显式图的叶节点。因此,任何时候AO *的最佳变体都必须解决勘探与开发之间的权衡问题:它们不能只是“利用”,还必须继续进行勘探。在这项工作中,我们开发了一种这样的AO *变体,并将其应用于有限水平MDP。这种Anytime AO *算法最终在使用可以采样的不可允许的随机启发式算法时提供了最佳策略,就像启发式算法是可以通过部署进行采样的基本策略的成本一样。然后,我们对随时可用的AO *进行测试,以选择大型的无限水平MDP上的动作,而现有的离线启发式搜索和动态编程算法无法解决这些动作,并将其与UCT进行比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号