【24h】

Temporal-Difference Search in Computer Go

机译:Go语言中的时差搜索

获取原文
获取原文并翻译 | 示例

摘要

Temporal-difference (TD) learning is one of the most successful and broadly applied solutions to the reinforcement learning problem; it has been used to achieve master-level play in chess, checkers and backgammon. Monte-Carlo tree search is a recent algorithm for simulation-based search, which has been used to achieve master-level play in Go. We have introduced a new approach to high-performance planning (Silver, Sutton, and Miiller 2012). Our method, TD search, combines TD learning with simulation-based search. Like Monte-Carlo tree search, value estimates are updated by learning online from simulated experience. Like TD learning, it uses value function approximation and bootstrapping to efficiently generalise between related states. We applied TD search to the game of 9 x 9 Go, using a million binary features matching simple patterns of stones. Without any explicit search tree, our approach outperformed a vanilla Monte-Carlo tree search with the same number of simulations. When combined with a simple alpha-beta search, our program also outperformed all traditional (pre-Monte-Carlo) search and machine learning programs on the 9 × 9 Computer Go Server.
机译:时差(TD)学习是强化学习问题最成功且应用最广泛的解决方案之一;它已被用来实现国际象棋,跳棋和西洋双陆棋的大师级比赛。蒙特卡洛树搜索是一种用于基于模拟的搜索的最新算法,已被用于在Go中实现大师级的比赛。我们为绩效规划引入了一种新方法(Silver,Sutton和Miiller 2012)。我们的方法TD搜索将TD学习与基于模拟的搜索相结合。像蒙特卡洛树搜索一样,通过从模拟经验在线学习来更新价值估计。像TD学习一样,它使用值函数逼近和自举来有效地概括相关状态之间的关系。我们将TD搜索应用于9 x 9 Go的游戏中,使用了一百万个与简单石头图案匹配的二进制特征。在没有任何显式搜索树的情况下,我们的方法在模拟次数相同的情况下优于传统的蒙特卡洛树搜索。与简单的alpha-beta搜索结合使用时,我们的程序还优于9×9 Computer Go Server上的所有传统(蒙特卡洛之前)搜索和机器学习程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号