...
首页> 外文期刊>JMLR: Workshop and Conference Proceedings >Active Exploration in Markov Decision Processes
【24h】

Active Exploration in Markov Decision Processes

机译:马尔可夫决策过程中的积极探索

获取原文
           

摘要

We introduce the active exploration problem in Markov decision processes (MDPs). Each state of the MDP is characterized by a random value and the learner should gather samples to estimate the mean value of each state as accurately as possible. Similarly to active exploration in multi-armed bandit (MAB), states may have different levels of noise, so that the higher the noise, the more samples are needed. As the noise level is initially unknown, we need to trade off the exploration of the environment to estimate the noise and the exploitation of these estimates to compute a policy maximizing the accuracy of the mean predictions. We introduce a novel learning algorithm to solve this problem showing that active exploration in MDPs may be significantly more difficult than in MAB. We also derive a heuristic procedure to mitigate the negative effect of slowly mixing policies. Finally, we validate our findings on simple numerical simulations.
机译:我们在马尔可夫决策过程(MDP)中介绍了主动探索问题。 MDP的每个状态都具有随机值,学习者应收集样本以尽可能准确地估计每个状态的平均值。与在多武装匪徒(MAB)中进行主动探索类似,状态可能具有不同级别的噪声,因此,噪声越高,需要的样本就越多。由于噪声水平最初是未知的,因此我们需要权衡对环境的探索以估计噪声以及对这些估计的利用以权衡使平均预测的准确性最大化的策略。我们引入了一种新颖的学习算法来解决此问题,表明在MDP中进行主动探索可能比在MAB中更加困难。我们还推导了启发式程序,以减轻缓慢混合政策的负面影响。最后,我们在简单的数值模拟中验证我们的发现。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号