We have presented a simple, principled technique to compute a valid Dec-POMDP policy for use as an initial policy in conjunction with reinforcement learning. Furthermore, we have demonstrated how to learn such policies in a model-free manner, and we have shown for two benchmark problems that using these initial policies can improve the outcome of alternating Q-learning. This result is encouraging and suggests that this initial policy may be useful in other algorithms (both existing and future) that might require an initial policy.
展开▼