A method for active learning is introduced which uses Gittins multi-armed bandit allocation indices to select actions which optimally trade-off exploration and exploitation to maximize expected payoff. We apply the Gittins method to continuous action spaces by using the C4.5 algorithm to learn a mapping from state (or perception of state) and action to the success or failure of the action when taken in the state. The leaves of the resulting tree form a finite set of alternatives over the continuous space of actions. The action selected is from that leaf which, of the leaves consistent with the perceived state, has the highest Gittins index. We illustrate the technique with a simulated robot learning task for grasping objects where each grasping trial can be lengthy and it is desirable to reduce unnecessary experiments. For the grasping simulation, the Gittins index approach demonstrates statistically significant performance improvement over the Interval Estimation action selection heuristic, with little increase in computational cost. The method also has the advantage of providing a principled way of choosing the exploration parameter based on the expected number of repetitions of the task.
展开▼