Next: Conclusion Up: Reinforcement Learning of Active Previous: Extracting Stable Behavior from

Discussion

The examples we have shown demonstrate the ability of hidden-state Q-learning methods, using an instance-based utility representation, to learn where to look to discriminate target from distractor patterns. Conceptually, the system constructs an action-selection mechanism which operates by considering prior experiences which had similar recent action/observation history compared to the most recent time point. Since the observation/action history capture both the state of the user and the state of the perception system (e.g. where the active camera is looking), this means that the system builds predictive models that can combine both what the user will do next, and where to look to confirm the relevant part of the hidden state.

Because our system is based on reinforcement learning, it has the ability to learn from delayed rewards. This is important, since reward is only distributed on the performance of the recognition action, which happens at the end of a sequence of foveation actions. Learning the correct utility values corresponding to possible foveation actions during the states before reward is generated is the key problem faced by the system; were the learning system only to model the expected instantaneous reward, it could never learn foveation behavior.

The major limitation of the our system as presented lies in the use of unstructured, random search. This leads to a large amount of initial training time. We are exploring ways to structure search to focus new experience towards likely goal locations in the utility landscape. We also note that it would be straightforward to allow a teacher to give examples of correct performance and use them as starting points for learning (or to improve learning in progress); this possibility remains a topic of future work.

The use of a reinforcement/reward paradigm offers considerable flexibility; in addition to the application of active pattern recognition, one can envision a range of interaction regimes that are applicable to this learning framework. Overt actions in an interface could be included into the POMDP action set, and reward directly conditioned on high-level task performance. Much in the way that general-purpose optimization frameworks have proven powerful for scene description and structure recovery, it is our belief that reinforcement protocols hold promise for modeling a wide range of performance in interactive systems. Our results demonstrate initial progress on this path.

Next: Conclusion Up: Reinforcement Learning of Active Previous: Extracting Stable Behavior from

Trevor Darrell
9/14/1998