A natural approach to developing control architectures for complex tasks is to design multiple simple components, each addressing one aspect of the task domain, and provide a framework for combining the results. When an agent must learn these components by interacting with the environment, as in a Markov decision problem, the designer must take care not to sacrifice an optimal solution for a convenient representation. We propose an architecture that extends the well-known Q learning framework. It permits each component of a controller to compute a value function corresponding to a particular reward signal in the environment, with a supervisor executing the policy that optimizes the average value over all components. This allows each component to learn an estimate of its true value function in advance and improve it while the supervisor executes the policy implied by its agents, and in certain circumstances leads to an optimal policy. Our research applies this approach to complex motor learning problems, which require agents to balance a variety of constraints. We also plan to adapt this method to current techniques that allow researchers to specify partial programs solving Markov decision problems.