| [1] |
Justin A. Boyan and Michael L. Littman.
Packet routing in dynamically changing networks: A reinforcement
learning approach.
In Advances in Neural Information Processing Systems 6, pages
671-678, San Francisco CA, 1994. Morgan Kauffman. [ bib | .ps ]
This is the original paper that started this line of research. Cost is defined in terms of time delay in local queue and transmission time. Basic Q-learning style updates are done. It is mentioned that function approximation was tried using a neural network but it did not help. No effort is made to solve the cycle or count-to-infinity problem. |
| [2] |
P. Marbach, O. Mihatsch, M. Schulte, and J. N. Tsitsiklis.
Reinforcement learning for call admission control and routing in
integrated service networks.
In Advances in Neural Information Processing Systems 11, 1999. [ bib | .ps ]
The problem of call admission and routing in a telecommunication network is addressed. There are a fixed number of service types that generate requests for a certain amount of bandwidth between certain nodes in the network. The links in the network have fixed bandwidth. The problem is to decide whether to admit a certain request (call admission) and if admitted, how to route it between the given links subject to capacity contraints on the links. The problem is modelled as a discounted reward problem and solved using a distributed version of TD(0) algorithm. The reward is distributed over links. The global value function is a sum of local value functions and the parameters are disjoint. The local state features at each link are the number of ongoing calls of each service type on that link. This decomposition allows training and decision making to be decentralized. |
| [3] |
Timothy X. Brown.
Low power wireless communication via reinforcement learning.
In Advances in Neural Information Processing Systems 12, pages
893-899. MIT Press, 2000. [ bib | .ps ]
This paper deals with the case of a single mobile, power-constrained device (such as a laptop) transmitting data to a fixed, power-unconstrained base station. The reward criterion is application based and the goal is to maximize expected reward while also maximizing battery life. This is modelled by a discounting factor which depends on the probability of ending up in the battery exhausted state. The state of the system is decomposed into the state of the channel (Gilbert-Eliott model is proposed but for simulations they assume error free transmission), the state of the data generating application (ON/OFF), the state of the mobile radio (ON/OFF/TX [transmitting]) and the state of the mobile and base station packet queues. The learning algorithm is Q-learning with state aggregation based on a reduced set of features (since the entire state is not observable at the mobile device). Use of POMDPs is suggested but not implemented. |
| [4] |
Nigel Tao, Jonathan Baxter, and Lex Weaver.
A multi-agent, policy gradient approach to network routing.
In Proceedings of the Eighteenth International Conference on
Machine Learning, 2001. [ bib | .ps ]
Each router is viewed as a single independent agent which makes its decision according to a locally parameterized stochastic policy. Reward is the negative of the total trip time. The aim is to maximize expected long term average reward. Local policy gradient updates involve the final destination of the packet, the neighbor it was routed to, local parameters and the sum of the (delayed) reward signal received. To penalize cycles, an ad hoc mechanism is used which involves storing the last two nodes visited in the packet. If a cycle is detected, a large negative reward is obtained. |
| [5] |
Leonid Peshkin and Virginia Savova.
Reinforcement learning for adaptive routing.
In Proceedings of the International Joint Conference on Neural
Networks, 2002. [ bib | .ps ]
Each router's policy is stochastic and locally parametrized. A ''temperature'' parameter is included so that a deterministic policy is obtained in the limit. Reward is distributed once in an epoch and is dependent upon average routing time (packets carry the time of origin). A discounted reward criterion is used and parameters are updated by gradient ascent on local value function. Random initialization gave poor results, so policy is initialized by shortest path and epsilon-greedy policy is followed during training. Also, the temperature and learning rate are held constant at this stage. |
| [6] |
Ying Zhang and Markus P. J. Fromherz.
Search-based adaptive routing strategies for sensor networks.
In AAAI-04 Workshop on Sensor Networks, 2004. [ bib | .pdf ]
Only paper so far that actually talks about routing in sensor networks using RL. The asymmetric, probabilistic nature of links is explicitly recognized (neighborhood relation is not symmetric) and need for power aware routing is expressed. Algorithm runs in three phases: initialization, forwarding and confirmation. The forwarding stage is basically a policy improvement stage and does Q-learning updates. Some results are proved about convergence in the static network case. Simulations are done using Prowler. |