- Title: Rainbow: Combining Improvements in Deep Reinforcement Learning
- Authors: Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, David Silver
- Link: https://arxiv.org/abs/1710.02298
- Tags: Neural Network, reinforcement
- Year: 2017
-
What
- They combine several previous improvements for reinforcement learning to one algorithm.
- The combination beats previous methods by a good margin.
- They analyze which of the used improvements has most influence on the result.
-
How
- They use the following improvements:
- Double Q-learning
- Uses two networks during training.
- One predicts Q-values, the other is updated.
- Usually done by using a copy with freezed parameters.
- Prioritized Replay
- Samples experiences from the replay memory based on the difference between predicted and real Q-values.
- Recent experiences also get higher priority.
- Dueling Networks
- Splits the Q-value prediction into a value stream (mean value over all values) and an advantage stream (advantage of specific actions over others).
- Multi-Step Learning
- Splits
direct reward + Q(next state, next action)
intodirect reward + direct reward of next N actions + Q(next Nth state, next Nth action)
(weighted with discrount factor). - I.e. per training example, some experiences from the future are directly used to compute the rewards and only after some point the Q-function is used.
- Splits
- Distributional RL
- Seems to be nothing else but switching the regression (of Q-values) to classification in order to avoid standard mode problems with regression.
- The range of possible reward values is partitioned into N bins.
- Noisy Nets
- Gets rid of the exploration factor in epsilon-greedy strategies.
- Instead it uses a noisy fully connected layer where the noise weights are learned by the network.
- As the network becomes more accurate at predicting good Q-values, it automatically decreases the noise.
- Double Q-learning
- They combine all of the mentioned methods.
- They use a KL term for weighting in Prioritized Replay (to account for the Distributional RL).
- Training
- They start training after 80k frames.
- They use Adam.
- When using epsilon-greedy instead of noise nets, they anneal epsilon to 0.01 at 250k frames.
- For multi-step learning they use n=3 future experiences.
- They use the following improvements:
-
Results
- They evaluate Rainbow 57 Atari games.
- Rainbow beats all other methods used on their own, both in learning speed and maximum skill level.
- It performs far better than the classic DQN approach.
- Average performance:
- Ablation
- Learning curves by game: