turing-talk-notes.rtf

{\rtf1\ansi\ansicpg1252\cocoartf2580
\cocoatextscaling0\cocoaplatform0{\fonttbl\f0\fswiss\fcharset0 Helvetica;}
{\colortbl;\red255\green255\blue255;}
{\*\expandedcolortbl;;}
\vieww14200\viewh16380\viewkind0
\deftab720
\pard\pardeftab720\ri-3833\sl276\slmult1\partightenfactor0

\f0\fs20 \cf0 \
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
Turing Talk\
\
1:\
\
Thank you for that introduction and thanks for the invitation to speak today.  It\'92s really a pleasure to have the chance to tell you a bit about our work on multi-agent reinforcement learning.\
\
2:\
\
Presumably I don\'92t need to explain the importance of multi-agent systems for this audience.  This talk focuses entirely on cooperative multi-agent systems, and in particular how to do cooperative multi-agent reinforcement learning.  The cooperative setting is all about solving coordination problems: how to get a team of agents to work together to achieve a common goal.  This setting is important because a large number of these real-world multi-agent systems are cooperative, like a team of robots in a warehouse, or contain subgroups that can be usefully modelled as such, like a fleet of self-driving cars sharing public roads with human drivers.\
\
Before we can get into some of the methods we\'92ve developed, we have to get clear on the exact problem setting.  This is a common source of confusion because, once you depart from the simplicity of the single agent setting, there\'92s an explosion of possible settings, each with different assumptions about the agents, what they can observe, what actions they can take, and how they are rewarded.  At one point, I got so frustrated with people misunderstanding our setting that I asked Jakob Foerster, who at the time was a PhD student in my lab: can you just make a single slide that clearly encapsulates the problem setting we\'92re working on.\
\
3:\
\
Well, this is what he came up with.  You can judge for yourself whether he succeeded.  The setting is of course multi-agent and cooperative, as I\'92ve discussed.  \
\
It\'92s also partially observable which means each agent has its own private and partial view of the global state.  As we\'92ll see that turns out to be crucial because it\'92s partial observability that makes the setting truly multi-agent.  \
\
In addition, we require that the policies learned by the agents can be executed in a decentralised fashion, which means each agent conditions only on its history of private observations, not on those of other agents.  \
\
However, we assume that learning takes place in a simulator or other safe setting like a laboratory, such that the learning process can be centralised.  The agents can share parameters, observations, gradients, whatever they want.  There are no rules, as long as the resulting policies are amenable to decentralised execution.\
\
4:\
\
To formalise this setting, let\'92s start with the basics, the single-agent MDP.  I\'92ll assume you\'92re familiar with it but just note that in our notation the action is denoted with u, as a will later be used to refer to the agent.  \
\
We have transition and reward functions, the return is a discounted sum of rewards, and value functions represent conditional expected returns.\
\
5: \
\
Now, the simplest way to make this setting multi-agent is to just add a separate action space for each agent. So every agent sees the global state, but can select its own action.  So a in the superscript here indicates which agent is taking the action.  \
\
The transition and reward functions are the same as before but now condition on the joint action, which is the vector of action choices of each agent.  \
\
As you may have already noticed, there\'92s nothing fundamentally multi-agent about the multi-agent MDP.  In fact we can think of it as just a single-agent MDP with a factored action space.  In other words taking an action means specifying a whole vector, as if a single puppeteer agent was controlling all the robots or whatever from above.\
\
6:\
\
That\'92s why partial observability is crucial to making the setting truly multi-agent, and this can be modelled with the decentralised partially observable Markov decision process or Dec-POMDP\
\
In addition to the elements already introduced, we have an observation function that conditions not only on the global state, but on the agent, such that each agent has in general a different private and partial view of the world.  \
\
Due to partial observability the agents will generally want to condition on their entire action-observation history tau and learning aims to produce a set of decentralised policies in which each agent doesn\'92t condition on any thing besides its private history.\
\
This constraint can be motivated in two ways.  There\'92s the natural decentralisation in which real-world communication or sensory constraints require decentralisation.  But there\'92s also the artificial decentralisation in which no such constraints exist inherently but we as the designers artificially impose them in oder to make learning more tractable, by for example forcing each agent to consider only a local field of view.\
\
And of course, as I mentioned we\'92re performing centralised learning of decentralised policies.  So we can do whatever we like during training as long as the result is a set of policies that obey this decentralisation constraint.  \
\
I\'92ve become a bit of an evangelist for this setting because a core belief of mine is that making progress on hard problems is often about making the right assumptions.  We\'92re looking for assumptions that give us a lot of leverage on the problem but which still mostly or approximately hold in the real world.  And this assumption meets those criteria.  We don\'92t deploy robots tabula rasa: we train them in a simulator or laboratory first, and centralising that process is a powerful tool in learning coordinated behaviour among cooperative agents.  \
\
7:\
\
The simplest approach we can take algorithmically is called independent learning.  This was first proposed as independent Q-learning back in 93 and the idea is that each agent simply learns independently with its own Q-function that conditions on its private observation history and individual action..  Nothing is centralised, there is no attempt to learn a joint value function that conditions on the joint action, and each agent essentially treats the other agents as if they were part of the environment.  \
\
Of course we can do the same thing with an actor-critic approach, where each agent has its own actor and critic.\
\
If we have centralised training, then an obvious improvement is to share parameters across agents during learning, which can speed learning and improve generalisation.  The agents can still behave differently because they receive different inputs and those inputs can even include an agent index so the agents can behave arbitrarily heterogeneously.  It\'92s natural to ask whether such learning should still be called independent but it is still independent in the important sense that the value functions condition only on private observations and individual actions, with no joint value functions.\
\
Obviously this is a naive approach.  A key limitation is that because each agent treats other agents as part of its environment, if those agents are also learning, then the environment from that agent\'92s perspective becomes nonstationary and convergence guarantees go out the window.  In addition, because there\'92s no attempt learn a joint value function, the synergistic value of coordination is not represented making it hard to learn coordinated behaviour.\
\
8:\
\
One way to do better is to take an actor-critic approach and centralise the critic.  So the critic conditions on the global state, the joint history, and maybe even the joint action.  Centralising the critic makes sense because it\'92s only needed during training.  Once you deploy, you can discard the critic and just use the policies to act.  Anything that you discard before deployment is a great candidate for centralisation.  \
\
This also makes explicit the motivation for an actor-critic approach.  Actor-critic methods are appealing anytime you have what I call a hard greedification problem, that is when finding the greedy action wrt the value function is nontrivial.  The classic example is continuous action spaces, and this is the typical setting in which actor-critic methods are used.  But here we have another hard greedification problem.  Because we have a centralised value function from which we need to derive decentralised policies.  Actor-critic methods can do that by having each actor incrementally update its policy by following a policy gradient estimated from the same centralised critic, as shown in the figure.\
\
9:\
\
However, learning a centralised value function over a complex action space can be challenging.  A crucial idea to address this is to learn factored value functions instead.  Factored value functions have a long history in reinforcement learning and an even longer one in decision-theoretic planning.  The idea is to represent the value function as a sum of local value functions, each of which depends on the observations and actions of only a subset of the agents.  \
\
This can be modelled in a coordination graph, which is just a factor graph where the factors are the local value functions and the variables are the agents.  Just like a probabilistic graphical model, a coordination graph captures conditional independence properties.  Here for example, If agent 1 knows the action of agent 2, it can select its own action without caring what agent 3 does.\
\
Such a factorisation reduces the number of parameters to be learned, thereby speeding learning and improving generalisation.  It also makes it tractable to maximise over the joint action space, since your favourite message passing algorithms for performing MAP inference in probabilistic graphical models can be reused to efficiently find the maximising joint action.\
\
10:\
\
DeepMind was the first to deep learn-ify this idea in an approach called value decomposition networks.  VDN uses the most extreme form of factorisation, with one factor per agent, yielding this disconnected factor graph.\
\
11:\
\
While this is obviously a highly restrictive factorisation, it has an important side effect of enabling a total decentralisation of the max and argmax.  Because each factor involves only one agent, we can compute the max over joint actions by just performing a max over each agent\'92s individual action space separately and then summing them.\
\
Similarly, we can compute the global argmax by performing a separate argmax for each agent and then compiling the resulting actions into a vector.\
\
What this means is that with a VDN factorisation, we no longer have a hard greedification problem.  Finding the greedy action wrt a value function is easy again, so we are no longer compelled to take an actor critic approach. Instead, we can just use Q-learning.  \
\
Here we have a DQN loss function where the Q-function is centralised.  Thanks to the VDN factorisation, this maximisation can be performed efficiently and, on deployment, action selection trivially decentralises since it requires only the decentralised argmax we just discussed.\
\
12:\
\
Which brings us at last to QMIX, a method we developed a few years ago that has proved quite useful in practice in this setting.  The main idea was to try to preserve this handy property of decentralising the argmax, while loosening the restrictive representation imposed by VDN\'92s extreme factorisation.  We can do that by leveraging the simple observation that to preserve the decentralisability of the argmax, it suffices to enforce the condition that for all agents, the partial derivative of the joint value function with respect to that agent\'92s individual value function is nonnegative.\
\
Now one potential source of confusion here is that because we\'92re considering discrete action spaces, these individual Q-values are only defined for a set of discrete points, so what does it even mean to take the derivative wrt it?\
\
This figure illustrates what\'92s really happening.  When we compute the centralised value from the individual values, we do so using a mixing function, which we can think of as taking real-valued continuous inputs. So this colour gradient here shows a mixing function that obeys the monotonicity constraint I just discussed.\
\
 It\'92s this mixing function whose partial derivatives must be nonnegative in order to obey the monotonicity constraint.  In VDN the mixing function is just a summation but the point is that any monotonic function will do.  Of course, in practice, the mixing function is only ever supplied with a discrete set of inputs, corresponding to the individual Q-values for each agent\'92s actions, which, thanks to the monotonicity constraint, can be individually maximised over.\
\
13: \
\
Now when the students first pitched this idea to me, I was pretty skeptical.  In fact, I was convinced it would never work. My reasoning was that if you preserve the decentralisability of the argmax, then you are still saddled with the key limitation of VDN: that it can\'92t represent the benefit of coordination.  By definition, if each agent can select its action in a vacuum then there can\'92t be any benefit to coordinating with other agents.\
\
This point is illustrated in the normal form games shown here.  On the left we have a game whose value function is both linear and monotonic.  So both VDN and QMIX can represent it.  In the middle, we have a game that is nonlinear but still monotonic.  So QMIX can represent this but VDN cannot.  And on the right you have a game that is both nonlinear and nonmonotonic, so neither VDN nor QMIX can represent it.  The point is that only the game on the right involves coordination because only there does one agent\'92s choice depend on the other agent\'92s choice.  \
\
The question then becomes: should we care about games in the middle because it\'92s these games that QMIX can represent better than VDN.  My claim was not really, because even if VDN couldn\'92t represent the middle game exactly, it could approximate it with the value function from the left game, and when we perform greedy action selection we\'92d get exactly the same result as with QMIX.\
\
14:\
\
The students however had an insight that I had overlooked.  Their point was that the value function is not just used for action selection, but also for bootstrapping.  The loss is a mean squared error between the Q-value and a target, and that target is computed by bootstrapping off the Q-value at the next state, as shown in red.  \
\
So even if a monotonic mixing function doesn\'92t select different actions than a linear mixing function in a given state, it can better estimate the value of that state, which results in less bootstrapping error and better action selection in earlier states.\
\
15:\
\
This two-step game illustrates this point.  In the first step, the red agent\'92s action is irrelevant and the blue agent\'92s action determines whether in the second step they go to state 2A or 2B.  In 2A, the payoff is 7 regardless of their actions while in 2B the payoff is 8 but only if they both select the right action.\
\
16:\
\
Let\'92s see what happens when we apply VDN and QMIX to this game.  VDN can accurately represent the value of 2A but not 2B.  In 2B, it correctly identifies the best action, but underestimates its value. Crucially, these errors in 2B propagate back via bootstrapping to result in errors in the value function for the first step, leading the blue agent to suboptimally choose to transition to 2A.  \
\
By contrast, QMIX can represent both 2A and 2B correctly which via bootstrapping leads to lower error in the value function for the first step, and so the blue agent optimally chooses to transition to 2B.\
\
17:\
\
So how do we actually enforce the monotonicity constraint?  QMIX does so using three networks: an agent network, a mixing network, and a hypernetwork.  The middle part of the figure shows the basic setup: the agent networks, which share parameters, produce the individual Q-values.  These are then fed into the mixing network which is constrained to have nonnegative weights to ensure monotonicity and produces the joint Q-value.  \
\
On the right is a closer look at the agent network, which is just a conventional deep network with feedforward and recurrent layers.  \
\
On the left is a closer look at the mixing network, whose weights are not learned directly but instead specified as the output a separate hypernetwork that conditions not only on the individual Q-values but on the global state, which is allowed because we only use the mixing function in the training phase.  The reason for the hypernetwork is to allow the value function to more flexibly condition on this global state.  Without the mixing network, the relationship between the state and the value would have to be monotonic because of the nonnegative weights.  With a hypernetwork, QMIX can in principle specify an arbitrarily different mixing function for every state.\
\
In the execution phase, we discard the mixing network and each agent selects actions greedily wrt its individual Q-values, which thanks to the monotonicity constraint, is guaranteed to maximise the joint Q-function.\
\
18:\
\
Now in these plots we see the max over the estimated Q-values on nine random matrix games, for both QMIX and VDN, compared to the true max shown with the dashed line.  These plots show essentially that the students were right: QMIX is consistently better than VDN at approximating the max over Q-values, and this is the quantity that in a sequential setting would be used for bootstrapping.\
\
19:\
\
Of course that\'92s just a sanity check.  For the proper evaluation of QMIX, we use the StarCraft Multi-Agent Challenge, or SMAC, a suite of cooperative multi-agent RL benchmark tasks we created based on the popular real-time strategy game StarCraft 2.\
\
As we know from supervised learning and from single-agent reinforcement learning, good benchmarks are crucial for driving progress in the field.  That\'92s why we created SMAC and open-sourced it, along with PyMarl, our software engineering framework that includes implementations of our and other key cooperative MARL algorithms and makes it easy to extend these algorithms and build new ones.\
\
In StarCraft, human players compete against each other or against the game AI to gather resources, build buildings and armies, and defeat opponents.  You\'92ve probably heard about AlphaStar, DeepMind\'92s highly successful StarCraft playing agent.  While AlphaStar also uses StarCraft 2, the setting is actually only superficially similar to SMAC.  While AlphaStar considers the full game, it does so with a centralised policy, a single puppeteer agent that directs all the units, as in a multi-agent MDP, but also contains fully competitive aspects, as it uses self-play techniques to train against a suite of evolving opponents.\
\
In SMAC, we want to benchmark Dec-POMDPs, so we focus on micromanagement, which is the fine-grained control of individual units and create a fully cooperative setting by fixing the opponent\'92s policy to that of the game AI.  Most importantly, there is no puppeteer but instead each unit is controlled by a separate agent.\
\
20:\
\
As we know, to be truly multi-agent requires partial observability, which SMAC introduces by limiting the sight range of each agent, as shown in the figure.\
\
21:\
\
SMAC consists of a number of different maps, shown here.  We have symmetric maps, where both teams have the same type and number of agents.   Then we have maps where both teams have the same type of agents but the opponent has more of them.  And finally we have asymmetric maps where the two teams have different types of agents.\
\
22:\
\
The original QMIX paper just reported results on a few maps but now that we have SMAC, we can evaluate across all 14 maps.  I don\'92t want to bore you with dozens of plots so I\'92ll just show you this summary plot, which shows the number of maps out of 14 for which each method has the best performing policy at each point during training.  \
\
And again, the students were right: QMIX\'92s richer mixing function really pays off. This hump in the middle indicates that QMIX tends to learn faster than the other methods and while those other methods eventually catch up on a few maps, the right part of the figure shows that QMIX often learns substantially better final policies.  \
\
Note that to have the best policy on a map we require it to be epsilon better than the alternatives so even when QMIX is not winning on a map that doesn\'92t imply that it\'92s losing.  Typically it just means that it\'92s tying and this is usually because there are some very difficult maps on which none of the methods make much progress.\
\
The alternative methods include both independent Q-learning and VDN, as well as COMA, an actor-critic method that we also developed.  COMA uses the idea of a centralised critic I mentioned before and also has some other innovations like a clever baseline to reduce variance in the policy gradient estimates that arises from what we call the multi-agent credit assignment problem.  It\'92s more sophisticated algorithmically than QMIX, in fact the COMA paper actually did win a best paper award at AAAI, but when you start doing careful benchmarking across many maps, it\'92s clear that QMIX, despite its simplicity, is much stronger empirically. \
\
We\'92ve tried out a number of ideas to further improve performance on StarCraft, as have researchers in several other labs, but with mixed success as QMIX often proves surprisingly hard to beat, retaining competitive performance on a number of maps.  \
\
23:\
\
Let\'92s take a closer look at the factors that contribute to QMIX\'92s performance.  Here are the results of some ablation experiments where we compare QMIX and VDN to QMIX-NS, in which the mixing network does not condition on the global state, and VDN-S, in which VDN includes a state-dependent bias in its linear mixing.  \
\
The plots show the median test-win percentage across independent runs for each method throughout training on three different maps.\
\
These results show that conditioning on the state is an important factor in performance, at least one some maps, and that doing so with a state-dependent bias is not as good as the more flexible approach in QMIX involving a hypernetwork.\
\
24:\
\
And here we have another ablation experiment where we compare QMIX to QMIX-Lin, where the mixing function is restricted to be one linear layer.  Again we\'92re plotting median test win percentage.  As expected, the original QMIX with nonlinear mixing performs noticeably better.  \
\
25:\
\
OK, but here\'92s where things get weird.  If we actually look at the learned mixing functions, they look very close to being linear.  We\'92ve done this analysis on a number of maps but I\'92m showing here just an example on the 2 colossi vs 64 zerglings maps because it\'92s obviously easier to visualise when there are only two agents.  The left shows the mixing function for the initial state and the right shows it for the state at time step 50.\
\
26:\
\
So what is going on here?  To figure it out, we created yet another ablation called QMIX-2Lin, which like QMIX has two layers in its mixing function, but like QMIX-Lin has only linear layers.  \
\
Now your machine learning textbook will tell you that putting linear layers on top of each other won\'92t increase representational capacity because a linear combination of linear functions is still just a linear function.  However, what your textbook won\'92t tell you, but is probably obvious to any deep learning practitioner, is that adding such layers can greatly affect the learning dynamics, often favourably.\
\
That\'92s what\'92s illustrated in this plot.  This not actually an RL experiment, it\'92s just a regression task of predicting fixed Q-values so the y-axis is just a mean-squared error.\
\
You can see that QMIX-2Lin learns much faster than QMIX-Lin even though they both have linear mixing, and it even matches the performance of QMIX with nonlinear mixing.\
\
27:\
\
And sure enough, this result holds up in SMAC when we actually do RL.  The performance of QMIX and QMIX-2Lin are quite similar and substantially better than that of QMIX-Lin when we again consider median test win percentage.\
\
28:\
\
We can try to encourage QMIX to learn nonlinear mixing functions by changing the activation function from ELU to Tanh and indeed we do see more nonlinearity in the learned mixing functions, as shown at the top here.  However, as the bottom plots show, the effect on performance is modest at best.\
\
29:\
\
So the takeaways from these experiments are...Value factorisation is highly effective in these tasks.  Flexibly conditioning on the state, not just using a state-dependent bias, is also important.  And it's important to richly parameterise the mixing function, as VDN or even QMIX with a single linear layer is not sufficient.  \
\
However, as much as we might wish it was otherwise, nonlinear mixing does not seem to be important, at least not in SMAC.\
\
OK, I\'92m going to turn it over to Tabish now who\'92s going to talk about how we can make QMIX even better.\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
\
}