Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Fix multi-agent tutorial #1599

Merged
merged 5 commits into from
Oct 4, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion examples/multiagent/sac.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,6 @@ def train(cfg: "DictConfig"): # noqa: F821
loss_vals["loss_actor"]
+ loss_vals["loss_alpha"]
+ loss_vals["loss_qvalue"]
+ loss_vals["loss_alpha"]
)

loss_value.backward()
Expand Down
50 changes: 22 additions & 28 deletions tutorials/sphinx-tutorials/multiagent_ppo.py
Original file line number Diff line number Diff line change
Expand Up @@ -253,12 +253,11 @@
#
#

print("action_spec:", env.action_spec)
print("reward_spec:", env.reward_spec)
print("done_spec:", env.done_spec)
print("action_spec:", env.full_action_spec)
print("reward_spec:", env.full_reward_spec)
print("done_spec:", env.full_done_spec)
print("observation_spec:", env.observation_spec)


######################################################################
# Using the commands just shown we can access the domain of each value.
# Doing this we can see that all specs apart from done have a leading shape ``(num_vmas_envs, n_agents)``.
Expand All @@ -270,35 +269,20 @@
# In fact, specs that have the additional agent dimension
# (i.e., they vary for each agent) will be contained in a inner "agents" key.
#
# To access the full structure of the specs we can use
#

print("full_action_spec:", env.input_spec["full_action_spec"])
print("full_reward_spec:", env.output_spec["full_reward_spec"])
print("full_done_spec:", env.output_spec["full_done_spec"])

######################################################################
# As you can see the reward and action spec present the "agent" key,
# meaning that entries in tensordicts belonging to those specs will be nested in an "agents" tensordict,
# grouping all per-agent values.
#
# To quickly access the key for each of these values in tensordicts, we can simply ask the environment for the
# respective key, and
# To quickly access the keys for each of these values in tensordicts, we can simply ask the environment for the
# respective keys, and
# we will immediately understand which are per-agent and which shared.
# This info will be useful in order to tell all other TorchRL components where to find each value
#

print("action_key:", env.action_key)
print("reward_key:", env.reward_key)
print("done_key:", env.done_key)

######################################################################
# To tie it all together, we can see that passing these keys to the full specs gives us the leaf domains
#
print("action_keys:", env.action_keys)
print("reward_keys:", env.reward_keys)
print("done_keys:", env.done_keys)

assert env.action_spec == env.input_spec["full_action_spec"][env.action_key]
assert env.reward_spec == env.output_spec["full_reward_spec"][env.reward_key]
assert env.done_spec == env.output_spec["full_done_spec"][env.done_key]

######################################################################
# Transforms
Expand Down Expand Up @@ -615,6 +599,9 @@
action=env.action_key,
sample_log_prob=("agents", "sample_log_prob"),
value=("agents", "state_value"),
# These last 2 keys will be expanded to match the reward shape
done=("agents", "done"),
terminated=("agents", "terminated"),
)


Expand Down Expand Up @@ -649,11 +636,18 @@
episode_reward_mean_list = []
for tensordict_data in collector:
tensordict_data.set(
("next", "done"),
("next", "agents", "done"),
tensordict_data.get(("next", "done"))
.unsqueeze(-1)
.expand(tensordict_data.get(("next", env.reward_key)).shape),
) # We need to expand the done to match the reward shape (this is expected by the value estimator)
.expand(tensordict_data.get_item_shape(("next", env.reward_key))),
)
tensordict_data.set(
("next", "agents", "terminated"),
tensordict_data.get(("next", "terminated"))
.unsqueeze(-1)
.expand(tensordict_data.get_item_shape(("next", env.reward_key))),
)
# We need to expand the done and terminated to match the reward shape (this is expected by the value estimator)
Comment on lines 637 to +650
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the best way to approach this?
A common criticism of our tutorials is that they are too complex, I wonder what will be judged as the most complex thing here: (1) boilerplate code such as this, which IMO has little instructive value or (2) an ad-hoc transform that may be a bit too custom to be part of the core lib

Copy link
Contributor Author

@matteobettini matteobettini Oct 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i agree, what i can do is:

  1. leave as is
  2. make the transform in the tutorial
  3. make the transform in core
  4. We could also consider allowing rewards and dones with different shapes and expanding in the value functionals

What is your preference ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. seems like a stretch, these classes are already very complex and hard to maintain

I think hese operations are actually 2fold: (1) rename without deleting and (2) unsqueeze + expand
We have a transform that does rename, one that does unsqueeze. We just need one that does expand (or expand_as).

The quickest way forward is this:
Merge as is, open an issue with a feature request for the expand_as transform. Wdyt?


with torch.no_grad():
GAE(
Expand Down Expand Up @@ -688,7 +682,7 @@
collector.update_policy_weights_()

# Logging
done = tensordict_data.get(("next", "done"))
done = tensordict_data.get(("next", "agents", "done"))
episode_reward_mean = (
tensordict_data.get(("next", "agents", "episode_reward"))[done].mean().item()
)
Expand Down
Loading