-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algorithmic Expansion to PPO #462
base: main
Are you sure you want to change the base?
Conversation
- readd buffer_size to rollout buffer and include warning if we exceed it - switch default values to condig reading - started debugging policy update - adjusted dimensions of output
- Established fully decentralized Critic for PPO - Added debug statements - Changed activation layer of Actor Neural Network from softsign to sigmoid
- added mc as mean of distribution from which action_logits (softmax [-1,1] are substracted
- reset buffer for proper on-policy learning
…would always be calculated new with the gradient steps, which is not in accordance with the spinning up implementation - also added normalisation of advantages in accordance with Mappo - used obs collection for central critic from MATD3 to enable proper multi-agent learning - renamed epoch to gradient step
… do not define two distribtuions one in get_actions and one in policiy update, as this is prone to mistakes
Current version looks promising: (etwas weniger zurückhaltendes wuhu) |
One open problmem is the back propagation of the loss when we use a gradient_setp number >1, because we get tensor errors then for some reason... Need to discuss that with @nick-harder |
Pull Request
Related Issue
Closes #239
Description
This pull request introduces key improvements to the PPO implementation, specifically in buffering, action handling, and role-specific configurations.
Original paper: https://arxiv.org/abs/1707.06347
Based on implementation in https://github.com/marlbenchmark/on-policy/blob/main/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py
PPO-Clip: https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html
Paper that shows handling of non-stationartiy: https://arxiv.org/abs/2207.05742
Below is an overview of the changes by file:
Changes Proposed
Buffer.py
add()
: Automatically expands buffer size when called, triggeringexpand_buffer()
.reset()
: Fully resets the buffer.get()
: Returns all necessary data for policy updates.learning_utils.py
calculate required buffer sizes and are currently commented out for reference.Algorithm-specific Files:
matd3.py
andppo.py
get_actions()
method:LearningStrategy
withinbase.py
and assigned accordingly.action & noise
.action & log probability
.learning_strategies.py
calculate_bids()
:get_actions
asextra_info
based on its shape.unit.outputs
.World.py
add_rl_units_operator()
:train_freq
exceeds theend - start
time set in config.Loader_csv.py
run_learning()
:ReplayBuffer
orRolloutBuffer
based onworld.learning_role.rl_algorithm_name
.noise_scale
added solely for MATD3.terminate
conditions applied.inter_episodic_data
.Learning_unit_operator.py
write_learning_to_output
andwrite_to_learning_role
to be compatible with PPO.ppo.py
update_policy()
method remains incomplete and may require further development or revision.Discussion Points
PPO Mini-Batches:
sample()
fromReplayBuffer
.RolloutBuffer
currently stores observations, actions, and log probabilities before callingupdate_policy
. This data could then be processed further by the critic.get_actions
to reduce dependencies. Instead, the critic is used only withinupdate_policy
, though other implementations handle this differently. This setup allowsget_actions
to be used independently of the critic.Learning_role.py: Once all required parameters for PPO are identified, they need to be added during the initialization in
learning_role.py
. However, currently, the default values are defined in several places; we tried to make that more consistent.Training Frequency: Works with hour intervals smaller than 1 episode, but stability could improve with a setting that allows exceeding the episode length. Mainly when we simulate only very short episodes to prevent overfitting.
Learning_role
resets after each episode, limiting the ability to extend intervals.Loader_csv.py – Run_learning()
Additional Parameters: Add any additional parameters to config and subsequent stages as required.
Address complex bid RL:
learning_strategies.py
has been updated.learning_advanced_orders.py
.Config Adjustments: The configuration has only been updated in
example_02a_tiny
, with a shortened time period for testing. This should be adjusted in the main branch once the implementation is validated, and other configurations will need updating to be compatible with PPO.Testing
[Describe the testing you've done, including any specific test cases or scenarios]
Checklist
Please check all applicable items:
doc
folder updates)pyproject.toml
doc/release_notes.rst
of the upcoming release is includedAdditional Notes (if applicable)
[Any additional information, concerns, or areas you want reviewers to focus on]
Screenshots (if applicable)
[Add screenshots to demonstrate visual changes]