Algorithmic Expansion to PPO #462

kim-mskw · 2024-10-28T10:59:59Z

Pull Request

Related Issue

Closes #239

Description

This pull request introduces key improvements to the PPO implementation, specifically in buffering, action handling, and role-specific configurations.
Original paper: https://arxiv.org/abs/1707.06347
Based on implementation in https://github.com/marlbenchmark/on-policy/blob/main/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py
PPO-Clip: https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html
Paper that shows handling of non-stationartiy: https://arxiv.org/abs/2207.05742

Below is an overview of the changes by file:

Changes Proposed

Buffer.py
- Introduced a dynamic buffer for PPO to minimize dependency on varied configurations.
  - add(): Automatically expands buffer size when called, triggering expand_buffer().
  - Integrated a warning when memory demand is expected to exceed available space.
  - reset(): Fully resets the buffer.
  - get(): Returns all necessary data for policy updates.
  - Note: For reverting to a static buffer, pre-configured methods in learning_utils.py calculate required buffer sizes and are currently commented out for reference.
Algorithm-specific Files: matd3.py and ppo.py
- Implemented get_actions() method:
  - Initialized in LearningStrategy within base.py and assigned accordingly.
  - Methods in each file return distinct values:
    - MATD3: returns action & noise.
    - PPO: returns action & log probability.
learning_strategies.py
- calculate_bids():
  - Processes the second return value of get_actions as extra_info based on its shape.
  - Adds this value to unit.outputs.
World.py
- add_rl_units_operator():
  - Issues a warning if train_freq exceeds the end - start time set in config.
Loader_csv.py
- run_learning():
  - Initializes the buffer as ReplayBuffer or RolloutBuffer based on world.learning_role.rl_algorithm_name.
    - Contains a commented-out static buffer implementation for future use, referencing data loaded from configs if needed.
  - noise_scale added solely for MATD3.
  - Validation requirements are now compatible with PPO, which doesn’t gather initial experience.
  - Algorithm-specific terminate conditions applied.
  - Resets buffer via inter_episodic_data.
Learning_unit_operator.py
- Adjusted write_learning_to_output and write_to_learning_role to be compatible with PPO.
ppo.py
- update_policy() method remains incomplete and may require further development or revision.

Discussion Points

PPO Mini-Batches:
- Decision needed on whether to use mini-batches for optimization (currently not in use).
  - Advantages of Mini-Batches: Reduced memory and computational requirements during optimization.
  - Resources for Reference:
    - PPO with Mini-Batches
    - PPO without Mini-Batches
  - For mini-batch sampling, we could use a method similar to sample() from ReplayBuffer.
  - Currently proceeding without mini-batches for a minimum viable product (MVP).
- RolloutBuffer currently stores observations, actions, and log probabilities before calling update_policy. This data could then be processed further by the critic.
  - Decided not to involve the critic in get_actions to reduce dependencies. Instead, the critic is used only within update_policy, though other implementations handle this differently. This setup allows get_actions to be used independently of the critic.
Learning_role.py: Once all required parameters for PPO are identified, they need to be added during the initialization in learning_role.py. However, currently, the default values are defined in several places; we tried to make that more consistent.
Training Frequency: Works with hour intervals smaller than 1 episode, but stability could improve with a setting that allows exceeding the episode length. Mainly when we simulate only very short episodes to prevent overfitting.
- Currently, Learning_role resets after each episode, limiting the ability to extend intervals.
Loader_csv.py – Run_learning()
- Define a termination condition for PPO based on the surrogate loss in addition to current early stopping criteria:
  - Surrogate loss in PPO (Proximal Policy Optimization) monitors policy updates, indicating when the policy nears optimality (very small values) or becomes unstable (too large values).
  - Training can halt early if surrogate loss indicates little improvement or instability, using post-update evaluations.
Additional Parameters: Add any additional parameters to config and subsequent stages as required.
Address complex bid RL:
- Currently, only the version in learning_strategies.py has been updated.
- Future updates should also address the version in learning_advanced_orders.py.
Config Adjustments: The configuration has only been updated in example_02a_tiny, with a shortened time period for testing. This should be adjusted in the main branch once the implementation is validated, and other configurations will need updating to be compatible with PPO.

Testing

[Describe the testing you've done, including any specific test cases or scenarios]

Checklist

Please check all applicable items:

Additional Notes (if applicable)

[Any additional information, concerns, or areas you want reviewers to focus on]

Screenshots (if applicable)

[Add screenshots to demonstrate visual changes]

… files

- readd buffer_size to rollout buffer and include warning if we exceed it - switch default values to condig reading - started debugging policy update - adjusted dimensions of output

- Established fully decentralized Critic for PPO - Added debug statements - Changed activation layer of Actor Neural Network from softsign to sigmoid

- added mc as mean of distribution from which action_logits (softmax [-1,1] are substracted

…into drl-ppo

- reset buffer for proper on-policy learning

… information

…would always be calculated new with the gradient steps, which is not in accordance with the spinning up implementation - also added normalisation of advantages in accordance with Mappo - used obs collection for central critic from MATD3 to enable proper multi-agent learning - renamed epoch to gradient step

… do not define two distribtuions one in get_actions and one in policiy update, as this is prone to mistakes

kim-mskw · 2024-10-30T13:20:10Z

Current version looks promising: (etwas weniger zurückhaltendes wuhu)
We see some conversion but no profitable dispatch yet. This could be due to the hyperparameter setting, though. For example, the oscillation speaks for too big gradient updates, which sometimes "jump" too much in the wrong direction or even overshoot the optimum. Hence it should be clipped more!

kim-mskw · 2024-10-30T14:18:21Z

One open problmem is the back propagation of the loss when we use a gradient_setp number >1, because we get tensor errors then for some reason... Need to discuss that with @nick-harder

…into drl-ppo

ufqjh and others added 20 commits September 11, 2024 14:25

PPO

e996737

Merge branch 'main' into drl-ppo

5e12c2f

DRL PPO Update

46f843a

Merge branch 'main' into drl-ppo

4b889aa

Buffer changes, update_policy and further advancements in DRL related…

552f549

… files

first RUNABLE BUT NOT VALIDATED Version of PPO

128727f

- readd buffer_size to rollout buffer and include warning if we exceed it - switch default values to condig reading - started debugging policy update - adjusted dimensions of output

- Removed unused parts of the RolloutBuffer

991c311

- Established fully decentralized Critic for PPO - Added debug statements - Changed activation layer of Actor Neural Network from softsign to sigmoid

Implemented centralized critic

aba37f1

Fixed comments regarding centralized critic

9047578

Merge branch 'main' into drl-ppo

a5e0f5f

- implemnted perform eval differentiation that gets rid of stochasticity

ca1f9fc

- added mc as mean of distribution from which action_logits (softmax [-1,1] are substracted

Merge remote-tracking branch 'origin/main' into drl-ppo

d102a2d

Merge branch 'drl-ppo' of https://github.com/assume-framework/assume …

7e19389

…into drl-ppo

- tensor handling in get_actions function

e011baa

- reset buffer for proper on-policy learning

convergence of PPO? zurückhaltendes wuhu

5d2dd2d

- pushed config setting for reproducability

79c287a

- added further todos for prettier code and better handling of critic…

97c3bb6

… information

Merge branch 'main' into drl-ppo

9d669e0

- introduce new actor architecture with distributuon layer so that we…

a0c2a88

… do not define two distribtuions one in get_actions and one in policiy update, as this is prone to mistakes

kim-mskw added 2 commits October 30, 2024 15:41

Merge branch 'drl-ppo' of https://github.com/assume-framework/assume …

8542f08

…into drl-ppo

- delted epochs from config

cd30915

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithmic Expansion to PPO #462

Algorithmic Expansion to PPO #462

kim-mskw commented Oct 28, 2024 •

edited

Loading

kim-mskw commented Oct 30, 2024 •

edited

Loading

kim-mskw commented Oct 30, 2024 •

edited

Loading

Algorithmic Expansion to PPO #462

Are you sure you want to change the base?

Algorithmic Expansion to PPO #462

Conversation

kim-mskw commented Oct 28, 2024 • edited Loading

Pull Request

Related Issue

Description

Changes Proposed

Discussion Points

Testing

Checklist

Additional Notes (if applicable)

Screenshots (if applicable)

kim-mskw commented Oct 30, 2024 • edited Loading

kim-mskw commented Oct 30, 2024 • edited Loading

kim-mskw commented Oct 28, 2024 •

edited

Loading

kim-mskw commented Oct 30, 2024 •

edited

Loading

kim-mskw commented Oct 30, 2024 •

edited

Loading