Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Algorithmic Expansion to PPO #462

Draft
wants to merge 22 commits into
base: main
Choose a base branch
from
Draft

Algorithmic Expansion to PPO #462

wants to merge 22 commits into from

Conversation

kim-mskw
Copy link
Contributor

@kim-mskw kim-mskw commented Oct 28, 2024

Pull Request

Related Issue

Closes #239

Description

This pull request introduces key improvements to the PPO implementation, specifically in buffering, action handling, and role-specific configurations.
Original paper: https://arxiv.org/abs/1707.06347
Based on implementation in https://github.com/marlbenchmark/on-policy/blob/main/onpolicy/algorithms/r_mappo/algorithm/r_actor_critic.py
PPO-Clip: https://stable-baselines.readthedocs.io/en/master/modules/ppo1.html
Paper that shows handling of non-stationartiy: https://arxiv.org/abs/2207.05742

Below is an overview of the changes by file:

Changes Proposed

  • Buffer.py

    • Introduced a dynamic buffer for PPO to minimize dependency on varied configurations.
      • add(): Automatically expands buffer size when called, triggering expand_buffer().
      • Integrated a warning when memory demand is expected to exceed available space.
      • reset(): Fully resets the buffer.
      • get(): Returns all necessary data for policy updates.
      • Note: For reverting to a static buffer, pre-configured methods in learning_utils.py calculate required buffer sizes and are currently commented out for reference.
  • Algorithm-specific Files: matd3.py and ppo.py

    • Implemented get_actions() method:
      • Initialized in LearningStrategy within base.py and assigned accordingly.
      • Methods in each file return distinct values:
        • MATD3: returns action & noise.
        • PPO: returns action & log probability.
  • learning_strategies.py

    • calculate_bids():
      • Processes the second return value of get_actions as extra_info based on its shape.
      • Adds this value to unit.outputs.
  • World.py

    • add_rl_units_operator():
      • Issues a warning if train_freq exceeds the end - start time set in config.
  • Loader_csv.py

    • run_learning():
      • Initializes the buffer as ReplayBuffer or RolloutBuffer based on world.learning_role.rl_algorithm_name.
        • Contains a commented-out static buffer implementation for future use, referencing data loaded from configs if needed.
      • noise_scale added solely for MATD3.
      • Validation requirements are now compatible with PPO, which doesn’t gather initial experience.
      • Algorithm-specific terminate conditions applied.
      • Resets buffer via inter_episodic_data.
  • Learning_unit_operator.py

    • Adjusted write_learning_to_output and write_to_learning_role to be compatible with PPO.
  • ppo.py

    • update_policy() method remains incomplete and may require further development or revision.

Discussion Points

  • PPO Mini-Batches:

    • Decision needed on whether to use mini-batches for optimization (currently not in use).
      • Advantages of Mini-Batches: Reduced memory and computational requirements during optimization.
      • Resources for Reference:
      • For mini-batch sampling, we could use a method similar to sample() from ReplayBuffer.
      • Currently proceeding without mini-batches for a minimum viable product (MVP).
    • RolloutBuffer currently stores observations, actions, and log probabilities before calling update_policy. This data could then be processed further by the critic.
      • Decided not to involve the critic in get_actions to reduce dependencies. Instead, the critic is used only within update_policy, though other implementations handle this differently. This setup allows get_actions to be used independently of the critic.
  • Learning_role.py: Once all required parameters for PPO are identified, they need to be added during the initialization in learning_role.py. However, currently, the default values are defined in several places; we tried to make that more consistent.

  • Training Frequency: Works with hour intervals smaller than 1 episode, but stability could improve with a setting that allows exceeding the episode length. Mainly when we simulate only very short episodes to prevent overfitting.

    • Currently, Learning_role resets after each episode, limiting the ability to extend intervals.
  • Loader_csv.py – Run_learning()

    • Define a termination condition for PPO based on the surrogate loss in addition to current early stopping criteria:
      • Surrogate loss in PPO (Proximal Policy Optimization) monitors policy updates, indicating when the policy nears optimality (very small values) or becomes unstable (too large values).
      • Training can halt early if surrogate loss indicates little improvement or instability, using post-update evaluations.
  • Additional Parameters: Add any additional parameters to config and subsequent stages as required.

  • Address complex bid RL:

    • Currently, only the version in learning_strategies.py has been updated.
    • Future updates should also address the version in learning_advanced_orders.py.
  • Config Adjustments: The configuration has only been updated in example_02a_tiny, with a shortened time period for testing. This should be adjusted in the main branch once the implementation is validated, and other configurations will need updating to be compatible with PPO.

Testing

[Describe the testing you've done, including any specific test cases or scenarios]

Checklist

Please check all applicable items:

  • Code changes are sufficiently documented (docstrings, inline comments, doc folder updates)
  • New unit tests added for new features or bug fixes
  • Existing tests pass with the changes
  • Reinforcement learning examples are operational (for DRL-related changes)
  • Code tested with both local and Docker databases
  • Code follows project style guidelines and best practices
  • Changes are backwards compatible, or deprecation notices added
  • New dependencies added to pyproject.toml
  • A note for the release notes doc/release_notes.rst of the upcoming release is included
  • Consent to release this PR's code under the GNU Affero General Public License v3.0

Additional Notes (if applicable)

[Any additional information, concerns, or areas you want reviewers to focus on]

Screenshots (if applicable)

[Add screenshots to demonstrate visual changes]

ufqjh and others added 20 commits September 11, 2024 14:25
- readd buffer_size to rollout buffer and include warning if we exceed it
- switch default values to condig reading
- started debugging policy update
- adjusted dimensions of output
-	Established fully decentralized Critic for PPO
-	Added debug statements
-	Changed activation layer of Actor Neural Network from softsign to sigmoid
- added mc as mean of distribution from which action_logits (softmax [-1,1] are substracted
- reset buffer for proper on-policy learning
…would always be calculated new with the gradient steps, which is not in accordance with the spinning up implementation

- also added normalisation of advantages in accordance with Mappo
- used obs collection for central critic from MATD3 to enable proper multi-agent learning
- renamed epoch to gradient step
… do not define two distribtuions one in get_actions and one in policiy update, as this is prone to mistakes
@kim-mskw
Copy link
Contributor Author

kim-mskw commented Oct 30, 2024

Current version looks promising: (etwas weniger zurückhaltendes wuhu)
We see some conversion but no profitable dispatch yet. This could be due to the hyperparameter setting, though. For example, the oscillation speaks for too big gradient updates, which sometimes "jump" too much in the wrong direction or even overshoot the optimum. Hence it should be clipped more!

image

@kim-mskw
Copy link
Contributor Author

kim-mskw commented Oct 30, 2024

One open problmem is the back propagation of the loss when we use a gradient_setp number >1, because we get tensor errors then for some reason... Need to discuss that with @nick-harder

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Develop new learning algorithm: PPO
3 participants