From 162d4e9d0abf76eb252ff82a86aa8740bd12eb9d Mon Sep 17 00:00:00 2001
From: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
Date: Sun, 5 Jan 2025 11:40:31 +0800
Subject: [PATCH] Update the_n_implementation_details_of_rlhf_with_ppo.md

---
 the_n_implementation_details_of_rlhf_with_ppo.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/the_n_implementation_details_of_rlhf_with_ppo.md b/the_n_implementation_details_of_rlhf_with_ppo.md
index 115507595e..9350cc6a54 100644
--- a/the_n_implementation_details_of_rlhf_with_ppo.md
+++ b/the_n_implementation_details_of_rlhf_with_ppo.md
@@ -350,7 +350,7 @@ In this section, we will delve into details, such as layer initialization, data
     - The code adds a per-token KL penalty ([lm_human_preferences/train_policy.py#L150-L153](https://github.com/openai/lm-human-preferences/blob/cbfd210bb8b08f6bc5c26878c10984b90f516c66/lm_human_preferences/train_policy.py#L150-L153)) to the rewards, in order to discourage the policy to be very different from the original policy.
     - Using the `"usually, he would"` as an example, it gets tokenized to `[23073, 11, 339, 561]`. Say we use `[23073]` as the query and `[11, 339, 561]` as the response. Then under the default `gpt2` parameters, the response tokens will have log probabilities of the reference policy `logprobs=[-3.3213, -4.9980, -3.8690]` .
         - During the first PPO update epoch and minibatch update, so the active policy will have the same log probabilities `new_logprobs=[-3.3213, -4.9980, -3.8690]`. , so the per-token KL penalty would be  `kl = new_logprobs - logprobs = [0., 0., 0.,]`
-        - However, after the first gradient backward pass, we could have `new_logprob=[3.3213, -4.9980, -3.8690]` , so the per-token KL penalty becomes `kl = new_logprobs - logprobs = [-0.3315, -0.0426, 0.6351]`
+        - However, after the first gradient backward pass, we could have `new_logprob=[-3.6528, -5.0406, -3.2339]` , so the per-token KL penalty becomes `kl = new_logprobs - logprobs = [-0.3315, -0.0426, 0.6351]`
         - Then the `non_score_reward = beta * kl` , where `beta` is the KL penalty coefficient  \\(\beta\\), and it’s added to the `score` obtained from the reward model to create the `rewards` used for training. The `score` is only given at the end of episode; it could look like `[0.4,]` , and we have `rewards = [beta * -0.3315, beta * -0.0426, beta * 0.6351 + 0.4]`.
 9. **Per-minibatch reward and advantage whitening, with optional mean shifting**
     1. OAI implements a `whiten` function that looks like below, basically normalizing the `values` by subtracting its mean followed by dividing by its standard deviation. Optionally, `whiten` can shift back the mean of the whitened `values` with `shift_mean=True`.