You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am training the humanoid agent to walk using the DeepMimic environment.
While training the policy, the terminal prints the total_reward for each episode, equal to the sum of rewards at each time step.
Then after 40 episodes or so (1 iteration), the terminal prints the train_return and test_return values.
How do these values relate to the total_reward for the episodes? I tried manually finding the mean and the discounted sum using a lambda value of 0.95, but the result is not close to 34.6 or 40, as the train and test return indicates.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hi,
I am training the humanoid agent to walk using the DeepMimic environment.
While training the policy, the terminal prints the total_reward for each episode, equal to the sum of rewards at each time step.
Then after 40 episodes or so (1 iteration), the terminal prints the train_return and test_return values.
How do these values relate to the total_reward for the episodes? I tried manually finding the mean and the discounted sum using a lambda value of 0.95, but the result is not close to 34.6 or 40, as the train and test return indicates.
Iteration example:
total_reward= 304.7070782231422
total_reward= 330.99031506948995
total_reward= 280.0899647972968
total_reward= 334.5682120720093
total_reward= 290.48607507379035
total_reward= 296.15922621557917
total_reward= 284.28576137796716
total_reward= 318.3853249960263
total_reward= 281.1954503632689
total_reward= 291.4806676815156
total_reward= 267.16685971352155
total_reward= 296.23791982481396
total_reward= 276.2614277167039
total_reward= 347.34052600783497
total_reward= 307.8560193518319
total_reward= 318.5019787110523
total_reward= 283.2503021802854
total_reward= 302.85406186996715
total_reward= 292.4121275202293
total_reward= 302.8634105168602
total_reward= 295.9168667624474
total_reward= 334.2352692753193
total_reward= 294.5151168261536
total_reward= 290.95920850744614
total_reward= 306.1276442673896
total_reward= 308.0391413994197
total_reward= 285.98186639238116
total_reward= 308.9052466366138
total_reward= 291.3991421620044
total_reward= 286.29836297186966
total_reward= 314.10028170590556
total_reward= 254.62273445146707
total_reward= 290.8562960379172
total_reward= 272.62704895129536
total_reward= 325.0583622036573
total_reward= 273.9253170482888
Model saved to: /home/.local/lib/python3.6/site-packages/pybullet_data/data/policies/humanoid3d/agent0_model.ckpt
Agent 0
| Iteration | 71830 |
| Wall_Time | 134 |
| Samples | 313248246 |
| Train_Return | 34.6 |
| Test_Return | 40 |
| State_Mean | 0.107 |
| State_Std | 2.63 |
| Goal_Mean | 0 |
| Goal_Std | 0 |
| Exp_Rate | 0.2 |
| Exp_Noise | 0.05 |
| Exp_Temp | 0.001 |
| Critic_Loss | 0.00183 |
| Critic_Stepsize | 0.01 |
| Actor_Loss | 0.327 |
| Actor_Stepsize | 2.5e-06 |
| Clip_Frac | 0.251 |
| Adv_Mean | -0.19361387 |
| Adv_Std | 0.77200764 |
Any help would be very appreciated. I want to understand the plots from the plot_return.py script, but they're not making much sense to me now.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions