Questions about the reward function #45
-
Hello! |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 15 replies
-
I like this question, because I am thinking these things right now for my own different/related project. |
Beta Was this translation helpful? Give feedback.
-
Hi! So, in the TrackMania pipeline, the only reward is indeed the number of points passed from the demo trajectory during the previous timestep. This reward encompasses everything in theory: a trajectory that is better than the demo trajectory yields a higher reward, banging into walls yields a lower reward, etc. It is true that SAC struggles at understanding that ramming into walls is not a great idea though. Several other works have noticed this, and they usually artificially add a punishment for collisions to avoid this issue altogether. In TrackMania, this issue can be alleviated with hyperparameter tuning though, and I believe the residual difficulty comes from non-Markovness essentially and could be alleviated with a recurrent model (in other words, ATM the car doesn't know whether it crashed 1 second ago, and I have the feeling that it somehow affects its acceleration for a while in TrackMania). You can also look at what Laurens did for the competition: he has a hack that detects collisions on simple tracks so that he can penalize them directly. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the explanation. Sad that i didnt notice someone had worked with collision detection. had to edit an openplanet plugin myself for the collision data :D. Will look into this hack, cause i see some flaws with the solution i found. |
Beta Was this translation helpful? Give feedback.
-
I was wondering which variable from the wandb graphs would indicate the reward progress. I Get that you can see training progress from loss functions, but is is the debug_r graph the right one im looking at for the reward info? |
Beta Was this translation helpful? Give feedback.
Hi!
So, in the TrackMania pipeline, the only reward is indeed the number of points passed from the demo trajectory during the previous timestep. This reward encompasses everything in theory: a trajectory that is better than the demo trajectory yields a higher reward, banging into walls yields a lower reward, etc.
It is true that SAC struggles at understanding that ramming into walls is not a great idea though. Several other works have noticed this, and they usually artificially add a punishment for collisions to avoid this issue altogether. In TrackMania, this issue can be alleviated with hyperparameter tuning though, and I believe the residual difficulty comes from non-Markovness essenti…