Questions about the reward function #45

veczieds · 2023-04-13T07:12:10Z

veczieds
Apr 13, 2023

Hello!
I had some questions about the reward function used in the project.
Im collecting data about the collision count during training for my own research. What i experienced while training SAC was that at some point the car decided that its effective in corners just to ram into the wall and stick to it :D. So i decided to look at the reward function if that behavior is incentivized. But i couldnt fully understand from the text how the function works. As i understand the reward trajectory is split into points and then the agent is rewarded by points passed in a time-step. Now my question is: Is there any incentive for the car to stick to the trajectory besides the fact that you can complete the track faster with better trajectories? (well for example is there a straight up reward for the car driving right on top of the trajectory).
And i was trying to find the reward function in code parts, but had no luck (Possibly i just didnt notice it, cause im new to this stuff), maybe you could point to me where to look for it.

Answered by yannbouteiller

Apr 13, 2023

Hi!

So, in the TrackMania pipeline, the only reward is indeed the number of points passed from the demo trajectory during the previous timestep. This reward encompasses everything in theory: a trajectory that is better than the demo trajectory yields a higher reward, banging into walls yields a lower reward, etc.

It is true that SAC struggles at understanding that ramming into walls is not a great idea though. Several other works have noticed this, and they usually artificially add a punishment for collisions to avoid this issue altogether. In TrackMania, this issue can be alleviated with hyperparameter tuning though, and I believe the residual difficulty comes from non-Markovness essenti…

View full answer

NDR008 · 2023-04-13T08:47:59Z

NDR008
Apr 13, 2023

I like this question, because I am thinking these things right now for my own different/related project.
My thoughts in generic terms (without experience on this matter yet).
I think the points and time aspect are required to have a non-naïve reward (i.e. to actually go as fast as possible in the direction of the finish).
I think if you make the trajectory following the reward, I think you would end up with something that nearly follows the trajectory (which would not require AI perse). But making it a part of the reward, might make it interesting and dependent on how strong is the trajectory reward against other rewards, but you could argue that this is cheating in the sense it is not a pure robotics like problem.
I intend to investigate this (hopefully) in my project, because I am looking at the human-like experience of driving a car.
Human drivers first learn the text-book / recommended driving-line / trajectory, and later learn how to optimize beyond it (find alternative lines that are more reward such as progressing through the track even faster).
So I am thinking of how to use a trajectory-linked-reward as the equivalent to "first learn the text-book / recommended driving-line / trajectory".

0 replies

yannbouteiller · 2023-04-13T14:48:12Z

yannbouteiller
Apr 13, 2023
Maintainer

Hi!

So, in the TrackMania pipeline, the only reward is indeed the number of points passed from the demo trajectory during the previous timestep. This reward encompasses everything in theory: a trajectory that is better than the demo trajectory yields a higher reward, banging into walls yields a lower reward, etc.

It is true that SAC struggles at understanding that ramming into walls is not a great idea though. Several other works have noticed this, and they usually artificially add a punishment for collisions to avoid this issue altogether. In TrackMania, this issue can be alleviated with hyperparameter tuning though, and I believe the residual difficulty comes from non-Markovness essentially and could be alleviated with a recurrent model (in other words, ATM the car doesn't know whether it crashed 1 second ago, and I have the feeling that it somehow affects its acceleration for a while in TrackMania).

You can also look at what Laurens did for the competition: he has a hack that detects collisions on simple tracks so that he can penalize them directly.

1 reply

yannbouteiller Apr 13, 2023
Maintainer

PS: The reward function for TrackMania is defined here

veczieds · 2023-04-13T14:53:53Z

veczieds
Apr 13, 2023
Author

Thank you for the explanation. Sad that i didnt notice someone had worked with collision detection. had to edit an openplanet plugin myself for the collision data :D. Will look into this hack, cause i see some flaws with the solution i found.

0 replies

veczieds · 2023-05-08T13:39:08Z

veczieds
May 8, 2023
Author

I was wondering which variable from the wandb graphs would indicate the reward progress. I Get that you can see training progress from loss functions, but is is the debug_r graph the right one im looking at for the reward info?

14 replies

veczieds May 8, 2023
Author

tmrl - 0.5.1

yannbouteiller May 8, 2023
Maintainer

Well, my bad, I had inadvertently pushed the yb/dev branch on PyPI instead of master. Version 0.5.2 should fix this.

veczieds May 8, 2023
Author

Understood. Will be waiting for 0.5.2 then :D

yannbouteiller May 8, 2023
Maintainer

Well I just published it on PyPI, it should be there already

veczieds May 8, 2023
Author

oh, that was fast :D. Will try it out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions about the reward function #45

{{title}}

Replies: 4 comments 15 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Questions about the reward function #45

veczieds Apr 13, 2023

Replies: 4 comments · 15 replies

NDR008 Apr 13, 2023

yannbouteiller Apr 13, 2023 Maintainer

yannbouteiller Apr 13, 2023 Maintainer

veczieds Apr 13, 2023 Author

veczieds May 8, 2023 Author

veczieds May 8, 2023 Author

yannbouteiller May 8, 2023 Maintainer

veczieds May 8, 2023 Author

yannbouteiller May 8, 2023 Maintainer

veczieds May 8, 2023 Author

veczieds
Apr 13, 2023

Replies: 4 comments 15 replies

NDR008
Apr 13, 2023

yannbouteiller
Apr 13, 2023
Maintainer

yannbouteiller Apr 13, 2023
Maintainer

veczieds
Apr 13, 2023
Author

veczieds
May 8, 2023
Author

veczieds May 8, 2023
Author

yannbouteiller May 8, 2023
Maintainer

veczieds May 8, 2023
Author

yannbouteiller May 8, 2023
Maintainer

veczieds May 8, 2023
Author