-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc(gry): add tutorial for RM #241
base: main
Are you sure you want to change the base?
Conversation
@abstractmethod | ||
def estimate(self, data: list) -> Any: | ||
""" | ||
给出估计的奖励值 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的注释我感觉可以写的详细些,比如这个函数会修改data中的reward值为reward model给出的值,要注意此方法会导致原始data的reward值无法回复,需要在使用时格外注意
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
好的,我这就修改一下
from ding.entry import serial_pipeline_reward_model_offpolicy | ||
|
||
# 你所要训练的main config, create config | ||
# cooptrain_reward = True 表示在训练policy的时候,同时训练reward model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
也许可以加一下,哪些算法适用于这种情况需要co-train?
|
||
|
||
|
||
**在强化学习训练中添加Reward Model** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我认为接下来的几段缺了一个总括性的概述,比如我认为应该加一个:如何在强化学习中使用Reward model(类似这样的大标题)。这一标题下分出几点,分别是:如何定义和添加reward model, 如何训练reward model,如何使用reward model预测reward
|
||
在强化学习中,Reward Model 是对智能体的行为进行评价的模型,它的输入是智能体的观测和动作,输出是一个标量的奖励值。 | ||
|
||
在 DI-engine 中,Reward Model是我们提供的一个组件。所有的Reward Model类都继承自名为\ **BaseRewardModel**\的一个抽象基类。在这个类中,我们定义了最基本的Reward Model功能如下。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
英文和中文之间应当存在空格,这个涉及许多地方,都需要修改下
|
||
|
||
|
||
**在强化学习训练中添加Reward Model** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我认为接下来的几段缺了一个总括性的概述,比如我认为应该加一个:如何在强化学习中使用Reward model(类似这样的大标题)。这一标题下分出几点,分别是:如何定义和添加reward model, 如何训练reward model,如何使用reward model预测reward
如何添加新的Reward Model | ||
------------------------------- | ||
|
||
在上一节中,我们介绍了如何使用现有的reward model。接下来,我们将展示如何添加新的reward model,并且需要遵循哪些规范。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reward model是否要全文保持首字母大写,需要统一一下
|
||
def __init__(self, config: EasyDict, device: str, tb_logger: 'SummaryWriter') -> None: | ||
""" | ||
初始化RM,会在create RM的时候调用,需要注意的是如果要导入expert data(可以写一个self.load_expert_data()的方法) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个地方我认为就讲必须初始化的属性吧,如cfg, reward_model, tb_logger。像计数器之类的,由于每个算法不同,也不一定存在,让用户自行定义即可
功能是train整个RM,并向logger添加内容,形式应该如下 | ||
1. 由内部方法_train()进行具体训练,接受需要加入logger的返回值 | ||
2. 将对应内容添加到logger | ||
for _ in range(self.cfg.update_per_collect): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
example:
@abstractmethod | ||
def collect_data(self, data) -> None: | ||
""" | ||
收集RM所需要的训练数据 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
收集数据好像有点歧义,不如说将外部传入的训练数据存储在 RM 内部用于训练
|
||
def load_expert_data(self, data) -> None: | ||
""" | ||
加载专家数据,只有在使用专家数据训练Reward Model时才需要实现 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
加个举例,例如 Inverse RL 算法 Guided Cost Learning,T-REX
Reward Model 入门 | ||
------------------------------- | ||
|
||
**Reward Model 的基本概念** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要画一张图,表明 RM 在标准 RL pipeline 中的位置
pass | ||
|
||
def collect_data(self, data: list) -> None: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
带上缩进吧
这个方法的用途是在coop-train的时候,向RM增加新的数据(不用于pretrain) | ||
传入的data应该是一个由dict组成的list, | ||
每个dict需要包含(特殊情况请在注释中写明,推荐用assert确定在运算前) | ||
{"obs": torch.tensor, "next_obs": torch.tensor, "action": torch.tensor, "reward": torch.tensor} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
torch.Tensor
No description provided.