You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question about your calculation of GNS. You use different classes for the calculation (GradientNoiseScale, AdamGradientNoiseScale). As far as I understand, the preconditioner matrices differ for each class (GNS is always 1, AdamGNS is adjusted).
I have three questions regarding this:
For which optimizers does the code work or deliver correct results, and what would need to be done if using other optimizers to correctly calculate the GNS? Is it correct, that it only works for SGD, Adam, Adagrad?
To what extent does the scheduler or scaling rule influence the calculation of the GNS because this is the criteria on which you decide how the GNS is calculated?
If I use another optimizer than Adam or AdamW the "normal" GNS class is used for calculating the GNS (with precondition matrices 1). Does this work for all other opimizer like SGD, LAMB and some other, or is this only valid for SGD. (precondition = 1 should be the vanilla SGD case from the original GNS paper "An Empirical Model of Large-Batch Training".
Here is the relevant code:
if not scaling_rule and (isinstance(optimizer, torch.optim.Adam) or
isinstance(optimizer, torch.optim.AdamW)):
self.scaling_rule = AdamScale()
else:
self.scaling_rule = scaling_rule or AdaScale()
if isinstance(scaling_rule, AdamScale):
self.gns = AdamGradientNoiseScale(self, optimizer,
mp_scaler=mp_scaler)
else:
self.gns = GradientNoiseScale(self, optimizer, mp_scaler=mp_scaler)
self.scaling_rule.initialize(self, optimizer, patch_optimizer=True)
I would appreciate any kind of help.
The text was updated successfully, but these errors were encountered:
Dear all,
I have a question about your calculation of GNS. You use different classes for the calculation (GradientNoiseScale, AdamGradientNoiseScale). As far as I understand, the preconditioner matrices differ for each class (GNS is always 1, AdamGNS is adjusted).
I have three questions regarding this:
For which optimizers does the code work or deliver correct results, and what would need to be done if using other optimizers to correctly calculate the GNS? Is it correct, that it only works for SGD, Adam, Adagrad?
To what extent does the scheduler or scaling rule influence the calculation of the GNS because this is the criteria on which you decide how the GNS is calculated?
If I use another optimizer than Adam or AdamW the "normal" GNS class is used for calculating the GNS (with precondition matrices 1). Does this work for all other opimizer like SGD, LAMB and some other, or is this only valid for SGD. (precondition = 1 should be the vanilla SGD case from the original GNS paper "An Empirical Model of Large-Batch Training".
Here is the relevant code:
I would appreciate any kind of help.
The text was updated successfully, but these errors were encountered: