Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to calculate R1 & R2 mentioned in the paper? #2

Open
jaytimbadia opened this issue Apr 17, 2022 · 1 comment
Open

How to calculate R1 & R2 mentioned in the paper? #2

jaytimbadia opened this issue Apr 17, 2022 · 1 comment

Comments

@jaytimbadia
Copy link

image

I have not even understood them properly, but I really want to.

As per me,
more R1 -> it tells u that we have more dense curvature, means we have more up and downs in the loss function, rough loss function with many hills and valleys, so sign sgd will get confused in the majority vote here, may be, right, so use sgd?

Also more R2 -> more noise in the feature set?

Also what does it mean by dense gradients, means more in magnitude or entire gradient vector have almost same numbers?

I do not understand, any such deep optimisation tricks to improve training do not have enough literature, as maths is too complcated to understand just by looking at the paper.

Please help if possible by explaining what are components in R1 & R2?

Jay

@jxbz
Copy link
Owner

jxbz commented May 25, 2022

Hi Jay,

Sorry for the late reply.

The idea in this paper was that phi measures the sparseness / denseness of a vector. When the vector is "dense" (meaning most of the components are similar in magnitude) then phi is close to one. On the other hand, when the vector is "sparse" (meaning a few components are much larger than the others) then phi is close to zero.

This means that phi(L) was supposed to measure whether the function is very curvy in just a few directions ( phi(L) ≈ zero ) or in lots of directions ( phi(L) ≈ one ). Similarly phi(sigma) measures whether the stochastic gradient is noisy in just a few directions or in lots of directions. And phi(g) measures the same property for the expected gradient.

In hindsight, it's not obvious that assumption 2 in that paper is a good model of curvature for deep neural networks. My more recent work has attempted to design better notions of curvature for neural nets. See for instance this paper: https://arxiv.org/abs/2002.03432.

In general, our understanding of optimisation theory of deep neural nets is still evolving. I hope we have better and simpler math to describe it soon.

Jeremy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants