You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have not even understood them properly, but I really want to.
As per me,
more R1 -> it tells u that we have more dense curvature, means we have more up and downs in the loss function, rough loss function with many hills and valleys, so sign sgd will get confused in the majority vote here, may be, right, so use sgd?
Also more R2 -> more noise in the feature set?
Also what does it mean by dense gradients, means more in magnitude or entire gradient vector have almost same numbers?
I do not understand, any such deep optimisation tricks to improve training do not have enough literature, as maths is too complcated to understand just by looking at the paper.
Please help if possible by explaining what are components in R1 & R2?
Jay
The text was updated successfully, but these errors were encountered:
The idea in this paper was that phi measures the sparseness / denseness of a vector. When the vector is "dense" (meaning most of the components are similar in magnitude) then phi is close to one. On the other hand, when the vector is "sparse" (meaning a few components are much larger than the others) then phi is close to zero.
This means that phi(L) was supposed to measure whether the function is very curvy in just a few directions ( phi(L) ≈ zero ) or in lots of directions ( phi(L) ≈ one ). Similarly phi(sigma) measures whether the stochastic gradient is noisy in just a few directions or in lots of directions. And phi(g) measures the same property for the expected gradient.
In hindsight, it's not obvious that assumption 2 in that paper is a good model of curvature for deep neural networks. My more recent work has attempted to design better notions of curvature for neural nets. See for instance this paper: https://arxiv.org/abs/2002.03432.
In general, our understanding of optimisation theory of deep neural nets is still evolving. I hope we have better and simpler math to describe it soon.
I have not even understood them properly, but I really want to.
As per me,
more R1 -> it tells u that we have more dense curvature, means we have more up and downs in the loss function, rough loss function with many hills and valleys, so sign sgd will get confused in the majority vote here, may be, right, so use sgd?
Also more R2 -> more noise in the feature set?
Also what does it mean by dense gradients, means more in magnitude or entire gradient vector have almost same numbers?
I do not understand, any such deep optimisation tricks to improve training do not have enough literature, as maths is too complcated to understand just by looking at the paper.
Please help if possible by explaining what are components in R1 & R2?
Jay
The text was updated successfully, but these errors were encountered: