- AD,Gradient descent and Backpropagation
- Numerical differentiation
- Directed Acyclic Graphs
- The chain rule
- Taylor series expansion
- Limits and continuity
- Partial derivatives
- Optimization
- The Gradient descent algorithm
- The Backpropagation algorithm
- Feed forward neural networks
- Activation functions, Autograd/JAX
- Dual numbers in AD
- Forward mode AD
- Forward mode AD table construction
- Symbolic differentiation
- Simple differentiation
- The Beta-Binomial model
Q. Differentiable functions
- What does it mean when a function is differentiable?
- Give an example of when a function doesn’t have a derivative at a point.
- Give an example of non-differentiable functions that are frequently used in machine learning. How do we do backpropagation if those functions aren’t differentiable?
Answer
Q. Convexity
- What does it mean for a function to be convex or concave? Draw it.
- Why is convexity desirable in an optimization problem?
- Show that the cross-entropy loss function is convex.
Answer
Q.
- Write the formulae for the finite difference rule used in numerical differentiation.
- What is the main problem with this formulae?
- Indicate one problem with software tools which utilize numerical differentiation and successive operations on floating point numbers.
Answer
Q.
- Given a function
$f(x)$ and a point a, define the instantaneous rate of change of$f(x)$ at$a$ . - What other commonly used alternative name does the instantaneous rate of change have?
- Given a function
$f (x)$ and a point a, define the tangent line of$f (x)$ at a.
Answer
Q.
- State the definition of the derivative
$f(c)$ of a function$f(x)$ at$x = c$ . - With respect to the DAG depicted in 5.3:
An expression graph for g(x). Constants are shown in gray, crossed-out since derivatives should not be propagated to constant operands |
-
(a) Traverse the graph 5.3 and find the function g(x) it represents.
-
(b) Using the definition of the derivative, find g′(9).
Answer
Q.
- With respect to the expression graph depicted in 5.4, traverse the graph and find the function g(x) it represents.
An expression graph for g(x). Constants are shown in gray, crossed-out since derivatives should not be propagated to constant operands |
- Using the definition of the derivative find the derivative of
$g(x)$ .
Answer
Q.
- The chain rule is key concept in differentiation. Define it.
- Elaborate how the chain rule is utilized in the context of neural networks.
Answer
Q. Find the Taylor series expansion for:
$\frac{1}{1-x}$ $e^x$ $sin(x)$ $cos(x)$
Answer
Q. Find the Taylor series expansion for:
Answer
Q. Find the Taylor series expansion centered at
Answer
Q. Find the
Answer
Q. At
Answer
Q. Find the following limits:
$\lim_{{x \to 3}}\frac{e^{x^3} - e^{27}}{3x - 9}$ $\lim_{{x \to 0}}\frac{e^{x^2} - x - 1}{3\cos x-x - 3}$ $\lim_{{x \to \inf}}\frac{x - ln x}{x^{1/100} + 4}$
Answer
Q.
- True or false: When applying a partial derivative, there are two variables considered constants - the dependent and independent variable.
- Given
$g(x, y)$ , find its partial derivative with respect to x:
Answer
Q. How can we use the Hessian (second derivative matrix) to test for critical points?
Answer
Q. The gradient of a two-dimensional function is given by
- Find the gradient of the function:
- Given the function:
evaluate it at
Answer
Q. Find the partial derivatives of:
Answer
Q. Find the partial derivatives of:
Answer
Q. Consider
Answer
Q. Consider
- Derive
$f (x)$ and conclude on its behavior. - Derive once again and discuss the concavity of the function
$f (x)$ .
Answer
Q. Consider the function
and find maximum, minimum, and saddle points.
Answer
Q. The gradient descent algorithm can be utilized for the minimization of convex functions. Stationary points are required in order to minimize a convex function. A very simple approach for finding stationary points is to start at an arbitrary point, and move along the gradient at that point towards the next point, and repeat until converging to a stationary point.
- What is the term used to describe the vector of all partial derivatives for a function
$f(x)$ ? - Complete the sentence: when searching for a minima, if the derivative is positive, the function is increasing/decreasing.
- The function
$x^2$ as depicted in 5.5, has a derivative of$f′(x) = 2x$ . Evaluated at$x = −1$ , the derivative equals$f′(x = −1) = −2$ .$At x = −1$ , the function is decreasing as$x$ gets larger. We will happen if we wish to find a minima using gradient descent, and increase (decrease)$x$ by the size of the gradient, and then again repeatedly keep jumping?
x-squared Function |
- How this phenomena can be alleviated?
- True or False: The gradient descent algorithm is guaranteed to find a local minimum if the learning rate is correctly decreased and a finite local minimum exists.
Answer
Q.
- Is the data linearly separable?
X_1 | X_2 | Y |
---|---|---|
1 | 1 | + |
12 | 12 | - |
4 | 5 | - |
12 | 12 | + |
- What is loss function for linear regression?
- What is the gradient descent algorithm to minimize a function
$f(x)$ ?
Answer
Q. Let
Answer
Q. Given the function
Answer
Q. Given a logistic discriminant classifier:
where the sigmoid function is given by:
The logistic loss for a training sample
- Show that
$p(y=−1|x)=σ(−w^Tx)$ . - Show that
$Δ_wL(y_i,x_i;w)=−y_i(1−p(y_i|x_i))x_i$ . - Show that
$Δ_wL(y_i,x_i;w)$ is convex.
Answer
Q. Most ML algorithms we use nowadays use first-order derivatives (gradients) to construct the next training iteration.
- How can we use second-order derivatives for training models?
- Pros and cons of second-order optimization.
- Why don’t we see more second-order optimization in practice?
Answer
Q.
- During the training of an ANN, a sigmoid layer applies the sigmoid function to every element in the forward pass, while in the backward pass the chain rule is being utilized as part of the backpropagation algorithm. With respect to the backpropagation algorithm, given a sigmoid
$σ(x) = \frac{e^x}{1+e^x}$ activation function, and a J as the cost function, annotate each part of equation (5.21):
-
Code snippet 5.6 provides a pure Python-based (e.g. not using Autograd) implementation of the forward pass for the sigmoid function. Complete the backward pass that directly computes the analytical gradients.
class Sigmoid: def forward(self,x): self.x = x return 1/(1+np.exp(-x)) def backward(self, grad): grad_input = [???] return grad_input
Answer
Q. This question deals with the effect of customized transfer functions. Consider a neural network with hidden units that use
Answer
Q.
- True or false: InAutograd if any input tensor of an operation has requires_grad=True, the computation will be tracked. After computing the backward pass, a gradient w.r.t. this tensor is accumulated into .grad attribute
- True or false: In Autograd, multiple calls to backward will sum up previously computed gradients if they are not zeroed.
Answer
Q. Your friend, a veteran of the DL community wants to use logistic regression and implement custom activation functions using Autograd. Logistic regression is used when the variable y that we want to predict can only take on discrete values (i.e. classification). Considering a binary classification problem (y = 0 or y = 1), the hypothesis function could be defined so that it is bounded between [0, 1] in which we use some form of logistic function, such as the sigmoid function. Other, more efficient functions exist such as the ReLU (Rectified Linear Unit) which we discussed later. Note: The weights in (5.8) are only meant for illustration purposes and are not part of the solution.
A typical binary classification problem |
- Given the sigmoid function : $g(x)= \frac{1}{1 + e^{-z}} $ what is the expression for the corresponding hypothesis in logistic regression?
- What is the decision boundary?
- What does
$h_Θ(x) = 0.8$ mean? - Using an Auto grad based Python program,implement both the forward and backward pass for the sigmoid activation function and evaluate it’s derivative at x = 1.
- Using an Autograd based Python program,implement both the forward and backward pass for the ReLU activation function and evaluate it’s derivative at x = 1
Answer
Q. For real values,
On the other hand, the
Your friend, a veteran of the DL community wants to implement a custom activation function for the
- Use this numpy array as an input
$[[0.37, 0.192, 0.571]]$ and evaluate the result using pure Python. - Use the PyTorch based
$torch.autograd.Function$ class to implement a custom Function that implements the forward pass for the arctanh function in Python. - Use the PyTorch based
$torch.autograd.Function$ class to implement a custom Function that implements the backward pass for the arctanh function in Python. - Name the class ArtanhFunction,and using the grad check method from
$torch.autograd$ , verify that your numerical values equate the analytical values calculated by gradcheck. Remember you must implement a method entitled$.apply(x)$ so that the function can be invoked by Autograd.
Answer
Q.
- Explain how AD uses floating point numerical rather than symbolic expressions.
- Explain the notion of DN as introduced by ([2]).
- What arithmetic operations are possible on DN?.
- Explain the relationship between a Taylor series and DN.
Answer
Q.
- Expand the following function using DN:
- With respect to the expression graph depicted in 5.9:
An expression graph for g(x). Constants are shown in gray, crossed-out since derivatives should not be propagated to constant operands |
-
(a) Traverse the graph 5.9 and find the function
$g(x)$ it represents. -
(b) Expand the function
$g(x)$ using DN.
-
Show that the general identity :
$$ g(x + \dot{x }d) = g(x) + g′(x)\dot{x}d $$
holds in this particular case too.
-
Using the derived DN, evaluate the function
$g(x)$ at$x = 2$ . -
Using an Autograd based Python program implement the function and evaluate it’s derivative at
$x = 2$ .
Answer
Q. With respect to the expression graph depicted in 5.10:
An expression graph for g(x). Constants are shown in gray, crossed-out since derivatives should not be propagated to constant operands |
- Traverse the graph 5.10 and find the function
$g(x)$ it represents. - Expand the function
$g(x)$ using DN. - Using the derived DN, evaluate the function
$g(x)$ at$x = 5$ . - Using an AutoGrad based Python program implement the function and evaluate it’s derivative at
$x = 5$ .
Answer
Q. When differentiating a function using forward-mode AD, the computation of such an expression can be computed from its corresponding directed a-cyclical graph by propagating the numerical values.
- Find the function,
$g(A, B, C)$ represented by the expression graph in 5.11.
A computation graph for g(x) |
- Find the partial derivatives for the function
$g(x)$ .
Answer
Q. Answer the following given that a computational graph of a function has N inputs and M outputs.
-
True or False?
(a) Forward and reverse mode AD always yield the same result.
(b) In reverse mode AD there are fewer operations (time) and less space for intermediates (memory).
(c) The cost for forward mode grows with N.
(d) The cost for reverse mode grows with M.
Answer
Q.
- Given the function:
Answer
Q. Answer the following questions:
- Which differentiation method is inherently prone to rounding errors?
- Define the term symbolic differentiation.
Answer
Q. Answer the following questions:
- Implement the sigmoid function
$$σ(x) = 1$$ symbolically using a Python based 1+e−x SymPy program. - Differentiate the sigmoid function using SymPy and compare it with the analytical
derivation
$$σ′(x) = σ(x)(1 − σ(x))$$ . - Using SymPy, evaluate the gradient of the sigmoid function at
$x = 0$ . - Using SymPy, plot the resulting gradient of the sigmoid function.
Answer
Q. You will most likely not be given such a long programming task during a face-to-face interview. Nevertheless, an extensive home programming assignment is typically given at many of the start-ups I am familiar with. You should allocate around approximately four to six hours to completely answer all questions in this problem.
We discussed the Beta-Binomial model extensively in chapter 3. Recall that the Beta- Binomial distribution is frequently used in Bayesian statistics to model the number of suc- cesses in n trials. We now employ SymPy to do the same; demonstrate computationally how a prior distribution is updated to develop into a posterior distribution after observing the data via the relationship of the Beta-Binomial distribution.
Provided the probability of success, the number of successes after n trials follows a binomial distribution. Note that the beta distribution is a conjugate prior for the parameter of the binomial distribution. In this case, the likelihood function is binomial, and a beta prior distribution yields a beta posterior distribution. Recall that for the Beta-Binomial distribution the following relationships exist: (5.27)
A computation graph for g(x) |
-
Likelihood: The starting point for our inference problem is the Likelihood, the probability of the observed data. Find the Likelihood function symbolically using sympy. Convert the SymPy representation to a purely Numpy based callable function with a Lambda expression. Evaluate the Likelihood function at
$θ = 0.5$ with 50 successful trials out of 100. -
Prior: The Beta Distribution. Define the Beta distribution which will act as our prior distribution symbolically using sympy. Convert the SymPy representation to a purely Numpy based callable function. Evaluate the Beta Distribution at
$θ : 0.5, a : 2, b : 7$ -
Plot the Beta distribution, using the Numpy based function.
-
Posterior: Find the posterior distribution by multiplying our Beta prior by the Binomial Likelihood symbolically using sympy. Convert the SymPy representation to Prior of θ Likelihood Posterior of θ Posterior Mean
$(a+x)/(a+b+n−x)$ Beta(a,b) binomial (n, θ) Beta (a + x, b + n − x) a purely Numpy based callable function. Evaluate the Posterior Distribution at$θ : 0.5, a : 2, b : 7$ -
Plot the posterior distribution, using the Numpy based function.
-
Show that the posterior distribution has the same functional dependence on θ as the prior, and it is just another Beta distribution.
-
Given: Prior:$Beta(θ|a=2,b=7)=56θ(−θ+1)^6$ and: Likelihood :
$Bin(r = 3|n = 6, θ) = 19600θ3 (−θ + 1)^{47}$ find the resulting posterior distribution and plot it.