Homework 6

Homework Submission Workflow

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Important: Failure to do any of the following will result in lost points:

  • Submit one PDF file and one notebook per group

  • Enter all the group members in your PDF Gradescope submission, not just in your documents. Look at this Piazza Post if you don't know how to do this

  • Match each answer (not question!) with the appropriate page in Gradescope

  • Avoid large blank spaces in your PDF file

Part 1: Newton's Method

The file hw6.py that comes with this assignment defines the function lineSearch we developed in homework 5, as well as three functions that compute the value, gradient, and Hessian of the Rosenbrock function we encountered in that assignment. These three funcitons are called Rosenbrock, RosenGrad, and RosenHess, and take a numpy array $\mathbf{z}\in\mathbb{R}^2$ as argument. They return respectively a scalar, a numpy array of shape $2$, and a numpy array of shape $(2, 2)$. The file hw6.py also defines a floating point value small to be used as a default value for numbers to be considered "small."

Problem 1.1

Write a function with header

def Newton(f, g, H, z0, maxK=1000, delta=hw6.small, normalize=False):

that takes three functions f, g, H that compute value, gradient, and Hessian of some function $f$ and a starting point z0 (a one-dimensional numpy array) and finds a minimum of $f$ by Newton's method. The optional parameter maxK is the maximum number of iterations, and delta is a threshold on the Euclidean norm of $\mathbf{z}_{k+1} - \mathbf{z}_{k}$. The algorithm stops when that norm falls below delta. See the programming notes for the meaning of normalize.

Show your code and the value zNewton[-1] found by running Newton on the Rosenbrock function, starting at $\mathbf{z}_0 = (-1.2, 1)$. Also print out the number of points in zNewton.

Programming Notes

  • For this part, you may not use any library or other code that implements Newton's method, either exactly or substantively.

  • Say import hw6 to import the code from hw6.py.

  • Whenever the Hessian fails to be positive semidefinite, your code should fall back to a single call to lineSearch, rather than computing the Newton step. Remember to pass delta to lineSearch.

  • Rather than returning the value $\mathbf{z}^{\ast}$ found for $\arg\min_{\mathbf{z}} f(\mathbf{z})$, the function Newton should return a list of numpy arrays containing the points $\mathbf{z}_0,\ldots, \mathbf{z}_m$ encountered at each step of running Newton. Here, $\mathbf{z}_0$ is the initial point, and $m$ is the number of iterations, so if zNewton is the name for the value returned by Newton, then the minimum it found is zNewton[-1].

  • Newton returns the list of values encountered even when it fails to find a minimum within maxK iteration.

  • Even if the Hessian is positive semidefinite, it can be degenerate (that is, it can have zero determinant). In that case, the function numpy.linalg.solve would fail. To address this problem, solve the system with the pseudo-inverse. Specifically, to solve

$$A\mathbf{z} = \mathbf{b}$$

you would write

z = np.dot(np.linalg.pinv(A), b)

  • The keyword argument normalize will be set to True in a later problem, for reasons we will examine. For now, right after finding the new value (let us call that, say, z1) of $\mathbf{z}$ in the current iteration of Newton's method, add code that checks if normalize is True and, if so, normalizes the value to have norm $1$:

      if normalize:
          if not np.all(z1 == 0):
              z1 /= np.linalg.norm(z1)

Problem 1.2

How does the number of iterations from Newton compare with that you found for steepest descent in homework 5 on the same optimization problem? You need not give exact numbers.

Part 2: The Geometry of Newton's Method

In this part, you will plot the history of steps taken by Newton in a way that clarifies the method's geometry. This problem is also a good exercise in geometry and linear algebra. A theoretical prelude is needed first.

The Geometry of Newton's Method in Two Dimensions

Let $f(\mathbf{z})$ be a function from $\mathbb{R}^2$ to $\mathbb{R}$, let $\mathbf{z}_0$ be the initial point in Newton's method, and define

$$ f_0 = f(\mathbf{z}_0) \;\;\;\;,\;\;\;\; \mathbf{g}_0 = \nabla f(\mathbf{z}_0) \;\;\;\;,\;\;\;\; H_0 = H_f(\mathbf{z}_0) $$ be the value, gradient, and Hessian of $f$ at $\mathbf{z}_0$. Throughout this part, assume that $H_0$ is positive definite (stronger than positive semidefinite).

Further, let $\mathbf{z}_1$ be the value of $\mathbf{z}$ obtained with one Newton step from $\mathbf{z}_0$, and let

$$ \Delta\mathbf{z} = \mathbf{z} - \mathbf{z}_0\;. $$

Then the method relies on the following approximation of $f$:

$$ f(\mathbf{z}) \approx g_2(\mathbf{z}) = f_0 + \mathbf{g}_0^{T} \Delta\mathbf{z} + \frac{1}{2} \Delta\mathbf{z}^T H_0 \Delta\mathbf{z} \;. $$

The isocontour of $g_2$ through $\mathbf{z}_0$ is the ellipse $E$ with equation

$$ g_2(\mathbf{z}) = f_0 \;, $$

that is,

$$ \frac{1}{2} \Delta\mathbf{z}^T H_0 \Delta\mathbf{z} + \mathbf{g}_0^{T} \Delta\mathbf{z} = 0\;. \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(1) $$

By construction, the minimum of $g_2$ is at $\mathbf{z}_1$.

Let $\mathbf{i}$ and $\mathbf{j}$ be two unit vectors on the plane that point along the axes of the ellipse $E$. Also, let $\ell_1, \ell_2$ be the half-lengths of the axes (a half-length is half the length). If we define the two $2\times 2$ matrices

$$ R = \left[\begin{array}{cc}\mathbf{i}^T\\\mathbf{j}^T \end{array}\right] \;\;\;\;\text{and}\;\;\;\; L = \left[\begin{array}{cc}\ell_1 & 0\\0 & \ell_2 \end{array}\right] $$

then the ellipse has equation

$$ (\mathbf{z} - \mathbf{z}_1)^T R^T L^{-2} R (\mathbf{z} - \mathbf{z}_1) = 1\;. \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;(2) $$

To compare equations (1) and (2), which reprsent the same ellipse, we find the eigenvalues and eigenvectors of $H_0$, perhaps by using the function numpy.linalg.eigh, since $H_0$ is symmetric. Let $d$ and $V$ be the arrays returned by this function (read the manual). The columns of $V$ are the unit eigenvectors of $H_0$, and the vector $d$ contains the corresponding eigenvalues of $H_0$. Because of this, and by the definition of eigenvalues and eigenvectors, we can write

$$ H_0 = V D V^T $$ where $D$ is a diagonal matrix with the entries of $d$ on the diagonal.

Since equations (1) and (2) represent the same ellipse, the rows of $R$ and the columns of $V$ must point along the same axes, and therefore be the same vectors up to signs. Because these matrices always show up in pairs in the equations above, the signs cancel, so we can ignore them and write

$$ R = V^T\;. $$

Scaling, on the other hand, is different in equations (1) and (2), so the two diagonal matrices $L$ and $D$ differ by a multiplicative factor $\alpha$:

$$ \alpha L^{-2} = D\;. $$

We could find $\alpha$ by algebraic manipulation, but there is an easier way: Rewrite equation (2) in terms of $V$ and $D$, and then require $\mathbf{z}_0$ to satisfy the equation, since $\mathbf{z}_0\in E$ by construction. This yields

$$ \frac{1}{\alpha}\ \Delta_1^T\, V\, D\, V^T \Delta_1 = 1 $$

that is,

$$ \alpha = \Delta_1^T\, H_0 \Delta_1 $$

where $\Delta_1 = \mathbf{z}_1 - \mathbf{z}_0$.

Drawing Newton Steps

The idea is now to display each Newton step by drawing the isocontour $E_k$ through $\mathbf{z}_k$ at every iteration. For notational simplicity, redefine

$$ \mathbf{z}_0 = \mathbf{z}_k \;\;\;\;\text{and}\;\;\;\; \mathbf{z}_1 = \mathbf{z}_{k+1} \;\;\;\;\text{and}\;\;\;\; E = E_k\;. $$

To draw $E$, we use the function matplotlib.patches.Ellipse (read the manual), which requires the center of $E$, the angle $\theta$ formed by the first axis of $E$ with the horizontal axis of the reference system, and the full lengths width and height of the axes (not the half-lengths!).

The point $\mathbf{z}_0$ is on the ellipse $E$, and the point $\mathbf{z}_1$ is its center.

From the definitions of $R$ and $V$, we see that (using Python subscripts to avoid confusion)

$$ V_{00} = \cos\theta\;\;\;\;\text{and}\;\;\;\; V_{10} = \sin\theta $$

so that we can find $\theta$ by using the function numpy.arctan2:

theta = numpy.arctan2(V[1][0], V[0][0])

However, this function returns $\theta$ in radians and in the interval $(-\pi, \pi]$, while matplotlib.patches.Ellipse requires an angle in degrees and in the interval $[0, 360)$, so a conversion is needed.

Problem 2.1

Write a function with header

def ellipse(z0, z1, Hessian, color='red'):

that takes two consecutive points z0 and z1 on a Newton optimization path and a function Hessian that computes the Hessian of some function from $\mathbb{R}^2$ to $\mathbb{R}$ and returns the ellipse $E$ associated with that step.

Show your code and display the ellipse for the first Newton step of your optimization in the previous part.

Programming Notes

  • Use the default color for the ellipse.

  • Display your ellipse on top of a display of the contours of the Rosenbrock function for values of the two components of $\mathbf{z}$ between $-4$ and $3$. This set of contours is in the same style as the one you drew in homework 5, and is computed by the function RosenContours in hw6.py. Look at the code to see what this function returns.

  • Also draw a red dot at $\mathbf{z}_0$ and a red cross (x) at $\mathbf{z}_1$.

Problem 2.2

Make a new plot that displays the ellipses for the first four steps of your Newton optimization path you found in the previous part of this assignment.

Show your code and the resulting plot.

Programming Notes

  • The ellipses are in a single diagram, on top of the countour diagram produced by RosenContours.

  • The four ellipses should be in red, green, blue, and orange, in this order.

  • For each ellipse, superimpose a dot at $\mathbf{z}_k$ and a cross (x) at $\mathbf{z}_{k+1}$, both in the same color as the ellipse. This will cause some dots and crosses for different ellipses to be drawn in the same position. This is intended.

Part 3: Binary Linear Classifiers

The file hw6.py also defines a function makeData that returns a training set T and a test set S if you call

T, S = hw6.makeData()

The data points are image of hand-written digits, just as in a previous assignment. However, only the digits 4 and 9 are used, so the classification problem is binary. Digit 4 is given label 0, and digit 9 is given label 1.

Section 2 of the class notes on binary linear predictors defines the cross-entropy loss, the score function and risk function for a binary logistic-regression classifier with parameter vector $\mathbf{v}\in\mathbb{R}^m$ where $m = d+1$ and $\mathbf{x}\in\mathbb{R}^d$. The notes in that Section also define the gradient and Hessian of the risk. Read and understand that Section carefully.

Problem 3.1

Define functions with headers

def lossXE(y, p):
def score(x, v):

that implement the cross-entropy loss and score function for a logistic-regression classifier.

Show your code, and a single diagram that shows the two plots of lossXE(0, p) and lossXE(1, p) for p between $10^{-3}$ and $1 - 10^{-3}$.

Programming Notes

  • Use natural logarithms.

  • If the score is too close to either 0 or 1, then the cross-entropy loss may become undefined. Also, if the argument to the score function is too small or too large, the exp function can cause numerical overflow or underflow. To prevent both problems, if

$$ a = b + \mathbf{w}^T\mathbf{x} $$

in the score function, then clip the value of a between -10 and 10:

a = max(-10, min(10, a))

Problem 3.2

Define functions with headers

def riskXE(v):
def riskXEGrad(v):
def riskXEHess(v):

that implement the cross-entropy risk on set T (loaded by hw6.makeData()), its gradient, and Hessian, following the formulas in the notes.

Then run your own Newton to train the optimal parameter vector $\mathbf{v}$ on T, starting with a vector $\mathbf{v}_0$ of all zeros.

Show your code and give the number of iterations to convergence.

Programming Notes

  • For compatibility with Newton, these functions use T as defined in the environment, rather than passing T as a parameter. This is not best coding practice, because now T is hard-wired in the functions and cannot be changed, but simplifies your task.

  • Use the default values for the keyword arguments of Newton, except that normalize is set to True (more on this point in a later problem).

  • Remember that Newton outputs the whole history, and this makes it easy to tell the number of iterations.

  • Better numerical methods exist. This problem has a purely pedagogical value.

Problem 3.3

The solution is obtained with a variant of Newton's method that normalizes the value of $\mathbf{z}$ at every iteration, because the parameter normalize is set to True. In this application of the method, the unknown vector $\mathbf{z}$ really represents $\mathbf{v}$, the vector of parameters of the logistic regressor.

Normalization changes the method quite substantially, and some of the guarantees of Newton's method no longer apply. We ignore this issue in this assignment.

Why do we normalize the value of $\mathbf{v}$ at every iteration? Answer this question by explaining why the norm of $\mathbf{v}$ is meaningless for this problem.

Problem 3.4

Write functions with headers

def h(x, v):
def risk01(v, S):

that implement the logistic-regression classifier $h$ with parameter $\mathbf{v}$ and the zero-one risk on set $S$.

Show your code and the values of zero-one training and test risk for the optimal parameter vector $\mathbf{v}^{\ast}$ you found in the previous problem. Express these risks as percentage values.