COMPSCI 371 Homework 3¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Gradient Descent with Analytic Line Search¶

As discussed in class, line search, a component of some variants of gradient descent, is a numerical procedure for finding a local minimum of a function $f(\mathbf{z})\ :\ \mathbb{R}^m \rightarrow \mathbb{R}$ restricted to a particular direction $\mathbf{p}\in\mathbb{R}^m$ from a point $\mathbf{z}_0\in\mathbb{R}^m$. In other words, line search finds a minimum $\alpha^\ast$ of $$ \phi(\alpha) = f(\mathbf{z}_0 + \alpha \mathbf{p}) \;. $$

If the function $f$ is simple, we may be able to find the minimum of $\phi(\alpha)$ by hand using calculus, by finding the solution of $\phi'(\alpha) = 0$ that is nearest to $\alpha = 0$ (here the prime denotes the first derivative). In that case, we say that we perform an analytic line search, rather than a numerical one. An example of such a function $f\ :\ \mathbb{R}^2 \rightarrow \mathbb{R}$ is as follows:

$$ f(x, y) = \sin(x^2 y^2 - 1) $$

where for notational simplicity we write

$$ \mathbf{z} = \left[\begin{array}{c} x \\ y \end{array}\right]\;. $$

Note¶

Problems in this part of this assignment ask you to give numerical values for some quantities. Your numerical results will typically be approximate. Give them with three decimals after the period. It is fine to use Python or a calculator for the numerical computations when needed, but you should not use any software that performs gradient descent or line search in this part of the assignment.

Problem 1.1 (Exam Style)¶

Perform one step of analytic gradient descent on the function $f$ given above and starting at point

$$ \mathbf{z}_0 = \left[\begin{array}{c} x_0 \\ y_0 \end{array}\right] = \left[\begin{array}{c} 1 \\ 1 \end{array}\right]\;. $$

Specifically, find an algebraic expression for $\alpha_0$, the positive value of $\alpha$ that yields the local minimum of $\phi(\alpha)$ that is closest to $\alpha = 0$ (that is, the first local minimum you encounter as $\alpha$ grows away from zero in the search direction). Then convert that expression to a decimal number. Finally, give the numerical values of $\mathbf{z}_1$ (that is, the vector $\mathbf{z}$ that corresponds to $\alpha_0$) and of $f(\mathbf{z}_1)$.

Show the main steps of your calculations.

Problem 1.2¶

This problem gives you an opportunity to check your answers for the previous problem.

The function scipy.optimize.golden implements essentially the line search method described in the class notes on optimization, except that in each iteration the appropriate bracketing triple interval is split according to the golden ratio, rather than in two equal parts. The golden ratio idea is described in Appendix A of the notes, but you may ignore the distinction between the two methods for this problem.

Use the function scipy.optimize.golden to write a function with the following header:

def line_search(f, g, z0):

The argument f is a function $f(\mathbf{z}) : \mathbb{R}^m \rightarrow \mathbb{R}$ and g is a function $g(\mathbf{z}) : \mathbb{R}^m \rightarrow \mathbb{R}^m$ that returns the gradient of $f$ at $\mathbf{z}$. The numpy vector z0 is an initial value for $\mathbf{z}$.

The function line_search returns a triple of values z1, f1, ne. The numpy vector z1 is the point $\mathbf{z}_1\in\mathbb{R}^m$ reached through a line search starting at z0. The scalar f1 is $f(\mathbf{z}_1)$. The integer ne is the number of times that scipy.optimize.golden called the function f.

Then write the two functions f and g for the function $f$ given in this part of the assignment and show the result of calling line_search with the value of $\mathbf{z}_0$ given in Problem 1.1.

Printout Format¶

Your result should be printed on a single line with the following format:

f(xxx) = xxx -> f(xxx) = xxx in xxx evaluations

where the strings xxx are replaced (in this order) by the numerical values of $\mathbf{z}_0$, $f(\mathbf{z}_0)$, $\mathbf{z}_1$, $f(\mathbf{z}_1)$, and the number of function evaluations ne returned by line_search. All floating-point values should be printed in fixed-point notation with up to three decimal digits after the period (so anything smaller than 0.0005 becomes 0.). For numpy arrays this format can be achieved with the following context manager:

with np.printoptions(precision=3, suppress=True):

Programming Notes¶

  • Please read the manual to understand what scipy.optimize.golden requires and returns.
  • The function scipy.optimize.golden has an optional parameter brack that lets you set the three values $(a, b, c)$ of the initial bracketing triple or just the first two values $(a, b)$ (using the same notation as in the class notes). Set brack=(0., 1.e-3).
  • If the optional parameter full_output to scipy.optimize.golden is set to True, the function returns a triple $(\alpha_0, \phi(\alpha_0), n)$ where $\alpha_0$ is the step that scipy.optimize.golden finds, $\phi$ is the function $\mathbb{R} \rightarrow \mathbb{R}$ you pass to it, and $n$ is the number of evaluations of $\phi$.
  • Note that scipy.optimize.golden returns a descent rate $\alpha_0$ while your line_search should return $\mathbf{z}_1$.
  • If scipy.optimize.golden fails, your function line_search should return z0 and f0 in lieu of z1 and f1. No need to issue a warning in that case (although you would do this in production-quality code).
  • Use your analytical result to write the function g.

Problem 1.3 (Exam Style)¶

If you did everything correctly in the previous two problems, the analytic and numerical answers should be consistent with each other.

However, this is not guaranteed in general. That is, for arbitrary functions $f$, the output result from the numerical line search may be different from the solution of the analytical line search as defined in Problem 1.1 even if perfect arithmetic were used. Explain briefly why.

Problem 1.4 (Exam Style)¶

By construction, the value $\alpha_0$ you found in problem 1.1 is a local minimum of $\phi(\alpha)$. Is the corresponding point $\mathbf{z}_1$ a local minimum for $f(\mathbf{z})$? Prove your answer.

Note¶

A global minimum is also a local minimum. The type of local minimum (strong or weak) does not matter.

Problem 1.5 (Exam Style)¶

The analytic line search performed in Problem 1.1 is myopic, in that it is content to find the first local minimum of $\phi(\alpha)$ for $\alpha > 0$.

Since the calculation is analytic, we can do better. Specifically, use calculus to find the first global minimum of $\phi(\alpha)$ (for $\alpha > 0$) for the same function $f(x, y)$ examined above and starting at the same point $\mathbf{z}_0$.

As before, give $\alpha_0$, $\mathbf{z}_1$, and $f(\mathbf{z}_1)$. For $\alpha_0$ give both an algebraic expression and a numerical value with three decimal digits after the period. For the other quantities just give numerical values with three decimal digits after the period.

Show the main steps of the calculation where it differs from what you did in Problem 1.1.

Problem 1.6 (Exam Style)¶

Is the point $\mathbf{z}_1$ you found in Problem 1.5 a global minimum for $f(\mathbf{z})$? If not, why not? If it is, is this the only global minimum for $f(\mathbf{z})$? Explain your answers as succinctly as you can.

Part 2: Gradient Descent¶

In this part of the assignment you will implement a basic version of gradient descent and then use that to explore the effects of various choices and parameters. Since you will use your code repeatedly in this part, it is important that you develop that with care, to avoid issues in later problems.

It is also important to understand the distinction between iterations of steepest descent and number of function evaluations. An iteration occurs every time you replace the current value $\mathbf{z}_k$ of the independent variable with a new value $\mathbf{z}_{k+1}$ in the main loop of gradient descent and then compute a new descent direction. If you use a descent rate $\alpha$ (fixed or otherwise), each iteration involves just one call of the function $f$ being optimized, so the number of function evaluations in this case is the same as the number of iterations. We do not count gradient evaluations as involving function evaluations in this part.

If on the other hand you use line search, a single run of line search involves many function evaluations, so in this case the number of function evaluations per iteration can be much greater than one. The function line_search you wrote tells you this number through its return value ne.

A Simple Sandbox¶

The simplest nontrivial function we can use to experiment with gradient descent is a 2D paraboloid,

$$ f(\mathbf{z}) = f(x, y) = a x^2 + b y^2 \;. $$

When $a$ and $b$ are very different we say that the paraboloid is elongated. We will use an elongation $b/a = 10$ in all our experiments.

Of course we know the minimum of this function, but our descent algorithm does not. In addition, every smooth function looks like some paraboloid once you get close enough to a local minimum, so seeing what happens for this function is instructive.

Here is a function that lets you make both this function $f$ and its gradient $g$.

In [1]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
In [2]:
def paraboloid(z, axes):
    z2 = z * z
    p = axes[0] * z2[0] + axes[1] * z2[1]
    return p
In [3]:
def g_paraboloid(z, axes):
    g = 2 * z * axes
    return g
In [4]:
def make_bowl_functions(axes):
    def f(z):
        return paraboloid(z, axes)

    def g(z):
        return g_paraboloid(z, axes)

    return f, g

Doing it this way ensures that $f$ and $g$ use the same parameters, since you can then write

In [5]:
bowl, g_bowl = make_bowl_functions((0.1, 1.))

We can draw isocontours of $f$ to visualize the function, as done below. We can also draw the unit circle on the $\mathbf{z}$ plane. This will turn out useful later.

In [6]:
def show_function(p, x_range=(-1.2, 1.2), y_range=(-1.2, 1.2), n=101, levels=10):
    x = np.linspace(x_range[0], x_range[1], n)
    y = np.linspace(y_range[0], y_range[1], n)
    xx, yy = np.meshgrid(x, y)
    zz = np.stack((xx, yy), axis=0)
    pp = p(zz)
    plt.contour(x, y, pp, levels=levels)
    plt.plot(0., 0., '.', ms=8)
    plt.axis('square')
    plt.xticks([])
    plt.yticks([])
In [7]:
def draw_circle(center=(0., 0.), radius=1., samples=300, color='r'):
    omega = np.linspace(0., 2. * np.pi, samples)
    x, y = np.cos(omega), np.sin(omega)
    x = radius * x + center[0]
    y = radius * y + center[1]
    plt.plot(x, y, color=color)
In [8]:
show_function(bowl)
draw_circle()

Problem 2.1¶

Write a function with the following header:

def gradient_descent(
    f, g, z0,
    alpha=0.,
    momentum=0.,
    max_iterations=1000,
    min_gradient=1.e-6,
    min_step=1.e-8,
    stop_radius=0.,
    record=False
):

that implements several styles of gradient descent and, if record is set to true, remembers all the points $\mathbf{z}_k$ the descent traverses, as well as the corresponding function values $f(\mathbf{z}_k)$.

Arguments¶

  • The arguments f, g, and z0 are as for line_search.
  • If alpha has the default value 0 or is negative, the function gradient_descent uses line_search at each iteration. If alpha has a positive value, then gradient_descent uses that descent rate instead of performing a line search. The descent rate is fixed, $\alpha_k = \alpha = $ alpha .
  • If momentum is positive, then gradient_descent also uses momentum with a fixed value $\mu_k = \mu =$ momentum .
  • If the number of iterations exceeds max_iterations, the function gradient_descent simply returns the result of running that number of iterations. It is not necessary to print a warning message in that case (although that would be done in production-quality code).
  • The function gradient_descent stops if any of the following conditions are met:

    • The magnitude of the gradient of $f$ at the current point $\mathbf{z}_k$ falls below min_gradient.
    • The magnitude $\|\mathbf{z}_k - \mathbf{z}_{k-1}\|$ of the last step falls below min_step.
    • The current point $\mathbf{z}_k$ is within stop_radius from the origin (that is, if $\|\mathbf{z}_k\| < $ stop_radius). This is a peculiar stopping criterion that would make little sense in general code. It is only used in this assignment.

      The first two criteria are checked before a new point is computed (either by line search or with a descent rate). The third criterion is checked after a new point is computed.

  • See Return Values below for the meaning of record.

Return Values¶

  • If record is False, the function gradient_descent returns a triple with the point $\mathbf{z}_k$ it found, then corresponding function value $f(\mathbf{z}_k)$, and the total number of function evaluations.
  • If record is True, the function gradient_descent returns a quadruple. The first three values are as above and the fourth value is a list [(z0, f0), (z1, f1), ...] of points and function values the function encountered during descent. Make sure that the first item in the list is the starting point and the last item is the final point.

Programming Notes¶

  • Set stop_radius=0.01 so the descent path does not iterate too much. This is essentially to save you time.
  • Use default values for the parameters of gradient_descent except as explicitly noted.
  • Make sure you append copies of z0, z1, ... to the list if record is True. Otherwise, all values would end up being the same point.
  • Analogous considerations hold if you store old values for later comparison. For instance, z0_old = z0.copy(). If you were to write z0_old = z0, then every time you change z0 the value of z0_old would change with it.

What to Show¶

Show your code and the paths z0, z1, ... traversed by gradient_descent on the bowl function using line search and starting from scratch at several points on the unit circle. Specifically, start at the points with coordinates

$$ \mathbf{z}_0 = (x_0, y_0) = (\cos\omega, \sin\omega) \;\;\;\text{where}\;\;\; \omega = 0, 18, 36, \ldots, 90 \;\;\text{degrees} \;. $$

Then add a brief paragraph commenting on the results. What do you observe and what are some reasons for that?

Output Format¶

  • Show each of the six plots in a panel of a plt.subplot(2, 3, k).
  • The starting point of each run should be highlighted with a cross. For instance, if your starting point $\mathbf{z}_0$ is in numpy array z_start, do

      plt.plot(z_start[0], z_start[1], 'x', ms=12)
  • Each subplot should have a title with the angle $\omega$ in degrees, and the entire figure should have a title (use plt.suptitle) with the values of descent rate and momentum (which are both zero for this problem but will change later).

Code Format¶

Encapsulate the code that does all of this (including making a figure and its subplots and drawing the bowl and the unit circle in each subplot) into a function with header

def show_paths(alpha=0., momentum=0.):

This will turn out to be useful later, as you will be asked to repeat the same experiment with different values of descent rate and momentum. Of course, make sure you pass these parameters to gradient_descent in your show_paths function.

For this problem, just run show_paths().

Problem 2.2¶

To get sharper insights on the scenario from the previous problem we can plot the number of function evaluations as a function of the angle $\omega$ for the starting point. We will sample $\omega$ more finely now.

Write code that plots $n_e(\omega)$, the number of function evaluations starting at $\mathbf{z}_0(\omega) = (\cos\omega, \sin\omega)$, under the same circumstances as in Problem 2.1.

Show your code and a plot for values of omega one degree apart in $[0, 90]$ (so the plot is done over 91 points). To keep the number of evaluations reasonable, set stop_radius=0.01 as before. Label the axes and include a title with the values of the descent rate and momentum (both zero for now).

Then comment in a brief paragraph in what way your plot is consistent with the paths you plotted in Problem 2.1. Include some comments on the actual number of evaluations for key values of $\omega$.

Code Format¶

Encapsulate the code that does all of this into a function with header

def show_evaluations(alpha=0., momentum=0.):

which you will use again later.

However, differently from show_paths, the function show_evaluations should not include plt.figure, plt.subplot, and plt.show commands. For this problem, call your code as follows:

plt.figure(figsize=(8, 8), tight_layout=True)
show_evaluations()
plt.show()

In later problems, you will put a few plots like these side to side using also plt.subplot commands. This is why you want to write show_evaluations without these commands.

Problem 2.3¶

Repeat the experiments in the previous 2 problems for zero momentum and descent rates alpha equal to 0.01, 0.5, 1.

Specifically, first show the three resulting number-of-evaluation plots in the three panels of a plt.subplot(1, 3). Then show the paths as three separate figures (in three separate notebook cells for easier pagination). Use the same parameters as before except for alpha.

Also comment briefly on what goes wrong for descent rates 0.01 and 1, and compare (in broad strokes) the number of function evaluations for a descent rate of 0.5 with the number of evaluations used by line search.

Problem 2.4¶

Repeat the experiments in Problem 2.3 for a descent rate 0.5 and momentum equal to 0.1, 0.5, 0.9.

Then comment briefly of advantages and possible pitfalls for different momentum values.