When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
As discussed in class, line search, a component of some variants of gradient descent, is a numerical procedure for finding a local minimum of a function $f(\mathbf{z})\ :\ \mathbb{R}^m \rightarrow \mathbb{R}$ restricted to a particular direction $\mathbf{p}\in\mathbb{R}^m$ from a point $\mathbf{z}_0\in\mathbb{R}^m$. In other words, line search finds a minimum $\alpha^\ast$ of $$ \phi(\alpha) = f(\mathbf{z}_0 + \alpha \mathbf{p}) \;. $$
If the function $f$ is simple, we may be able to find the minimum of $\phi(\alpha)$ by hand using calculus, by finding the solution of $\phi'(\alpha) = 0$ that is nearest to $\alpha = 0$ (here the prime denotes the first derivative). In that case, we say that we perform an analytic line search, rather than a numerical one. An example of such a function $f\ :\ \mathbb{R}^2 \rightarrow \mathbb{R}$ is as follows:
$$ f(x, y) = \sin(x^2 y^2 - 1) $$where for notational simplicity we write
$$ \mathbf{z} = \left[\begin{array}{c} x \\ y \end{array}\right]\;. $$Problems in this part of this assignment ask you to give numerical values for some quantities. Your numerical results will typically be approximate. Give them with three decimals after the period. It is fine to use Python or a calculator for the numerical computations when needed, but you should not use any software that performs gradient descent or line search in this part of the assignment.
Perform one step of analytic gradient descent on the function $f$ given above and starting at point
$$ \mathbf{z}_0 = \left[\begin{array}{c} x_0 \\ y_0 \end{array}\right] = \left[\begin{array}{c} 1 \\ 1 \end{array}\right]\;. $$Specifically, find an algebraic expression for $\alpha_0$, the positive value of $\alpha$ that yields the local minimum of $\phi(\alpha)$ that is closest to $\alpha = 0$ (that is, the first local minimum you encounter as $\alpha$ grows away from zero in the search direction). Then convert that expression to a decimal number. Finally, give the numerical values of $\mathbf{z}_1$ (that is, the vector $\mathbf{z}$ that corresponds to $\alpha_0$) and of $f(\mathbf{z}_1)$.
Show the main steps of your calculations.
This problem gives you an opportunity to check your answers for the previous problem.
The function scipy.optimize.golden
implements essentially the line search method described in the class notes on optimization, except that in each iteration the appropriate bracketing triple interval is split according to the golden ratio, rather than in two equal parts. The golden ratio idea is described in Appendix A of the notes, but you may ignore the distinction between the two methods for this problem.
Use the function scipy.optimize.golden
to write a function with the following header:
def line_search(f, g, z0):
The argument f
is a function $f(\mathbf{z}) : \mathbb{R}^m \rightarrow \mathbb{R}$ and g
is a function $g(\mathbf{z}) : \mathbb{R}^m \rightarrow \mathbb{R}^m$ that returns the gradient of $f$ at $\mathbf{z}$. The numpy
vector z0
is an initial value for $\mathbf{z}$.
The function line_search
returns a triple of values z1, f1, ne
. The numpy
vector z1
is the point $\mathbf{z}_1\in\mathbb{R}^m$ reached through a line search starting at z0
. The scalar f1
is $f(\mathbf{z}_1)$. The integer ne
is the number of times that scipy.optimize.golden
called the function f
.
Then write the two functions f
and g
for the function $f$ given in this part of the assignment and show the result of calling line_search
with the value of $\mathbf{z}_0$ given in Problem 1.1.
Your result should be printed on a single line with the following format:
f(xxx) = xxx -> f(xxx) = xxx in xxx evaluations
where the strings xxx
are replaced (in this order) by the numerical values of $\mathbf{z}_0$, $f(\mathbf{z}_0)$, $\mathbf{z}_1$, $f(\mathbf{z}_1)$, and the number of function evaluations ne
returned by line_search
. All floating-point values should be printed in fixed-point notation with up to three decimal digits after the period (so anything smaller than 0.0005 becomes 0.). For numpy
arrays this format can be achieved with the following context manager:
with np.printoptions(precision=3, suppress=True):
scipy.optimize.golden
requires and returns.scipy.optimize.golden
has an optional parameter brack
that lets you set the three values $(a, b, c)$ of the initial bracketing triple or just the first two values $(a, b)$ (using the same notation as in the class notes). Set brack=(0., 1.e-3)
.full_output
to scipy.optimize.golden
is set to True
, the function returns a triple $(\alpha_0, \phi(\alpha_0), n)$ where $\alpha_0$ is the step that scipy.optimize.golden
finds, $\phi$ is the function $\mathbb{R} \rightarrow \mathbb{R}$ you pass to it, and $n$ is the number of evaluations of $\phi$.scipy.optimize.golden
returns a descent rate $\alpha_0$ while your line_search
should return $\mathbf{z}_1$.scipy.optimize.golden
fails, your function line_search
should return z0
and f0
in lieu of z1
and f1
. No need to issue a warning in that case (although you would do this in production-quality code).g
.If you did everything correctly in the previous two problems, the analytic and numerical answers should be consistent with each other.
However, this is not guaranteed in general. That is, for arbitrary functions $f$, the output result from the numerical line search may be different from the solution of the analytical line search as defined in Problem 1.1 even if perfect arithmetic were used. Explain briefly why.
By construction, the value $\alpha_0$ you found in problem 1.1 is a local minimum of $\phi(\alpha)$. Is the corresponding point $\mathbf{z}_1$ a local minimum for $f(\mathbf{z})$? Prove your answer.
A global minimum is also a local minimum. The type of local minimum (strong or weak) does not matter.
The analytic line search performed in Problem 1.1 is myopic, in that it is content to find the first local minimum of $\phi(\alpha)$ for $\alpha > 0$.
Since the calculation is analytic, we can do better. Specifically, use calculus to find the first global minimum of $\phi(\alpha)$ (for $\alpha > 0$) for the same function $f(x, y)$ examined above and starting at the same point $\mathbf{z}_0$.
As before, give $\alpha_0$, $\mathbf{z}_1$, and $f(\mathbf{z}_1)$. For $\alpha_0$ give both an algebraic expression and a numerical value with three decimal digits after the period. For the other quantities just give numerical values with three decimal digits after the period.
Show the main steps of the calculation where it differs from what you did in Problem 1.1.
Is the point $\mathbf{z}_1$ you found in Problem 1.5 a global minimum for $f(\mathbf{z})$? If not, why not? If it is, is this the only global minimum for $f(\mathbf{z})$? Explain your answers as succinctly as you can.
In this part of the assignment you will implement a basic version of gradient descent and then use that to explore the effects of various choices and parameters. Since you will use your code repeatedly in this part, it is important that you develop that with care, to avoid issues in later problems.
It is also important to understand the distinction between iterations of steepest descent and number of function evaluations. An iteration occurs every time you replace the current value $\mathbf{z}_k$ of the independent variable with a new value $\mathbf{z}_{k+1}$ in the main loop of gradient descent and then compute a new descent direction. If you use a descent rate $\alpha$ (fixed or otherwise), each iteration involves just one call of the function $f$ being optimized, so the number of function evaluations in this case is the same as the number of iterations. We do not count gradient evaluations as involving function evaluations in this part.
If on the other hand you use line search, a single run of line search involves many function evaluations, so in this case the number of function evaluations per iteration can be much greater than one. The function line_search
you wrote tells you this number through its return value ne
.
The simplest nontrivial function we can use to experiment with gradient descent is a 2D paraboloid,
$$ f(\mathbf{z}) = f(x, y) = a x^2 + b y^2 \;. $$When $a$ and $b$ are very different we say that the paraboloid is elongated. We will use an elongation $b/a = 10$ in all our experiments.
Of course we know the minimum of this function, but our descent algorithm does not. In addition, every smooth function looks like some paraboloid once you get close enough to a local minimum, so seeing what happens for this function is instructive.
Here is a function that lets you make both this function $f$ and its gradient $g$.
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
def paraboloid(z, axes):
z2 = z * z
p = axes[0] * z2[0] + axes[1] * z2[1]
return p
def g_paraboloid(z, axes):
g = 2 * z * axes
return g
def make_bowl_functions(axes):
def f(z):
return paraboloid(z, axes)
def g(z):
return g_paraboloid(z, axes)
return f, g
Doing it this way ensures that $f$ and $g$ use the same parameters, since you can then write
bowl, g_bowl = make_bowl_functions((0.1, 1.))
We can draw isocontours of $f$ to visualize the function, as done below. We can also draw the unit circle on the $\mathbf{z}$ plane. This will turn out useful later.
def show_function(p, x_range=(-1.2, 1.2), y_range=(-1.2, 1.2), n=101, levels=10):
x = np.linspace(x_range[0], x_range[1], n)
y = np.linspace(y_range[0], y_range[1], n)
xx, yy = np.meshgrid(x, y)
zz = np.stack((xx, yy), axis=0)
pp = p(zz)
plt.contour(x, y, pp, levels=levels)
plt.plot(0., 0., '.', ms=8)
plt.axis('square')
plt.xticks([])
plt.yticks([])
def draw_circle(center=(0., 0.), radius=1., samples=300, color='r'):
omega = np.linspace(0., 2. * np.pi, samples)
x, y = np.cos(omega), np.sin(omega)
x = radius * x + center[0]
y = radius * y + center[1]
plt.plot(x, y, color=color)
show_function(bowl)
draw_circle()
Write a function with the following header:
def gradient_descent(
f, g, z0,
alpha=0.,
momentum=0.,
max_iterations=1000,
min_gradient=1.e-6,
min_step=1.e-8,
stop_radius=0.,
record=False
):
that implements several styles of gradient descent and, if record
is set to true, remembers all the points $\mathbf{z}_k$ the descent traverses, as well as the corresponding function values $f(\mathbf{z}_k)$.
f
, g
, and z0
are as for line_search
.alpha
has the default value 0 or is negative, the function gradient_descent
uses line_search
at each iteration. If alpha
has a positive value, then gradient_descent
uses that descent rate instead of performing a line search. The descent rate is fixed, $\alpha_k = \alpha = $ alpha
.momentum
is positive, then gradient_descent
also uses momentum with a fixed value $\mu_k = \mu =$ momentum
.max_iterations
, the function gradient_descent
simply returns the result of running that number of iterations. It is not necessary to print a warning message in that case (although that would be done in production-quality code).The function gradient_descent
stops if any of the following conditions are met:
min_gradient
.min_step
.The current point $\mathbf{z}_k$ is within stop_radius
from the origin (that is, if $\|\mathbf{z}_k\| < $ stop_radius
). This is a peculiar stopping criterion that would make little sense in general code. It is only used in this assignment.
The first two criteria are checked before a new point is computed (either by line search or with a descent rate). The third criterion is checked after a new point is computed.
record
.record
is False
, the function gradient_descent
returns a triple with the point $\mathbf{z}_k$ it found, then corresponding function value $f(\mathbf{z}_k)$, and the total number of function evaluations.record
is True
, the function gradient_descent
returns a quadruple. The first three values are as above and the fourth value is a list [(z0, f0), (z1, f1), ...]
of points and function values the function encountered during descent. Make sure that the first item in the list is the starting point and the last item is the final point.stop_radius=0.01
so the descent path does not iterate too much. This is essentially to save you time.gradient_descent
except as explicitly noted.z0, z1, ...
to the list if record
is True
. Otherwise, all values would end up being the same point.z0_old = z0.copy()
. If you were to write z0_old = z0
, then every time you change z0
the value of z0_old
would change with it.Show your code and the paths z0, z1, ...
traversed by gradient_descent
on the bowl
function using line search and starting from scratch at several points on the unit circle. Specifically, start at the points with coordinates
Then add a brief paragraph commenting on the results. What do you observe and what are some reasons for that?
plt.subplot(2, 3, k)
.The starting point of each run should be highlighted with a cross. For instance, if your starting point $\mathbf{z}_0$ is in numpy
array z_start
, do
plt.plot(z_start[0], z_start[1], 'x', ms=12)
Each subplot should have a title with the angle $\omega$ in degrees, and the entire figure should have a title (use plt.suptitle
) with the values of descent rate and momentum (which are both zero for this problem but will change later).
Encapsulate the code that does all of this (including making a figure and its subplots and drawing the bowl and the unit circle in each subplot) into a function with header
def show_paths(alpha=0., momentum=0.):
This will turn out to be useful later, as you will be asked to repeat the same experiment with different values of descent rate and momentum. Of course, make sure you pass these parameters to gradient_descent
in your show_paths
function.
For this problem, just run show_paths()
.
To get sharper insights on the scenario from the previous problem we can plot the number of function evaluations as a function of the angle $\omega$ for the starting point. We will sample $\omega$ more finely now.
Write code that plots $n_e(\omega)$, the number of function evaluations starting at $\mathbf{z}_0(\omega) = (\cos\omega, \sin\omega)$, under the same circumstances as in Problem 2.1.
Show your code and a plot for values of omega one degree apart in $[0, 90]$ (so the plot is done over 91 points). To keep the number of evaluations reasonable, set stop_radius=0.01
as before. Label the axes and include a title with the values of the descent rate and momentum (both zero for now).
Then comment in a brief paragraph in what way your plot is consistent with the paths you plotted in Problem 2.1. Include some comments on the actual number of evaluations for key values of $\omega$.
Encapsulate the code that does all of this into a function with header
def show_evaluations(alpha=0., momentum=0.):
which you will use again later.
However, differently from show_paths
, the function show_evaluations
should not include plt.figure
, plt.subplot
, and plt.show
commands. For this problem, call your code as follows:
plt.figure(figsize=(8, 8), tight_layout=True)
show_evaluations()
plt.show()
In later problems, you will put a few plots like these side to side using also plt.subplot
commands. This is why you want to write show_evaluations
without these commands.
Repeat the experiments in the previous 2 problems for zero momentum and descent rates alpha
equal to 0.01, 0.5, 1.
Specifically, first show the three resulting number-of-evaluation plots in the three panels of a plt.subplot(1, 3)
. Then show the paths as three separate figures (in three separate notebook cells for easier pagination). Use the same parameters as before except for alpha
.
Also comment briefly on what goes wrong for descent rates 0.01 and 1, and compare (in broad strokes) the number of function evaluations for a descent rate of 0.5 with the number of evaluations used by line search.
Repeat the experiments in Problem 2.3 for a descent rate 0.5 and momentum equal to 0.1, 0.5, 0.9.
Then comment briefly of advantages and possible pitfalls for different momentum values.