COMPSCI 527 Homework 2¶

Homework Submission Workflow¶

When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Bilinear Interpolation¶

Problem 1.1 (Exam Style)¶

Let four adjacent pixels in a black-and-white image have the following values

$$ \begin{eqnarray*} I[10, 20] = 60 &\phantom{wide space}& I[10, 21] = 100 \\ I[11, 20] = 80 &\phantom{wide space}& I[11, 21] = 40 \end{eqnarray*} $$

In these expressions, the first index for $I$ is the row index and the second is the column index.

Give the value of $I[u, v]$ obtained by bilinear interpolation for row coordinate $u = 10.4$ and column coordinate $v = 20.8$. Show your calculations.

Problem 1.2¶

Write a Python function with header

def interpolate(i, r, c):

that takes an array i of shape (rows, columns) and two non-empty arrays r c of floating-point row, column coordinates respectively. The two arrays have arbitrary shapes, except that the following predicate must be true:

r.shape == c.shape`

The entries in these arrays must satisfy the following predicate:

np.all(r >= 0) and np.all(r < rows - 1) and \
np.all(c >= 0) and np.all(c < columns - 1)

so that they can be interpreted as legitimate row, column indices into i.

The function interpolate returns a single array j of the same shape as r (and therefore c) with the values of i obtained by bilinear interpolation at the (floating-point) positions specified by r and c.

Follow the programming notes below to get credit for your code.

Then print the two arrays that interpolate returns when called with the following image img and (i) row, column arrays r1, c1 and then (ii) row, column arrays r2, c2.

In [1]:
import numpy as np
In [2]:
def ramp(rows, columns, r_slope, c_slope):
    return np.outer(np.ones(rows), c_slope * np.arange(columns)) + \
           np.outer(r_slope * np.arange(rows), np.ones(columns))
In [3]:
img = ramp(5, 5, 2., 3.)
In [4]:
r1 = np.linspace(0.2, 3.8, num=4)
c1 = np.linspace(2., 3.5, num=4)
In [5]:
r2 = np.array([[0.4, 3.1, 1.9], [0.0, 2.1, 3.7], [3.5, 1.0, 0.3]])
c2 = np.array([[2.0, 1.3, 3.2], [0.9, 1.4, 2.9], [1.8, 2.0, 0.7]])

Programming Notes¶

  • You may only import numpy for this problem. No other packages are allowed. In addition, you may not use any interpolation functions provided in numpy. In other words, you must write your function from scratch.
  • Include assert statements that check for the conditions on r and c given above.
  • Convert the array i to float inside the function (using i.astype(float)). It may already have this format, but it is safer to convert anyway in case it is not.
  • There are many ways to work with arrays of arbitrary shapes like r and c. A particularly simple technique is to first ravel all the arrays and then reshape the result to the desired shape. If you use this method, it is cleaner to define a tiny helper function that given two integer row, column coordinates returns the corresponding index in the raveled array. However, the technique you use is up to you.
  • Do not use any explicit for loops, but use numpy array operations instead.
  • If you use the function numpy.floor, keep in mind that numpy.floor(3.2) is conceptually an integer but is still represented as a float. If you use that number for indexing it must be an int, so you should say numpy.floor(3.2).astype(int).

Problem 1.3¶

Let us now use the function interpolate developed above to carve a disk out of an image. A disk is a filled circle.

To this end, first write a function with the following header:

def disk(center, radius, num_rho, num_theta):

This function takes a numpy vector center with two floating-point numbers; a floating-point number radius; and two integers num_rho and num_theta. The arguments center and radius represent a circle and center is in row, column coordinates.

The function disk first makes a grid $(\rho, \theta)$ of radii and angles. Each of $\rho$ and $\theta$ is an array of shape (num_rho, num_theta) and the two arrays together represent a grid of points that cover the given circle in polar coordinates and with num_rho equally spaced samples in the $\rho$ coordinate (varying from $0$ to radius) and num_theta equally spaced samples in the $\theta$ coordinate (varying from $0$ to $2\pi$).

For instance, disk((3, 2), 1.5, 7, 12) would make two arrays that represent the nodes of the following mesh:

spider web

Note that there are 11 spokes are visible even if the last argument is 12. This is because the first value of $\theta$ is 0 and the last one is $2\pi$, so the two directions coincide. The arrays have 12 columns so that the circle is "closed."

The function disk then converts these polar $(\rho, \theta)$ coordinates to Cartesian $(r, c)$ coordinates, with the row coordinate $r$ pointing vertically down and the column coordinate $c$ pointing horizontally to the right. The function returns these Cartesian coordinates, stored in two arrays each of the same shape as that of $\rho$ or $\theta$.

The following code retrieves a black-and-white image cafe.png and a Python file display.py with a few display utility functions.

In [6]:
import urllib.request
from os import path as osp


def retrieve(file_name, semester='spring23', homework=2):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci527/homework/{}/{}'
        url = fmt.format(semester, homework, file_name)
        urllib.request.urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))
In [7]:
retrieve('cafe.png')
retrieve('display.py')
Downloaded file cafe.png
Downloaded file display.py

Show your code and the result of the following calls (un-commented). These calls are also included in the homework template file.

In [8]:
from matplotlib import pyplot as plt
from display import show_image_excerpt
%matplotlib inline
In [9]:
# img = plt.imread('cafe.png')
# center = [img.shape[0] // 2, img.shape[1] // 2]
# radius = img.shape[0] // 4
# r_disk, c_disk = disk(center, radius, 100, 300)
# d = interpolate(img, r_disk, c_disk)
# show_image_excerpt(img, r_disk, c_disk, d)

Part 2: Blurring and Down-Sampling¶

The code below retrieves a Python file smoothing.py that contains some utilities to work with Gaussian pyramids. That file also contains a function match_smoothers which is run below.

In [10]:
retrieve('smoothing.py')
Downloaded file smoothing.py
In [11]:
from smoothing import match_smoothers

img = plt.imread('cafe.png')
down_sampling_rounds, max_blurring_rounds = 3, 40
match_smoothers(img, down_sampling_rounds, max_blurring_rounds)
best match occurs after 24 rounds of blurring

Here is an explanation of how this plot is made. Feel free to inspect the code in smoothing.py to understand any unclear part of this explanation.

Suppose that we want to down-sample an image by a factor $\phi = 1/2$, so that the resulting image has half the rows and half the columns as the original one. We will keep this factor fixed throughout this part.

Blurring through a Pyramid¶

For down-sampling, the down function you studied in the class notes first blurs an image with a Gaussian to prevent aliasing to occur. The function then takes every other pixel from the blurred image in each direction, an operation called subsampling. In code (taken straight out of smoothing.py):

def down(x):
    smooth = blur(x)
    return smooth[::2, ::2]

Please pay attention to the difference between "subsampling" and "down-sampling." We have

down-sampling = blurring + subsampling

Thus, small = down(img) has a quarter of the pixels in img (half in each dimension). Because of smoothing, small also contains less detail than img.

The image small can be made to have the same size as img by running

p_blurred_1 = up(small)

The up operator, which you also saw in the class notes and is implemented in smoothing.py, restores resolution (in the sense of "number of pixels") by bilinear interpolation. However, up cannot reintroduce the detail that blurring removed. Thus, p_blurred_1 is still a blurred version of img, even if it has the same number of pixels as img.

If you now want to make a version of img that is even more blurred, you can apply down a number of times, that is, traverse several levels of a pyramid, and then apply up to the result the same number of times, that is, traverse the pyramid in the reverse direction. The prefix p in p_blurred_1 stands for "pyramid." The suffix 1 means that we blur just once. If we go through n levels we obtain p_blurred_n. The image below is a block diagram for p_blurred_3. The meaning of the symbols should be obvious.

blurring through a pyramid

Blurring through a Stack¶

A different way of blurring an image is more straightforward: Just apply blur as many times as desired. In this way, there is no change in resolution (number of pixels), so we call the sequence of increasingly blurred versions of the image a stack rather than a pyramid.

We call the resulting blurred image s_blurred_k if we apply blur k times. The prefix s can stand for "stack" (as opposed to "pyramid") or for "straightforward" if you prefer.

Comparing Blurring Styles¶

The function match_smoothers is called above on the image cafe.png you saw earlier. It first produces p_blurred_3, because the argument down_sampling_rounds is set to 3. It then computes s_blurred_k for k ranging between 1 and 40 (the value of the argument max_blurring_rounds).

For each value of k, a central portion of p_blurred_3 is compared with the corresponding portion of s_blurred_k by taking the 2-norm of the difference between these two portions. The reason for only looking at a central portion is to stay away from image boundaries, where padding may cause unwanted artifacts that would complicate the analysis.

The plot above is the value of this 2-norm as a function of k.

The conclusion we can draw from the plot above is as follows:

It takes about 24 rounds of blur to achieve approximately the same effect as 3 rounds of down and 3 rounds of up.

The value of the minimum is not zero mainly because the bilinear interpolation performed by up does not cancel subsampling (that is, taking every other pixel in each dimension) exactly.

If it did, then it would take 21 rounds of blur rather than 24.

Problem 2.1 (Exam Style)¶

Your job here is to explain the relationship between the numbers 3 and 24 (or rather 21) in the conclusion above. In other words, why do you need to blur an image about 21 times to achieve an effect that is similar to that of (i) blurring and subsampling 3 times, and then (ii) interpolating your way back up through the 3 levels of the pyramid?

Notes¶

  • As noted above, the theoretical value one would obtain if up were to reverse subsampling exactly is 21, so your explanation should come up with this value instead of 24.
  • Here are a few helpful facts, which you may use without proof:
    • Bilinear interpolation is approximately the inverse of subsampling (not of down-sampling!). Assume that this inverse is exact for your argument.
    • Convolution is associative: $a \ast (b \ast c) = (a \ast b) \ast c$.
    • Convolving an image with a Gaussian kernel with parameter $\sigma_1$ and then convolving the result with a Gaussian kernel with parameter $\sigma_2$ is equivalent to convolving the original signal with a Gaussian kernel with parameter $\sqrt{\sigma_1^2 + \sigma_2^2}$.
    • up(blur(i[::2, ::2])) where the Gaussian in blur has parameter $\sigma$ is approximately equivalent to just blurring i with a Gaussian with parameter $2\sigma$. Assume that this equivalence is exact: subsampling + blurring with $\sigma$ + up-sampling = blurring with $2\sigma$.
  • You may want to start at the right end of the block diagram for p_blurred_3 shown earlier and reason your way towards the left to transform the diagram into a chain of blur operators only.
  • Please be clear in your explanation. Figure out what to say, write it down, then edit for clarity and good logical flow. No formal math is needed.

Part 3: The Laplacian Pyramid¶

Successive layers in a Gaussian pyramid contain coarser and coarser image features because of the blur operator in down. Another way of saying this is to say that Gaussian smoothing eliminates high spatial frequencies from the image, in the sense that sharp details involve rapid transitions of image brightness from one pixel to the next (high spatial frequencies), and these are reduced by Gaussian blurring.

Because of this behavior, the Gaussian filter (convolution with a Gaussian kernel) is called a low-pass filter: It stops some high frequencies and lets lower frequencies (more slowly varying signals) go through to the output unharmed.

In some scenarios it is also useful to look at the part of the image that a Gaussian filter blocks, that is, the higher spatial frequencies that do not make it through the filter.

As an example, the Figure below shows

  • part of a satellite image img,
  • a low-pass version low = up(up(up(down(down(down(img)))))) (this is the same transformation that produced p_blurred_3 in the previous Part of this assigment), and
  • the high-pass image, that is, the difference high = img - low.

low-pass and high-pass image components

By-and-large, the clouds dominate the low-pass image, while the finer detail on the ground dominates the high-pass image.

While the separation is not perfect, this example is meant to suggest that looking at the contents of an image in different frequency bands may be useful in various scenarios.

If the town in the satellite image above were on rolling hills, it is conceivable that the shading the sun creates on the hills would lead to image features at intermediate frequencies: Neither as high as the buildings nor as low as the clouds. The hill image would be a band-pass image.

To create these band-pass images one can start from a Gaussian pyramid and compute appropriate differences. For instance, with three levels:

low_1 = down(img)
    band_1 = img - up(low_1)
low_2 = down(low_1)
    band_2 = low_1 - up(low_2)
low_3 = down(low_2)
    band_3 = low_2 - up(low_3)

The original image img together with the three low-pass images low_1, low_2, low_3 form the Gaussian pyramid. The image band_1 is what we called the "high-pass image." We call it band_1 simply for notational consistency, so we can easily write a for loop over the levels when we have an arbitrary number of levels in the pyramid instead of three.

The three images band_1, band_2, band_3 encode image details at three different levels of detail (think of buildings, hills, and clouds).

Problem 3.1 (Exam Style)¶

The three-level pipeline above involves seven images: img, low_1, low_2, low_3, band_1, band_2, band_3 .

Show that it is possible to reconstruct img exactly (up to rounding errors) if only the four images band_1, band_2, band_3, low_3 are available.

Note¶

This is a simple high-school algebra problem: Solve the equations above in reverse order:

  • Write low_2 in terms of images that are available (and using available operators).
  • Write low_1 in terms of images that are available at this point (so low_2 is included).
  • Write img in terms of images that are available at this point.

Just show the resulting equations in the order in which you solved them:

low_2 = ...
low_1 = ...
img = ...

Problem 3.2¶

A Laplacian pyramid is a pyramid of the type above that contains only the images that are necessary to reconstruct the input image img. In the three-level case, these images are band_1, band_2, band_3, low_3.

More generally, one can keep creating $k$ levels in the pyramid until down-sampling is no longer possible because the image low_k is too small.

To make this problem as simple as possible, we assume that images have shape $(2^k, 2^k)$, so that the pyramid has exactly $k$ levels.

To check for this condition you can import the function check_image_size from smoothing (you retrieved smoothing.py earlier).

The resulting pyramid contains the band-pass images band_1 $\ldots$ band_k as well as the low-pass image low_k.

Since each application of down shrinks the image by half, the low-pass image low_k contains a single pixel. Also, since all Gaussians in down are properly normalized, the value of low_k is the average of all pixel values in the original image.

If we put all these images into a Python list, we can compute the Laplacian pyramid as follows.

In [12]:
from smoothing import check_image_size, up, down
In [13]:
def pyramid(x):
    d = check_image_size(x)[0]
    p = []
    for level in range(d):
        lower = down(x)
        p.append(x - up(lower))
        x = lower
    p.append(x)
    return p

The function composite imported from display places all the images in a pyramid into a single composite image for display with show_image, which can also be imported from display (you retrieved display.py earlier). A function overlay_grid (also from display) adds some yellow lines to make the separation between the various images in the pyramid clearer. For example:

In [14]:
from display import composite, show_image, overlay_grid
In [15]:
img = plt.imread('cafe.png')
pyr = pyramid(img)
com = composite(pyr)
plt.figure(figsize=(12, 8), tight_layout=True)
show_image(com)
overlay_grid(com.shape[0])
plt.show()

The lower-left corner of the composite image is painted with the value of the single pixel in low_k.

Your Task¶

Write a Python function with header

def reconstruct(p):

that takes a Laplacian pyramid as constructed with pyramid and reconstructs the original image.

Keep your code as simple as posible. Show your code and the result of the following calls (after uncommenting). These calls are also available in the homework template file.

In [16]:
#from display import show_image
#
# img = plt.imread('cafe.png')
# pyr = pyramid(img)
# rec = reconstruct(pyr)
# show_image(rec, fig=True)

Part 4: Line Search¶

Let $\mathbf{z} = (x, y)$ and

$$ f(\mathbf{z}) = 2 x^2 + xy + 3 y^2 $$

and define

$$ \mathbf{z}_0 = (1, 0) \;. $$

Problem 4.1 (Exam Style)¶

Write an algebraic expression in terms of $x$ and $y$ for the gradient

$$ \mathbf{g}(\mathbf{z}) = \nabla \mathbf{f}(\mathbf{z}) $$

and give the numerical value of $\mathbf{g}(\mathbf{z}_0)$.

Problem 4.2 (Exam Style)¶

Suppose that gradient descent is performed on $f$ starting at $\mathbf{z}_0$. The first iteration of gradient descent determines the next point $\mathbf{z}_1$ by line search.

Give an algebraic expression in terms of $\alpha\in\mathbb{R}^+$ for the function $h(\alpha)$ that is minimized for that line search. Simplify your expression as much as possible.

Then, assume that line search is continued until the difference between the third and the first item of the bracketing triple is at most $10^{-6}$. Give the numerical value $\alpha_0$ that line search would return in that case rounding to four decimal digits after the period. Also give the numerical values of the point $\mathbf{z}_1$ and of the value $f(\mathbf{z}_1)$, rounding to the same number of decimal digits for each.

Important: Do not actually perform line search to answer this question. It is possible to answer the question by calculus alone. Show your calculations.

Problem 4.3 (Exam Style)¶

Line search during minimization of a certain function (unrelated to those in the previous problems) leads to minimization of the function

$$ h(\alpha) = 2 \alpha^3 - \alpha^2 -\alpha + 1 \;. $$

The first bracketing triple is found to be

$$ a, b, c = 0, \frac{1}{4}, 1 \;. $$

Give the second bracketing triple. Show your reasoning.

Problem 4.4¶

Bracketing Triple Shrinkage¶

The class notes on optimization give pseudo-code for the procedure that shrinks a bracketing triple once (page 9). Turn that pseudo-code into a Python function with header

def shrink_triple(a, b, c, h):

where h is the function $h(\cdot)$.

Bracketing Triple Intialization¶

The paragraph at the top of page 9 describes a way to initialize the bracketing triple. Turn that description into a Python function with header

def initial_triple(h, step=1.e-4, min_step=1.e-12, max_step=1.e12):

In this header:

  • step is the initial step, which is denoted by $\alpha_1$ in the notes.

  • For the initialization to work, it is necessary that $h(\alpha_1) < h(\alpha_0)$. If on the other hand $h(\alpha_1) \geq h(\alpha_0)$, try again with step divided by 10. This can be achieved by a recursive call initial_triple(h, step / 10.). Stop these attempts if step goes below min_step. In that case return the triple None, None, None to indicate failure.

  • If $h(\alpha_1) < h(\alpha_0)$, we are in business. Then, as long as $h(\alpha_i) \leq h(\alpha_{i-1})$, step is multiplied by 10 and $\alpha_{i+1} = \alpha_i + $ step, and the proper conditions for success are checked. This search succeeds if these conditions are met before step exceeds max_step. In that case, return the bracketing triple found.

  • If the conditions are not met by the deadline (that is, for step no greater than max_step), return None, None, min_c, where min_c is the smallest value $h(\alpha_i)$ encountered up to that point: This means that the search for a bracketing triple has failed, but the value min_c is still an acceptable value for $\alpha$, because $h($ min_c $) < h(0)$ and so some descent has been ensured in the value of the function.

Line Search¶

You now have all the ingredients for line search. Write a Python function with header

def line_search(h, precision=1.e-6):

that uses initial_triple and shrink_triple and returns a value of $\alpha$ within precision from a local minimum, or returns $\alpha = $c if initial_triple returns None, None, c.

To return an appropriate value once $c - a$ is at most equal to precision, just return b, the second value of the bracketing triple.

If no value of $\alpha$ can be found, line_search returns None.

Your Task¶

Write the functions above, show your code, and the values of $\alpha$ returned by the calls below (uncommented). These calls are also available in the homework template file.

One of the three calls should fail outright (return None), one should return a value resulting from failed initialization (when initial_triple returns None, None, c), and one should succeed with an appropriate value. Think about which is which, to check your code. No need to let us know.

Do not use numpy or any other import for this code, just straight Python. Leave all keyword arguments to their default values in all functions.

In [17]:
# print(line_search(lambda x: (x - 1) ** 2))
# print(line_search(lambda x: x ** 2))
# print(line_search(lambda x: -x))

Part 5: Gradient Descent¶

In this part you will write code that uses either line search (which you developed in Part 4) or a fixed learning rate to minimize a function $f(\mathbf{z})$ of two variables (that is, $\mathbf{z}\in\mathbb{R}^2$) by gradient descent.

Restricting this exercise to a two-dimensional domain lets you plot the path $\mathbf{z}_0, \mathbf{z}_1, \ldots $ traversed during descent, and also the sequence of function values $f(\mathbf{z}_0), f(\mathbf{z}_1), \ldots $ (of course, this last sequence can be plotted regardless of the dimensionality of the domain).

Code will also keep track of how many times the function being minimized is called, to determine the cost of each full run of gradient descent. This monitoring of a function's behavior is implemented here through Python decorators and generators. Don't worry if you don't know much about these concepts, as most of the work on those aspects of the code is done for you.

While there is a bit of reading involved here, you will learn about programming techniques that come in handy in a variety of situations.

Decorators¶

One way to keep track of descent paths and function call counts would be to include code that does that in your implementation of gradient descent. However, this would clutter the code and make it difficult to read and debug. Instead, Python decorators and generators let you separate what the function is supposed to do from the code that monitors any aspects of its behavior.

As an example, the function you will minimize is decorated with the @Count decorator as follows, to monitor how many times it is called. Remember that you already retrieved display.py, from which these decorators are imported.

In [18]:
from display import Display, Count, print_outcome
In [19]:
@Count
def rosenbrock(z):
    a, b = z[1:] - z[:-1] ** 2, 1 - z[:-1]
    f = np.sum(100. * a ** 2 + b ** 2, axis=0)
    g = np.zeros(z.shape)
    g[:-1] = - 400. * z[:-1] * a - 2 * b
    g[1:] += 200 * a
    return f, g

You can then simply reset the call counter by saying

In [20]:
rosenbrock.reset_counter()

and then call the function a few times:

In [21]:
for _ in range(5):
    rosenbrock(np.random.rand(2))

To find out how many times the function was called do

In [22]:
print(rosenbrock.calls())
5

Note that the definition of rosenbrock knows nothing about reset_counter or calls. All of that is done for you in the decorator @Count. Thus, the code of rosenbrock is kept clean of anything that does not pertain to the function itself.

You won't even have to worry about calling reset_counter or calls yourself, since display.py also includes a print_outcome function, discussed later, that both prints the number of calls and resets the call counter. So all you have to do is to call print_outcome, as you will be instructed to do later.

The description of the @Count decorator above is just to let you know what is going on. Of course, you can always peruse display.py to see how this is all implemented.

The @Display Decorator and Python Generators¶

The @Display decorator plots the gradient descent path and function values, superimposed on a plot of the iso-contours of the function in some rectangle of the (two-dimensional) domain. This decorator does require a bit of collaboration on your part, as discussed next.

Initializing the Decorator¶

You will be asked to write a function gd that searches for a local minimum of a function by gradient descent. The @Display decorator for gd needs to know which function to draw the iso-contours of, and in which rectangle.

Because of this, before you invoke gd on a new function you have to say

gd.set_function(fct, box)

where fct is the function to be minimized and box is a list of the left, right, lower, and upper bounds of the rectangle. More on this later.

Again, it is not your job to write set_function, that is done for you in the decorator, which is defined in display.py.

Generators¶

The @Display decorator also needs to be given the opportunity to temporarily stop execution of gd, see the current status of the search (that is, be told the values of $\mathbf{z}_k$ and $f(\mathbf{z}_k)$ which the decorator needs in order to plot the search path), and then resume execution of gd.

To this end, you will be shown how to transform your gd, which is initially a function, into a generator, which lets @Display intrude as described. This transformation is simple and merely involves the following:

  • Changing the word return in your definition of gd into yield. Yielding differs from returning in that control is yielded only temporarily to the caller (which is the @Display decorator). While the caller takes over, gd still remembers all of its state, so that when the caller calls gd again this function can just continue execution as if nothing had happened.
  • Adding yield statements wherever your code changes the values of $\mathbf{z}_k$ or $f(\mathbf{z}_k)$, so that the caller knows to record these values.

Problem 5.1¶

Write a generator with decorated header

@Display
def gd(f, z, learning_rate=None, min_step=1.e-6,
    min_drop=1.e-8, max_iter=100000):

that performs gradient descent on the function f starting at point z.

Instructions are given under the Programming Notes below on how to make the generator. You will first write a regular function, then transform it to a decorated generator. So don't worry about the decorator and generator aspects for now. Just start by writing a regular function that satisfies the following specifications.

Specifications¶

  • If learning_rate is None, your function performs line search at each iteration, calling the function line_search you developed earlier.

  • Otherwise, learning_rate must be a positive floating point number, and gd will use that as the learning rate, to be kept fixed throughout all iterations.

  • The function runs at most max_iter iterations, and it will therefore contain a loop like this:

      for k in range(max_iter):
    
    

    where k is the iteration number.

  • You break out of the loop (do not return from inside the loop) if either
    • the Euclidean norm of the step $\mathbf{z}_k - \mathbf{z}_{k-1}$ falls below min_step, or
    • the value drop $f(\mathbf{z}_{k-1}) - f(\mathbf{z}_k)$ (note the order) falls below min_drop, or
    • the maximum number of iterations has been reached.
  • The search fails if the maximum number of iterations is reached or if the value drop is negative (that is, the function value increases instead of decreasing) or if line_search fails to find an acceptable value.
  • If failure occurs, the function gd returns None, None.
  • Upon success, the function gd returns the pair fz, z, where z is the current position $\mathbf{z}_k$ and fz is the corresponding value $f(\mathbf{z}_k)$.
  • Always use the default values of the parameters max_iter, min_step, max_drop when you run gd in this assignment.
  • Do not call the function to be minimized unnecessarily. That is, think of this function as being expensive to run. You want to keep runs to a minimum.

Programming Notes¶

  • To write the gd generator, first comment out the decoration: # @Display then write gd as a function (that is, use return rather than yield).

  • When done, debug your function carefully and test it.

  • When you write gd make sure that it has the following structure. This will make it straightforward to turn the function into a generator later.

def gd(...)
    # Compute the initial value of the function and its gradient
    # and make a copy of z to prevent side effects.
    (fz, gz), z = f(z), z.copy()

    for k in range(max_iter):
        ...

        # Compute the descent step
        ...

        # Take the step
        z -= step

        # Check if done
        if ...:
            # IMPORTANT: Do not return here.
            # Instead, break out of the loop
            break

    # This should be the ONLY return in the entire function
    return fz, z

Transforming your Function into a Generator¶

Once you are happy with how gd works as a function, transform it into a generator by changing return to yield and adding extra yield statements as specified below.

  • Replace the word return to yield in the last line.
  • Insert two more statements yield fz, z as follows:
    • After the statement that evaluates f the first time (before the for loop).
    • In front of the if block that terminates the for loop.

That's all there is to it. In this way, your code yields to the @Display decorator every time z and fz change, so that @Display can keep track of these values and then plot them when gd is done.

Decorating your Function¶

To decorate your function, just uncomment the line

# @Display

that you commented out. If you run the function again as shown below, the decorator will display two plots that show how $\mathbf{z}_k$ and $f(\mathbf{z}_k)$ change over time.

Calling gd¶

Here is the code for calling gd. This code is also available in the homework template file (also commented out).

Rather than using a for loop on the different learning rates, the calls are separated out into different notebook cells for easier formatting of the PDF file.

In [23]:
# # Part of the function domain to plot contours in
# box = [-1, 2, -1, 2]

# # Tell the decorator which function to use and in what box
# gd.set_function(rosenbrock, box)

# # Starting point for gradient descent
# z0 = np.array([-1 / 2, 1])

# # Run gd with line search
# value, position = gd(rosenbrock, z0)
# print_outcome('line_search', position, value, rosenbrock)
In [24]:
## Run gd again with three different, fixed learning rates
In [25]:
# rate = 0.00001
# value, position = gd(rosenbrock, z0, learning_rate=rate)
# print_outcome('learning rate {}'.format(rate),
#               position, value, rosenbrock)
In [26]:
# rate = 0.001
# value, position = gd(rosenbrock, z0, learning_rate=rate)
# print_outcome('learning rate {}'.format(rate),
#               position, value, rosenbrock)
In [27]:
# rate = 0.002
# value, position = gd(rosenbrock, z0, learning_rate=rate)
# print_outcome('learning rate {}'.format(rate),
#               position, value, rosenbrock)
In [28]:
# rate = 0.003
# value, position = gd(rosenbrock, z0, learning_rate=rate)
# print_outcome('learning rate {}'.format(rate),
#               position, value, rosenbrock)

The Rosenbrock Function¶

  • The function rosenbrock defined earlier takes a numpy array z of shape (2,) (that is, a two-dimensional vector) and returns both the value fz of the function and its gradient gz at z.

  • The function rosenbrock has a unique minimum (which is therefore global) at $\mathbf{z}^\ast = (1, 1)$, so you'll know when your implementation of gd succeeds.

  • To be precise, the definition of rosenbrock above takes inputs z of any dimensionality. However, we will only use it with two-dimensional vectors z.

Your Task¶

  • As you see, your only task is to write gd correctly and then transform it to a generator.
  • Show your code and the result of the test runs above. Two of the runs are supposed to fail. In those cases, the decorator places a question mark at the last point visited during search.

Problem 5.2 (Exam Style)¶

Answer the following questions about the results you obtained in the previous problem. Please number your answers and keep them in the same order as the questions.

Note that for all successful runs of gd above you can determine the number of iterations, at least approximately, by looking at the abscissa of the function value plots. The number of function calls, on the other hand, is printed out explicily.

  1. Explain why the search with the smallest fixed learning rate above failed.

  2. Explain why the search with the greatest fixed learning rate above failed.

  3. How could the code for gd be improved so it can recover from the problem that occurred with an excessively large fixed learning rate? (No need to implement this improvement.)

  4. What is the average number of function calls per iteration, approximately, for the run with line search above? An iteration corresponds to one increment of the for loop variable k.

  5. What is the average number of function calls per iteration, approximately, in the successful runs with a fixed learning rate above?

  6. What are the main trade-offs in selecting a good learning rate when the rate is fixed?

  7. Since line search does not always lead to the most efficient search, what could be a reason for preferring that method to a fixed learning rate? This question is a bit open-ended, and different valid answers are possible.