assignment03

Note on Exam-Style Problems¶

Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.

As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.

Problem 0 (3 points)¶

Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?

Note¶

This question requires no explicit answer from you, and no space is provided for an answer in the template file.

However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.

Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.

Part 1: Gradient Descent Revisited¶

We revisit gradient descent in the context of functions that are averages of other functions. This will let us set up some code that we can use for Stochastic Gradient Descent (SGD) later in this assignment.

More importantly for this Part, we take a function $f$ of a large number of variables, which makes $f$ relatively expensive to compute, and restrict it to a two-dimensional subspace of the domain of $f$, so we can plot the path that gradient descent traverses when minimizing the restriction.

A "Large" Average Function¶

Let us retrieve a file function.py and import from it the implementation f of a function $f:\mathbb{R}^d \rightarrow \mathbb{R}$ that is an average of $N$ functions:

$$ f(\mathbf{w}) = \frac{1}{N} \sum_{n=0}^{N-1} \phi_n(\mathbf{w}) \;. $$

This is a rather "large" function, in that $d = 7510$ and $N = 1500$. What the function computes is irrelevant to this Part.

We also retrieve a file display.py with some visualization utilities and a file gd.py with a reference implementation of gradient descent.

In [1]:

import urllib.request
from os import path as osp


def retrieve(file_name, semester='spring23', homework=3):
    if osp.exists(file_name):
        print('Using previously downloaded file {}'.format(file_name))
    else:
        fmt = 'https://www2.cs.duke.edu/courses/{}/compsci527/homework/{}/{}'
        url = fmt.format(semester, homework, file_name)
        urllib.request.urlretrieve(url, file_name)
        print('Downloaded file {}'.format(file_name))

In [2]:

retrieve('function.py')
retrieve('display.py')
retrieve('gd.py')

Using previously downloaded file function.py
Using previously downloaded file display.py
Using previously downloaded file gd.py

In [3]:

from function import f, start_over, print_outcome
from display import Display, show_function
from gd import gd
import numpy as np

If you call

z, g = f(w, indices)

where w is some numpy array of $d$ floating-point numbers and indices is a list of integers in $[0, N)$, then f returns the value z and gradient g of the function

$$ f_B(\mathbf{w}) = \frac{1}{|B|} \sum_{n\in B} \phi_n(\mathbf{w}) $$

where $B$ is the set of integers in indices and $|B|$ denotes the cardinality of $B$.

This additional argument to f will help in SGD, where we need to compute averages over mini-batches $B$ rather than on the entire set $[0, N)$.

Of course, to compute the value and gradient on the entire set of indices $[0, N)$ you just say

z, g = f(w, np.arange(n_indices))

where n_indices is $N = 1500$.

A Reference Implementation of Gradient Descent¶

The generator gd imported from gd.py above is an implementation of gradient descent more or less from the last homework assignment.

Look at the function gd in the file just retrieved and note the @Display decorator and the yield statements rather than return. If you forgot what this means please refer back to Homework 2.

The decorator is a bit simpler now in that it only shows the descent path and function values rather than also displaying iso-contours of the function.

The figure below shows plots produced by gd run on a function fr of two variables that you will implement in Problem 1.1. This figure is just to give you an idea of what to expect in that problem.

Write a function with header

def fr(v, indices):

This function restricts the function $f$ defined above to dimensions 6420 and 6424 of the input space of $f$ around a point w0$\in\mathbb{R}^{d}$ (with $d = 7510$) given later on.

The role of the second parameter indices for fr is the same as that for f.

The restriction fr of f is defined as follows:

$$ f_r(\mathbf{v}, B) = f(\left[\begin{array}{c} w_0\\\vdots\\w_{6419}\\w_{6420} + v_0 \\ w_{6421} \\ w_{6422}\\ w_{6423}\\ w_{6424} + v_1 \\ w_{6425} \\ \vdots \\ \\ w_{7510} \end{array}\right], B) $$

where $v_0$ and $v_1$ are the two entries of the parameter v.

Your Task¶

Show your definition of fr.
Show the plot of both the values and gradients of fr on a grid of values for $\mathbf{v}$ in the square $[-3, 3]^2$ when w0$\in\mathbb{R}^{7510}$ is the point returned by the call to start_over in a code cell below.
Run gd on fr starting at v0 = np.array([2., 1.]) and print out the outcome. Use default parameters for all named arguments to gd.

Programming Notes¶

By now you have computed function restrictions in two homework assignments, so you should know how this is done. The sticky point with some seems to be to figure out how to let fr "know" about values it depends on (such as f or w0 or 6420) even if these are not passed explicitly as parameters. In Python, this is actually very easy: if those values are known when fr is called, then fr knows about them as well, as illustrated by add_number below.

In [4]:

def add_number(x):
    return x + number


number = 3
add_number(2)

Out[4]:

Of course, the gradient needs to be restricted as well.
Tasks 2 and 3 above are simply implemented by un-commenting and running the code in the appropriate cells below (assuming that you ran all the retrieve and import statements introduced earlier).
Note that start_over also returns n_indices, which equals 1500.
The list coords contains the indices of the two coordinates associated to $\mathbf{v}$ in the definition of $f_r$. Use coords rather than hard-wiring those two indices into your code.

In [5]:

%matplotlib inline
w0, n_indices = start_over()
coords = [6420, 6424]

In [6]:

# def fr(v, indices):
#    ...

In [7]:

# show_function(fr, n_indices)

In [8]:

# v0 = np.array([2., 1.])
# start_over(fr, gd);
# f_star, v_star = gd(fr, v0, np.arange(n_indices))
# print_outcome(f_star, v_star)

Problem 1.2 (Exam Style)¶

If you were to compute a local minimum of fr by gradient descent starting at $\mathbf{v}_0 = (2, -2)$, what value $\mathbf{v}^\ast$ do you think this computation would find, approximately? Justify your answer briefly.

Notes¶

You can answer this question by inspection of the plot that show_function produced in the previous problem, and it is not necessary to actually run gradient descent.
You may use gd to check your answers (do not submit those checks), but you still need to justify your answers with reference to the plot made by show_function, as if you had not run gd.
The horizontal coordinate in that plot is $v_0$ and the vertical coordinate is $v_1$.
Grid points are 0.2 units apart. It is OK if your answer is off by a grid cell or so.

Problem 1.3 (Exam Style)¶

If you were to start at $\mathbf{v}_1' = (2, 1)$ instead, would gradient descent find a point where $f_r$ has a lower value than it had at the point $\mathbf{v}^\ast$ you found in the previous problem? Justify your answer briefly.

The same Notes as in Problem 1.2 hold here, including the bullet on the style of your justification.

Part 2: Stochastic Gradient Descent¶

We now write a generator sgd that implements Stochastic Gradient Descent (SGD). The generator is decorated with @Display, just as gd was, so we can see the path traversed when the function depends on just two variables.

The general idea of SGD is quite simple: Split the indices into batches and run Gradient Descent (GD) on averages estimated on each batch, rather than on the entire index set. When all the indices have been visited once, an epoch ends and the next one begins. At the beginning of each epoch the indices are randomly permuted, so that batches are different in each epoch.

In its simplest form SGD does not have a termination check, because it would be too expensive to determine if the overall average is small enough. Instead, SGD simply runs for a prespecified number of epochs. This is the version we implement here.

Since minimizing an average of many functions of many variables can take very long, it is common for the algorithm to display intermediate values of $f(\mathbf{w})$, so that whoever runs it can monitor progress. To this end, it is important to make these displays meaningful. This topic is discussed next.

We then consider which learning rate to use in SGD.

Yielding Meaningful Function Values¶

Because of how the @Display decorator works, sgd must be a generator, that is, use yield statements rather than return statements.

We need to think a bit carefully about what @Display should record, and therefore what sgd should yield. We think this through for the function $f$, with the understanding that all considerations hold also for its restriction $f_r$.

SGD approximates the overall average $f(\mathbf{w})$ with batch averages $f_B(\mathbf{w})$. "Overall" means "computed from all the data."

Since the batch averages are noisy estimates of $f(\mathbf{w})$, it is preferable to display the overall average $f(\mathbf{w}_e)$ at the end of each epoch, rather than the batch averages $f_B(\mathbf{w}_b)$ at the end of each batch.

However, it would be too expensive to call $f$ on the entire index set at the end of each epoch to compute $f(\mathbf{w}_e)$. Instead, this value is approximated as follows.

Suppose for simplicity that the total number $N$ of indices is an integer multiple of the batch size $s = |B|$. Then there are $b = N / s$ batches of equal size. Let $B_\beta$ be batch number $\beta$ for $\beta = 0,\ldots, b-1$, so that $f_{B_\beta}(\mathbf{w})$ is the average over batch $B_\beta$,

$$ f_{B_\beta}(\mathbf{w}) = \frac{1}{s} \sum_{n\in B_\beta} \phi_n(\mathbf{w}) \;. $$

For any $\mathbf{w}\in\mathbb{R}^d$ we can write

$$ f(\mathbf{w}) = \frac{1}{N} \sum_{n=0}^{N-1} \phi_n(\mathbf{w}) = \frac{1}{bs} \sum_{\beta=0}^{b-1} \sum_{n\in B_\beta} \phi_n(\mathbf{w}) = \frac{1}{bs} \sum_{\beta=0}^{b-1} s f_{B_\beta}(\mathbf{w}) = \frac{1}{b} \sum_{\beta=0}^{b-1} f_{B_\beta}(\mathbf{w}) \;. $$

In words, the full average at $\mathbf{w}$ is the average of the batch averages at $\mathbf{w}$.

Let now $f_\beta$ be the batch average computed on batch number $\beta$ during an epoch of SGD. The epoch average $f_e$ for that epoch is the average of these batch averages:

$$ f_e = \frac{1}{b} \sum_{\beta=0}^{b-1} f_\beta \;. $$

In practice, $f_e$ can be computed in one of two ways:

Method 1: Keep a running sum of the values $f_\beta$ encountered in the current epoch. At the end of the epoch, divide the final sum by the number $b$ of batches.
Method 2: Let $a_0 = f_0$, the first batch average, and apply the recursion

$$ a_\beta = \frac{\beta a_{\beta-1} + f_\beta}{\beta + 1} \;. $$

The two methods are of course mathematically equivalent. However, the first one can lead to very large values for the sum of all the $f_\beta$. Because of this, the second method is preferable.

The Learning Rate¶

If GD uses a learning rate $\alpha_e$ to multiply by the gradient of $f$ on the entire index set, it makes sense for SGD to use a learning rate $\alpha$ for each mini-batch that is equal to $$ \alpha = \frac{\alpha_e}{b} $$ where $b$ is the number of batches. Because of this, $\alpha_e$ is called the epoch learning rate and $\alpha$ is called the batch learning rate.

In this way, GD and SGD move by roughly comparable amounts in each epoch:

GD takes a step $-\alpha_e \mathbf{g}_0$ where $\mathbf{g}_0$ is the gradient at the beginning of the epoch.
The sum of the batch steps for SGD in an epoch is $-\sum_{\beta=0}^{b-1} \alpha \mathbf{g}_\beta$ where $\mathbf{g}_\beta$ is the gradient at the beginning of batch $\beta$. To the extent that $\mathbf{g}_0 \approx \mathbf{g}_\beta$, the two steps have similar sizes.

In this assignment, the default learning rate $\alpha_e$ is set (rather aggressively) to 3 for both gd and sgd. Do not change these default values.

Problem 2.1 (Exam Style)¶

Explain briefly and clearly why the epoch average $f_e$ in a given epoch is not the same as the average $f(\mathbf{w}_e)$ over all the data, where $\mathbf{w}_e$ is the value of the independent variable $\mathbf{w}$ at the end of that epoch.

Also, does $f_e$ typically over-estimate or under-estimate $f(\mathbf{w}_e)$, and why?

Write a function whose definition starts as follows:

@Display
def sgd(fun, v, n_idx, batch_size, epoch_alpha=3., n_epochs=5):
    batch_size = min(batch_size, n_idx)
    rng = np.random.default_rng(0)
    v = v.copy()

and implements Stochastic Gradient Descent in compliance with the notes below.

Inserting `yield` Statements¶

Consistently with the discussion in the preamble to this part, your sgd generator will have exactly two yield statements placed as follows:

At the end of each epoch, a statement of the form
```
  yield f_epoch, v
```
where f_epoch is the epoch average and v is the current value of the independent variable. Make sure that this yield is inside the epoch loop but outside the batch loop.
At the very end of sgd, a yield statement preceded by a computation of the final value of the true function value. Thus, your function ends like this (assuming that v is the current value of the independent variable vector $\mathbf{w}$):
```
  f_final = fun(v, np.arange(n_idx))[0]
  yield f_final, v
```
Make sure that these two statements are outside all loops.

Programming Notes¶

The arguments fun, v, and n_idx specify the function to be minimized (this will be $f_r$ in your experiments), the starting value of the vector of independent variables, and the number of indices in the definition of the function (that is, n_idx is $N=1500$).
batch_size is the number of indices in each mini-batch. The first assignment in the body of sgd above makes sure that batch_size is no greater than n_idx.
You may assume that n_idx is an integer multiple of batch_size.
The random number generator rng defined above uses the same seed for all of us, so your results will be comparable when we grade them.
epoch_alpha is the epoch learning rate. Compute the batch learning rate alpha from this.

At the beginning of each epoch, create a scrambled vector idx that contains all the indices as follows:
```
  idx = rng.permutation(n_idx)
```
Mini-batches are then formed by taking the first chunk of batch_size indices out of idx, and then the second chunk, and so forth.
Making a copy of v, as done above, makes sure that the value of v the caller passes to sgd is not modified.
Remember, there is no termination check. Just go through n_epochs epochs.
Use Method 2 to compute the epoch average.
Before running sgd, you must run
```
  start_over(fr, sgd)
```
This will perform various essential housekeeping chores.

Your Tasks¶

Show your definition of sgd.
Explain very briefly under what conditions sgd and gd do the exact same thing.
Run an experiment where you give sgd appropriate inputs so that it gives the same result as gd for the experiment you ran in Problem 1.1 (same fr, same v0), and use print_outcome to show the outcome, as done above for gd. Use default parameters for all named arguments to sgd except n_epochs.

The experiment in task 3 requires you to determine the number of iterations gd ran for convergence, since sgd has no termination condition. The generator gd prints out that number, just look at the printed output in Problem 1.1.

All the decimal digits that print_outcome displays must coincide in your outcome values from gd and sgd. In addition, the number of calls of $\phi_n$ functions (also printed by print_outcome) should be the same for the two experiments.

Now run your experiment (with start_over(fr, sgd), a run of sgd, and one of print_outcome) with batch size 100. All the other parameters remain the same, including fr as the function to optimize, v0 as the starting point, n_indices, and n_epochs.

Then answer the following questions:

How many times does sgd take a step (that is, how many times does v change), approximately?
Ditto for your run of sgd in problem 2.2.
How many calls of $\phi_n$ functions do these two runs perfom overall (this is printed by print_outcome).
Are the final values of $f_r$ from the two runs equal/comparable? Why?

A Note on Theoretical Complexity versus Python Wall Time¶

Your answers to some of the questions above hopefully suggest that the running time ought to be about the same for the two runs of sgd if the main computational expense is a call of a $\phi_n$ function.

On the other hand, you may have noticed that the second run of SGD is actually slower than the first. So SGD seems to be slower than its GD-equivalent version. This is surprising, since one of the motivations for SGD is efficiency given a fixed storage budget (that is, how much data we can fit into a GPU).

However, the storage budget is not fixed in these experiments. We are not running on the GPU, and the CPU can easily hold all the data at the same time. Thus, the storage-efficiency argument is moot in this small test case.

Other advantages of SGD still hold. For instance, the randomness in the estimates of the batch gradients in SGD will still get SGD out of any saddle points it may encounter, while GD would get stuck there since the gradient would remain zero.

To explain why SGD is slower than the GD-equivalent run of SGD in the last two experiments, note that the main computational expense here is not a call of a $\phi_n$ function. This would indeed be the main expense if all your code were implemented in a low-lever language like C.

In Python, on the other hand, the main expense is by far managing the infrastructure used by the numpy package. Thus, Python running times are essentially meaningless if the algorithms are eventually written in a low level language, as they should be for efficiency, and if storage efficiency is not paramount.

The entries $z_i$ of the output

$$ \mathbf{z} = (z_0, \ldots, z_{e-1}) = f(\mathbf{x}) $$

of a neural network $f \ :\ \mathbb{R}^d \rightarrow \mathbb{R}^e$ are called the logits.

We consider the following two uses of this network:

By itself, as a regressor that returns an estimate $\hat{y} = \mathbf{z}$ of some vector whose true value is $y$.
As a component of a classifier $\hat{y} = h(\mathbf{x})$ with classes in $Y = \{0, \ldots, e-1\}$, implemented as follows:

$$ h(\mathbf{x}) = \arg\max_i z_i \;. $$

Given a training set

$$ T = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\} $$

the network is trained by finding a minimum $\hat{\mathbf{w}}$ of the empirical training risk $L_T$ on $T$:

$$ L_T(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, f(\mathbf{x}_n))\;. $$

Note that $y_n$ is a vector with $e$ entries for the regressor and a scalar for the classifier.

In the expression above the vector $\mathbf{w}$ collects the weights of the network and $\ell$ is a suitable loss function.

Specifically, the regressor is trained with the quadratic loss

$$ \ell(\mathbf{y}, \mathbf{z}) = \|\mathbf{y} - \mathbf{z}\|^2 \;:\; \mathbb{R}^e \times \mathbb{R}^e \rightarrow \mathbb{R} $$

where $\|\cdot\|$ denotes the Euclidean norm and we use a bold letter for the true value $y = \mathbf{y}$ to remind ouselves that this is a vector.

The classifier, on the other hand, is trained with the cross-entropy loss applied to the soft-max transformation of the logits:

$$ \ell(y, \mathbf{z}) = -\log [\boldsymbol{\delta}(y)^T \boldsymbol{\sigma}(\mathbf{z})] \;:\; \mathbb{R} \times \mathbb{R}^e \rightarrow \mathbb{R}\;. $$

We use natural logarithms here.

In the last expression for $\ell$ above, $\boldsymbol{\delta}(y)$ is the one-hot encoding of the true label $y$. This encoding is a vector with $e$ entries: Entry $y$ is one and all others are zero.

For instance, if $e = 10$ and $y = 3$ we have

$$ \boldsymbol{\delta}(3) = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) \;. $$

The symbol $\boldsymbol{\sigma}$ represents the soft-max function:

$$ \boldsymbol{\sigma}(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\sum_{i=0}^{e-1} e^{z_i}} \;. $$

Because entry $y$ of $\boldsymbol{\delta}(y)$ is one and the other entries are zero, the expression for the loss for the classifier simplifies to the following:

$$ \ell(y, \mathbf{z}) = -\log \sigma_y(\mathbf{z}) = -\log \frac{e^{z_y}}{\sum_{i=0}^{e-1} e^{z_i}} = -z_y + \log \sum_{i=0}^{e-1} e^{z_i} \;. $$

Write a function with header

def quadratic_loss(y, z):

that returns the quadratic loss for a vector regressor. That is, your function should work for y and z of any positive length, as long as the two lengths are equal.

Then consider the scalar version, $e = 1$ and two training samples $(\mathbf{x}_0, y_0)$ and $(\mathbf{x}_1, y_1)$ with true values $y_0 = 0$ and $y_1 = 1$.

Show a single diagram with plots of the two loss functions $\ell(y_0, z)$ and $\ell(y_1, z)$ as $z$ ranges between -3 and 3.

Programming Notes¶

It is OK to use explicit for loops.
Your diagram should include a legend that shows which plot is for which true value.
Label the axes meaningfully.
You may want to encapsulate your plotting into a function, since you will be asked to make a similar plot for the cross-entropy loss.

Problem 3.2 (Exam Style)¶

Let $e = 2$. Show that in this case the soft-max function of the vector $\mathbf{z} = (z_0, z_1)$ depends only on the difference $z_0 - z_1$. That is, if

$$ \mathbf{p} = (p_0, p_1) = \boldsymbol{\sigma}((z_0, z_1)) $$

then there exist two functions $s_0$ and $s_1$ from $\mathbb{R}$ to $\mathbb{R}$ such that

$$ p_0 = s_0(d) \;\;\;\;\;\;\text{and}\;\;\;\;\;\; p_1 = s_1(d) \;\;\;\;\;\;\text{where}\;\;\;\;\;\; d = z_0 - z_1\;. $$

Your proof should be constructive, that is, you should specify the two functions $s_0$ and $s_1$.

Hint¶

Just write out the definition of soft-max for each of the two entries in the binary case and manipulate the resulting expressions.

Write a function with header

def cross_entropy_loss(y, z):

that returns the cross-entropy loss for a classifier with any positive number of classes as a function of true label y and logits z.

Your function should work when z is a vector of any positive length $e$ and the true label y is any integer scalar in $[0, e-1]$.

Then consider the binary version, $e = 2$ and two training samples $(\mathbf{x}_0, y_0)$ and $(\mathbf{x}_1, y_1)$ with true values $y_0 = 0$ and $y_1 = 1$.

Show a single diagram with plots of the loss functions for these two true values as the difference $d = z_0 - z_1$ ranges between -3 and 3.

Programming Notes¶

It is OK to use explicit for loops.
Your diagram should include a legend that shows which plot is for which true value.
Label the axes meaningfully.

Part 4: Neural Network Basics¶

Problem 4.1 (Exam Style)¶

A layer in a neural network has input of shape $w\times h\times c$ where $c$ is the number of channels.

The output of the layer has shape $w\times h\times d$ where $d$ is the number of channels.

Give expressions for the number $m_1$ and $m_2$ of weights for this layer in each of the following cases:

The layer is fully connected.
The layer is convolutional with square kernels of shape $k\times k$.

Your expressions should be in terms of $w, h, c, d, k$.

Problem 4.2 (Exam Style)¶

The tiny neural network in the figure below has two layers, three neurons, and an input vector $\mathbf{x}$ with three components $x_1, x_2, x_3$. Neurons $n_1$ and $n_2$ have a ReLU nonlinearity while $n_3$ is linear (that is, it has no non-linearity). The network gains are shown near the edges and the biases are $b_1, b_2, b_3$.

Thus, the relation $\hat{y} = h(\mathbf{x})$ between input vector $\mathbf{x} = (x_1, x_2, x_3)$ and output $\hat{y}$ from the network is as follows:

$$ \begin{eqnarray*} a_1 &=& u_{11} x_1 + u_{12} x_2 + u_{13} x_3 + b_1 \\ a_2 &=& u_{21} x_1 + u_{22} x_2 + u_{23} x_3 + b_2 \\ z_1 &=& \max(0, a_1) \\ z_2 &=& \max(0, a_2) \\ \hat{y} &=& v_1 z_1 + v_2 z_2 + b_3 \end{eqnarray*} $$

The variables $a_1$ and $a_2$ in these expressions denote the activations of neurons $n_1$ and $n_2$, that is, the values of their outputs before they are passed through the ReLU $\rho(a) = \max(0, a)$. Making $a_1$ and $a_2$ explicit will simplify your work later on.

The numerical values of the weights for this network are as follows:

$$ \begin{array}{llll} u_{11} = 3 & u_{12} = -1 & u_{13} = 0 & b_1 = -4 \\ u_{21} = -1 & u_{22} = 1 & u_{23} = 3 & b_2 = 1 \\ v_1 = 4 & v_2 = 1 & b_3 = 0 \end{array} $$

We collect all the network weights into a vector $\mathbf{w}$ in the following order:

$$ \mathbf{w} = (u_{11}, u_{12}, u_{13}, b_1, u_{21}, u_{22}, u_{23}, b_2, v_1, v_2, b_3) \in\mathbb{R}^{11} \;. $$

Given an input vector $\mathbf{x}$ with true output value $y$, the quadratic loss is used to measure estimation errors. This loss is defined as follows: $$ \ell(y, \mathbf{x}) = \frac{1}{2} [h(\mathbf{x}) - y]^2 \;. $$

The division by 2 is introduced here for mathematical convenience, since it cancels a 2 that comes from differentiating the square when computing gradients.

To run a forward pass through a network means to compute $\hat{y} = h(\mathbf{x})$ for a specific input $\mathbf{x}$.

Run a forward pass through the network in the figure for input $$ \mathbf{x} = (-1, -2, 2) $$ giving, in this order, the numerical values of $a_1$, $a_2$, $z_1$, $z_2$, and $\hat{y}$. You will need these values in the next problem.

Finally, give the value of the quadratic loss if the true value for this input is $y = 2$.

Notes¶

There is no need to show your calculations if you are sure of your answers, since the calculations merely follow the equations given above.
However, be careful in your calculations, since mistakes will propagate to the next problem and you will be penalized twice for them.

Problem 4.3 (Exam Style)¶

Back-propagating through the network for a particular input $\mathbf{x}$ and set of values for the network weights in $\mathbf{w}$ means computing the gradient

$$ \mathbf{g} = \frac{\partial\ell}{\partial\mathbf{w}} \in \mathbb{R}^{11} $$

for that input and those weights.

The calculation is done as described in the class notes.

In the specific instance, the order of calculations is as follows, where $g_\alpha$ is shorthand for $\frac{\partial\ell}{\partial\alpha}$:

$$ g_{\hat{y}}, g_{b_3}, g_{v_2}, g_{v_1}, g_{z_2}, g_{z_1}, g_{a_2}, g_{a_1}, g_{b_2}, g_{u_{23}}, g_{u_{22}}, g_{u_{21}}, g_{b_1}, g_{u_{13}}, g_{u_{12}}, g_{u_{11}} $$

Back-propagate through the given network for the numerical input and true label values were given above. Once you are done, also give the vector $g_{\mathbf{w}} = \frac{\partial\ell}{\partial\mathbf{w}}$ in numerical form, where the entries are now ordered as they are in the vector $\mathbf{w}$ defined in the previous problem.

This time, do show your calculations, using the format given in the homework template file. The first few lines in the template are done for you, so you see what format we are looking for.

You may use the expression $\sigma(a)$ to denote a sub-derivative of the ReLU function $\rho(a)$. One possible definition fo the sub-derivative of $\sigma(a)$ is 1 for $a > 0$ and 0 otherwise.

Part 5: Training and Evaluating a Neural Network¶

In parts 1 and 2 of this assignment you worked with a mystery function $f$ and its restriction to a two-dimensional subspace.

Here is what $f$ actually is: The input $\mathbf{w}$ is a set of $d=7510$ gain and bias values for a small neural network. The value returned by $f$ is the average cross-entropy loss on an annotated training data set of $N = 1500$ hand-written digits. The gradient that $f$ returns is computed by back-propagation on this training set.

Network Structure¶

The network has just two layers and a soft-max transformation at the output. The two layers are fully connected. A network with this structure (cascade of fully-connected layers) is called a Multi-Layer Perceptron.

The input layer takes as its input the 64 pixel values of certain tiny images. This layer has 100 neurons, and therefore 6400 gains and 100 biases, for a total of 6500 weights. The 100 outputs are passed through a ReLU. The second layer takes the 100 values from the output of the ReLU and produces 10 values, called logits, one logit per possible digit value in $Y = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$. There is no ReLU here. So the second layer has 1000 gains and 10 biases, for a total of 1010 weights.

The output from the second layer is passed to the soft-max to produce 10 softmax scores. The soft-max transformation is not learnable, so it has no parameters.

In total, the number of weights is therefore $6500 + 1010 = 7510$, the length of $\mathbf{w}$.

The Data¶

The training data consists of 1500 black-and-white images of handwritten digits from the MNIST database, resized to $8\times 8$ pixel images so you can finish this assignment in a reasonable amount of time.

A separate set of 297 images from the same set, similarly resized, is reserved for testing. You can load and look at a subset of the data as follows:

In [9]:

from function import get_data

data = get_data()

In [10]:

from matplotlib import pyplot as plt

t = data['training']
plt.figure(figsize=(8, 8), tight_layout=True)
rows = 6
for k in range(rows ** 2):
    plt.subplot(rows, rows, k + 1)
    img = t['x'][k].reshape((8, 8))
    plt.imshow(img, cmap=plt.cm.gray)
    plt.axis('image')
    plt.axis('off')
    plt.title('{}'.format(t['y'][k]))
plt.show()

Note that data['training']['x'] gives you all 1500 images in a $1500\times 64$ numpy array, one image per row, which the code above reshapes into an $8\times 8$ array.

The labels are digits between 0 and 9 in data['training']['y'].

You can get the test data similarly via data['test'].

However, our interaction with the data in the problems below is more abstract, and occurs through vectors of indices.

Specifically, you can first just train your system by minimizing f, the function from part 1 of this assignment, starting from a random set of network weights. We start with a less aggressive learning rate and more epochs, given the much higher dimensionality of the weight space (uncomment below to execute once you have sgd).

In [11]:

# w0, n_indices = start_over(f, sgd)
# f_star, w_star = sgd(f, w0, n_indices, n_indices,
#                      epoch_alpha=0.5, n_epochs=500)
# print_outcome(f_star, w_star)

Of course, the @Display decorator is now unable to print the path in the space of all weights, as this space has dimension 7510, so it only prints the function values.

You can get a predictor function from the resulting vector w_star of weights like this (uncomment):

In [12]:

# from function import get_score_predictor

# predict_scores = get_score_predictor(w_star)

The function predict_scores does not take an image as input. Instead, it takes a vector of indices and a string that is either 'training' or 'test'. For instance, the following call prints a $3\times 10$ array of positive real values (uncomment):

In [13]:

# idx, which_set = [10, 25, 40], 'training'
# with np.printoptions(precision=2):
#     print(predict_scores(idx, which_set))

The ten numbers in the first row are the soft-max scores for the ten digit categories (ordered 0 through 9). A more direct answer (what is the digit for each of the input images) can then be obtained by determining the category with the highest score in each row (uncomment):

In [14]:

# scores = predict_scores(idx, which_set)
# print(np.argmax(scores, axis=1))

Note¶

While we could in principle remove the soft-max and work directly with the logits, we don't have access to the logits given how the network is constructed, so we'll work with the soft-max scores instead. The soft-max function is monotonic, so working with logits or soft-max scores is the same in terms of classification.

If you wanted to check the results visually you could do the following (uncomment):

In [15]:

# plt.figure(figsize=(8, 3), tight_layout=True)
# for plot, (index, score) in enumerate(zip(idx, scores)):
#     plt.subplot(1, len(idx), plot + 1)
#     image = data[which_set]['x'][index].reshape(8, 8)
#     y_hat, y = np.argmax(score), data[which_set]['y'][index]
#     plt.imshow(image, cmap=plt.cm.gray)
#     plt.axis('off')
#     plt.title('true {}, predicted {}'.format(y, y_hat))
# plt.show()

Problem 5.1 (Exam Style Except for the Code)¶

Write functions with headers

def predict(h, d, set_type):
def error_rate(y, y_hat):
def evaluate(h, d, set_type):

The function predict takes a score predictor h (such as predict_scores), a data set (such as data), and a string set_type that is either 'training' or 'test' and returns a numpy vector y_hat with the digit labels inferred from the predictor's output for all of the data in the specified set (training or test set).

The function error_rate takes two numpy vectors y and y_hat of true and predicted digit labels (these are vectors of integers in $[0, 9]$) and returns the fraction of times that an entry in y differs from the corresponding entry in y_hat. So the output is a real number between 0 and 1 inclusive.

The inputs to the function evaluate are the same as for predict. The function evaluate prints the error rate on the specified set (training or test set) as a percentage value (so you need to multiply the output from error_rate by 100).

The format of the printout is

'{} error rate {:.2f} percent'

where the first bracket is replaced by set_type and the second by the percentage (with two decimal digits after the period).

Your Task¶

Show your code and the output from the calls below, uncommented.

Do not use explicit for loops in your code.

Then answer the following questions:

Does the network achieve good accuracy on the training set?
Does the network generalize well? Justify your answer briefly.

In [16]:

# evaluate(predict_scores, data, 'training')
# evaluate(predict_scores, data, 'test')

Problem 5.2 (Exam Style Except for the Code)¶

The confusion matrix for a classifier with output set $Y$ with cardinality $m$ and on a data set $D$ is an $m\times m$ matrix $C$ whose entry $c_{tp}$ in row $t$ and column $p$ is the number of entries in $D$ whose true label is $t$ and for which the classifier predicts label $p$. In our case $m = 10$.

Write a function with header

def confusion_matrix(h, d, set_type):

that takes the same arguments as the functions predict and evaluate in the previous problem and returns the confusion matrix for the given score predictor, data set, and set type.

COMPSCI 527 Homework 3¶

Homework Submission Workflow¶

Note on Exam-Style Problems¶

Problem 0 (3 points)¶

Note¶

Part 1: Gradient Descent Revisited¶

A "Large" Average Function¶

A Reference Implementation of Gradient Descent¶

Problem 1.1¶

Your Task¶

Programming Notes¶

Problem 1.2 (Exam Style)¶

Notes¶

Problem 1.3 (Exam Style)¶

Part 2: Stochastic Gradient Descent¶

Yielding Meaningful Function Values¶

The Learning Rate¶

Problem 2.1 (Exam Style)¶

Problem 2.2¶

Inserting yield Statements¶

Programming Notes¶

Your Tasks¶

Problem 2.3¶

A Note on Theoretical Complexity versus Python Wall Time¶

Part 3: Loss¶

Problem 3.1¶

Programming Notes¶

Problem 3.2 (Exam Style)¶

Hint¶

Problem 3.3¶

Programming Notes¶

Part 4: Neural Network Basics¶

Problem 4.1 (Exam Style)¶

Problem 4.2 (Exam Style)¶

Notes¶

Problem 4.3 (Exam Style)¶

Part 5: Training and Evaluating a Neural Network¶

Network Structure¶

The Data¶

Note¶

Problem 5.1 (Exam Style Except for the Code)¶

Your Task¶

Problem 5.2 (Exam Style Except for the Code)¶

Your Task¶

Inserting `yield` Statements¶