When you submit your work, follow the instructions on the submission workflow page scrupulously for full credit.
Instead of handing out sample exams in preparation for the midterm and final, homework assignments include several problems that are marked Exam Style . These are problems whose format and contents are similar to those in the exams, and are graded like the other problems. Please use these problems as examples of what may come up in an exam.
As a further exercise in preparation for exams, you may want to think of other plausible problems with a similar style. If you collect these problems in a separate notebook as the course progresses, you will have a useful sample you can use to rehearse for the exams.
Is this assignment properly formatted in the PDF submission, and did you tell Gradescope accurately where each answer is?
This question requires no explicit answer from you, and no space is provided for an answer in the template file.
However, the rubric entry for this problem in Gradescope will be used to decuct points for poor formatting of the assignment (PDF is not readable, code or text is clipped, math is not typeset, ...) and for incorrect indications in Gradescope for the location of each answer.
Sorry we have to do this, but poor formatting and bad answer links in Gradescope make our grading much harder.
We revisit gradient descent in the context of functions that are averages of other functions. This will let us set up some code that we can use for Stochastic Gradient Descent (SGD) later in this assignment.
More importantly for this Part, we take a function $f$ of a large number of variables, which makes $f$ relatively expensive to compute, and restrict it to a two-dimensional subspace of the domain of $f$, so we can plot the path that gradient descent traverses when minimizing the restriction.
Let us retrieve a file function.py
and import from it the implementation f
of a function $f:\mathbb{R}^d \rightarrow \mathbb{R}$ that is an average of $N$ functions:
This is a rather "large" function, in that $d = 7510$ and $N = 1500$. What the function computes is irrelevant to this Part.
We also retrieve a file display.py
with some visualization utilities and a file gd.py
with a reference implementation of gradient descent.
import urllib.request
from os import path as osp
def retrieve(file_name, semester='spring23', homework=3):
if osp.exists(file_name):
print('Using previously downloaded file {}'.format(file_name))
else:
fmt = 'https://www2.cs.duke.edu/courses/{}/compsci527/homework/{}/{}'
url = fmt.format(semester, homework, file_name)
urllib.request.urlretrieve(url, file_name)
print('Downloaded file {}'.format(file_name))
retrieve('function.py')
retrieve('display.py')
retrieve('gd.py')
Using previously downloaded file function.py Using previously downloaded file display.py Using previously downloaded file gd.py
from function import f, start_over, print_outcome
from display import Display, show_function
from gd import gd
import numpy as np
If you call
z, g = f(w, indices)
where w
is some numpy
array of $d$ floating-point numbers and indices
is a list of integers in $[0, N)$, then f
returns the value z
and gradient g
of the function
where $B$ is the set of integers in indices
and $|B|$ denotes the cardinality of $B$.
This additional argument to f
will help in SGD, where we need to compute averages over mini-batches $B$ rather than on the entire set $[0, N)$.
Of course, to compute the value and gradient on the entire set of indices $[0, N)$ you just say
z, g = f(w, np.arange(n_indices))
where n_indices
is $N = 1500$.
The generator gd
imported from gd.py
above is an implementation of gradient descent more or less from the last homework assignment.
Look at the function gd
in the file just retrieved and note the @Display
decorator and the yield
statements rather than return
. If you forgot what this means please refer back to Homework 2.
The decorator is a bit simpler now in that it only shows the descent path and function values rather than also displaying iso-contours of the function.
The figure below shows plots produced by gd
run on a function fr
of two variables that you will implement in Problem 1.1. This figure is just to give you an idea of what to expect in that problem.
Write a function with header
def fr(v, indices):
This function restricts the function $f$ defined above to dimensions 6420 and 6424 of the input space of $f$ around a point w0
$\in\mathbb{R}^{d}$ (with $d = 7510$) given later on.
The role of the second parameter indices
for fr
is the same as that for f
.
The restriction fr
of f
is defined as follows:
where $v_0$ and $v_1$ are the two entries of the parameter v
.
fr
.fr
on a grid of values for $\mathbf{v}$ in the square $[-3, 3]^2$ when w0
$\in\mathbb{R}^{7510}$ is the point returned by the call to start_over
in a code cell below.gd
on fr
starting at v0 = np.array([2., 1.])
and print out the outcome. Use default parameters for all named arguments to gd
.fr
"know" about values it depends on (such as f
or w0
or 6420
) even if these are not passed explicitly as parameters. In Python, this is actually very easy: if those values are known when fr
is called, then fr
knows about them as well, as illustrated by add_number
below.def add_number(x):
return x + number
number = 3
add_number(2)
5
retrieve
and import
statements introduced earlier).start_over
also returns n_indices
, which equals 1500.coords
contains the indices of the two coordinates associated to $\mathbf{v}$ in the definition of $f_r$. Use coords
rather than hard-wiring those two indices into your code.%matplotlib inline
w0, n_indices = start_over()
coords = [6420, 6424]
# def fr(v, indices):
# ...
# show_function(fr, n_indices)
# v0 = np.array([2., 1.])
# start_over(fr, gd);
# f_star, v_star = gd(fr, v0, np.arange(n_indices))
# print_outcome(f_star, v_star)
If you were to compute a local minimum of fr
by gradient descent starting at $\mathbf{v}_0 = (2, -2)$, what value $\mathbf{v}^\ast$ do you think this computation would find, approximately? Justify your answer briefly.
You can answer this question by inspection of the plot that show_function
produced in the previous problem, and it is not necessary to actually run gradient descent.
You may use gd
to check your answers (do not submit those checks), but you still need to justify your answers with reference to the plot made by show_function
, as if you had not run gd
.
The horizontal coordinate in that plot is $v_0$ and the vertical coordinate is $v_1$.
Grid points are 0.2 units apart. It is OK if your answer is off by a grid cell or so.
If you were to start at $\mathbf{v}_1' = (2, 1)$ instead, would gradient descent find a point where $f_r$ has a lower value than it had at the point $\mathbf{v}^\ast$ you found in the previous problem? Justify your answer briefly.
The same Notes as in Problem 1.2 hold here, including the bullet on the style of your justification.
We now write a generator sgd
that implements Stochastic Gradient Descent (SGD). The generator is decorated with @Display
, just as gd
was, so we can see the path traversed when the function depends on just two variables.
The general idea of SGD is quite simple: Split the indices into batches and run Gradient Descent (GD) on averages estimated on each batch, rather than on the entire index set. When all the indices have been visited once, an epoch ends and the next one begins. At the beginning of each epoch the indices are randomly permuted, so that batches are different in each epoch.
In its simplest form SGD does not have a termination check, because it would be too expensive to determine if the overall average is small enough. Instead, SGD simply runs for a prespecified number of epochs. This is the version we implement here.
Since minimizing an average of many functions of many variables can take very long, it is common for the algorithm to display intermediate values of $f(\mathbf{w})$, so that whoever runs it can monitor progress. To this end, it is important to make these displays meaningful. This topic is discussed next.
We then consider which learning rate to use in SGD.
Because of how the @Display
decorator works, sgd
must be a generator, that is, use yield
statements rather than return
statements.
We need to think a bit carefully about what @Display
should record, and therefore what sgd
should yield. We think this through for the function $f$, with the understanding that all considerations hold also for its restriction $f_r$.
SGD approximates the overall average $f(\mathbf{w})$ with batch averages $f_B(\mathbf{w})$. "Overall" means "computed from all the data."
Since the batch averages are noisy estimates of $f(\mathbf{w})$, it is preferable to display the overall average $f(\mathbf{w}_e)$ at the end of each epoch, rather than the batch averages $f_B(\mathbf{w}_b)$ at the end of each batch.
However, it would be too expensive to call $f$ on the entire index set at the end of each epoch to compute $f(\mathbf{w}_e)$. Instead, this value is approximated as follows.
Suppose for simplicity that the total number $N$ of indices is an integer multiple of the batch size $s = |B|$. Then there are $b = N / s$ batches of equal size. Let $B_\beta$ be batch number $\beta$ for $\beta = 0,\ldots, b-1$, so that $f_{B_\beta}(\mathbf{w})$ is the average over batch $B_\beta$,
$$ f_{B_\beta}(\mathbf{w}) = \frac{1}{s} \sum_{n\in B_\beta} \phi_n(\mathbf{w}) \;. $$For any $\mathbf{w}\in\mathbb{R}^d$ we can write
$$ f(\mathbf{w}) = \frac{1}{N} \sum_{n=0}^{N-1} \phi_n(\mathbf{w}) = \frac{1}{bs} \sum_{\beta=0}^{b-1} \sum_{n\in B_\beta} \phi_n(\mathbf{w}) = \frac{1}{bs} \sum_{\beta=0}^{b-1} s f_{B_\beta}(\mathbf{w}) = \frac{1}{b} \sum_{\beta=0}^{b-1} f_{B_\beta}(\mathbf{w}) \;. $$In words, the full average at $\mathbf{w}$ is the average of the batch averages at $\mathbf{w}$.
Let now $f_\beta$ be the batch average computed on batch number $\beta$ during an epoch of SGD. The epoch average $f_e$ for that epoch is the average of these batch averages:
$$ f_e = \frac{1}{b} \sum_{\beta=0}^{b-1} f_\beta \;. $$In practice, $f_e$ can be computed in one of two ways:
The two methods are of course mathematically equivalent. However, the first one can lead to very large values for the sum of all the $f_\beta$. Because of this, the second method is preferable.
If GD uses a learning rate $\alpha_e$ to multiply by the gradient of $f$ on the entire index set, it makes sense for SGD to use a learning rate $\alpha$ for each mini-batch that is equal to $$ \alpha = \frac{\alpha_e}{b} $$ where $b$ is the number of batches. Because of this, $\alpha_e$ is called the epoch learning rate and $\alpha$ is called the batch learning rate.
In this way, GD and SGD move by roughly comparable amounts in each epoch:
In this assignment, the default learning rate $\alpha_e$ is set (rather aggressively) to 3 for both gd
and sgd
. Do not change these default values.
Explain briefly and clearly why the epoch average $f_e$ in a given epoch is not the same as the average $f(\mathbf{w}_e)$ over all the data, where $\mathbf{w}_e$ is the value of the independent variable $\mathbf{w}$ at the end of that epoch.
Also, does $f_e$ typically over-estimate or under-estimate $f(\mathbf{w}_e)$, and why?
Write a function whose definition starts as follows:
@Display
def sgd(fun, v, n_idx, batch_size, epoch_alpha=3., n_epochs=5):
batch_size = min(batch_size, n_idx)
rng = np.random.default_rng(0)
v = v.copy()
and implements Stochastic Gradient Descent in compliance with the notes below.
yield
Statements¶Consistently with the discussion in the preamble to this part, your sgd
generator will have exactly two yield
statements placed as follows:
At the end of each epoch, a statement of the form
yield f_epoch, v
where f_epoch
is the epoch average and v
is the current value of the independent variable. Make sure that this yield
is inside the epoch loop but outside the batch loop.
At the very end of sgd
, a yield
statement preceded by a computation of the final value of the true function value. Thus, your function ends like this (assuming that v
is the current value of the independent variable vector $\mathbf{w}$):
f_final = fun(v, np.arange(n_idx))[0]
yield f_final, v
Make sure that these two statements are outside all loops.
fun
, v
, and n_idx
specify the function to be minimized (this will be $f_r$ in your experiments), the starting value of the vector of independent variables, and the number of indices in the definition of the function (that is, n_idx
is $N=1500$).batch_size
is the number of indices in each mini-batch. The first assignment in the body of sgd
above makes sure that batch_size
is no greater than n_idx
.n_idx
is an integer multiple of batch_size
.rng
defined above uses the same seed for all of us, so your results will be comparable when we grade them.epoch_alpha
is the epoch learning rate. Compute the batch learning rate alpha
from this.At the beginning of each epoch, create a scrambled vector idx
that contains all the indices as follows:
idx = rng.permutation(n_idx)
Mini-batches are then formed by taking the first chunk of batch_size
indices out of idx
, and then the second chunk, and so forth.
Making a copy of v
, as done above, makes sure that the value of v
the caller passes to sgd
is not modified.
Remember, there is no termination check. Just go through n_epochs
epochs.
Use Method 2 to compute the epoch average.
Before running sgd
, you must run
start_over(fr, sgd)
This will perform various essential housekeeping chores.
sgd
.sgd
and gd
do the exact same thing.sgd
appropriate inputs so that it gives the same result as gd
for the experiment you ran in Problem 1.1 (same fr
, same v0
), and use print_outcome
to show the outcome, as done above for gd
. Use default parameters for all named arguments to sgd
except n_epochs
.The experiment in task 3 requires you to determine the number of iterations gd
ran for convergence, since sgd
has no termination condition. The generator gd
prints out that number, just look at the printed output in Problem 1.1.
All the decimal digits that print_outcome
displays must coincide in your outcome values from gd
and sgd
. In addition, the number of calls of $\phi_n$ functions (also printed by print_outcome
) should be the same for the two experiments.
Now run your experiment (with start_over(fr, sgd)
, a run of sgd
, and one of print_outcome
) with batch size 100. All the other parameters remain the same, including fr
as the function to optimize, v0
as the starting point, n_indices
, and n_epochs
.
Then answer the following questions:
sgd
take a step (that is, how many times does v
change), approximately?sgd
in problem 2.2.print_outcome
).Your answers to some of the questions above hopefully suggest that the running time ought to be about the same for the two runs of sgd
if the main computational expense is a call of a $\phi_n$ function.
On the other hand, you may have noticed that the second run of SGD is actually slower than the first. So SGD seems to be slower than its GD-equivalent version. This is surprising, since one of the motivations for SGD is efficiency given a fixed storage budget (that is, how much data we can fit into a GPU).
However, the storage budget is not fixed in these experiments. We are not running on the GPU, and the CPU can easily hold all the data at the same time. Thus, the storage-efficiency argument is moot in this small test case.
Other advantages of SGD still hold. For instance, the randomness in the estimates of the batch gradients in SGD will still get SGD out of any saddle points it may encounter, while GD would get stuck there since the gradient would remain zero.
To explain why SGD is slower than the GD-equivalent run of SGD in the last two experiments, note that the main computational expense here is not a call of a $\phi_n$ function. This would indeed be the main expense if all your code were implemented in a low-lever language like C.
In Python, on the other hand, the main expense is by far managing the infrastructure used by the numpy
package. Thus, Python running times are essentially meaningless if the algorithms are eventually written in a low level language, as they should be for efficiency, and if storage efficiency is not paramount.
The entries $z_i$ of the output
$$ \mathbf{z} = (z_0, \ldots, z_{e-1}) = f(\mathbf{x}) $$of a neural network $f \ :\ \mathbb{R}^d \rightarrow \mathbb{R}^e$ are called the logits.
We consider the following two uses of this network:
By itself, as a regressor that returns an estimate $\hat{y} = \mathbf{z}$ of some vector whose true value is $y$.
As a component of a classifier $\hat{y} = h(\mathbf{x})$ with classes in $Y = \{0, \ldots, e-1\}$, implemented as follows:
Given a training set
$$ T = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_N, y_N)\} $$the network is trained by finding a minimum $\hat{\mathbf{w}}$ of the empirical training risk $L_T$ on $T$:
$$ L_T(\mathbf{w}) = \frac{1}{N} \sum_{n=1}^N \ell(y_n, f(\mathbf{x}_n))\;. $$Note that $y_n$ is a vector with $e$ entries for the regressor and a scalar for the classifier.
In the expression above the vector $\mathbf{w}$ collects the weights of the network and $\ell$ is a suitable loss function.
Specifically, the regressor is trained with the quadratic loss
$$ \ell(\mathbf{y}, \mathbf{z}) = \|\mathbf{y} - \mathbf{z}\|^2 \;:\; \mathbb{R}^e \times \mathbb{R}^e \rightarrow \mathbb{R} $$where $\|\cdot\|$ denotes the Euclidean norm and we use a bold letter for the true value $y = \mathbf{y}$ to remind ouselves that this is a vector.
The classifier, on the other hand, is trained with the cross-entropy loss applied to the soft-max transformation of the logits:
$$ \ell(y, \mathbf{z}) = -\log [\boldsymbol{\delta}(y)^T \boldsymbol{\sigma}(\mathbf{z})] \;:\; \mathbb{R} \times \mathbb{R}^e \rightarrow \mathbb{R}\;. $$We use natural logarithms here.
In the last expression for $\ell$ above, $\boldsymbol{\delta}(y)$ is the one-hot encoding of the true label $y$. This encoding is a vector with $e$ entries: Entry $y$ is one and all others are zero.
For instance, if $e = 10$ and $y = 3$ we have
$$ \boldsymbol{\delta}(3) = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0) \;. $$The symbol $\boldsymbol{\sigma}$ represents the soft-max function:
$$ \boldsymbol{\sigma}(\mathbf{z}) = \frac{e^{\mathbf{z}}}{\sum_{i=0}^{e-1} e^{z_i}} \;. $$Because entry $y$ of $\boldsymbol{\delta}(y)$ is one and the other entries are zero, the expression for the loss for the classifier simplifies to the following:
$$ \ell(y, \mathbf{z}) = -\log \sigma_y(\mathbf{z}) = -\log \frac{e^{z_y}}{\sum_{i=0}^{e-1} e^{z_i}} = -z_y + \log \sum_{i=0}^{e-1} e^{z_i} \;. $$Write a function with header
def quadratic_loss(y, z):
that returns the quadratic loss for a vector regressor. That is, your function should work for y
and z
of any positive length, as long as the two lengths are equal.
Then consider the scalar version, $e = 1$ and two training samples $(\mathbf{x}_0, y_0)$ and $(\mathbf{x}_1, y_1)$ with true values $y_0 = 0$ and $y_1 = 1$.
Show a single diagram with plots of the two loss functions $\ell(y_0, z)$ and $\ell(y_1, z)$ as $z$ ranges between -3 and 3.
It is OK to use explicit for
loops.
Your diagram should include a legend that shows which plot is for which true value.
Label the axes meaningfully.
You may want to encapsulate your plotting into a function, since you will be asked to make a similar plot for the cross-entropy loss.
Let $e = 2$. Show that in this case the soft-max function of the vector $\mathbf{z} = (z_0, z_1)$ depends only on the difference $z_0 - z_1$. That is, if
$$ \mathbf{p} = (p_0, p_1) = \boldsymbol{\sigma}((z_0, z_1)) $$then there exist two functions $s_0$ and $s_1$ from $\mathbb{R}$ to $\mathbb{R}$ such that
$$ p_0 = s_0(d) \;\;\;\;\;\;\text{and}\;\;\;\;\;\; p_1 = s_1(d) \;\;\;\;\;\;\text{where}\;\;\;\;\;\; d = z_0 - z_1\;. $$Your proof should be constructive, that is, you should specify the two functions $s_0$ and $s_1$.
Just write out the definition of soft-max for each of the two entries in the binary case and manipulate the resulting expressions.
Write a function with header
def cross_entropy_loss(y, z):
that returns the cross-entropy loss for a classifier with any positive number of classes as a function of true label y
and logits z
.
Your function should work when z
is a vector of any positive length $e$ and the true label y
is any integer scalar in $[0, e-1]$.
Then consider the binary version, $e = 2$ and two training samples $(\mathbf{x}_0, y_0)$ and $(\mathbf{x}_1, y_1)$ with true values $y_0 = 0$ and $y_1 = 1$.
Show a single diagram with plots of the loss functions for these two true values as the difference $d = z_0 - z_1$ ranges between -3 and 3.
It is OK to use explicit for
loops.
Your diagram should include a legend that shows which plot is for which true value.
Label the axes meaningfully.
A layer in a neural network has input of shape $w\times h\times c$ where $c$ is the number of channels.
The output of the layer has shape $w\times h\times d$ where $d$ is the number of channels.
Give expressions for the number $m_1$ and $m_2$ of weights for this layer in each of the following cases:
Your expressions should be in terms of $w, h, c, d, k$.
The tiny neural network in the figure below has two layers, three neurons, and an input vector $\mathbf{x}$ with three components $x_1, x_2, x_3$. Neurons $n_1$ and $n_2$ have a ReLU nonlinearity while $n_3$ is linear (that is, it has no non-linearity). The network gains are shown near the edges and the biases are $b_1, b_2, b_3$.
Thus, the relation $\hat{y} = h(\mathbf{x})$ between input vector $\mathbf{x} = (x_1, x_2, x_3)$ and output $\hat{y}$ from the network is as follows:
$$ \begin{eqnarray*} a_1 &=& u_{11} x_1 + u_{12} x_2 + u_{13} x_3 + b_1 \\ a_2 &=& u_{21} x_1 + u_{22} x_2 + u_{23} x_3 + b_2 \\ z_1 &=& \max(0, a_1) \\ z_2 &=& \max(0, a_2) \\ \hat{y} &=& v_1 z_1 + v_2 z_2 + b_3 \end{eqnarray*} $$The variables $a_1$ and $a_2$ in these expressions denote the activations of neurons $n_1$ and $n_2$, that is, the values of their outputs before they are passed through the ReLU $\rho(a) = \max(0, a)$. Making $a_1$ and $a_2$ explicit will simplify your work later on.
The numerical values of the weights for this network are as follows:
$$ \begin{array}{llll} u_{11} = 3 & u_{12} = -1 & u_{13} = 0 & b_1 = -4 \\ u_{21} = -1 & u_{22} = 1 & u_{23} = 3 & b_2 = 1 \\ v_1 = 4 & v_2 = 1 & b_3 = 0 \end{array} $$We collect all the network weights into a vector $\mathbf{w}$ in the following order:
$$ \mathbf{w} = (u_{11}, u_{12}, u_{13}, b_1, u_{21}, u_{22}, u_{23}, b_2, v_1, v_2, b_3) \in\mathbb{R}^{11} \;. $$Given an input vector $\mathbf{x}$ with true output value $y$, the quadratic loss is used to measure estimation errors. This loss is defined as follows: $$ \ell(y, \mathbf{x}) = \frac{1}{2} [h(\mathbf{x}) - y]^2 \;. $$
The division by 2 is introduced here for mathematical convenience, since it cancels a 2 that comes from differentiating the square when computing gradients.
To run a forward pass through a network means to compute $\hat{y} = h(\mathbf{x})$ for a specific input $\mathbf{x}$.
Run a forward pass through the network in the figure for input $$ \mathbf{x} = (-1, -2, 2) $$ giving, in this order, the numerical values of $a_1$, $a_2$, $z_1$, $z_2$, and $\hat{y}$. You will need these values in the next problem.
Finally, give the value of the quadratic loss if the true value for this input is $y = 2$.
Back-propagating through the network for a particular input $\mathbf{x}$ and set of values for the network weights in $\mathbf{w}$ means computing the gradient
$$ \mathbf{g} = \frac{\partial\ell}{\partial\mathbf{w}} \in \mathbb{R}^{11} $$for that input and those weights.
The calculation is done as described in the class notes.
In the specific instance, the order of calculations is as follows, where $g_\alpha$ is shorthand for $\frac{\partial\ell}{\partial\alpha}$:
$$ g_{\hat{y}}, g_{b_3}, g_{v_2}, g_{v_1}, g_{z_2}, g_{z_1}, g_{a_2}, g_{a_1}, g_{b_2}, g_{u_{23}}, g_{u_{22}}, g_{u_{21}}, g_{b_1}, g_{u_{13}}, g_{u_{12}}, g_{u_{11}} $$Back-propagate through the given network for the numerical input and true label values were given above. Once you are done, also give the vector $g_{\mathbf{w}} = \frac{\partial\ell}{\partial\mathbf{w}}$ in numerical form, where the entries are now ordered as they are in the vector $\mathbf{w}$ defined in the previous problem.
This time, do show your calculations, using the format given in the homework template file. The first few lines in the template are done for you, so you see what format we are looking for.
You may use the expression $\sigma(a)$ to denote a sub-derivative of the ReLU function $\rho(a)$. One possible definition fo the sub-derivative of $\sigma(a)$ is 1 for $a > 0$ and 0 otherwise.
In parts 1 and 2 of this assignment you worked with a mystery function $f$ and its restriction to a two-dimensional subspace.
Here is what $f$ actually is: The input $\mathbf{w}$ is a set of $d=7510$ gain and bias values for a small neural network. The value returned by $f$ is the average cross-entropy loss on an annotated training data set of $N = 1500$ hand-written digits. The gradient that $f$ returns is computed by back-propagation on this training set.
The network has just two layers and a soft-max transformation at the output. The two layers are fully connected. A network with this structure (cascade of fully-connected layers) is called a Multi-Layer Perceptron.
The input layer takes as its input the 64 pixel values of certain tiny images. This layer has 100 neurons, and therefore 6400 gains and 100 biases, for a total of 6500 weights. The 100 outputs are passed through a ReLU. The second layer takes the 100 values from the output of the ReLU and produces 10 values, called logits, one logit per possible digit value in $Y = \{0, 1, 2, 3, 4, 5, 6, 7, 8, 9\}$. There is no ReLU here. So the second layer has 1000 gains and 10 biases, for a total of 1010 weights.
The output from the second layer is passed to the soft-max to produce 10 softmax scores. The soft-max transformation is not learnable, so it has no parameters.
In total, the number of weights is therefore $6500 + 1010 = 7510$, the length of $\mathbf{w}$.
The training data consists of 1500 black-and-white images of handwritten digits from the MNIST database, resized to $8\times 8$ pixel images so you can finish this assignment in a reasonable amount of time.
A separate set of 297 images from the same set, similarly resized, is reserved for testing. You can load and look at a subset of the data as follows:
from function import get_data
data = get_data()
from matplotlib import pyplot as plt
t = data['training']
plt.figure(figsize=(8, 8), tight_layout=True)
rows = 6
for k in range(rows ** 2):
plt.subplot(rows, rows, k + 1)
img = t['x'][k].reshape((8, 8))
plt.imshow(img, cmap=plt.cm.gray)
plt.axis('image')
plt.axis('off')
plt.title('{}'.format(t['y'][k]))
plt.show()
Note that data['training']['x']
gives you all 1500 images in a $1500\times 64$ numpy array, one image per row, which the code above reshapes into an $8\times 8$ array.
The labels are digits between 0 and 9 in data['training']['y']
.
You can get the test data similarly via data['test']
.
However, our interaction with the data in the problems below is more abstract, and occurs through vectors of indices.
Specifically, you can first just train your system by minimizing f
, the function from part 1 of this assignment, starting from a random set of network weights. We start with a less aggressive learning rate and more epochs, given the much higher dimensionality of the weight space (uncomment below to execute once you have sgd
).
# w0, n_indices = start_over(f, sgd)
# f_star, w_star = sgd(f, w0, n_indices, n_indices,
# epoch_alpha=0.5, n_epochs=500)
# print_outcome(f_star, w_star)
Of course, the @Display
decorator is now unable to print the path in the space of all weights, as this space has dimension 7510, so it only prints the function values.
You can get a predictor function from the resulting vector w_star
of weights like this (uncomment):
# from function import get_score_predictor
# predict_scores = get_score_predictor(w_star)
The function predict_scores
does not take an image as input. Instead, it takes a vector of indices and a string that is either 'training'
or 'test'
. For instance, the following call prints a $3\times 10$ array of positive real values (uncomment):
# idx, which_set = [10, 25, 40], 'training'
# with np.printoptions(precision=2):
# print(predict_scores(idx, which_set))
The ten numbers in the first row are the soft-max scores for the ten digit categories (ordered 0 through 9). A more direct answer (what is the digit for each of the input images) can then be obtained by determining the category with the highest score in each row (uncomment):
# scores = predict_scores(idx, which_set)
# print(np.argmax(scores, axis=1))
While we could in principle remove the soft-max and work directly with the logits, we don't have access to the logits given how the network is constructed, so we'll work with the soft-max scores instead. The soft-max function is monotonic, so working with logits or soft-max scores is the same in terms of classification.
If you wanted to check the results visually you could do the following (uncomment):
# plt.figure(figsize=(8, 3), tight_layout=True)
# for plot, (index, score) in enumerate(zip(idx, scores)):
# plt.subplot(1, len(idx), plot + 1)
# image = data[which_set]['x'][index].reshape(8, 8)
# y_hat, y = np.argmax(score), data[which_set]['y'][index]
# plt.imshow(image, cmap=plt.cm.gray)
# plt.axis('off')
# plt.title('true {}, predicted {}'.format(y, y_hat))
# plt.show()
Write functions with headers
def predict(h, d, set_type):
def error_rate(y, y_hat):
def evaluate(h, d, set_type):
The function predict
takes a score predictor h
(such as predict_scores
), a data set (such as data
), and a string set_type
that is either 'training'
or 'test'
and returns a numpy
vector y_hat
with the digit labels inferred from the predictor's output for all of the data in the specified set (training or test set).
The function error_rate
takes two numpy
vectors y
and y_hat
of true and predicted digit labels (these are vectors of integers in $[0, 9]$) and returns the fraction of times that an entry in y
differs from the corresponding entry in y_hat
. So the output is a real number between 0 and 1 inclusive.
The inputs to the function evaluate
are the same as for predict
. The function evaluate
prints the error rate on the specified set (training or test set) as a percentage value (so you need to multiply the output from error_rate
by 100).
The format of the printout is
'{} error rate {:.2f} percent'
where the first bracket is replaced by set_type
and the second by the percentage (with two decimal digits after the period).
Show your code and the output from the calls below, uncommented.
Do not use explicit for
loops in your code.
Then answer the following questions:
Does the network achieve good accuracy on the training set?
Does the network generalize well? Justify your answer briefly.
# evaluate(predict_scores, data, 'training')
# evaluate(predict_scores, data, 'test')
The confusion matrix for a classifier with output set $Y$ with cardinality $m$ and on a data set $D$ is an $m\times m$ matrix $C$ whose entry $c_{tp}$ in row $t$ and column $p$ is the number of entries in $D$ whose true label is $t$ and for which the classifier predicts label $p$. In our case $m = 10$.
Write a function with header
def confusion_matrix(h, d, set_type):
that takes the same arguments as the functions predict
and evaluate
in the previous problem and returns the confusion matrix for the given score predictor, data set, and set type.
Show your code and the output from the calls below, uncommented.
It is OK to use nested loops on $t$ and $p$ in your code, since these indices range over small sets, but use no other explicit for
loops.
Make sure the confusion matrix has integer type.
# print(confusion_matrix(predict_scores, data, 'training'))
# print(confusion_matrix(predict_scores, data, 'test'))
The package sklearn
has a similar function. However, for credit, you need to write confusion_matrix
without using that function from sklearn
or similar functions from other packages.
Then give expressions in terms of the entries $c_{tp}$ and shape parameter $m$ of $C$ for empirical estimates of the following probabilites:
No need to justify your formulas if you are confident about them.